The application was submitted by Haute école spécialisée de Suisse occidentale HES-SO -Haute Ecole de Gestion, Genève in cooperation with HTW Chur and Universitätsbibliothek Basel (project swissbib).
One of the aims of the project is the transformation of the current swissbib content into a Semantic Data format (RDF) to enable and simplify the linking of swissbib data with external information.
Finally the created services by linked.swissbib.ch should run on the infrastructure provided by swissbib.ch to facilitate scientists and general library users access to the linked information. Additionally the infrastructure should be available and usable for any other project or service who wants to work on similar questions (catchphrase: “swissbib as a working bench”)
This first blog contribution reflects my current considerations and experiments for the selected tools part of the future infrastructure. Of course this is a work in progress. Contributions and suggestions are always welcomed.
I. General requirements for future tools and the infrastructure to be created by linked.swissbib.ch (and swissbib classic)
a) The software has to transform any kind of meta-data structure into another (meta-data) structure.
b) In general the chosen software should be reusable and expandable. The used development tools should be state of the art and future-proof (if this is at all possible in the software world...)
c) The software has to be Open Source – otherwise it is often difficult or nearly impossible for other parties or services to (re)-use it.
d) The software shouldn't be in a phase of its infancy. Because of our really restricted resources we can't afford to build a new software just from scratch. It should be demonstrated that it is already used by institutions in a similar way that we intend to use it with linked.swissbib.ch.
e) Ideally there should exist a community around the system which enables further development in the future.
f) The software for meta-data transformation should be usable not only by software developers.
People related to meta-data transformations in the library world (scripting knowledge is helpful but not necessary) should be able to define work-flows for their own purposes – recently I heard that the phrase “biblio information scientist” is sometimes used as an expression for such a group of people.
People related to meta-data transformations in the library world (scripting knowledge is helpful but not necessary) should be able to define work-flows for their own purposes – recently I heard that the phrase “biblio information scientist” is sometimes used as an expression for such a group of people.
g) The software tools should be very scalable.
The mechanisms should be available on a desktop (for example a person in development of a workflow for metadata transformations) and at the same time it should be possible to use the same artifact on a cluster to process millions of items in reasonable time. Actually the classic swissbib content consists of around 20 million records (merged). We are in charge to include licensed article meta-data in the swissbib data-hub. This content (potentially between 30 -150 million descriptions) should also be transformed and serialized into a semantic RDF format.
The mechanisms should be available on a desktop (for example a person in development of a workflow for metadata transformations) and at the same time it should be possible to use the same artifact on a cluster to process millions of items in reasonable time. Actually the classic swissbib content consists of around 20 million records (merged). We are in charge to include licensed article meta-data in the swissbib data-hub. This content (potentially between 30 -150 million descriptions) should also be transformed and serialized into a semantic RDF format.
II. Possible software alternatives to reach the defined goals
A. Infrastructure
developed by the Culturegraph project (German National library)
One of the main components provided by Culturegraph is called MetaFacture which in turn comprises of three main parts: Framework, Morph and Flux. You can find a more detailed explanation of these parts on the Wiki of the software on Github .
To get to know and to evaluate the component I created my own repository on Github. There I included the sources of the lobid project
1) How are the main components of Culturegraph related to the general requirements defined above?
a) Framework:
The framework makes the solution re-usable and expandable. It defines general classes and interfaces which could be used by others (swissbib too) to extend the solution in a way it is required for their aims.
As an example for this: The metafacture-core component provides commands which are combined to FLUX – workflows (we will see the description of this later). These commands were extended by HBZ by some commands of their own to serve their needs.
Beside new commands being part of the workflow you can define new types as functions being part of the Metamorph domain specific language.
The same mechanism to write specialized functionality with dedicated commands beside the metafacture-core commands is used by the Culturegraph project itself. There already exist connectors for SQL databases, NoSQL stores like the often used Mongo store and to include Mediawiki content as part of the meta-data transformations. All these repositories are available on Github
b) (Meta)-Morph transformation language
Morph is a Domain Specific Language to
define Meta-Data transformations using XML. Here
you can find a larger example for a morph script which was
build by HBZ to transform bibliographic MAB descriptions into
RDF. For the result of this transformation (done by myself with the evaluation repository on Github) look here.
My guess (at the moment – I'm still in the phase of evaluation): Metamorph enables you to define at least most of the transformation required for our needs. For most of the work, knowledge of programming languages isn't necessary. You have to know your meta-data structures and how you want to transform it.
My guess (at the moment – I'm still in the phase of evaluation): Metamorph enables you to define at least most of the transformation required for our needs. For most of the work, knowledge of programming languages isn't necessary. You have to know your meta-data structures and how you want to transform it.
General potential of
Meta-Morph:
In the mid-term perspective I have in mind to use this kind of solution not only for transformations into RDF but for already existing processes in the classic swissbib service too (CBS FCV transformations and Search Engine - SOLR / Elasticsearch - document processing for example)
In the mid-term perspective I have in mind to use this kind of solution not only for transformations into RDF but for already existing processes in the classic swissbib service too (CBS FCV transformations and Search Engine - SOLR / Elasticsearch - document processing for example)
c) FLUX – create
workflows for transformation processes
I would describe FLUX as a “macro language” (similar to macros in office programs) for the definition of processing pipelines. Here you can find an example.
The following picture will give you an overview and idea how heterogeneous article meta-data (provided by publishers) could be normalized in processes based on Metafacture.
I would describe FLUX as a “macro language” (similar to macros in office programs) for the definition of processing pipelines. Here you can find an example.
The following picture will give you an overview and idea how heterogeneous article meta-data (provided by publishers) could be normalized in processes based on Metafacture.
To support the creation of FLUX work-flows (especially for non developers) HBZ implemented an Eclipse extension. This extension gives you suggestions for commands (together with their arguments) as part of whole work-flows. Additionally you can start a FLUX process under control of the extension. Fabian Steeg (HBZ and creator of the extension) wrote a helpful summary of the
metafacture-ide and metafacture-core out of his point of view.
2) Additional relations to the defined goals
As I tried to describe: the main parts of Metafacture (and extended components based on the core framework) addresses the requirements of the goals in a), b) and f)
(meta-data transformation / reusable, expandable and future proofed / not only for software developers)
But what about the other defined goals:
- The software is Open Source, originally developed by the Deutsche Nationalbibliothek and available via Github and Maven
- The software (Metafacture) isn't in its phase of infancy and a community has been established around it. Currently deployed as version 2.0, it is the heart of the linked data service provided by DNB. It is already re-used by other institutions for their services (especially HBZ for lobid.org service and SLUB Dresden for their EU financed “Data Management Platform for the Automatic Linking of Library Data")
- The infrastructure should have the potential being very scalable.
Transformation processes of meta-data are currently often based on traditional techniques like relational databases which is proven and mature. E.g. a lot of sophisticated meta-data transformation processes were created by swissbib classic in the last 5 years.
We don't have in mind to throw away these possibilities and collected experiences in the past. And even for the remarkable larger amount of meta-data for licensed articles we think it makes sense and it is possible to use it by increasing our (virtual) hardware.
But (probably soon) there comes the point in time where newer possibilities should be used (at least in parallel). These possibilities are often labeled with the catchphrase “Big Data” and behind these catchphrases are mostly Hadoop for distributed processing and distributed (NoSQL) storages like Hbase.
Because the Metafacture work-flows are stream-based (combining several modules in a chain) it is well prepared to run on a Hadoop Cluster. Culturegraph has already done this with their component called metafacture-cluster.
This technique made it possible for them to bring together all the bibliographic resources of all the library networks in Germany and to build clusters of equal or similar data (which is a foundation of their Linked Data service) – really fast. This is actually done by swissbib classic within our current data-hub based on traditional technologies (see above). Here you can find a more detailed view on the architecture of the current swissbib.
We have to keep this aspect (processing of really large amounts of data) in mind. Perhaps not immediately when it comes to article data but with greater probability if research data should be incorporated into a data-hub like swissbib.
B. Toolset developed by LibreCat (with Catmandu as the Backbone)
I don’t have any experience with LibreCat.
I
spoke with a colleague (who is involved in the community) about the
current status of the project and we tried to compare it on a very
high level.
One sentence I still have in my mind:
Because Culturegraph is based on Java they (Culturegraph) use a lot of XML while Librecat (implemented in Perl) can do the same pretty much faster directly in the code.
I guess it's true as long as it's true... there are some misconceptions from my point of view:
a) There is no use of XML because the core of Metafacture is implemented in Java. Java and XML aren't correlated in any way. Metafacture uses XML because it's the mean (not more) to express the DSL (Domain Specific Language - Metamorph) used to define meta-data transformations. This makes it possible for people without advanced programming skills to define such transformations on a higher level.
b) a clear separation of implementation and definitions for meta-data transformations (the rules) is not unimportant to make the system in the long term better maintainable although it's probably faster and easier to understand 'for the developer itself' to implement transformations in Perl.
- maybe a very subjective point of view:
Today I wouldn't implement a system with Perl. If a dynamic scripting language my recommendation would be something like Python or Ruby. Although things change pretty fast nowadays my impression is that Perl isn't often used by younger people. It might be different in the 'current' library world. But my assumption is it's going to change in the future.
- Especially when it comes to the processing of large amounts of data – I wouldn't use dynamic scripting languages. Execution performance is one thing. The other: the integration with clusters for large amounts of data is less supported.
Nevertheless: we should keep an eye on the development in the project. A noticeable group of people (able to contribute) is interested in the project and as I tried to depict it in my picture above:
I guess we could integrate it (if reasonable) in an infrastructure like Metafacture.
Another aspect: If it is possible to re-use the work for normalization of licensed article meta-data done by GBV we have to be “Java compatible”. If I'm not wrong the work is done on the basis of this programming language.
One sentence I still have in my mind:
Because Culturegraph is based on Java they (Culturegraph) use a lot of XML while Librecat (implemented in Perl) can do the same pretty much faster directly in the code.
I guess it's true as long as it's true... there are some misconceptions from my point of view:
a) There is no use of XML because the core of Metafacture is implemented in Java. Java and XML aren't correlated in any way. Metafacture uses XML because it's the mean (not more) to express the DSL (Domain Specific Language - Metamorph) used to define meta-data transformations. This makes it possible for people without advanced programming skills to define such transformations on a higher level.
b) a clear separation of implementation and definitions for meta-data transformations (the rules) is not unimportant to make the system in the long term better maintainable although it's probably faster and easier to understand 'for the developer itself' to implement transformations in Perl.
- maybe a very subjective point of view:
Today I wouldn't implement a system with Perl. If a dynamic scripting language my recommendation would be something like Python or Ruby. Although things change pretty fast nowadays my impression is that Perl isn't often used by younger people. It might be different in the 'current' library world. But my assumption is it's going to change in the future.
- Especially when it comes to the processing of large amounts of data – I wouldn't use dynamic scripting languages. Execution performance is one thing. The other: the integration with clusters for large amounts of data is less supported.
Nevertheless: we should keep an eye on the development in the project. A noticeable group of people (able to contribute) is interested in the project and as I tried to depict it in my picture above:
I guess we could integrate it (if reasonable) in an infrastructure like Metafacture.
Another aspect: If it is possible to re-use the work for normalization of licensed article meta-data done by GBV we have to be “Java compatible”. If I'm not wrong the work is done on the basis of this programming language.