Montag, 14. März 2016

National Licences and article metadata in Germany, Canada, France and Switzerland

National Licences and article metadata in Germany, Canada, France and Switzerland


Introduction - the Swiss Context


In 2015, Switzerland launched a Swiss National Licences project. This is a 2-year project, funded by swissuniversities / program SUK-P2 "Scientific information : Access, processing and storage" with a total amount of 10 million Swiss Francs. 7.5 million to buy contents from publishers, 2 million to ensure preservation for the whole Switzerland (probably with Portico and LOCKSS) and 0.5 million for the negotiations of contracts, the overall management and the metadata management. The project is led by the Consortium of Swiss University Libraries (at ETH Zurich). The metadata management subproject has been allocated to the Swissbib team at the University of Basel.

Access will be possible for every partner of the Consortium : universities, universities of applied sciences, state libraries, research institutes… One of the goal is therefore to bridge the gap between “rich” and “poor” institutions in Switzerland. Interested private persons living in Switzerland can also access the content, directly from their home, after following a suitable registration.

The Swiss National licences will be coupled together with the current content licences. This process can be summarized as follows. Let’s say Switzerland has a national licence for journal articles for a given publisher covering the years 1947-2008. At the same time, some universities in Switzerland have a licence for current content covering 2009-2016. At the beginning of 2017, if some universities sign for 2017 content for this publisher, then one more year will be available as part of the national licence (meaning the national license is expanding to 1947-2009). The details (when a new year is added and under which conditions) may vary from publisher to publisher, but that’s the main idea.

The goals of the metadata management subproject are the following :

  • Private users. Build a search engine to allow private users in Switzerland to search and access the content which is licensed for them. This will be done using the Swissbib existing infrastructure. The registration and authentication mechanism will be created together with SWITCH.
  • Integration in Library Discovery tools. For participating libraries which already have some kind of discovery tools, the integration of content should be seamless, for example with the creation of dedicated targets in ExLibris SFX or Proquest Intota
  • Gain experience in article metadata management. The management of publisher’s metadata at the article level didn’t happen yet in Switzerland. The goal is to gain experience with this to deliver additional services later on (for example within the SLSP project) : text and data mining, discovery tools for smaller institutions (university of applied sciences) as well as a collaboration with international partners.

The contracts are not signed yet, but the goal is to sign them in Spring 2016 and have the portal live at the end of 2016. The negotiations are still happening and the contracts will be evaluated by an independent board (“Evalutionsgremium”) in the coming weeks.

At the start of the project, the goal was to gather what has been done abroad with this respect. There was a telephone call with the people from Toronto, a meeting in Göttingen with the people from GBV and Finc and a meeting in Nancy with the people from INIST. All that happened in January 2016.




Summary National Licences

 

Germany Ontario/Canada France Planned for Switzerland
Link http://www.nationallizenzen.de http://scholarsportal.info http://www.istex.fr
Start year 2004 2006 2012 2015
Number of articles in the platform 25 million 44 million 20 million 3.5 million
Budget > 120 million € 60 million € 10 million CHF
Main focus access preservation text and data mining access
Participating Publishers ~30 30 20 4
Pivot Format OCLC PICA+ NLM JATS MODS NLM JATS
Fulltext (pdf) delivery on the project platform yes, but via a proxy to the publisher’s platform yes yes no
Access for private users yes no no yes
Rights to share the metadata yes no yes yes
Language for transformation C and Java Java in the past, XSLT now XSLT XSLT
Document store OCLC CBS MarkLogic OCLC CBS
Search engine technology SOLR MarkLogic Elasticsearch SOLR
Front-end technology VuFind MarkLogic no Front-end VuFind

 

 

Summary Article Indexes

Articles indexes are not bound to national licences. They contain plenty of journal articles, which can then be matched with library e-journals holdings.

GBV Zentral Finc
Link http://findex.gbv.de http://finc.info
Start year 2012 2014
Number of articles in the platform 130 million 80 million
Participating Publishers ~40 mostly via crossref
Pivot Format marc21 local format
Rights to share the metadata yes yes
Language for transformation C and Java go
Search engine technology SolrCloud SOLR
Front-end technology no Front-end VuFind

 

 

Germany Nationallizenzen & Allianzlizenzen

Germany was a pioneer in national licensing. The DFG (Deutsche Forschung Gesellschaft) project German National Licences was launched in 2004 and followed in 2011 by the so called Allianz-Lizenzen. The National Licences were funded only by the DFG. The content licensed is accessible in the whole Germany. The Allianz-Lizenzen are funded at 25% by the DFG and at 75% by participating universities. The contents licensed are available only to participating institutions. After a couple of years (so called Moving Walls whose duration depends on the publishers), an Allianz Licence becomes a National Licence, and access is then granted to the whole Germany. Since the beginning of the project, Germany has been spending more than 120 million Euros on the various projects. The negotiations are done by 8 libraries in Germany and the metadata handling is done by the GBV (Gemeinsamer Bibliotheksverbund) in Göttingen.
Currently, they have managed the metadata of ~25 million journal articles (and other kind of content like book chapters). They got the metadata directly from the publishers. In the first years, processing this metadata was really a pain, but it has improved in recent years. They transform all metadata in OCLC PICA+ format and store it in OCLC CBS software. Access is done through the Suchkiste via a VuFind interface and a SOLR index. They loaded all content in the Suchkiste in 2010, but haven’t updated yet because this wasn’t a priority.
Interested libraries can download article metadata in MARC21 format (as an export from OCLC CBS) from the GBV. They can also access directly the GBV Zentral Index (SOLR). The metadata is available to everybody. The description of the collections (journal holdings) is in EZB, the German electronic journals database.
Regarding the fulltexts (pdf), access is done on the Publisher’s Platform directly, but the GBV store the pdf as well in case of failure. This amounts yet to 60 TB of data. In the contract for national licences, it was mandatory for the publishers to deliver the pdf on their own platform for the next 10 years.
The access is also possible for private users who live in Germany. The person needs to register online and give its private home address. The registration is than quickly checked by one of the 8 libraries (depending on the first letter of the last name) which subsequently sends a letter by post to the person. Password for login is enclosed in the letter. This process can last up to 10 days. At the time of the registration, the user must say which publisher he is interested in, because each of them have different conditions. Access is then done via a Shibboleth authentication to a proxy at GBV to access the National Licences content. On some publishers platform, it is also possible to login directly via Shibboleth. Since the beginning of the project, there has been more or less every year 8000 private active users. Every year 2000 more are added, and 2000 are removed because they are inactive.

 

 

Ontario/Canada ScholarsPortal project

The ScholarsPortal project from Ontario/Canada started in 2006. They deliver content (metadata + fulltext) to the members of all universities in the province of Ontario, in Canada. Meanwhile, they have worked with over 35 academic publishers and aggregators that are able to deliver article metadata and fulltext. After a focus on access in the first years, recently they invested more in the preservation aspects. They are now an audited trustworthy digital repository and act as a preservation agent for all university libraries of Ontario (ISO 16363, known as the Trusted Digital Repository Checklist). Currently ~3 persons (rather 3 FTE) are working for the journals part of Scholars Portal.
They have managed the metadata and fulltext of ~44 million journal articles. They get the metadata directly from publishers. They transform everything into NLM JATS format using either plain Java programming or XSLT stylesheets (in the recent years). Data is then stored in a MarkLogic Server, which is also used to search (via xQuery) and deliver content. For the last couple of years, they have also been processing eBooks, using the NLM BITS format. They even provide a navigation by Volume/Issue, using additional normalization rules. For libraries, they create dedicated targets in ExLibris SFX link resolver as well as in Serial Solutions knowledgebase, 360 Core. A researcher in Ontario can access the fulltexts (pdf or XML) either on the publisher’s platform or directly in the Scholars Portal. They observed differences in behavior between small and big universities. Users from larger universities tend to prefer publisher’s platforms whereas users from smaller institutions download a majority of content directly from Scholars Portal. On Scholars Portal, even the links (citations) between articles can be resolved internally in the portal.
They don’t have the rights to share metadata with libraries outside the consortium, but if a specific publisher agrees, on their side they are ready to share.

 

 

France ISTEX project

The ISTEX project started in 2012. It is planned until 2017 and funded with 60 million Euros (55 are allocated to buy content). It is a collaboration between :
  • the Couperin consortium which focus on collecting researchers and libraries needs
  • the ABES in Montpellier which does the negotiations with the publishers, as well as the signing of contracts. The integration in Knowledge Bases from Library IT Providers (such as ExLibris SFX) is also done by the ABES via the BACON project
  • the University of Lorraine in Nancy which is the representative of universities in the project
  • the INIST (from CNRS) in Nancy which is building the platform and processing publisher’s metadata
They have no plans to build a portal, but really to be a data hub, with a strong focus on Text and Data Mining. Currently ~15 persons are working at INIST for the ISTEX project. There are 16 million documents currently in ISTEX and this will go up to ~20 million until the end of the project. They get the metadata and fulltext directly from the publishers. They deliver the fulltext (pdf’s) to the researchers as well (which is not the case in the German project for example). In the contracts with publishers, the publishers only need to deliver the fulltext on their own platform for the next five years. The ISTEX team even analyzes the text content of pdf files, using among others the GROBID software. They enrich the contents (for example with geoNames from France). They have had a lot of problems with metadata and fulltexts from publishers : invalid XML with respect to the attached DTD’s, undocumented formats, strong heterogeneity of formats (even within the same journal over time), pdf of very poor quality, missing contents… In the last months, they asked the publishers to deliver the full content before signing the contract, to make sure that it is possible to work with the delivered content. They also ask the publishers to deliver a whole data package at once : metadata, fulltexts, description of the structure of the directories and filenames, list of journals with years and number of articles published, DTD’s, contact of a person responsible for technical details. In some cases, it was impossible to match the metadata and pdf’s. The preservation is done by another institution : the CINES. Everything received from Publisher is kept.
They offer all the content (metadata + fulltext) via a REST API. Researchers can then mine the whole content, or a specific collection. The first users are in the domain of automated text analysis or in the history of sciences. Up to know, the usage and interest is still timid, but this is growing. Libraries can also use these API to integrate the content directly in their online tools, or using the widgets provided by the project. They plan to insert all ISTEX metadata in tools like Ebsco Discovery Service or ExLibris Primo Central.
They analyze the incoming metadata with elasticsearch and Kibana, reporting the problems to the publishers. After that, they transform everything to MODS for metadata (using XSLT stylesheets) and TEI for fulltext. If the XML is invalid, they make what is necessary to deliver valid XML at the end. Indeed, the researchers who do text and data mining can use the original format from publisher as well.
Almost all metadata is licensed with an Etalab licence (a French licence very similar to Creative Commons Zero). This means that it is possible to share metadata with other libraries, even outside France.

 

 

Article Index : GBV Zentral

After building the suchkiste for the German national licences, the GBV went one step further and decided to build an article index that contains current content as well. In 2012, they launched GBV Zentral. Now, there are 130 million articles available in GBV Zentral. There is no front-end in GBV Zentral, it is an index based on the SolrCloud technology. The print collections of all the libraries of GBV are also included in GBV Zentral. Therefore, in total there are 158 million documents in GBV Zentral. 51 million of them have some kind of searchable enrichments (table of contents, reviews, front or back matters from publishers). If a library manage its journal holdings in EZB (the German electronic journal library), then it is possible to match GBV Zentral content with the electronic collections of a specific library to add a filter in the search. 76 libraries are using GBV Zentral (either for print collections, online contents or both), for example
Every year more than 1.8 billion searches are done in GBV Zentral. All the metadata comes from the publishers or some databases (like Pubmed). The process is more or less the same as for the German National Licences. The GBV Zentral is also used as a central place to deliver content to ExLibris Primo, Summon and Ebsco Discovery Service. GBV Zentral is updated daily.
Every interested institution in the world can use GBV Zentral at no cost.

 

 

Article Index : the Finc Project (Leipzig / Germany)

The Finc Project (from the University of Leipzig in Germany) is not bound to national licences in any ways. It has a very different focus : the goal was to build a local article index, without buying one (for example ExLibris Primo Central or Proquest Summon). The focus was really on efficiency and current availability, without having to take care on preservation or that much on metadata quality. There are ~3 persons working on the project.
With this in mind, they decided to get all metadata from Crossref. Crossref is the organization which delivers DOI to scientific publishers. It means that Crossref has some metadata for every journal article which has a DOI. The main advantage is that all metadata is already in a common format (crossref unified schema) and there is only one provider to take care of. The disadvantage is that the metadata is somehow poorer than the metadata which is directly available from the publishers. For example, abstracts are often missing. After this initial step, Finc decided to get the metadata from other sources directly as well (for example from JSTOR or DeGruyter).
Currently, the finc project has gathered more than ~80 million journal articles from crossref. They transform everything to a very simple internal flat format, using the go programming language. They index then all the metadata in Apache SOLR and show it to users using VuFind. There is a growing interest in Germany for the finc article index. Indeed, a library which uses VuFind as an online discovery tool can use it without too much complications. The Finc index is updated more or less on a monthly basis.

 

 

Implications for Switzerland

As the coverage dates for Swiss national licences (probably going up to 2015) won’t completely overlap with the projects from the other countries, we will need to process at least some metadata on our side. We can count on international partners for specific problems (a very bad metadata set that has already been processed or a publisher specific format that has already been transformed to a more standard one).
Here are what we plan to do for Switzerland. First, we set up a collection of metadata requirements for publishers. Here are some of the important points :
  • all metadata should be delivered using a Creative Commons Zero Licence to allow Switzerland to process it as needed (this allows for example the transformation towards linked open data at a later standpoint)
  • the publisher needs to deliver the whole metadata set before signing the contract. This allows the swissbib team to check the quality of metadata. Experience has shown that after signature, it is often too late
On the technical side, the plans are the following : Incoming metadata will be processed using Metafacture from the German National Library and analyzed with elasticsearch. It will be then transformed using XSLT towards a standardized NLM JATS format, with some additional requirements on mandatory metadata. The documents will then be stored in OCLC CBS and delivered to the users using SOLR and VuFind (same as the currenct swissbib architecture).
Additionnally, the metadata will be available via an OAI-PMH interface, an SRU web service as well as maybe a REST API. At the journal level, holdings in KBART format will be created and delivered to Vendors of Library Software for the creation of dedicated targets in Link Resolvers and Discovery tools.


Examples of metadata (JATS, MODS, MARC21, Crossref, finc)

As an example, the same article in various metadata formats.