You are here

Frequently Asked Questions - Metadata in CLARIN: harvesting and VLO

Yes - bringing all metadata descriptions together ("harvesting") and making them searchable ("indexing")  is an important part of the infrastructure that CLARIN is building. When you provide CMDI metadata CLARIN can harvest it and make it available via the Virtual Language Observatory.

If you have many metadata records or records that frequently change:

  • Use OAI-PMH
  • Provide them preferrably as CMDI (click here for details about how to serve CMDI over OAI-PMH)
  • If that is not possible provide them as OLAC
  • Send a mail to to notify us about your OAI-PMH access point
  • Depending on the situation we will add it to the centre registry (in case of a registered CLARIN centre) or add it manually to our harvester

If you have only a few static records and setting up an OAI-PMH access point is not feasible:

  • Submit your records to the Language Resource Inventory  (for data, corpora, lexica, web services, software, ...)
  • All the records in the LRT inventory will be automatically converted into CMDI records. Note that this process can take a while.

Data provided over OAI-PMH or via the LRT inventory will be made available and searchable via the Virtual Language Observatory.

No, we are not. It is a term used for gathering metadata descriptions from several locations and storing it in a central database. You can find the results of such a harvesting process at

More information about harvesting metadata can be found at

There are several packages available to setup an OAI provider, some popular examples:

  • file-based, Tomcat web application written in Java: jOAI (some centres have good experiences with this one)

  • database-based, written in PHP: oai-pmh-2

  • library to connect to a Java application: OAIProvider

More tools to setup an OAI data provider can be found at the OAI webpage.

The VLO uses concepts from the CMDI profiles (and in some cases XPaths as a fallback). A detailed overview is available at:

More information about this topic can be found in this paper

The harvester uses the namespace url to detect the provided metadata format ( for CMDI), so all prefixes get included automatically. If you nevertheless encounter unexpected behaviour, please contact

You can use the harvester yourself. Its source code is available at GitHub.

By the way, if you only need access to the harvested files, you can also download these as a tarball.

Yes. If you contact we can try to import your newly created CMDI metadata in the VLO Alpha instance, so that you get an idea how well the mapping to the VLO facets works.

The metadata harvester runs with different configurations at different times:

  • Monday and Thursday 20:00 CET: harvesting of CLARIN providers
  • Friday 20:00 CET: harvesting of non-CLARIN providers

Harvester runs typically take about 12-24 hours to complete, depending primarily on the response rate of the metadata providers.

The VLO importer for the production VLO runs after every completed harvest (importing all current metadata, both CLARIN and non-CLARIN). This means that it can take a couple of days before provided metadata becomes available in the VLO. If it takes longer than that, please send a message to

Be aware that the harvester and VLO schedule may be subject to change!

If some or most of your records have been harvested and imported in the VLO fine, but some records seem to have been omitted, they were probably explicitly skipped by the VLO importer process. The VLO import process skips records that are too large. The limit currently lies around roughly 10 megabytes.

Note that metadata files that are not in the CMDI 1.2 version get converted before import. The limit applies to the converted file, so certain records may get skipped even if the original file is within the size limit. Typically a converted record should not be more than twice the size of the original file, depending on the formatting. Typically the file size difference is 20% or less.

If files seem to be omitted from hierarchies, but do appear in the VLO in isolation, the most likely reason is that the Resource Proxy reference value (a URL or PID) used in the parent record does not match the self link value (in the MdSelfLink header item) in the referenced record. The VLO will only consider a record to be a parent of another record if it uses the exact self link of the latter to link to it.

The result ranking is based on relevancy with respect to the query (if applicable) and a number of general record properties. This is further explained in the section Understanding Search Results of the VLO's help page.