As there are many different repository systems, we cannot offer detailed instructions for all use cases. However, depending on the exact solution, repository managers may have to consider the aspects elaborated on in the sections below.
Ingestion of metadata
Users will enter or upload metadata, which then needs to be processed and stored. It may be submitted as , converted to CMDI at this point and stored as such, or stored in another way for (on-demand) conversion at a later point. Some points to consider:
- Ideally, the record receives a persistent identifier ( ). This PID must be encoded in the CMDI record as its “self link”, provided that the system allows retrieval of the metadata as CMDI via this PID (by default or as an option by means of content negotiation).
- A correctly resolving reference to the described resource(s) must be included in the metadata, as should a reference to a landing page. Often this can only be determined at or beyond the point of ingestion, therefore this must be taken care of and/or validated at some point in the ingestion pipeline.
- Metadata should be XSD validated to ensure correctness and detect potential issues as soon as possible.
Conversion to/from CMDI
Most use cases require some form of metadata translation, conversion or export at one or more points in the metadata lifecycle. The nature of these operations depends on how metadata is accepted, stored and distributed.
- Metadata ingested and/or stored as non-CMDI XML: Convert to CMDI on ingestion, or on the fly when requested. Stylesheets are available for conversion from several formats to CMDI in CLARIN’s metadata conversion repository. For other formats, a crosswalk will have to be implemented in order to support CMDI.
- Metadata stored as CMDI XML: Conversion is not needed to integrate with the CLARIN infrastructure, but to operate a fully compliant provider, a Dublin Core ( ) representation of the metadata has to be offered. There is currently no generic conversion from CMDI to DC, but a basic stylesheet producing minimal DC should be straightforward to implement for the profile(s) used. The same applies to other generic representations commonly required such as DataCite, which is a requirement for DOIs. With such a conversion in place, CMDI can be offered both on request by an individual user and as a metadata format in the OAI-PMH provider.
- Metadata not stored as XML: This typically means a ‘flat’ representation that can be expressed as key-value pairs. A crosswalk of these pairs to a matching CMD profile should normally be straightforward. If the metadata is available from the system in JSON format, an intermediate conversion to XML might be useful, or use the native JSON support in more recent XML standards libraries. With such an export in place, CMDI can be offered both on request by an individual user and as a metadata format in the OAI-PMH provider.
Providing CMDI for harvesting
This requires an endpoint that can offer metadata on request on the basis of the OAI-PMH protocol. CLARIN “harvests” such endpoints (“providers”) to collect the metadata that gets imported into the VLO and processed by the Curation Dashboard on a regular basis.
- An OAI-PMH provider may be available as a built-in feature of the repository solution used, or else needs to be set up separately using one of the existing solutions or alternatively implemented according to the specification.
- OAI-PMH providers that provide Dublin Core (as required per the OAI-PMH 2.0 specification) can be harvested by CLARIN. However, B-centres are required to provide CMDI directly from their provider. The exact way in which this can be achieved depends on the solution used and the way in which metadata is available to the provider. In some cases, CMDI will need to be produced from some other source, either on the fly or beforehand.
- To have your metadata harvested by CLARIN, either register as a C-centre or B-centre (see requirements) in the Centre Registry if you are affiliated with one of CLARIN’s national consortia, or else contact firstname.lastname@example.org and request to have your endpoint evaluated for addition as a non-CLARIN endpoint.
Some other useful tips for repository managers:
- Metadata harvested by CLARIN from B-centres and C-centres, regularly updated results of an automatic evaluation will be available from the Curation Dashboard. This will help you identify potential issues with the metadata, understand the representation of the metadata in the , and find broken resource links.
- Harvesting results and logs are available via the harvest viewer and as an open directory listing.