You are here
Frequently Asked Questions - Metadata in CLARIN: basics
Metadata is data about data: information describing properties of linguistic resources, for instance the size of a corpus, the recording date of a speech file, the purpose for which annotations were created.
A fixed set of elements for the description of resources. Think of the traditional filing cards in the library, specifying the writer and title of each book.
Good question. In fact, there is no such thing as a single CLARIN metadata scheme. Practice showed that using a single scheme for a large community (e.g. the Humanities) often results in a mismatch between the chosen elements and the needs of the user.
CLARIN proposes a component-based approach: you can combine several metadata components (sets of metadata elements) into a self-defined scheme that suits your particular needs. Of course you can share your profile with others (in fact we strongly advise that). If sharing the full profile is not an option, you still can use common components, e.g. a component to describe a sound recording. In case that still does not address your needs, it is even possible to create components yourself.
Link from the parent .cmdi file to the child .cmdi file with a ResourceProxy that has the ResourceType Metadata.
- E.g. http://infra.clarin.eu/cmd/example/collection/collection_root.cmdi has 2 child collections:
- http://infra.clarin.eu/cmd/example/collection/collection_lrt_inventory.cmdi, this file contains in turn links to files like:
The recommended profile to use for collection description is clarin.eu:cr1:p_1345561703620 ("Collection").
All files of this example collection can be accessed and explored via http://infra.clarin.eu/cmd/example/collection/
Below is a graphical representation (as shown by Arbil) of the CMDI file hierarchy used above as an example.
There are multiple suitable profiles, as described in the CMDI core model for web services (and extended documentation).
See also the following paper:
Windhouwer, M., Broeder, D., & Van Uytvanck, D. (2012). A CMD core model for CLARIN web services. In Proceedings of the workshop on Describing Language Resources with Metadata: Towards Flexibility and Interoperability in the Documentation of Language Resources at LREC 2012 (pp. 41-48).
Each CMDI files exists of 3 parts:
- a (fixed) Header, containing administrative information:
- MdCreator: the author of the file
- e.g. "Eric Carlson"
- MdCreationDate: the creation date of this file
- e.g. "2016-12-31"
- MdSelfLink: the URL or PID of this file
- MdProfile: the unique identifier of a CMDI profile, as generated by the component registry
- e.g. "clarin.eu:cr1:p_1290431694484"
- MdCollectionDisplayName: an (optional but recommended) plain text indication to which collection this file belongs. Used for the Collection facet in the VLO
- MdCreator: the author of the file
- a (fixed) Resources section, containing links to:
- external files (e.g. an annotation file or a sound recording)
- and/or other CMDI metadata files (to build hierarchies)
- a (flexible) Components section, where the actual components that this profile contains will appear
This example CMDI file illustrates the use of the 3 parts.
<Resources> <!-- List of external resource files and (CMDI) metadata files --> <ResourceProxyList> <ResourceProxy id="a_photo"> <ResourceType mimetype="image/jpeg">Resource</ResourceType> <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef --> <ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef> </ResourceProxy> <ResourceProxy id="a_text"> <ResourceType mimetype="text/plain">Resource</ResourceType> <ResourceRef>http://www.clarin.eu/sometext.txt</ResourceRef> </ResourceProxy> ...
As you can see, for each link to an external resource a ResourceProxy (= file) is added to the ResourceProxyList (= file list). For each ResourceProxy you need to specify the ResourceType:
- Resource, the default, for a link to a web-accessible file (e.g. text file, MPEG video, TEI file)
- Metadata in case you want to build a hierarchy of CMDI files
- SearchPage, to link to a specialised website where the described resource can be queried (more details...)
- LandingPage, to link to the "original context", e.g. the URL of a repository system displaying the digital object that is described (more details...)
- SearchService, to link to a specialised webservice where the described resource can be queried (more details...)
With an optional (but very useful) mimetype attribute you can (surprise!) indicate the file's mime type. The ResourceRef contains either a normal URL or a handle PID.
E.g. in the example CMDI file we can add a textual description of the photo. First the relevant ResourceProxy gets the id "a_photo":
<ResourceProxy id="a_photo"> <ResourceType mimetype="image/jpeg">Resource</ResourceType> <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef --> <ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef> </ResourceProxy>Then, later on in the same CMDI file, we have an explanantory component example-component-photo with a description element:
<example-component-photo ref="a_photo"> <description>a suitable textual description of this photo</description> </example-component-photo>
Thanks to the reference from this component to the ResourceProxy with the ref attribute we know that the description relates to the photo.
Note that the id attribute should be unique for each ResourceProxy.
The MdProfile element (in the Header section) contains a unique profile code (e.g.: clarin.eu:cr1:p_1290431694484). Alternatively you can also find the profile identifier as part of the schema location, for example (CMDI 1.1):
<CMD ... xsi:schemaLocation="http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1290431694484/xsd">
or (CMDI 1.2):
<cmd:CMD ... xsi:schemaLocation="http://www.clarin.eu/cmd/1 https://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1381926654508 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1290431694484/xsd">
You can find the profile in the component registry with the following URL:
The isProfile="true" attribute indicates that a CMD_ComponentSpec defines a profile and not just a component.
<!-- Note the support for multilingual fields, using the xml:lang attribute --> <title xml:lang="eng">mister</title> <title xml:lang="fra">monsieur</title> <title xml:lang="nld">mijnheer</title>
For indicating the language we strongly advice to use the ISO-639-3 language code.
Please note that enabling Multilingual will make the element repeatable, even if the Maximum number of occurences is set to 1.
As a starting point, see the list below. We are working to extend it.
Components with controlled vocabularies
- For languages:
- For language families: iso-languagefamiliy-639-5
- For countries: iso-country
- For continents: iso-continent
- For mimetypes: cmdi-mimetype
- For license types: License
- Collection descriptions: collection
- TEI headers: teiHeader
- OLAC: OLAC-DcmiTerms
- IMDI session: imdi-session , IMDI corpus: imdi-corpus
- Phonetic collections/corpora: media-session-profile (recording session )and media-corpus-profile (collection level)
- Lexical resources: LexicalResourceProfile
- Interview: OralHistoryInterview
This can be done with a ResourceProxy where:
- ResourceType = SearchService
- mimetype = application/sru+xml
<ResourceProxy id="d55"> <ResourceType mimetype="application/sru+xml">SearchService</ResourceType> <ResourceRef>http://cqlservlet.mpi.nl/</ResourceRef> </ResourceProxy>
For a complete example file see: http://www.clarin.eu/cmd/example/example-cgn-sru.cmdi
- In the Component Registry: open the drop down menu on the far right column of the profile's row and select Download XSD (CMDI 1.1) or Download XSD (CMDI 1.2) depending on the support and requirements of your tools, repository etc (more information).
- Or use the web service directly and download the XSD from the following url (if you know the profile ID)
- For CMDI 1.1: http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1288172614017/xsd - where the part in boldface stands for the unique profile ID
- For CMDI 1.2: http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.2/profiles/clarin.eu:cr1:p_1288172614017/xsd - where the part in boldface stands for the unique profile ID
This can be done with a ResourceProxy where:
- ResourceType = SearchPage
<ResourceProxy id="d55"> <ResourceType>SearchPage</ResourceType> <ResourceRef>http://corpus1.mpi.nl/ds/trova/search.jsp?nodeid=MPI86949%23</ResourceRef> </ResourceProxy>
For a complete example file see: http://catalog.clarin.eu/metadata/cmdi/collections/collection-cgn.cmdi
If you want to add a link to the original context of the metadata file, e.g. to the repository where it is hosted (example), add a ResourceProxy of the type LandingPage, e.g.:
<ResourceProxy id="lp"> <ResourceType>LandingPage</ResourceType> <ResourceRef>http://hdl.handle.net/11858/00-097C-0000-0008-E130-A</ResourceRef> </ResourceProxy>
The fields of a profile are fixed, so you will need to use a different profile. Don't worry, you can create your own. Since you found a profile that seems to almost match your needs, the most logical thing to do is to create a new profile based on that one.
You can do this yourself, as long as you have a way to login to the Component Registry. Click the 'login' link and select your home institute or another provider you have an account with. If none is in the list, create an account with CLARIN (more info).
When logged in, select the base profile and click the 'Edit as new' button. Save it in your private workspace (under a different name and/or group). The profile consists of links to a number of components (some of which in turn consist partially of links to components), so you will have to identify the components that you need to change. Edit these 'as new', as well and make the required changes. You may have to do this recursively for deeper hierarchies. Then, in your profile, replace the references to the original components with references to your new versions of these components. Save the profile, and test it in an editor (e.g. oXygen or Arbil) before publishing (you can get the XSD link by selecting the profile in the component browser and choosing 'Show Info' from the drop down menu on the far right. You can open this link in an XML editor or validator; in Arbil you can add it via the 'Profiles and templates' settings.
There currently are two supported versions of the CLARIN's component metadata framework: CMDI 1.1 and CMDI 1.2. The former has been in active use for many years and is widely supported within the CLARIN infrastructure. CMDI 1.2 was introduced in 2016 and provides a number of new features and improvements compared to its predecessor. However, its support throughout the infrastructure is still limited (at the time of writing this FAQ, July 2016).
Therefore in order to make a decision about which version of CMDI to use, it's advised to first determine which tools you need your metadata to be processed with. More details about CMDI 1.2, including current information with respect to its support throughout the infrastructure, can be found at the CMDI 1.2 page.