You are here
Frequently Asked Questions - Metadata in CLARIN: basics
Metadata is data about data: information describing properties of linguistic resources, for instance the size of a corpus, the recording date of a speech file, the purpose for which annotations were created.
A fixed set of elements for the description of resources. Think of the traditional filing cards in the library, specifying the writer and title of each book.
Good question. In fact, there is no such thing as a single CLARIN metadata scheme. Practice showed that using a single scheme for a large community (e.g. the Humanities) often results in a mismatch between the chosen elements and the needs of the user.
CLARIN proposes a component-based approach: you can combine several metadata components (sets of metadata elements) into a self-defined scheme that suits your particular needs. Of course you can share your profile with others (in fact we strongly advise that). If sharing the full profile is not an option, you still can use common components, e.g. a component to describe a sound recording. In case that still does not address your needs, it is even possible to create components yourself.
Link from the parent .cmdi file to the child .cmdi file with a ResourceProxy that has the ResourceType Metadata.
E.g. http://infra.clarin.eu/cmd/example/collection/collection_root.cmdi has 2 child collections:
- http://infra.clarin.eu/cmd/example/collection/collection_lrt_inventory.cmdi, this file contains in turn links to files like:
The recommended profile to use for collection description is clarin.eu:cr1:p_1345561703620 ("Collection").
All files of this example collection can be accessed and explored via http://infra.clarin.eu/cmd/example/collection/
Below is a graphical representation (as shown by Arbil) of the CMDI file hierarchy used above as an example.
There are multiple suitable profiles, as described in the CMDI core model for web services (and extended documentation).
See also the following paper:
Windhouwer, M., Broeder, D., & Van Uytvanck, D. (2012). A CMD core model for CLARIN web services. In Proceedings of the workshop on Describing Language Resources with Metadata: Towards Flexibility and Interoperability in the Documentation of Language Resources at LREC 2012 (pp. 41-48).
Each CMDI files exists of 3 parts:
- a (fixed) Header, containing administrative information:
- MdCreator: the author of the file
- e.g. "Eric Carlson"
- MdCreationDate: the creation date of this file
- e.g. "2016-12-31"
- MdSelfLink: the URL or PID of this file
- MdProfile: the unique identifier of a CMDI profile, as generated by the component registry
- e.g. "clarin.eu:cr1:p_1290431694484"
- MdCollectionDisplayName: an (optional but recommended) plain text indication to which collection this file belongs. Used for the Collection facet in the VLO
- MdCreator: the author of the file
- a (fixed) Resources section, containing links to:
- external files (e.g. an annotation file or a sound recording)
- and/or other CMDI metadata files (to build hierarchies)
- a (flexible) Components section, where the actual components that this profile contains will appear
This example CMDI file illustrates the use of the 3 parts.
Ok, so how can you refer to an external file from a CMDI metadata description? That is where the Resources section is for.
In the example CMDI file, the resources section looks like:
<Resources> <!-- List of external resource files and (CMDI) metadata files --> <ResourceProxyList> <ResourceProxy id="a_photo"> <ResourceType mimetype="image/jpeg">Resource</ResourceType> <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef --> <ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef> </ResourceProxy> <ResourceProxy id="a_text"> <ResourceType mimetype="text/plain">Resource</ResourceType> <ResourceRef>http://www.clarin.eu/sometext.txt</ResourceRef> </ResourceProxy> ...
As you can see, for each link to an external resource a ResourceProxy (= file) is added to the ResourceProxyList (= file list). For each ResourceProxy you need to specify the ResourceType:
- Resource, the default, for a link to a web-accessible file (e.g. text file, MPEG video, TEI file)
- Metadata in case you want to build a hierarchy of CMDI files
- SearchPage, to link to a specialised website where the described resource can be queried (more details...)
- LandingPage, to link to the "original context", e.g. the URL of a repository system displaying the digital object that is described (more details...)
- SearchService, to link to a specialised webservice where the described resource can be queried (more details...)
With an optional (but very useful) mimetype attribute you can (surprise!) indicate the file's mime type. The ResourceRef contains either a normal URL or a handle PID.
The information that a ResourceProxy can contain (a URL and mimetype) is kept very minimal, on purpose. However you can use any CMDI component to add more details about such a ResourceProxy, using the id attribute.
<ResourceProxy id="a_photo"> <ResourceType mimetype="image/jpeg">Resource</ResourceType> <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef --> <ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef> </ResourceProxy>
Then, later on in the same CMDI file, we have an explanantory component example-component-photo with a description element:
<example-component-photo ref="a_photo"> <description>a suitable textual description of this photo</description> </example-component-photo>
Thanks to the reference from this component to the ResourceProxy with the ref attribute we know that the description relates to the photo.
Note that the id attribute should be unique for each ResourceProxy.
The MdProfile element (in the Header section) contains a unique profile code (e.g.: clarin.eu:cr1:p_1290431694484). Alternatively you can also find the profile identifier as part of the schema location, for example (CMDI 1.1):
<CMD ... xsi:schemaLocation="http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1290431694484/xsd">
or (CMDI 1.2):
<cmd:CMD ... xsi:schemaLocation="http://www.clarin.eu/cmd/1 https://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1381926654508 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1290431694484/xsd">
You can find the profile in the component registry with the following URL:
Technically there is no real difference. A profile is a component that can be converted into an XSD file. A normal component can only be used within other components or profiles and can never be transformed into an XSD.
The isProfile="true" attribute indicates that a CMD_ComponentSpec defines a profile and not just a component.
Yes. If you tick the checkmark next to Multilingual for an element in the Component Registry, it will result in a multilingual field. With the xml:lang attribute you can then indicate the language in which an element has been described, see eg. the following fragment in this example CMDI file:
<!-- Note the support for multilingual fields, using the xml:lang attribute --> <title xml:lang="eng">mister</title> <title xml:lang="fra">monsieur</title> <title xml:lang="nld">mijnheer</title>
For indicating the language we strongly advice to use the ISO-639-3 language code.
Please note that enabling Multilingual will make the element repeatable, even if the Maximum number of occurences is set to 1.
As a starting point, see the list below. We are working to extend it.
Components with controlled vocabularies
- For languages:
- For language families: iso-languagefamiliy-639-5
- For countries: iso-country
- For continents: iso-continent
- For mimetypes: cmdi-mimetype
- For license types: License
- Collection descriptions: collection
- TEI headers: teiHeader
- OLAC: OLAC-DcmiTerms
- IMDI session: imdi-session , IMDI corpus: imdi-corpus
- Phonetic collections/corpora: media-session-profile (recording session )and media-corpus-profile (collection level)
- Lexical resources: LexicalResourceProfile
- Interview: OralHistoryInterview
This can be done with a ResourceProxy where:
- ResourceType = SearchService
- mimetype = application/sru+xml
<ResourceProxy id="d55"> <ResourceType mimetype="application/sru+xml">SearchService</ResourceType> <ResourceRef>http://cqlservlet.mpi.nl/</ResourceRef> </ResourceProxy>
For a complete example file see: http://www.clarin.eu/cmd/example/example-cgn-sru.cmdi
- In the Component Registry: open the drop down menu on the far right column of the profile's row and select Download XSD (CMDI 1.1) or Download XSD (CMDI 1.2) depending on the support and requirements of your tools, repository etc (more information).
- Or use the web service directly and download the XSD from the following url (if you know the profile ID)
- For CMDI 1.1: http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1288172614017/xsd - where the part in boldface stands for the unique profile ID
- For CMDI 1.2: http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.2/profiles/clarin.eu:cr1:p_1288172614017/xsd - where the part in boldface stands for the unique profile ID
This can be done with a ResourceProxy where:
- ResourceType = SearchPage
<ResourceProxy id="d55"> <ResourceType>SearchPage</ResourceType> <ResourceRef>http://corpus1.mpi.nl/ds/trova/search.jsp?nodeid=MPI86949%23</ResourceRef> </ResourceProxy>
For a complete example file see: http://catalog.clarin.eu/metadata/cmdi/collections/collection-cgn.cmdi
If you want to add a link to the original context of the metadata file, e.g. to the repository where it is hosted (example), add a ResourceProxy of the type LandingPage, e.g.:
<ResourceProxy id="lp"> <ResourceType>LandingPage</ResourceType> <ResourceRef>http://hdl.handle.net/11858/00-097C-0000-0008-E130-A</ResourceRef> </ResourceProxy>
The fields of a profile are fixed, so you will need to use a different profile. Don't worry, you can create your own. Since you found a profile that seems to almost match your needs, the most logical thing to do is to create a new profile based on that one.
You can do this yourself, as long as you have a way to login to the Component Registry. Click the 'login' link and select your home institute or another provider you have an account with. If none is in the list, create an account with CLARIN (more info).
When logged in, select the base profile and click the 'Edit as new' button. Save it in your private workspace (under a different name and/or group). The profile consists of links to a number of components (some of which in turn consist partially of links to components), so you will have to identify the components that you need to change. Edit these 'as new', as well and make the required changes. You may have to do this recursively for deeper hierarchies. Then, in your profile, replace the references to the original components with references to your new versions of these components. Save the profile, and test it in an editor (e.g. oXygen or Arbil) before publishing (you can get the XSD link by selecting the profile in the component browser and choosing 'Show Info' from the drop down menu on the far right. You can open this link in an XML editor or validator; in Arbil you can add it via the 'Profiles and templates' settings.
There currently are two supported versions of the CLARIN's component metadata framework: CMDI 1.1 and CMDI 1.2. The former has been in active use for many years and is widely supported within the CLARIN infrastructure. CMDI 1.2 was introduced in 2016 and provides a number of new features and improvements compared to its predecessor. However, its support throughout the infrastructure is still limited (at the time of writing this FAQ, July 2016).
Therefore in order to make a decision about which version of CMDI to use, it's advised to first determine which tools you need your metadata to be processed with. More details about CMDI 1.2, including current information with respect to its support throughout the infrastructure, can be found at the CMDI 1.2 page.
CMDI 1.2 is the successor to the CMDI 1.1 metadata framework and is one of the two currently supported versions of CMDI. More information about this specific version can be found at the CMDI 1.2 page. How the introduction of CMDI 1.2 affects you depends on your role within CLARIN. Click one of the following links to find detailed information about the transition to CMDI 1.2 that is relevant to you:
What it means to switch to CMDI 1.2 and whether you should depends on your role within CLARIN. The following pages provide answers to this and other questions for various groups: