Frequently Asked Questions - Metadata: basics

It is data about data: information describing properties of linguistic resources. Think of the size of a corpus, the recording date of a speech file, the purpose for which annotations were created.

A fixed set of elements for the description of resources. Think of the traditional filing cards in the library, specifying the writer and title of each book.

Quite a few, examples are: Dublin Core, OLAC (which is an enriched version of Dublin Core), IMDI, the TEI header, ...

Good question. In fact there is not such a thing as a single CLARIN metadata scheme. Practice showed that using a particular scheme for a large community (e.g. the humanities) often results in a mismatch between the chosen elements and the needs of the user.

CLARIN proposes a component-based approach: you can combine several metadata components (sets of metadata elements) into a self-defined scheme that suits your particular needs. Of course you can share your profile with others (in fact we strongly advise that). If sharing the full profile is not an option, you still can use common components, e.g. a component to describe a sound recording. In case that still does not address your needs, it is even possible to create components yourself.

Each CMDI files exists of 3 parts:
  • a (fixed) Header, containing administrative information:
    • MdCreator: the author of the file
    • MdCreationDate: the creation date of this file
    • MdSelfLink: the URL or PID of this file
    • MdProfile: the unique identifier of a CMDI profile, as generated by the component registry (e.g. clarin.eu:cr1:p_1290431694484)
    • MdCollectionDisplayName: an (optional but recommended) plain text indication to which collection this file belongs. Used for the Collection facet in the VLO
  • a (fixed) Resources section, containing links to:
    • external files (e.g. an annotation file or a sound recording)
    • and/or other CMDI metadata files (to build hierarchies)
  • a (flexible) Components section, where the actual components that this profile contains will appear
This example CMDI file illustrates the use of the 3 parts.

Ok, so how can you refer to an external file from a CMDI metadata description? That is where the Resources section is for.

In the example CMDI file, the resources section looks like:

<Resources>
     
      <!-- List of external resource files and (CMDI) metadata files -->
      <ResourceProxyList>
        
         <ResourceProxy id="a_photo">
            <ResourceType mimetype="image/jpeg">Resource</ResourceType>
            <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef -->
            <ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef>
         </ResourceProxy>
        
         <ResourceProxy id="a_text">
            <ResourceType mimetype="text/plain">Resource</ResourceType>
            <ResourceRef>http://www.clarin.eu/sometext.txt</ResourceRef>
         </ResourceProxy>

...

As you can see, for each link to an external resource a ResourceProxy (= file) is added to the ResourceProxyList (= file list). For each ResourceProxy you need to specify the ResourceType (either Resource, the default, or Metadata in case you want to build a hierarchy of CMDI files). With an optional (but very useful) mimetype attribute you can (surprise!) indicate the file's mime type. The ResourceRef contains either a normal URL or a handle PID.

The information that a ResourceProxy can contain (a URL and mimetype) is kept very minimal, on purpose. However you can use any CMDI component to add more details about such a ResourceProxy, using the id attribute.

E.g. in the example CMDI file we can add a textual description of the photo. First the relevant ResourceProxy gets the id "a_photo":
<ResourceProxy id="a_photo">
    <ResourceType mimetype="image/jpeg">Resource</ResourceType>
    <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef -->
    <ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef>
</ResourceProxy>
Then, later on in the same CMDI file, we have an explanantory component example-component-photo with a description element:
<example-component-photo ref="a_photo">
     <description>a suitable textual description of this photo</description>
</example-component-photo>

Thanks to the reference from this component to the ResourceProxy with the ref attribute we know that the description relates to the photo.

Note that the id attribute should be unique for each ResourceProxy.



Check out the CLARIN specification document: http://www.clarin.eu/files/wg2-4-metadata-doc-v5.pdf

PLEASE NOTE: the information in this document might be partially outdated, http://www.clarin.eu/cmdi (including these FAQs) is certainly more up to date and should be considered authorative.

The MdProfile element (in the Header section) contains a unique profile code (e.g.: clarin.eu:cr1:p_1290431694484).

You can find the profile in the component registry with the following URL:

http://catalog.clarin.eu/ds/ComponentRegistry?item=code

e.g. http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_1290431694484
Technically there is no real difference. A profile is a component that can be converted into an XSD file. A normal component can only be used within other components or profiles and can never be transformed into an XSD.

The isProfile="true" attribute indicates that a CMD_ComponentSpec defines a profile and not just a component.
Yes. If you tick the checkmark next to Multilingual for an element in the Component Registry, it will result in a multilingual field. With the xml:lang attribute you can then indicate the language in which an element has been described, see eg. the following fragment in this example CMDI file:
<!-- Note the support for multilingual fields, using the xml:lang attribute -->
<title xml:lang="eng">mister</title>
<title xml:lang="fra">monsieur</title>
<title xml:lang="nld">mijnheer</title>

For indicating the language we strongly advice to use the ISO-639-3 language code.

As a starting point, see the list below. We are working to extend it.

Components with controlled vocabularies

Other components

The CMDI core model for web services (and extended documentation) is available at:

http://www.isocat.org/clarin/ws/cmd-core/

Send a mail to vlw@clarin.eu and we will incorporate your metadata as soon as possible

If you have old records in the IMDI format you can use the following profiles:

From that profile you can generate the XSD:
And then you can transform your IMDI files into CMDI files that comply with the profile with the following XSLT:

http://www.clarin.eu/cmd/xslt/imdi2clarin.xsl

An example IMDI inputfile:
http://corpus1.mpi.nl/IFAcorpus/IMDI/Session_F20N1FY.imdi

The corresponding (CMDI) outputfile:
http://www.clarin.eu/cmd/example/example-phonological-corpus.cmdi
There is no general procedure to do this, as TEI has many variants and extensions. However, you could follow the following general workflow:
  • Inspect your TEI headers and decide what the relevant parts are. Some information (e.g. layout tags etc.) might be lost during the conversion.
  • Compare your needs with the TEI profile in the CMDI component registry. If it fulfills your needs, go to the next steps. If it does not, use the TEI profile as a basis to create your own CMDI profile.
  • Create an XSLT that generates CMDI instances (according to the profile that you chose in the previous step) from the TEI files. (Have a look at olac2cmdi.xsl and imdi2clarin.xsl for some inspiration).

There are indeed issues with searching if people aren't using matching descriptions. Think of someone calling a collection of texts a "text archive", while someone else might be searching for a "text corpus". Or think of all the variants that people can use for one and the same country: the Netherlands, Netherlands, Holland, etc. The same goes for lingustic annotations: "noun" and "substantive" can both be used to describe the same part-of-speech tag. To counter these problems the metadata components contain links to a kind of database that contains atomic concepts (say "country" or "resource type"). We call them data categories. Smart software will later on be able to "see" that if a user searches for nouns, he might also be interested in substantives.

In a data category registry - a server that can be reached via the internet, both by human users and computer programs.

Yes, go to the site mentioned above and click through to Public > Athens Core > Metadata