Frequently Asked Questions - Technical infrastructure

Centres

Centres will form the backbone of the persistent and stable CLARIN service infrastructure. For further information we refer to the Requirements Specification Document and to the Short Guides.

Researchers will only use certain resources and certain tools/services offered via the web when they are sure that they can access them also during a longer period. Currently, researchers mostly download resources first to their computer to create accessibility, but in the cyberinfrastructure scenario with the many big resources and collections this way is not suitable anymore. So availability and accessibility has to be guaranteed by institutions with a clear service oriented attitude that do pose as little restrictions as possible on the usage by the researchers. Only this new type of centres can give such guarantees.

Can everyone act as a CLARIN centre in the emerging network? No - since centres need to fulfill a number of criteria which have been mentioned in the documents. In particular these centres need to make a commitment statement that they will give their services for a defined period of time at a certain service level which is dependent on the type of service. Setting up centres that adhere to these requirements will cost some money, therefore it is obvious that centres need to have clear funding basis.

The answer  on this question is dependent on the country although some services are given at a European level. For detailed questions about the centres, see www.clarin.eu/centres

Metadata
Metadata: basics

It is data about data: information describing properties of linguistic resources. Think of the size of a corpus, the recording date of a speech file, the purpose for which annotations were created.

A fixed set of elements for the description of resources. Think of the traditional filing cards in the library, specifying the writer and title of each book.

Quite a few, examples are: Dublin Core, OLAC (which is an enriched version of Dublin Core), IMDI, the TEI header, ...

Good question. In fact there is not such a thing as a single CLARIN metadata scheme. Practice showed that using a particular scheme for a large community (e.g. the humanities) often results in a mismatch between the chosen elements and the needs of the user.

CLARIN proposes a component-based approach: you can combine several metadata components (sets of metadata elements) into a self-defined scheme that suits your particular needs. Of course you can share your profile with others (in fact we strongly advise that). If sharing the full profile is not an option, you still can use common components, e.g. a component to describe a sound recording. In case that still does not address your needs, it is even possible to create components yourself.

Each CMDI files exists of 3 parts:
  • a (fixed) Header, containing administrative information:
    • MdCreator: the author of the file
    • MdCreationDate: the creation date of this file
    • MdSelfLink: the URL or PID of this file
    • MdProfile: the unique identifier of a CMDI profile, as generated by the component registry (e.g. clarin.eu:cr1:p_1290431694484)
    • MdCollectionDisplayName: an (optional but recommended) plain text indication to which collection this file belongs. Used for the Collection facet in the VLO
  • a (fixed) Resources section, containing links to:
    • external files (e.g. an annotation file or a sound recording)
    • and/or other CMDI metadata files (to build hierarchies)
  • a (flexible) Components section, where the actual components that this profile contains will appear
This example CMDI file illustrates the use of the 3 parts.

Ok, so how can you refer to an external file from a CMDI metadata description? That is where the Resources section is for.

In the example CMDI file, the resources section looks like:

<Resources>
     
      <!-- List of external resource files and (CMDI) metadata files -->
      <ResourceProxyList>
        
         <ResourceProxy id="a_photo">
            <ResourceType mimetype="image/jpeg">Resource</ResourceType>
            <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef -->
            <ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef>
         </ResourceProxy>
        
         <ResourceProxy id="a_text">
            <ResourceType mimetype="text/plain">Resource</ResourceType>
            <ResourceRef>http://www.clarin.eu/sometext.txt</ResourceRef>
         </ResourceProxy>

...

As you can see, for each link to an external resource a ResourceProxy (= file) is added to the ResourceProxyList (= file list). For each ResourceProxy you need to specify the ResourceType (either Resource, the default, or Metadata in case you want to build a hierarchy of CMDI files). With an optional (but very useful) mimetype attribute you can (surprise!) indicate the file's mime type. The ResourceRef contains either a normal URL or a handle PID.

The information that a ResourceProxy can contain (a URL and mimetype) is kept very minimal, on purpose. However you can use any CMDI component to add more details about such a ResourceProxy, using the id attribute.

E.g. in the example CMDI file we can add a textual description of the photo. First the relevant ResourceProxy gets the id "a_photo":
<ResourceProxy id="a_photo">
    <ResourceType mimetype="image/jpeg">Resource</ResourceType>
    <!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef -->
    <ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef>
</ResourceProxy>
Then, later on in the same CMDI file, we have an explanantory component example-component-photo with a description element:
<example-component-photo ref="a_photo">
     <description>a suitable textual description of this photo</description>
</example-component-photo>

Thanks to the reference from this component to the ResourceProxy with the ref attribute we know that the description relates to the photo.

Note that the id attribute should be unique for each ResourceProxy.



Check out the CLARIN specification document: http://www.clarin.eu/files/wg2-4-metadata-doc-v5.pdf

PLEASE NOTE: the information in this document might be partially outdated, http://www.clarin.eu/cmdi (including these FAQs) is certainly more up to date and should be considered authorative.

The MdProfile element (in the Header section) contains a unique profile code (e.g.: clarin.eu:cr1:p_1290431694484).

You can find the profile in the component registry with the following URL:

http://catalog.clarin.eu/ds/ComponentRegistry?item=code

e.g. http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_1290431694484
Technically there is no real difference. A profile is a component that can be converted into an XSD file. A normal component can only be used within other components or profiles and can never be transformed into an XSD.

The isProfile="true" attribute indicates that a CMD_ComponentSpec defines a profile and not just a component.
Yes. If you tick the checkmark next to Multilingual for an element in the Component Registry, it will result in a multilingual field. With the xml:lang attribute you can then indicate the language in which an element has been described, see eg. the following fragment in this example CMDI file:
<!-- Note the support for multilingual fields, using the xml:lang attribute -->
<title xml:lang="eng">mister</title>
<title xml:lang="fra">monsieur</title>
<title xml:lang="nld">mijnheer</title>

For indicating the language we strongly advice to use the ISO-639-3 language code.

As a starting point, see the list below. We are working to extend it.

Components with controlled vocabularies

Other components

The CMDI core model for web services (and extended documentation) is available at:

http://www.isocat.org/clarin/ws/cmd-core/

Send a mail to vlw@clarin.eu and we will incorporate your metadata as soon as possible

If you have old records in the IMDI format you can use the following profiles:

From that profile you can generate the XSD:
And then you can transform your IMDI files into CMDI files that comply with the profile with the following XSLT:

http://www.clarin.eu/cmd/xslt/imdi2clarin.xsl

An example IMDI inputfile:
http://corpus1.mpi.nl/IFAcorpus/IMDI/Session_F20N1FY.imdi

The corresponding (CMDI) outputfile:
http://www.clarin.eu/cmd/example/example-phonological-corpus.cmdi
There is no general procedure to do this, as TEI has many variants and extensions. However, you could follow the following general workflow:
  • Inspect your TEI headers and decide what the relevant parts are. Some information (e.g. layout tags etc.) might be lost during the conversion.
  • Compare your needs with the TEI profile in the CMDI component registry. If it fulfills your needs, go to the next steps. If it does not, use the TEI profile as a basis to create your own CMDI profile.
  • Create an XSLT that generates CMDI instances (according to the profile that you chose in the previous step) from the TEI files. (Have a look at olac2cmdi.xsl and imdi2clarin.xsl for some inspiration).

There are indeed issues with searching if people aren't using matching descriptions. Think of someone calling a collection of texts a "text archive", while someone else might be searching for a "text corpus". Or think of all the variants that people can use for one and the same country: the Netherlands, Netherlands, Holland, etc. The same goes for lingustic annotations: "noun" and "substantive" can both be used to describe the same part-of-speech tag. To counter these problems the metadata components contain links to a kind of database that contains atomic concepts (say "country" or "resource type"). We call them data categories. Smart software will later on be able to "see" that if a user searches for nouns, he might also be interested in substantives.

In a data category registry - a server that can be reached via the internet, both by human users and computer programs.

Yes, go to the site mentioned above and click through to Public > Athens Core > Metadata

Metadata: harvesting and VLO

No, we are not. It is a term used for gathering metadata descriptions from several locations and storing it in a central database. You can find the results of such a harvesting process at http://catalog.clarin.eu/ds/vlo/

More information about harvesting metadata can be found at http://en.wikipedia.org/wiki/Open_Archives_Initiative_Protocol_for_Metad...

Yes - bringing all metadata descriptions together ("harvesting"), making them searchable ("indexing") and citeable ("creating persistent identifiers") is an important part of the infrastructure that CLARIN is building. When you provide CMDI metadata CLARIN can harvest it and make it available via the Virtual Language Observatory.

If you have many metadata records or records that frequently change:

  • Use OAI-PMH
  • Provide them preferrably as CMDI (click here for details about how to serve CMDI over OAI-PMH)
  • If that is not possible provide them as OLAC
  • Send a mail to vlw@clarin.eu to notify us about your OAI-PMH access point

If you have only a few static records and setting up an OAI-PMH access point is not feasible:

Data provided over OAI-PMH or via the LRT inventory will be made available and searchable via the Virtual Language Observatory.

Short answer: as indicated in the OAI-PMH protocol, you need to offer all records. There is no automatic harvesting of any of the CMDI child nodes.
 
Long answer: in the case of our toy example hierarchy (http://www.clarin.eu/faq/3454)
 
You would need to provide the following files over OAI-PMH:
 
 
Providing collection_root.cmdi (or even collection_olac.cmdi and collection_lrt_inventory.cmdi) is not enough, as all OAI harvesters are protocol-agnostic and thus do not know about CMDI’s hierarchy building! CMDI-consuming applications, such as the VLO, also need the physical files locally.

The CLARIN-D team in Leipzig has written an excellent guide explaining how to do this. (in German, see the attachment)
Metadata: Arbil as CMDI editor
Arbil is indeed a metadata editor with support for CMDI files.

It used to be an editor for IMDI only, the CMDI functionality has been added later on (since the beginning of 2010). This means that the support for CMDI files was not as extensive as the one for IMDI. However, since release 2.3 of Arbil the support for CMDI has been significantly improved.
We recommend to use at least version 2.3.

At the time of writing (2012-01-05) this is the "testing" version.
  • Download Arbil (2.3 or higher) and start it
  • Go to Options > Templates & Profiles
  • Select in "Clarin Profiles" which profile(s) you want to use as the basis for a CMDI file anc click on Close
  • Right-click on Local corpus, choose Add and select the relevant profile (the CMDI profiles are marked with a icon)

Some profiles (obvious tests and the ones not intended for manual metadata creation) have been excluded from the default profile list in arbil (testing). You can see them disabling the "only load profiles selected for manual editing" in the Available Templates & Profiles dialog.

By default, if you create a new profile in the component registry, it will show up in Arbil.

as of Arbil 2.3  there is an item 'Insert Manual Resource Location' in the context menu of CMDI instances in the tree.

See http://www.lat-mpi.eu/tools/arbil/manual/ch01s04.html

For CMDI, some additional icons are used and some icons have a slightly different meaning:

link to an external file (ResourceProxy)
grouping icon for repeatable elements (more information...)
root node of a CMDI file



When a component can occur multiple times (= CardinalityMax higher than 1), Arbil automatically groups all occurrences of these components in the CMDI file. You can recognize these by the following properties:
  • they have the grey club icon
  • the text is shown in grey
  • after the node a number indicates how many times the component occurs
E.g. in this example CMDI file there is a fragment that looks like:
<ISO639>
   <iso-639-3-code>cat</iso-639-3-code>
</ISO639>
<ISO639>
   <iso-639-3-code>spa</iso-639-3-code>
</ISO639>
Arbil shows this as:




Right click on the file in the "local corpus" panel and select Edit all Metadata



Correct observation. The elements that are optional (= have a CardinalityMin of 0) are not shown by default. You need to add them explicitly. To do this, right click on the CMDI file in the "local corpus" panel and select Add


ISOcat

An ISO Data Category Registry is a step in the direction of interoperability at the level of linguistic encoding (tag sets, metadata elements, etc.). The basic idea is to register all widely used concepts/terminology so that everyone can refer to them. All is based on the ISO 12620 standard which is a generic model not restricted to linguistics.

ISOcat is the software and database that implements the ISO 12620 standard and data model. In theory it is one of the many implementations of this standard, in practice it is the only one that currently exists. It can be accessed via http://www.isocat.org/

Currently it is being filled with many categories from for example the EAGLES project, various metadata initiatives and hopefully other sub-disciplines and initiatives. There are bodies made up by linguists that take care that the content of the ISOcat registry is not too fragmented and meets a number of criteria.

No - it is just a start to offer a reference, so that users creating new resources could use the registered categories and schemas describing legacy data can refer to them. But we will found that not all tag sets which are in use for various purposes can easily be mapped on another one. It also will largely depend on the intended usage. For searching an imperfect mapping may result in less precision, but for a researcher this may not be a problem.

There is much debate about this and other question and there is no good universal answer yet. However, we need to start using the ISOcat registry to find out how the definitions can be improved, which categories are missing and which granularity should be chosen for metadata, morphology and semantic annotation to just mention a few examples.

NO - the data model was set up with the explicit intention to not include relations, since these in most cases are dependent on theories and practical intentions.

To deal with relations between data categories a framework will be offered, RELcat, that allows users to easily manipulate and share relations according to their needs. From CLARIN we intend to offer at least one set of relations with a large coverage which users may want to use or manipulate.

In origin you can mention the inspiration source for the creation of the data category. If you do not know what to enter, please enter CLARIN

For the source of the language section, please enter CLARIN
No, unless it is required by certain language rules (e.g. nouns in German), the name of a data category should not contain capitals.
  • Register at isocat.org
  • Send a mail with your name to dieter.vanuytvanck@mpi.nl
  • You will be added to the CLARIN group
  • My Workspace > button “create new data category”
  • My Workspace > Private > CLARIN > MD > button “edit this data category selection”
  • My Workspace > Private > + (add this data category to selection)
  • Click on the icon for “save the selected data categories”
  • After inspection the new datacats can be moved to the Metadata thematic view
  • (Finally, and optionally the datcats in the Metadata thematic view can be submitted to the Thematic Domain Group for official approval)

It is a data category without conceptual domain but intended to group complex data categories (or another container).

As such it can be used to combine semantics. For instance, if a CMDI component "actor" has a reference to the container data categorie actor, and it contains an element with a reference to the complex data category "name", then a search engine could infer that the name is the one of the actor ( = "actor" + "name").

Persistent Identifiers (PIDs)

Persistent identifiers are increasingly often seen as core component for all the many references we are creating at various levels - this can range from references between metadata descriptions and their resources up to references between semantic assertions made by using the RDF (Resource Description Framework). For more information please read the requirements specification document or the short guide.

In the emerging cyberinfrastructure we are creating more and more references between resources, resource fragments and services. The creation of these references is very costly and often is essential for the interpretation of a resource. Therefore we need proper mechanisms to ensure that these references survive despite all the changes that happen in repositories for example. It is known that URLs are not appropriate - they are not persistent even when we believe that they are proper URIs. Therefore special PIDs come into place which identify an object and which are maintained by reliable institutions.

Handling PIDs is very simple. First you need to register a PID for a resource or service. You can do this very simply by providing the required information to the PID service site, in particular the path to access the resource such as a URL and you will receive back a PID which you can enter into the metadata description for example, so that everyone can use it for referencing. When a user finds such a PID in a resource, he/she can click on this reference and the service will resolve the PID and give access to (one of the copies of) the resource. Normally as user you don't see the intermediate transactions.

If the PIDs cannot be resolved at a certain moment one simply cannot access a resource. Think of a situation where hundreds of users are waiting on a resolution of a PID and nothing happens - a nightmare for any cyberinfrastructure scenario! Since this would not be acceptable, we need to make sure that the PID service is based (a) on a very robust and reliable software offering sufficient functionality, (b) on a proper service based on redundant centres with a high availability and persistency guarantee.

PIDs are as said unique and persistent identifiers of objects that are made available by proper repositories. For many resources there are additional characteristics such as multiple copies for preservation reasons, a string (such as MD5) that can be used to check authenticity, simple metadata for citation purposes, a reference to the access permission record etc. A proper PID system should offer such information immediately when resolving a PID. PURLs can't offer functionality, for URNs we do not know about well-proven and robust resolver, although the big libraries agreed on using URNs for their publications. DOIs are also based on the proven Handle System and it is certainly a proper service which is used in particular by the big publishing companies. However, DOI also comes with a business model that will not be acceptable for may research organizations.

A handle exists of 2 parts:
  • a prefix (e.g. 1839)
  • a suffix (e.g. 00-0000-0000-0009-3C7E-F)
The official way of refering to a handle is:

hdl: + prefix + / + suffix

e.g.:
hdl:1839/00-0000-0000-0009-3C7E-F

To resolve such a handle (=make it a clickable link that redirects to the resource itself) use the following formula:

http://hdl.handle.net/prefix/suffix

e.g.: http://hdl.handle.net/1839/00-0000-0000-0009-3C7E-F

 

 


CLARIN has an arrangement with the EPIC consortium that CLARIN members will be able to register PIDs and of course resolve them. This consortium groups a number of reliable European service providers that want to participate in providing a redundant service for the research world, i.e. we are speaking about millions of PIDs and a service at very low costs. The service is based on the Handle System which according to our investigations is the only robust system meeting all requirements. No one is obliged to register Handles, but of course CLARIN centres will need to demonstrate that their PIDs can be resolved in a robust manner and offer the required functionality.

As described at the EPIC website it is sufficient to send a mail to handle /at/ gwdg.de with a motivated request.

(Answer taken from the ISO citer draft, p. 11)

This International Standard supports different levels of granularity. The following recommendations are designed to encourage efficiency and promote interoperability with other naming schemes:

1) If there is an existing identifier scheme for a type of resources, for instance, ISBN, this level of granularity should be retained, which is to say that no new PIDs should be issued without very good reasons, such as for chapters. Chapters would preferably be addressed using part identifiers in conjunction with the PID of the book.

2) If the resource is associated with the complete content of a digital file, an individual PID should probably be assigned for this resource.

3) If the resource is autonomous and exists outside a larger context, an individual PID should probably be assigned for this resource.

4) If a resource should be citable apart from any containing resource, an individual PID should probably be assigned for this resource.

These recommendations are, however, subject to the needs of resource creators with respect to the level of granularity they deem suitable to the specific resource environment.

The rewriting behaviour of part identifiers can be configured per handle prefix (actually it can also be done per individual handle but this is not supported for EPIC at this point). For EPIC (version 1, so with prefix 11858) the choice was made to rewrite [suffix] to ?[suffix]

 So suppose that 11858/1234 resolves to http://clarin.eu then

 11858/1234@test=a will be resolved to http://clarin.eu?test=a

Please note that when you offer PIDs with part identifiers that you are responsible of maintaining the part identification fragment as well. Remember that users will use it to link to your resources and that the resulting end point should always be available.

Standards

CLARIN does not create linguistic resources; its purpose is to offer rapid access to the existing resources and to facilitate their reuse in new contexts. When resources and tools are produced for individual usage interoperability and therefore the need to adhere to standards or best practices is of little relevance. The problem of interoperability only emerges when linguists are ready to offer their resources and tools to other researchers. One of the requirements of interoperability is to connect different resources to the same tool. This can be made using standards, but this would imply having all the resources standardized (this is an ideal situation, but cannot always be achieved in reality). When needed, a standard can also play the role of a pivot format (resources are converted to the standard before they are used).

standards

An open list follows:

    • character encoding: ISO -10646 UNICODE, UTF-8
    • country codes: ISO 3166
    • language codes: ISO 639-1/2/3
    • codes for the representation of names of scripts: ISO 15924
    • text format: XML
    • text format: CSV (comma separated with "-quotes, with a header line and preferrably a line of ISOcat URIs for each column)
    • feature structure representation: ISO 24610-1:2006
    • representation of primary sources: TEI (Text Encoding Initiative)
    • knowledge engeneering: RDF, RDF-S, SKOS, OWL
    • audio/speech: PCM (Pulse Code Modulation) for digitizing sound waves, the Alphabet of the International Phonetic Association for phonetic transcriptions;
    • video/multimodality: MJPEG2000 lossles as backend format, MPEG2 or H.264 for handling and processing
    • data categories: ISO DCR and ISOcat
    • annotation of temporal entities: TimeML (part of TC 37/SC 4)
    • morpho-syntactic annotation: MAF (Morpho-syntactic Annotation Framework), ISO/DIS 24611
    • syntactic annotation: SynAF (Syntactic Annotation Framework), ISO/CD 24615
    • lexical annotation: LMF (Lexical Markup Framework), ISO 24613:2008
    • linguistic annotation: LAF (Linguistic Annotation Framework), ISO/DIS 24612
For more information on each of these standards, please take a look at the CLARIN Standardization Action Plan.

CLARIN actively tracks a number of ongoing standardisation activities at two major levels: linguistic structures/formats and linguistic encoding. CLARIN as an infrastructure project has the duty to evaluate, test and comment these proposals in close relation with the relevant standardisation bodies. When necessary, CLARIN may take the lead in initiating new standardisation activities when a clear gap in coverage is identified. For more information see the CLARIN Standardization Action Plan.

No linguist should be required to read long documents about standards; it is primarily the task of the tool, service and converter developers to provide frameworks that help the researcher and that hide complex formalisms as much as possible.

Very good question. A researcher will always face this issue. The research moves a field on, and in no-man's-land there are no standards (yet). The standards remain behind the research. Industry stays always on ferm ground, therefore on well establishd conventions. Although it appears that reseach has no means to make use of standards, it should base itself on well-established foundations, which should be expressed in standardized form whenever possible. Only for the head of the arrow, the really fresh things, just invented, the researcher should look for his own ad-hoc conventions. Applied to the linguistic data, this means that in an annotated corpus, for example, one will find a mixture of standard and invented markings. CLARIN can be used for that part of processing that involves using existing tools and resources, that have been converted to a standard format.