- Clarin
- Publications
- Clarin Groups
- Events
- Resources
- Help Desk
Join CLARIN
CLARIN Newsletter
On the following pages CLARIN wants to answer some of the most frequently asked questions which we are confronted with. The QA should give interested people a quick entry to a number of essential issues that have been discussed within CLARIN.
Experts can add other FAQs by clicking here.
About CLARIN
CLARIN is an acronym for Common Language Resources and Technology Infrastructure.
CLARIN aims at uniting existing digital archives in Europe that contain language-based material into a federation that will allow the social sciences and humanities research communities to have unified access to the content. It wants to make the wealth of language and speech processing tools that have been developed over the recent years available to interested researchers with a view to opening up new research avenues. Another goal of CLARIN is to provide web based services that will allow non-expert users (especially humanities and social sciences researchers without technological background) to perform complex tasks on the materials contained in the archives, such as ‘Summarize Le Monde of March 17 2008 – in Polish’.
How do we unite existing archives from all over Europe into a single federation? How can we manage the wide variability of conventions for describing resources, in many different languages? How might we chain together existing tools and applications, with their specific expectations in terms of input and output formats? Can we help to provide a minimum level of technological coverage for all languages, irrespective of the number of speakers? How do we protect the intellectual property rights of those who have provided data or tools? How do we ensure that whatever infrastructure facilities we manage to build will be sustainable?
These questions will have an answer at the end of the CLARIN preparatory phase at the end of 2010.
Not yet. CLARIN is presently approaching the end of the preparatory phase (2008-2010), in which the technological and organizational specifications are being finalized. The infrastructure will be set up during the construction phase (2011 - 2015) and will be in exploitation from 2016 on. However, in the meantime, we do aim at creating some prototype services.
At the end of the preparatory phase, CLARIN intends to make the transition towards a more permanent structure. CLARIN is now investigating the possibility that this be an ERIC - European Research Infrastructure Consortium - a legal entity based on EU law (Article 171 of the EC Treaty). It was designed to facilitate the joint establishment and operation of research facilities of European interest (see the European Commission site for more details).
CLARIN aims at bringing together producers and consumers of language resources and technologies. A producer is a contributor of linguistic data, tools or services. Mainly these are research organisations, such as a university or an industrial company that is collecting and annotating textual or speech corpora, or is inventing and implementing natural language processing technologies or applications. A consumer is a researcher or a group in need of linguistic data or processing technologies. In CLARIN, this person is seen mainly as a scholar in the humanities or social sciences. Although the consumer may have no computational linguistics background, CLARIN will aim to help to address research problems involving processing of linguistic data.
In the preparatory phase (2008-2010) CLARIN is funded by the EU through the 7th Framework ESFRI programme. One of the objectives of the preparatory phase is to come with cost estimations for the construction and exploitation phase. The main funders will then be the national governments, with a possible minor contribution from the EU for some generic costs of the infrastructure. The cost estimate will include European wide aspects such as comprehensive training and education programs.
CLARIN publishes a newsletter every 3 months and a newsflash every month. Have a look at the newsletter and at the newsflash (you can also subscribe to them, if you would like to receive these updates in your email).
Joining CLARIN
In the preparatory phase (until the end of 2010) CLARIN is still a project of the EU, therefore the number and names of consortium partners are strictly limited by the contract. But you can become a CLARIN Member, by filling in the form here. You also need to have recomendations from two existing CLARIN partners or members. If you are the first institution in your country which is joining CLARIN, you also need to appoint a national contact person for your country. For further details, please consult the document related to CLARIN membership criteria and procedures. You can also contact the CLARIN Office (clarin@clarin.eu).
Starting with the construction phase (from 2011), CLARIN will become an self-supporting organization with a legal entity (the CLARIN-ERIC) and the members will be the governments of the countries involved.
The preparatory phase of CLARIN is funded by the European Union, and the number and names of consortium partners are defined in the contract signed with the EU. Still, you can apply for CLARIN membership. Becoming a CLARIN member is free of charge and means you are supportive with regards to its goals. Membership requests are subject to acceptance by the national representative and the executive board of CLARIN. For further information on becoming a member in the preparatory phase, please have a look at the previous question, "How can I join CLARIN?". You can also contact the CLARIN Office (clarin@clarin.eu) for additional information.
In the construction phase, the country membership will be bound to an yearly fee. The level of this fee is yet to be decided before the construction phase begins, and will be laid down in the CLARIN-ERIC (the legal entity). The majority of CLARIN's activities will be funded through these contributions of the national governments.
In the construction phase, the country membership will be bound to an yearly fee. The level of this fee is yet to be decided before the construction phase begins, and will be laid down in the CLARIN-ERIC (the legal entity). The majority of CLARIN's activities will be funded through these contributions of the national governments.
You should ask your government for a support letter to join the CLARIN Consortium, and to appoint an institute or a person as the national contact. You can find an overview of what is expected in terms of financial and other types of support here. For more suggestions and help you can also contact the CLARIN Office (clarin@clarin.eu).
Only countries and institutions can be CLARIN members. Private persons are not allowed.
If you are an institution in Latvia, or in any other CLARIN member country, you should contact your national representative - the CLARIN national contact person. If you don't know who is the national contact point for your country, please have a look at the list here.
Yes, this is possible. Institutions from non-member countries, including countries outside Europe, can become associated members in the preparatory phase. Technical details on how to apply for the CLARIN membership in this phase can be found under the "How can I join CLARIN?" section. The procedure will change after 2010, in the construction phase, when the rights and duties of associated members will be laid down in the statuses of the new legal entity (CLARIN-ERIC).
The actual way to access the infrastructure will be established during the construction phase.
CLARIN resources
CLARIN is not the direct owner of any resource. It only provides access to a series of repositories where information about existing resources can be found. There are three main categories of resources: data, services and tools.
Several ways to access these resources are available: Another source of information regarding existing resources is LT-World. LT World is also a valuable source of information about linguistic terms and types of resources from a theoretical point of view.
Take note that this is not the entry point into the CLARIN world of information. When CLARIN centres will be in place, an extensive search service will provide access to all the types of resources.
All kinds of relevant linguistic data: textual and speech corpora, either raw or accompanied by metadata (annotated), lexica, grammars, video (sign-language recordings) and multimedia (text and speech, video, speech and subtitles), etc.
Yes, although not in CLARIN, but through CLARIN... You can use the facetted search tool for data to find a specific type of linguistic data.
If you are a member, you can add data by filling in the form here.
The procedure to update existing (meta)data is decribed here.
Tools are programs doing specific transformations over linguistic data.
Yes, you can use the facetted search interface for tools to search for various NLP tools. You can filter these tools by type, language, platform, license and author organization.
If you are a member, you can add tools by filling in the form here.
Services are the on-line equivalent of tools. The difference between a tool and a service is that a tool needs to be run locally (where the data is), while a service runs remotely. When using a web service the input data and the program that does the processing can reside on different machines. The data is transferred via protocols to the remote server, and the output results are transferred back when the processing is complete.
If you are a member, you can add tools by filling in the form here. You need to specify the type "Web service".
The Virtual Language World is a modality to browse the resources and tools in the CLARIN repository using the Google Earth Interface. You can find it here (note that you need to have Google Earth installed in order to use it). The VLW contains data from multiple sources: from the CLARIN LRT inventory, from OLAC providers, from IMDI and from the DFKI software registry.
Technical infrastructure
Standards
CLARIN does not create linguistic resources; its purpose is to offer rapid access to the existing resources and to facilitate their reuse in new contexts. When resources and tools are produced for individual usage interoperability and therefore the need to adhere to standards or best practices is of little relevance. The problem of interoperability only emerges when linguists are ready to offer their resources and tools to other researchers. One of the requirements of interoperability is to connect different resources to the same tool. This can be made using standards, but this would imply having all the resources standardized (this is an ideal situation, but cannot always be achieved in reality). When needed, a standard can also play the role of a pivot format (resources are converted to the standard before they are used).

An open list follows:
- character encoding: ISO -10646 UNICODE, UTF-8
- country codes: ISO 3166
- language codes: ISO 639-1/2/3
- codes for the representation of names of scripts: ISO 15924
- text format: XML
- text format: CSV (comma separated with "-quotes, with a header line and preferrably a line of ISOcat URIs for each column)
- feature structure representation: ISO 24610-1:2006
- representation of primary sources: TEI (Text Encoding Initiative)
- knowledge engeneering: RDF, RDF-S, SKOS, OWL
- audio/speech: PCM (Pulse Code Modulation) for digitizing sound waves, the Alphabet of the International Phonetic Association for phonetic transcriptions;
- video/multimodality: MJPEG2000 lossles as backend format, MPEG2 or H.264 for handling and processing
- data categories: ISO DCR and ISOcat
- annotation of temporal entities: TimeML (part of TC 37/SC 4)
- morpho-syntactic annotation: MAF (Morpho-syntactic Annotation Framework), ISO/DIS 24611
- syntactic annotation: SynAF (Syntactic Annotation Framework), ISO/CD 24615
- lexical annotation: LMF (Lexical Markup Framework), ISO 24613:2008
- linguistic annotation: LAF (Linguistic Annotation Framework), ISO/DIS 24612
CLARIN actively tracks a number of ongoing standardisation activities at two major levels: linguistic structures/formats and linguistic encoding. CLARIN as an infrastructure project has the duty to evaluate, test and comment these proposals in close relation with the relevant standardisation bodies. When necessary, CLARIN may take the lead in initiating new standardisation activities when a clear gap in coverage is identified. For more information see the CLARIN Standardization Action Plan.
No linguist should be required to read long documents about standards; it is primarily the task of the tool, service and converter developers to provide frameworks that help the researcher and that hide complex formalisms as much as possible.
Very good question. A researcher will always face this issue. The research moves a field on, and in no-man's-land there are no standards (yet). The standards remain behind the research. Industry stays always on ferm ground, therefore on well establishd conventions. Although it appears that reseach has no means to make use of standards, it should base itself on well-established foundations, which should be expressed in standardized form whenever possible. Only for the head of the arrow, the really fresh things, just invented, the researcher should look for his own ad-hoc conventions. Applied to the linguistic data, this means that in an annotated corpus, for example, one will find a mixture of standard and invented markings. CLARIN can be used for that part of processing that involves using existing tools and resources, that have been converted to a standard format.
ISO Data Category Registry
The ISO Data Category Registry is an attempt to make a step in the direction of interoperability at the level of linguistic encoding (tag sets). The basic idea is to register all widely used concepts/terminology so that everyone can refer to them or relate his own categories to them. All is based on the ISO 12620 standard which is a generic model not restricted to linguistics. If you want to read more, please, go to the ISOcat web-site (www.isocat.org).
ISOcat
Use the ISOcat forum - see http://www.isocat.org/forum/viewtopic.php?f=3&t=4&p=6#p6 for details.
In origin you can mention the inspiration source for the creation of the data category. If you do not know what to enter, please enter CLARIN
For the source of the language section, please enter CLARIN
For the source of the language section, please enter CLARIN
No, unless it is required by certain language rules (e.g. nouns in German), the name of a data category should not contain capitals.
- Register at isocat.org
- Send a mail with your name to dieter.vanuytvanck@mpi.nl
- You will be added to the CLARIN group
- My Workspace > button “create new data category”
- My Workspace > Private > CLARIN > MD > button “edit this data category selection”
- My Workspace > Private > + (add this data category to selection)
- Click on the icon for “save the selected data categories”
- After inspection the new datacats can be moved to the Metadata thematic view
- (Finally, and optionally the datcats in the Metadata thematic view can be submitted to the Thematic Domain Group for official approval)
ISOcat is the software and database that implements the ISO 12620 standard and data model which is currently being filled with many categories from for example the EAGLES project, various metadata initiatives and hopefully other sub-disciplines and initiatives. There are bodies made up by linguists that take care that the content of the ISOcat registry is not too fragmented and meets a number of criteria.
NO - the data model was set up with the explicit intention to not include relations, since these in most cases are dependent on theories and practical intentions. Therefore a framework will be offered that allows users to easily manipulate and share relations according to their needs. From CLARIN we intend to offer at least one set of relations with a large coverage which users may want to use or manipulate.
No - it is just a start to offer a reference, so that users creating new resources could use the registered categories and schemas describing legacy data can refer to them. But we will found that not all tag sets which are in use for various purposes can easily be mapped on another one. It also will largely depend on the intended usage. For searching an imperfect mapping may result in less precision, but for a researcher this may not be a problem.
There is much debate about this and other question and there is no good universal answer yet. However, we need to start using the ISOcat registry to find out how the definitions can be improved, which categories are missing and which granularity should be chosen for metadata, morphology and semantic annotation to just mention a few examples.
Metadata
It is data about data: information describing properties of linguistic resources. Think of the size of a corpus, the recording date of a speech file, the purpose for which annotations were created.
A fixed set of elements for the description of resources. Think of the traditional filing cards in the library, specifying the writer and title of each book.
Quite a few, examples are: Dublin Core, OLAC (which is an enriched version of Dublin Core), IMDI, the TEI header, ...
Good question. In fact there is not such a thing as a single CLARIN metadata scheme. Practice showed that using a particular scheme for a large community (e.g. the humanities) often results in a mismatch between the chosen elements and the needs of the user.
CLARIN proposes a component-based approach: you can combine several metadata components (sets of metadata elements) into a self-defined scheme that suits your particular needs. Of course you can share your profile with others (in fact we strongly advise that). If sharing the full profile is not an option, you still can use common components, e.g. a component to describe a sound recording. In case that still does not address your needs, it is even possible to create components yourself.
Each CMDI files exists of 3 parts:
- a (fixed) Header, containing administrative information:
- MdCreator: the author of the file
- MdCreationDate: the creation date of this file
- MdSelfLink: the URL or PID of this file
- MdProfile: the unique identifier of a CMDI profile, as generated by the component registry (e.g. clarin.eu:cr1:p_1290431694484)
- MdCollectionDisplayName: an (optional but recommended) plain text indication to which collection this file belongs. Used for the Collection facet in the VLO
- a (fixed) Resources section, containing links to:
- external files (e.g. an annotation file or a sound recording)
- and/or other CMDI metadata files (to build hierarchies)
- a (flexible) Components section, where the actual components that this profile contains will appear
Ok, so how can you refer to an external file from a CMDI metadata description? That is where the Resources section is for.
In the example CMDI file, the resources section looks like:
<Resources>
<!-- List of external resource files and (CMDI) metadata files -->
<ResourceProxyList>
<ResourceProxy id="a_photo">
<ResourceType mimetype="image/jpeg">Resource</ResourceType>
<!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef -->
<ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef>
</ResourceProxy>
<ResourceProxy id="a_text">
<ResourceType mimetype="text/plain">Resource</ResourceType>
<ResourceRef>http://www.clarin.eu/sometext.txt</ResourceRef>
</ResourceProxy>
...
As you can see, for each link to an external resource a ResourceProxy (= file) is added to the ResourceProxyList (= file list). For each ResourceProxy you need to specify the ResourceType (either Resource, the default, or Metadata in case you want to build a hierarchy of CMDI files). With an optional (but very useful) mimetype attribute you can (surprise!) indicate the file's mime type. The ResourceRef contains either a normal URL or a handle PID.
The information that a ResourceProxy can contain (a URL and mimetype) is kept very minimal, on purpose. However you can use any CMDI component to add more details about such a ResourceProxy, using the id attribute.
E.g. in the example CMDI file we can add a textual description of the photo. First the relevant ResourceProxy gets the id "a_photo":
E.g. in the example CMDI file we can add a textual description of the photo. First the relevant ResourceProxy gets the id "a_photo":
<ResourceProxy id="a_photo">
<ResourceType mimetype="image/jpeg">Resource</ResourceType>
<!-- note that both a normal URL and a handle Persistent Identifier can be used for the ResourceRef -->
<ResourceRef>hdl:1839/00-0000-0000-0009-3C7E-F</ResourceRef>
</ResourceProxy>
Then, later on in the same CMDI file, we have an explanantory component example-component-photo with a description element:
<example-component-photo ref="a_photo">
<description>a suitable textual description of this photo</description>
</example-component-photo>
Thanks to the reference from this component to the ResourceProxy with the ref attribute we know that the description relates to the photo.
Note that the id attribute should be unique for each ResourceProxy.
Check out the CLARIN specification document: http://www.clarin.eu/files/wg2-4-metadata-doc-v5.pdf
Link from the parent .cmdi file to the child .cmdi file with a ResourceProxy that has the ResourceType Metadata.
All files of this example collection can be accessed and explored via http://www.clarin.eu/cmd/example/collection/
Below is a graphical representation (as shown by Arbil) of the CMDI file hierarchy used above as an example.

- E.g. http://www.clarin.eu/cmd/example/collection/collection_root.cmdi has 2 child collections:
- http://www.clarin.eu/cmd/example/collection/collection_lrt_inventory.cmdi, this file contains in turn links to files like:
- http://www.clarin.eu/cmd/example/collection/collection_olac.cmdi
- http://www.clarin.eu/cmd/example/collection/olac/oai_childes_psy_cmu_edu...
- http://www.clarin.eu/cmd/example/collection/olac/oai_childes_psy_cmu_edu...
- http://www.clarin.eu/cmd/example/collection/olac/oai_childes_psy_cmu_edu...
- http://www.clarin.eu/cmd/example/collection/olac/oai_childes_psy_cmu_edu...
All files of this example collection can be accessed and explored via http://www.clarin.eu/cmd/example/collection/
Below is a graphical representation (as shown by Arbil) of the CMDI file hierarchy used above as an example.

The MdProfile element (in the Header section) contains a unique profile code (e.g.: clarin.eu:cr1:p_1290431694484).
You can find the profile in the component registry with the following URL:
http://catalog.clarin.eu/ds/ComponentRegistry?item=code
e.g. http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_1290431694484
You can find the profile in the component registry with the following URL:
http://catalog.clarin.eu/ds/ComponentRegistry?item=code
e.g. http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_1290431694484
Technically there is no real difference. A profile is a component that can be converted into an XSD file. A normal component can only be used within other components or profiles and can never be transformed into an XSD.
The isProfile="true" attribute indicates that a CMD_ComponentSpec defines a profile and not just a component.
The isProfile="true" attribute indicates that a CMD_ComponentSpec defines a profile and not just a component.
Yes - bringing all metadata descriptions together ("harvesting"), making them searchable ("indexing") and citeable ("creating persistent identifiers") is an important part of the infrastructure that CLARIN is building. When you provide CMDI metadata CLARIN can harvest it and make it available via the Virtual Language Observatory.
The CMDI core model for web services (and extended documentation) is available at:
http://www.isocat.org/clarin/ws/cmd-core/
http://www.isocat.org/clarin/ws/cmd-core/
Short answer: as indicated in the OAI-PMH protocol, you need to offer all records. There is no automatic harvesting of any of the CMDI child nodes.
Long answer: in the case of our toy example hierarchy (http://www.clarin.eu/faq/3454)
You would need to provide the following files over OAI-PMH:
Providing collection_root.cmdi (or even collection_olac.cmdi and collection_lrt_inventory.cmdi) is not enough, as all OAI harvesters are protocol-agnostic and thus do not know about CMDI’s hierarchy building! CMDI-consuming applications, such as the VLO, also need the physical files locally.
If you have many metadata records or records that frequently change:
- Use OAI-PMH
- Provide them preferrably as CMDI (click here for details about how to serve CMDI over OAI-PMH)
- If that is not possible provide them as OLAC
- Send a mail to vlw@clarin.eu to notify us about your OAI-PMH access point
If you have only a few static records and setting up an OAI-PMH access point is not feasible:
- Login at the CLARIN website (if you do not have an account yet, create one here)
- Add your records to the LRT inventory:
- for data, corpora, lexica, etc: http://www.clarin.eu/node/add/resource
- for web services and software: http://www.clarin.eu/node/add/tool
- All the records in the LRT inventory will be automatically converted into CMDI records. Note that this process can take a while.
Data provided over OAI-PMH or via the LRT inventory will be made available and searchable via the Virtual Language Observatory.
Send a mail to vlw@clarin.eu and we will incorporate your metadata as soon as possible
If you have old records in DC (or OLAC, a linguistic extension of DC) you can use the following profile:
http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_1288172614026
From that profile you can generate the XSD:
And then you can transform your DC XML files into CMDI files that comply with the profile with the following XSLT:
http://www.clarin.eu/cmd/xslt/olac2cmdi.xsl
An example (DC) inputfile:
http://catalog.clarin.eu/oai-harvester/olac-and-dc-providers/harvested/oai-pmh/olacx/Jo_ef_Stefan_Institute/oai_jsi_e8_JSI_Resource_ELAN.xml
The corresponding (CMDI) outputfile:
http://catalog.clarin.eu/oai-harvester/olac-and-dc-providers/harvested/results/cmdi/Jo_ef_Stefan_Institute/oai_jsi_e8_JSI_Resource_ELAN.xml
If you have old records in the IMDI format you can use the following profiles:
From that profile you can generate the XSD:
http://www.clarin.eu/cmd/xslt/imdi2clarin.xsl
An example IMDI inputfile:
http://corpus1.mpi.nl/IFAcorpus/IMDI/Session_F20N1FY.imdi
The corresponding (CMDI) outputfile:
http://www.clarin.eu/cmd/example/example-phonological-corpus.cmdi
- for sessions: http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_12881...
- for corpus nodes: http://catalog.clarin.eu/ds/ComponentRegistry?item=clarin.eu:cr1:p_12748...
From that profile you can generate the XSD:
- for sessions: http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/cla...
- for corpus nodes: http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/cla...
http://www.clarin.eu/cmd/xslt/imdi2clarin.xsl
An example IMDI inputfile:
http://corpus1.mpi.nl/IFAcorpus/IMDI/Session_F20N1FY.imdi
The corresponding (CMDI) outputfile:
http://www.clarin.eu/cmd/example/example-phonological-corpus.cmdi
No, we are not. It is a term used for gathering metadata descriptions from several locations and storing it in a central database. You can find the results of such a harvesting process at http://catalog.clarin.eu/ds/vlo/
Yes - have a look at http://www.clarin.eu/node/3026
There are indeed issues with searching if people aren't using matching descriptions. Think of someone calling a collection of texts a "text archive", while someone else might be searching for a "text corpus". Or think of all the variants that people can use for one and the same country: the Netherlands, Netherlands, Holland, etc. The same goes for lingustic annotations: "noun" and "substantive" can both be used to describe the same part-of-speech tag. To counter these problems the metadata components contain links to a kind of database that contains atomic concepts (say "country" or "resource type"). We call them data categories. Smart software will later on be able to "see" that if a user searches for nouns, he might also be interested in substantives.
In a data category registry - a server that can be reached via the internet, both by human users and computer programs.
Yes it does - you can have a look at it via http://www.isocat.org/interface/
Yes, go to the site mentioned above and click through to Public > Athens Core > Metadata
Arbil as CMDI editor
Arbil is indeed a metadata editor with support for CMDI files.
It used to be an editor for IMDI only, the CMDI functionality has been added later on (since the beginning of 2010). This means that the support for CMDI files was not as extensive as the one for IMDI. However, since release 2.3 of Arbil the support for CMDI has been significantly improved.
It used to be an editor for IMDI only, the CMDI functionality has been added later on (since the beginning of 2010). This means that the support for CMDI files was not as extensive as the one for IMDI. However, since release 2.3 of Arbil the support for CMDI has been significantly improved.
We recommend to use at least version 2.3.
At the time of writing (2012-01-05) this is the "testing" version.
At the time of writing (2012-01-05) this is the "testing" version.
- Download Arbil (2.3 or higher) and start it
- Go to Options > Templates & Profiles
- Select in "Clarin Profiles" which profile(s) you want to use as the basis for a CMDI file anc click on Close
- Right-click on Local corpus, choose Add and select the relevant profile (the CMDI profiles are marked with a
icon)

Some profiles (obvious tests and the ones not intended for manual metadata creation) have been excluded from the default profile list in arbil (testing). You can see them disabling the "only load profiles selected for manual editing" in the Available Templates & Profiles dialog.
By default, if you create a new profile in the component registry, it will show up in Arbil.
as of Arbil 2.3 there is an item 'Insert Manual Resource Location' in the context menu of CMDI instances in the tree.
See http://www.lat-mpi.eu/tools/arbil/manual/ch01s04.html
For CMDI, some additional icons are used and some icons have a slightly different meaning:
For CMDI, some additional icons are used and some icons have a slightly different meaning:
![]() |
link to an external file (ResourceProxy) |
![]() |
grouping icon for repeatable elements (more information...) |
| root node of a CMDI file |
When a component can occur multiple times (= CardinalityMax higher than 1), Arbil automatically groups all occurrences of these components in the CMDI file. You can recognize these by the following properties:

- they have the grey club icon
- the text is shown in grey
- after the node a number indicates how many times the component occurs
<ISO639> <iso-639-3-code>cat</iso-639-3-code> </ISO639> <ISO639> <iso-639-3-code>spa</iso-639-3-code> </ISO639>Arbil shows this as:

Right click on the file in the "local corpus" panel and select Edit all Metadata


Correct observation. The elements that are optional (= have a CardinalityMin of 0) are not shown by default. You need to add them explicitly. To do this, right click on the CMDI file in the "local corpus" panel and select Add


Please send a mail with your question to cmdi@clarin.eu
Centres
Centres will form the backbone of the persistent and stable CLARIN service infrastructure. For further information we refer to the Requirements Specification Document and to the Short Guides.
Researchers will only use certain resources and certain tools/services offered via the web when they are sure that they can access them also during a longer period. Currently, researchers mostly download resources first to their computer to create accessibility, but in the cyberinfrastructure scenario with the many big resources and collections this way is not suitable anymore. So availability and accessibility has to be guaranteed by institutions with a clear service oriented attitude that do pose as little restrictions as possible on the usage by the researchers. Only this new type of centres can give such guarantees.
Can everyone act as a CLARIN centre in the emerging network? No - since centres need to fulfill a number of criteria which have been mentioned in the documents. In particular these centres need to make a commitment statement that they will give their services for a defined period of time at a certain service level which is dependent on the type of service. Setting up centres that adhere to these requirements will cost some money, therefore it is obvious that centres need to have clear funding basis.
The answer on this question is dependent on the country although some services are given at a European level. For detailed questions about the centres, see www.clarin.eu/centres
Persistent Identifiers (PIDs)
Persistent identifiers are increasingly often seen as core component for all the many references we are creating at various levels - this can range from references between metadata descriptions and their resources up to references between semantic assertions made by using the RDF (Resource Description Framework). For more information please read the requirements specification document or the short guide.
In the emerging cyberinfrastructure we are creating more and more references between resources, resource fragments and services. The creation of these references is very costly and often is essential for the interpretation of a resource. Therefore we need proper mechanisms to ensure that these references survive despite all the changes that happen in repositories for example. It is known that URLs are not appropriate - they are not persistent even when we believe that they are proper URIs. Therefore special PIDs come into place which identify an object and which are maintained by reliable institutions.
Handling PIDs is very simple. First you need to register a PID for a resource or service. You can do this very simply by providing the required information to the PID service site, in particular the path to access the resource such as a URL and you will receive back a PID which you can enter into the metadata description for example, so that everyone can use it for referencing. When a user finds such a PID in a resource, he/she can click on this reference and the service will resolve the PID and give access to (one of the copies of) the resource. Normally as user you don't see the intermediate transactions.
If the PIDs cannot be resolved at a certain moment one simply cannot access a resource. Think of a situation where hundreds of users are waiting on a resolution of a PID and nothing happens - a nightmare for any cyberinfrastructure scenario! Since this would not be acceptable, we need to make sure that the PID service is based (a) on a very robust and reliable software offering sufficient functionality, (b) on a proper service based on redundant centres with a high availability and persistency guarantee.
CLARIN has an arrangement with the EPIC consortium that CLARIN members will be able to register PIDs and of course resolve them. This consortium groups a number of reliable European service providers that want to participate in providing a redundant service for the research world, i.e. we are speaking about millions of PIDs and a service at very low costs. The service is based on the Handle System which according to our investigations is the only robust system meeting all requirements. No one is obliged to register Handles, but of course CLARIN centres will need to demonstrate that their PIDs can be resolved in a robust manner and offer the required functionality.
PIDs are as said unique and persistent identifiers of objects that are made available by proper repositories. For many resources there are additional characteristics such as multiple copies for preservation reasons, a string (such as MD5) that can be used to check authenticity, simple metadata for citation purposes, a reference to the access permission record etc. A proper PID system should offer such information immediately when resolving a PID. PURLs can't offer functionality, for URNs we do not know about well-proven and robust resolver, although the big libraries agreed on using URNs for their publications. DOIs are also based on the proven Handle System and it is certainly a proper service which is used in particular by the big publishing companies. However, DOI also comes with a business model that will not be acceptable for may research organizations.
A handle exists of 2 parts:
hdl: + prefix + / + suffix
e.g.:
- a prefix (e.g. 1839)
- a suffix (e.g. 00-0000-0000-0009-3C7E-F)
hdl: + prefix + / + suffix
e.g.:
hdl:1839/00-0000-0000-0009-3C7E-F
To resolve such a handle (=make it a clickable link that redirects to the resource itself) use the following formula:
http://hdl.handle.net/prefix/suffix
e.g.: http://hdl.handle.net/1839/00-0000-0000-0009-3C7E-F
(Answer taken from the ISO citer draft, p. 11)
This International Standard supports different levels of granularity. The following recommendations are designed to encourage efficiency and promote interoperability with other naming schemes:
1) If there is an existing identifier scheme for a type of resources, for instance, ISBN, this level of granularity should be retained, which is to say that no new PIDs should be issued without very good reasons, such as for chapters. Chapters would preferably be addressed using part identifiers in conjunction with the PID of the book.
2) If the resource is associated with the complete content of a digital file, an individual PID should probably be assigned for this resource.
3) If the resource is autonomous and exists outside a larger context, an individual PID should probably be assigned for this resource.
4) If a resource should be citable apart from any containing resource, an individual PID should probably be assigned for this resource.
These recommendations are, however, subject to the needs of resource creators with respect to the level of granularity they deem suitable to the specific resource environment.
The rewriting behaviour of part identifiers can be configured per handle prefix (actually it can also be done per individual handle but this is not supported for EPIC at this point). For EPIC (version 1, so with prefix 11858) the choice was made to rewrite [suffix] to ?[suffix]
So suppose that 11858/1234 resolves to http://clarin.eu then
11858/1234@test=a will be resolved to http://clarin.eu?test=a
Please note that when you offer PIDs with part identifiers that you are responsible of maintaining the part identification fragment as well. Remember that users will use it to link to your resources and that the resulting end point should always be available.
Intellectual Property Rights and Business Models
CLARIN has already sketched out three types of license that could be used when an individual or an institution contributes resources to the CLARIN Infrastructure: Additionally, there can be other conditions imposed, like non-commercial use, usage report (the end user must report the published articles which use the resources) and grant back (the end user must license back the modified versions of the resources).
Additional details will continue to be worked out during the preparatory phase. The official licensing model will be made available during the construction phase.
To distribute tools and resources to the research community through CLARIN, a Content Provider and a CLARIN Service Provider have to sign a Deposition Licensing Agreement. An End User intrested in the tools and resources distributed through CLARIN has to sign a Terms of Service agreement. In addition, the user may need to sign one or more End-User License Agreements to obtain access to individual tools or resources and a Data User Agreement in case there are some additional ethical restrictions on the data.
You do not have to be authenticated to use publicly available resources, even if they are distributed through the CLARIN Infrastructure.
No. In the construction phase, CLARIN will only provide the access to the resources. Each resource will entrusted to CLARIN together with its specific EULA (End Use License Agreement) that you will need to agree to.
Yes, depending on the nature of the end use of the resource, you will have access to resources licensed for academic or commercial purposes.
CLARIN maintains a catalogue of resources. Providers of these resources may be either CLARIN Members or outsiders. However, only resources submitted by CLARIN Members will be integrated into the infrastructure, since a certain amount of effort must be put into the integration. Resources provided by non-members will only be listed in the catalogue, they will not be available to the processing infrastructure.
CLARIN for the Humanities and Social Sciences
Mainly, because CLARIN is an infrastructure of language technologies, and the primary users of this technology are the humanities and social sciences. Then, because one of the goals of CLARIN is to make the NLP technology easily available, without the necessity to download, install and get aquainted with all the tools that might be needed for their research. Also because CLARIN will also facillitate the wide distribution of the language resources created by HSS researchers.
Yes. At the beginning of 2009 CLARIN organised a call for collaboration with HSS projects. Three HSS projects were selected as a result of this call. CLARIN will collaborate with these projects during their development and offer language resources and tools, as well as advice on how to use the provided language technology to enhance the work of the projects.
It depends on what you want to do.
If you are searching for a specific corpus, you can find it using the VLW or the facetted search tool. The same happens if you are looking for a specific kind of tool.
If you know the processing steps that would solve your problem, you can use the CLARIN repository to assemble yourself a solution (either browsing through the VLW or searching using the facetted search tool).
CLARIN is currently looking for more advanced solutions that would help even the totally innocent HSS researcher (for instance, guiding him along the process of building a solution).
Basic languare resource kit - terminology
Annotation
Generally - a word, but also an abbreviation. In some aglutinative languages it could be just a morphem, part of a compound.
Used XML marking:
<tok tid="...">word</tok> or <w wid="...">word</w>.
Part of speech (POS) seem to occur in every natural language. The usual categories are: noun, verb, article, adjective, pronoun, preposition, adverb, conjunction, etc. Sometimes, by POS morphological and syntactic classes are also meant.
The canonical form of a word. It represent all the various forms of a morphological paradigm. It is marked on tokens with a specialized attribute, for instance:
<tok id=“w10” pos=“det” lemma=“the”>the</tok>
<tok id=“w11” pos=“n” lemma=“wind”>winds</tok>
<tok id=“w12” pos=“prep” lemma=“of”>of</tok>
<tok id=“w13” pos=“n” lemma=“change”>change</tok>
A group of words acting as a unit surrounding at least one noun. One noun of the group acts as representative - the head. It gives the morpho-syntactic properties of the group.
Common XML notation:
<np np_id="np1" head_id="t3"><tok tid="t1">the</tok><tok tid="t2">black</tok><tok tid="t3">cat</tok></np>.
A complex tag delimiting sentence boundaries.
For instance:
<seg sid="...">This is a sentence.</seg>.
Syntactic descriptions constitute a huge chapter of computational linguistics. A very general classification, however, sees a syntactic description notating either constituents or functional dependencies. In an FDG type of notation, for instance, one possibility is to mark on each word its parent and the type of the syntactic relation to its parent:
<tok id=“w10” pos=“det” lem=“the” link=“w11” linktype=“det">the</tok>
<tok id=“w11” pos=“n” lem=“wind” link=“…” linktype=“…">winds</tok>
<tok id=“w12” pos=“prep” lem=“of” link=“w11” linktype="mod">of</tok>
<tok id=“w13” pos=“n” lem=“change” link=“w12” linktype=“pcomp">change</tok>
As words are often polisemous, only the contexts can define their senses. Accordingly, notations used to disambiguate word senses (WS) are indications of senses in context, as given by specialized repositories (for instance WordNet).
Common XML notation of WS are realized by an attribute (for instance, 'wsd' or 'sense') complementing a token marking:
<tok tid="..." pos="v" lemma="run" wsd="s1">running</tok> or
<tok tid="..." pos="v" lemma="run" sense="s1">running</tok>.
We say that an anaphor and an antecedent are coreferential is both are text spans (usually NPs) refering the same discourse entity.
Bellow is an example of notation, as an attribute (coref) belonging to a 'np' element:
<np np_id="np1"><tok tid=“t1”>John</tok></np>
<tok tid=“t2”>hit</tok><np np_id="np2">
<tok tid=“t3” coref=“np1”>himself</tok></np>
Processes
The process of segmenting word tokens in texts. In all modern languages that use Latin, Cyrillic or Greek writting systems word tokens are recognized by the delimiting blank or punctuation. Numbers, alphanumerics and special format expressions (dates, measures, abbreviations) are also recognized as tokens, traditionally by using regular expressions. Tokenization in non-segmented languages, such as many Oriental languages, require more soffisticated algorithms (lexical look-up of longest matching sequences, hidden Markov models, n-gram methods and other statistical techniques).



