Frequently Asked Questions - Standards

Very good question. A researcher will always face this issue. The research moves a field on, and in no-man's-land there are no standards (yet). The standards remain behind the research. Industry stays always on ferm ground, therefore on well establishd conventions. Although it appears that reseach has no means to make use of standards, it should base itself on well-established foundations, which should be expressed in standardized form whenever possible. Only for the head of the arrow, the really fresh things, just invented, the researcher should look for his own ad-hoc conventions. Applied to the linguistic data, this means that in an annotated corpus, for example, one will find a mixture of standard and invented markings. CLARIN can be used for that part of processing that involves using existing tools and resources, that have been converted to a standard format.

No linguist should be required to read long documents about standards; it is primarily the task of the tool, service and converter developers to provide frameworks that help the researcher and that hide complex formalisms as much as possible.

An open list follows:

  • character encoding: ISO -10646 UNICODE, UTF-8
  • country codes: ISO 3166
  • language codes: ISO 639-1 and 639-3
  • codes for the representation of names of scripts: ISO 15924
  • text format: XML
  • text format: CSV (comma separated with "-quotes, with a header line and preferrably a line of ISOcat URIs for each column)
  • feature structure representation: ISO 24610-1:2006
  • representation of primary sources: (Text Encoding Initiative)
  • knowledge engeneering: RDF, RDF-S, SKOS, OWL
  • audio/speech: PCM (Pulse Code Modulation) for digitizing sound waves, the Alphabet of the International Phonetic Association for phonetic transcriptions;
  • video/multimodality: MJPEG2000 lossles as backend format, MPEG2 or H.264 for handling and processing
  • annotation of temporal entities: TimeML (part of TC 37/SC 4)
  • morpho-syntactic annotation: MAF (Morpho-syntactic Annotation Framework), ISO/DIS 24611
  • syntactic annotation: SynAF (Syntactic Annotation Framework), ISO/CD 24615
  • lexical annotation: (Lexical Markup Framework), ISO 24613:2008
  • linguistic annotation: LAF (Linguistic Annotation Framework), ISO/DIS 24612

CLARIN actively tracks a number of ongoing standardisation activities at two major levels: linguistic structures/formats and linguistic encoding. CLARIN as an infrastructure project has the duty to evaluate, test and comment these proposals in close relation with the relevant standardisation bodies. When necessary, CLARIN may take the lead in initiating new standardisation activities when a clear gap in coverage is identified. For more information see the CLARIN Standardization Action Plan.

CLARIN does not create linguistic resources; its purpose is to offer rapid access to the existing resources and to facilitate their reuse in new contexts. When resources and tools are produced for individual usage interoperability and therefore the need to adhere to standards or best practices is of little relevance. The problem of interoperability only emerges when linguists are ready to offer their resources and tools to other researchers. One of the requirements of interoperability is to connect different resources to the same tool. This can be made using standards, but this would imply having all the resources standardized (this is an ideal situation, but cannot always be achieved in reality). When needed, a standard can also play the role of a pivot format (resources are converted to the standard before they are used).