Standards and Formats

Basic principles

CLARIN adheres to the following principles:

  • Open standards are preferred over proprietary standards
  • Formats and protocols should be:
    • well-documented
    • verifiable
    • proven (being used in practice)
  • Text-based formats are (where possible) preferred over binary formats
  • In the case of digitisation of an analogue signal, using no or lossless compression is recommended

Learning more

Relevant formats

Several CLARIN centres have published excellent guidance on formats suitable to deposit language research data:

Relevant standards

The table below contains a (non-exhaustive) list of standards that are relevant for the CLARIN community.

Source: CLARIN standard guidance website (provided by the IDS)

Abbreviation/Name Topic(s) Standard body CLARIN centre(s)
CES Generic Corpus Annotation EAGLES
CHAT File Formats | Transcription Other
CMDI Metadata ISO CLARIN-PL, IDS, MPI-PL, UC
Controlled Vocabulary Controlled Vocabulary | Knowledge Representation | Thesaurus NISO CLARIN-PL, MPI-PL
CQLF Query ISO IDS
DCAM Metadata DCMI
DCMES Metadata DCMI MPI-PL
DCR Data Categorization ISO CLARIN-PL, IDS, MPI-PL
DiAML Markup Language | Semantic Annotation ISO
DictionaryEntry-RePresentation Controlled Vocabulary | Terminology ISO
DITA Generic Corpus Annotation OASIS
DOL Knowledge Representation | Ontology ISO
DSSSL Formatting | Transformation ISO
Feature structures Feature Structure ISO CLARIN-PL
GOLD Ontology | Terminology GOLD Community
HTML Meta Language ISO CLARIN-PL, IDS, MPI-PL
HyTime Markup Language ISO
IMDI Metadata Other CLARIN-PL, MPI-PL, UC
ISBD Metadata IFLA
ISO-Thesauri Controlled Vocabulary | Thesaurus ISO
ITS Data Categorization W3C
JATS Generic Corpus Annotation NISO
LAF Generic Corpus Annotation ISO CLARIN-PL
LMF Lexical Knowledge ISO CLARIN-PL, MPI-PL, UC
MAF Morpho-syntactic Annotation ISO
METS Metadata LoC
MLIF Multilingual data annotation ISO CLARIN-PL
Multilingual Thesaurus Controlled Vocabulary | Thesaurus IFLA
NISO MIX Metadata | Schema LoC
NLM JAITS Generic Corpus Annotation Other
OLAC Metadata Metadata OLAC CLARIN-PL, MPI-PL, UC
OLiA Ontology | Terminology
OntoIOp Knowledge Representation | Ontology ISO
OWL Knowledge Representation | Ontology W3C UC
PDF/A File Formats ISO
PISA Terminology ISO IDS, MPI-PL
RDF Knowledge Representation | Metadata W3C CLARIN-PL, IDS, UC
RDF/XML Meta Language | Serialization W3C CLARIN-PL
RDFS Constraint Language | Schema W3C
RELAX NG Constraint Language ISO MPI-PL, UC
RTF File Formats Other
SemAF Semantic Annotation ISO
SemRoleML Markup Language | Semantic Annotation ISO
SGML Meta Language ISO
SimpL-1 Terminology ISO
SKOS Knowledge Representation | Thesaurus W3C
SPARQL Query W3C
SRX Segmentation LISA CLARIN-PL
Structured Vocabulary Controlled Vocabulary | Knowledge Representation | Thesaurus BSi CLARIN-PL
SynAF Syntactic Annotation ISO
TBX File Formats | Markup Language | Terminology ISO
TEI Guidelines Generic Corpus Annotation | Markup Language TEI CLARIN-PL, MPI-PL, UC
TextMD Metadata LoC
TimeML Markup Language | Semantic Annotation ISO CLARIN-PL
TMF Data Categorization ISO
TMS Data Categorization | Terminology ISO
TMX File Formats | Markup Language LISA
Topic Maps Knowledge Representation ISO
Turtle Serialization W3C
WordSeg Generic Corpus Annotation | Segmentation ISO
XCES Generic Corpus Annotation EAGLES CLARIN-PL, IDS
XHTML Meta Language ISO CLARIN-PL, IDS
XML Meta Language W3C CLARIN-PL, MPI-PL, UC
XMLNS Meta Language W3C IDS, MPI-PL
XPath Markup Language | Transformation W3C CLARIN-PL, IDS, MPI-PL, UC
XQuery Markup Language | Query W3C IDS, MPI-PL, UC
XSD Constraint Language W3C IDS, MPI-PL
XSL-FO Formatting W3C MPI-PL
XSLT Transformation W3C CLARIN-PL, MPI-PL, UC