You are here

Standards and Formats

Basic principles

CLARIN adheres to the following principles:

  • Open standards are preferred over proprietary standards
  • Formats and protocols should be:
    • well-documented
    • verifiable
    • proven (being used in practice)
  • Text-based formats are (where possible) preferred over binary formats
  • In the case of digitisation of an analogue signal, using no or lossless compression is recommended

Learning more

Relevant formats

Several CLARIN centres have published excellent guidance on formats suitable to deposit language research data:

Relevant standards

The table below contains a (non-exhaustive) list of standards that are relevant for the CLARIN community.

Source: CLARIN standard guidance website (provided by the IDS)

Abbreviation/Name Topic(s) Standard body CLARIN centre(s)
CES Generic Corpus Annotation EAGLES  
CHAT File Formats | Transcription Other  
CMDI Metadata ISO CLARIN-PL, IDS, MPI-PL, UC
Controlled Vocabulary Controlled Vocabulary | Knowledge Representation | Thesaurus NISO CLARIN-PL, MPI-PL
CQLF Query ISO IDS
DCAM Metadata DCMI  
DCMES Metadata DCMI MPI-PL
DCR Data Categorization ISO CLARIN-PL, IDS, MPI-PL
DiAML Markup Language | Semantic Annotation ISO  
DictionaryEntry-RePresentation Controlled Vocabulary | Terminology ISO  
DITA Generic Corpus Annotation OASIS  
DOL Knowledge Representation | Ontology ISO  
DSSSL Formatting | Transformation ISO  
Feature structures Feature Structure ISO CLARIN-PL
GOLD Ontology | Terminology GOLD Community  
HTML Meta Language ISO CLARIN-PL, IDS, MPI-PL
HyTime Markup Language ISO  
IMDI Metadata Other CLARIN-PL, MPI-PL, UC
ISBD Metadata IFLA  
ISO-Thesauri Controlled Vocabulary | Thesaurus ISO  
ITS Data Categorization W3C  
JATS Generic Corpus Annotation NISO  
LAF Generic Corpus Annotation ISO CLARIN-PL
LMF Lexical Knowledge ISO CLARIN-PL, MPI-PL, UC
MAF Morpho-syntactic Annotation ISO  
METS Metadata LoC  
MLIF Multilingual data annotation ISO CLARIN-PL
Multilingual Thesaurus Controlled Vocabulary | Thesaurus IFLA  
NISO MIX Metadata | Schema LoC  
NLM JAITS Generic Corpus Annotation Other  
OLAC Metadata Metadata OLAC CLARIN-PL, MPI-PL, UC
OLiA Ontology | Terminology    
OntoIOp Knowledge Representation | Ontology ISO  
OWL Knowledge Representation | Ontology W3C UC
PDF/A File Formats ISO  
PISA Terminology ISO IDS, MPI-PL
RDF Knowledge Representation | Metadata W3C CLARIN-PL, IDS, UC
RDF/XML Meta Language | Serialization W3C CLARIN-PL
RDFS Constraint Language | Schema W3C  
RELAX NG Constraint Language ISO MPI-PL, UC
RTF File Formats Other  
SemAF Semantic Annotation ISO  
SemRoleML Markup Language | Semantic Annotation ISO  
SGML Meta Language ISO  
SimpL-1 Terminology ISO  
SKOS Knowledge Representation | Thesaurus W3C  
SPARQL Query W3C  
SRX Segmentation LISA CLARIN-PL
Structured Vocabulary Controlled Vocabulary | Knowledge Representation | Thesaurus BSi CLARIN-PL
SynAF Syntactic Annotation ISO  
TBX File Formats | Markup Language | Terminology ISO  
TEI Guidelines Generic Corpus Annotation | Markup Language TEI CLARIN-PL, MPI-PL, UC
TextMD Metadata LoC  
TimeML Markup Language | Semantic Annotation ISO CLARIN-PL
TMF Data Categorization ISO  
TMS Data Categorization | Terminology ISO  
TMX File Formats | Markup Language LISA  
Topic Maps Knowledge Representation ISO  
Turtle Serialization W3C  
WordSeg Generic Corpus Annotation | Segmentation ISO  
XCES Generic Corpus Annotation EAGLES CLARIN-PL, IDS
XHTML Meta Language ISO CLARIN-PL, IDS
XML Meta Language W3C CLARIN-PL, MPI-PL, UC
XMLNS Meta Language W3C IDS, MPI-PL
XPath Markup Language | Transformation W3C CLARIN-PL, IDS, MPI-PL, UC
XQuery Markup Language | Query W3C IDS, MPI-PL, UC
XSD Constraint Language W3C IDS, MPI-PL
XSL-FO Formatting W3C MPI-PL
XSLT Transformation W3C CLARIN-PL, MPI-PL, UC