Tour de CLARIN: the Czech CLARIN consortium presents Universal Dependencies (UD)

Submitted by karolina@clarin.eu on 22 May 2018

Blog post written by Barbora Hladka, edited by Darja Fišer and Jakob Lenardič

Universal Dependencies (UD) is an open collaboration project in the field of Natural Language Processing (NLP). Its motivation comes from multi- and cross-lingual research and its goal is to develop a universal approach to grammatical annotation, applicable to as many languages as possible. UD is administered by an international team under supervision of Joakim Nivre. The Universal Dependencies project has been up and running since the spring of 2014.

UD provides a universal inventory of Part of Speech categories and syntactic relations for consistent cross-linguistic annotation, as well as a number of existing treebanks that are richly annotated with the grammatical features. The following picture shows a UD tree structure for the sentence Mary loves John. Three Part-Of-Speech categories – PROPN (proper name), VERB (verb), and PUNCT (punctuation) – and four syntactic relations – root (predicate), nsubj (nominal subject), obj (object), and punct (punctuation) – occur in the tree.

UD is also accompanied by detailed guidelines for carrying out the annotation, with examples from numerous languages. The following picture illustrates the complex criteria UD uses to recognize nominal modifiers, which often also take into account complex grammatical interdependencies from formal grammar, such as case assignment/checking.

To search the UD treebanks, researchers can use the online PML-TQ (PML Tree Query) service and UDPipe, which is an automatic UD annotation pipeline that uses models trained for nearly all the treebanks, so it offers an easy access point to the Universal Dependencies. A number of graphical user interfaces for manual UD annotation are also available. One of them is TrEd, which is a fully customizable and programmable editor and viewer of tree structures developed at the Institute of Formal and Applied Linguistics. The editor, which offers an extension for UD annotation illustrated in the following picture, has been successfully used to annotate thousands of sentences in the Prague Dependency Treebanks.

A new version of the UD treebanks is released every six months. The latest version (2.1) came out at the end of 2017 and consists of an impressive number of treebanks, 102, for an equally impressive number of languages, 60. This version offers a ten times greater number of treebanks for six times more languages than the very first release in 2014, which shows how the inclusion of new language data is exponentially growing. All the versions are downloadable from the LINDAT/CLARIN repository.

After a period of speedy growth in 2014–2017, LINDAT has organized a series of events dedicated to training and conducting parsing experiments with UD treebanks, as well as discussions of UD-related topics. Among them was a tutorial on UD at the EACL 2017 conference in Valencia in Spain, the first workshop on Universal Dependencies in Gothenburg in Sweden in May 2017, and the CoNLL 2017 and 2018 Shared Tasks, in which the UD treebanks were successfully used as models for the development of advanced dependency parsers.

Click here to read more about Tour de CLARIN