Tour de CLARIN: CINTIL-DependencyBank

Submitted by Karina Berger on 13 August 2021

Written by João Silva

CINTIL‑DependencyBank is a corpus of Portuguese utterances annotated with the representation of grammatical dependency relations, a kind of linguistic information that, roughly speaking, captures the fact that for a sentence to be grammatical the occurrence and position of words depends on, and is constrained by, the occurrence and position of other words in the sentence. The annotation is represented in a machine-readable tabular format.

Such annotated corpora are important resources for the study of natural languages and for the development of natural language processing tools. In the former, they support, for instance, concordancing and the search for syntactic patterns in corpora, which are necessary to check whether theory fits with observed data; while in the latter, they are used, for instance, as training and evaluation data in the development of machine learning parsers (such as LX‑DepParser, also presented in this Tour de CLARIN).

The developmental process of CINTIL-DependencyBank is worth noting, as it sets it apart from other dependency corpora. Generally, the manual annotation of corpora is a very time-consuming process that requires expert knowledge and, for large corpora, it is easy for errors and inconsistencies to occur. Because of this and because of a general lack of expert annotators, many corpora are automatically annotated and then manually corrected. While this can help, inconsistencies can still easily occur and the amount of effort required for correcting the annotation depends on the quality of the annotation tool.

In the NLX-Group, which is the Natural Language and Speech Group of the Faculty of Sciences of the University of Lisbon, we have developed LXGram, a symbolic deep processing grammar of Portuguese ('deep' in the sense that the analysis goes all the way to a semantic representation of meaning) under the HPSG framework (Pollard and Sag 1994), which we have used to support the annotation process. Figure 1 gives a striking impression of the amount of data and the complexity involved in a full deep analysis of what is a relatively simple sentence.

Figure 1: Printout in 6 pt font of the output of LXGram for the sentence Todos os computadores têm um disco ('All computers have a disk'). The hand holding a pen is provided to give a sense of scale.

To use LXGram to support the annotation of corpora, we rely on the fact that the grammar produces all grammatically valid analyses of a sentence, the parse forest. Then, instead of having to correct the output of the grammar, the human annotator only has to disambiguate by picking which of the possible analyses in the parse forest is the valid one in that particular case. Note that the annotator does not need to scan all possible analyses one by one, which would be tiresome and error-prone, as parse forests often contain many hundreds of analyses. Instead, the annotation platform that was used, [incr tsdb()] (Oepen 2001), automatically provides a set of discriminants (something like 'does this constituent attach at point A or at point B?'), which are binary decisions that, when the annotator makes a choice, allow cutting the size of the forest in half. Through this process, a parse forest can be reduced to a single analysis with only a few discriminant choices, greatly speeding up the process. This is done by two annotators, following a process of double-blind annotation with adjudication of disagreements by a third annotator, thereby greatly reducing errors and improving the consistency of the annotation.

From the rich deep semantic representation (DeepBank) produced by LXGram we extracted sub‑representations, which we term 'vistas' (Silva and Branco 2012). We have thereby extracted not only the DependencyBank presented here, but also a TreeBank (with constituency trees), a PropBank (with semantic roles) and a LogicalFormBank (with semantic representations); importantly, all these vistas are consistent and aligned among themselves, a major advantage in studying, for instance, the correspondence between constituency and dependency analyses. Figure 1 shows an example of a sentence whose constituency and dependency analyses have both been extracted from the same source deep analysis provided by LXGram.

Figure 2: A view of a sentence in the CINTIL corpus ('Each city is an identity and there are no architecturally finished models'), showing the constituency and dependency vistas.

References:

Oepen, S. 2001. [incr tsdb()]—competence and performance laboratory. User manual. Technical report. Saarland University: Saarbrücken.

Pollard, C., and I. Sag, 1994. Head-driven phrase structure grammar. Chicago University Press and CSLI Publications: Stanford.

Silva, J., and A. Branco. 2012. Deep, consistent and also useful: extracting vistas from deep corpora for shallower tasks. In Proceedings of the Workshop on Advanced Treebanking at the 8th International Conference on Language Resources and Evaluation (LREC'12), 45–52.

linguistic annotation