Blog post written by Artūrs Znotiņš, and edited by Darja Fišer and Jakob Lenardič
Working with large volumes of texts usually requires multiple linguistic annotation steps which are increasingly difficult to integrate if they are based on different technologies. NLP-PIPE is a modular toolchain that allows researchers to combine multiple natural language processing tools in a unified framework. It provides the gluing code that is used to combine tools even if they are written in different programming languages and rely on conflicting library versions. It was created to make technology more accessible to linguists, and to make new tool creation and integration easier to researchers and software developers
NLP-PIPE supports a wide range of annotation services for Latvian, including tokenization, morphological tagging, lemmatisation, universal dependency parsing, and named entity recognition. The easiest way to start using the toolchain is via the on-line demo version. In the web based interface, a user simply selects the required processing tools and inputs the text they want to annotate. The results can then be viewed either directly on the website (see Figure 1) or exported in several formats.
Figure 1: NLP-PIPE applied to the sentence “In this school year Marisa Butnere from America was studying in the 8th grade of Aizkraukle County gymnasium.” The results of the annotation process are displayed in the CONLL-U format with standardised columns. The XPOSTAG column corresponds to the Latvian morphological tag set based on the MULTEXT-East format. For example, the npfsg5 tags for the proper noun Aizkraukles in the fourth row translates to n – noun, p – proper, f – feminine, s – singular, g – genitive case, 5 – 5th declension. The results of the Named Entity recognition are visualized with highlighted text spans.
The NLP-PIPE web interface has been successfully used to perform named entity recognition on autobiographical texts, as well as to extract person mentions from an archive of photo descriptions. NLP-PIPE has also been used by CLARIN Latvia to create a multilayer corpus for Full-Stack natural language understanding (NLU), which is of crucial importance for advancing machine reading comprehension. The tool also allows post-editing the annotation results which helps to create reliable training datasets.
NLP-PIPE is developed at the Institute of Mathematics and Computer Science at University of Latvia and can be freely used for non-commercial purposes from GitHub. For more details on the NLP-PIPE, see Znotins and Cirule (2018) and Gruzitis and Znotins (2018).
- Znotins A. and Cirule E. NLP-PIPE: Latvian NLP Tool Pipeline. In Proceeding of Human Language Technologies– The Baltic Perspective, IOS Press, 2018.
- Gruzitis N. and Znotins A. Multilayer Corpus and Toolchain for Full-Stack NLU in Latvian. In Proceedings of the CLARIN Annual Conference 2018
Click here to read more about Tour de CLARIN