You are here

CLARIN-DK presents the CST Lemmatizer


Lemmatizers generalize over the different forms of a word used in free text and provide its lemma, which is the base or dictionary look-up form. They are therefore one of the basic NLP tools which are not only important for NLP, but also for lexicographic work and all text-based studies. They are especially indispensable in morphologically rich languages that have a large number of word forms for the same lemma, which severely hinders querying or processing all of them in running text.

The CST lemmatizer has been developed over many years and as part of various projects, especially the Danish STO (Jongejan and Haltrup 2005) and the Nordic Tvärsök (Jongejan and Dalianis 2009). While it was initially used as a tool to support Danish lexicographic work, it has gradually been extended with a dynamic self-learning algorithm which learns new lemmatization rules from morphological lexica that contain the relations between word forms and their corresponding lemmas. The lemmatization rules are organized in a decision tree.

In comparison to other state-of-the-art stemmers and rule-based lemmatizers, the current version of the CST lemmatizer learns lemmatization rules not only from word endings, and recognizes a wide variety of derivational patterns; e.g., prefixation, infixation, suffixation.  Therefore, it can deal with languages with different morphological systems. Currently, the CST lemmatizer has been trained on 25 languages. The list of these language-trained versions of the CST lemmatiser available from the Center for Language Technology is in Figure 1.


Figure 1: The languages for which the trained CST-lemmatiser is available.

Danish and English texts can be lemmatized online with the CST lemmatizer. The lemmatizer is available for download via GITHUB. Figure 2 shows the CLARIN-DK web service for the CST-lemmatizer, while Figure 3 shows a Danish example sentence that was lemmatized with the tool.


Figure 2: The online CST lemmatiser on CLARIN-DK.


Figure 3: Lemmatization of the Danish sentence Dog, året der er gået, kan også have budt på tunge stunder -- ikke alt er glæde for os alle  (“However, the past year can also have provided sad moments – not everything can give happiness to all of us ”), which is taken  from the 2017 New Eve talk by the Danish Queen.

The CST lemmatizer trained for Danish has been used in many NLP projects, but also outside the NLP community.  Frederik Hjorth, who is a political science researcher at the Department of Political Science, University of Copenhagen, has applied the CST lemmatizer to political speeches as one of the preprocessing steps in order to investigate how members of the existing political parties have addressed right-wing populists who have been challenging the order of the established political system (Hjorth 2018). The results of the study  indicate that young politicians are often willing to engage with the populists as well as with other politicians across the political spectrum in name of democratic freedom (which Hjorth calls the strategy of engagement), while older politicians often describe the populist challengers as morally illegitimate (which Hjorth calls the strategy of disparagement) and refuse to discuss with them.

The CST lemmatizer was also used for many other languages in different linguistic projects. For example, it was trained on Russian (Sharoff and Nivre 2011) and then used e.g. for event identification (Solovyev and Ivanov 2016), and for anaphora and co-reference resolution (Toldova et al. 2014).

References

Jongejan, Bart and Dorte Haltrup. 2005. The CST Lemmatiser. Center for Sprogteknologi, University of Copenhagen version 2.7. http://cst.dk/online/lemmatiser/cstlemma.pdf

Jongejan, Bart and Hercules Dalianis. 2009. Automatic Training of Lemmatization Rules That Handle Morphological Changes in Pre-, in- and Suffixes Alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - ACL-IJCNLP ’09. Vol. 1 Suntec, Singapore: Association for Computational Linguistics p. 145.

Frederik Hjorth. 2018. Establishment Responses to Populist Challenges: Evidence from Legislative Speech. 2018 Annual Meeting of the Danish Political Science Association. http://fghjorth.github.io/papers/responses.pdf

Sharoff, Serge and Joachim Nivre. 2011. The proper place of men and machines in language technology: Processing Russian without any linguistic knowledge. In Proc. Computational Linguistics and Intelligent Technologies DIALOGUE2011, Bekasovo, 591–604. https://pdfs.semanticscholar.org/36df/5fbe04f425e9b089437e979581d1f5375a94.pdf

Solovyev, Valery and Vladimir Ivanov. 2016.  Knowledge-driven event extraction in Russian: corpus-based linguistic resources, Computational intelligence and neuroscience, 11 pages. https://doi.org/10.1155%2f2016%2f4183760  

Toldova, Svetlana et al. 2014. RU-EVAL-2014: Evaluating Anaphora and Coreference Resolution for Russian. Computational Linguistics and Intellectual Technologies, Vol. 13 (20), pp. 681-694.


Blog post written by Bart Jongejan and Costanza Navarretta, edited by Darja Fišer and Jakob Lenardič.