Tour de CLARIN: Interview with Jack Rueter

Submitted by Jakob Lenardič on 18 June 2021

This interview is with Jack Rueter, who is involved with the SAFMORIL Knowledge Centre. He is a computational linguist whose work primarily focuses on finite-state descriptions of highly endangered languages with complex morphological systems.

Interview by Jakob Lenardič

1. Please introduce yourself – your academic background and current position. What inspired you to become a computational linguist?

My name is Jack Rueter, and I have a doctor’s degree in General linguistics from the University of Helsinki. My earlier studies had helped me to specialize in Uralic languages, and, in fact, I have been able to apply this acquired knowledge analogically to other languages of the world, including indigenous languages of the Americas. Over the past decade I have worked as a university researcher in Digital Humanities, which to a great extent has involved finite-state descriptions of highly endangered languages with complex morphological systems, over twenty languages in all – Komi-Zyrian (1996->), Erzya (1999->), Olonets-Karelian, Livonian, Moksha, Hill Mari, Tundra Nenets (2013->), Skolt Saami (2014), etc. This work has been made possible through collaboration with the Giellatekno infrastructure at the Norwegian Arctic University in Tromsø, which hosts well over 150 languages, and where much of the Helsinki Finite-State Technology (HFST) has been used in practice and of course funding from the Kone Foundation as well as FIN-CLARIN itself, where the technology has been developed.

Since the time I was a little boy, I was fascinated by foreign languages. The University of Helsinki provided me with the environment and possibility to pursue language learning. It was not until the 1990s, however, that I was introduced to the possibilities of describing languages in a way that could also be used for their facilitation. That was when I made my first description of the Komi-Zyrian language. It all started with creating a lexicon and adding glosses in two languages for a university language class. Subsequently, I introduced regular inflection of nouns and verbs utilizing the two-level model concept behind HFST with the mentorship of Kimmo Koskenniemi.

When I traveled to the Komi Republic and the Republic of Mordovia in the second half of the 1990s, I was encouraged by Pirkko Suihkonen to gather text corpora for the University of Helsinki Language Corpus Server (UHLCS) at the University of Helsinki, which predates FIN-CLARIN, and is now one of the many elements available through FIN-CLARIN and the Language Bank of Finland. Language corpora collecting naturally required gathering releases from the individual authors and the publishers. Learning to speak the Komi and Erzya (one of two Mordvin languages) languages made it easier to negotiate work with language corpora. It even helped me to acquire and develop connections with speakers of Moksha (the other Mordvin language) and the Udmurt languages (a close relative of Komi).

It was this work with finite-state description of languages and corpora collecting that brought me to my dissertation in a portion of Erzya regular morphology and strengthened my close relation with HFST and the Giellatekno infrastructure. Regular morphology is where each incremental part of morphology is directly related to one or more increments of semantics, and the meaning of the resulting word form can be deduced from its morphological structure. While regular morphology in the English language might be equated to four regular forms in verbs and four in nouns, e.g. “to talk” in “talk”, “talks”, “talked” and “talking” and “cat” in “cat”, “cat’s (name)”, “cats” and “cats’ (names)”, regular morphology in Erzya might easily take us to figures of over 200 for nouns and verbs alike.

It can be stated that my work with HFST tools has contributed greatly to the multilingual facilitation of minority Uralic languages, and projects directly associated include mini-paradigms and derivation information for a new Finnish-Skolt Saami dictionary, the FST behind the python libraries in UralicNLP, online dictionaries, spell checkers and keyboards from GiellaLT, lemmatization for corpora at the Language Bank of Finland and the University of Turku, not to mention integration work of HFST in ELAN for Komi language forms and analogical with others by Niko Partanen.

2. What is your current role at SAFMORIL? How did you get involved?

I am currently continuing the development of morphological descriptions for Finno-Ugric languages, as well as acquiring and developing additional resources for research in these languages in general. This involves acquiring new corpus data, developing Jupyter courses for morphological descriptions using HFST, and has lately also involved efforts to make transliterated speech data available. The ultimate goal should of course be to capture the spoken varieties of a language as variations of the HFST descriptions.

3. Could you briefly present HFST? What does it do? Why is it important for computational linguistics?

The Helsinki Finite-State Transducer (HFST) toolkit is intended for processing natural language morphologies. The toolkit is demonstrated by wide-coverage implementations of a number of languages of varying morphological complexity. HFST can be used via command line tools or a Python .

HFST can be used for compiling various formalisms into finite-state transducers and then combining these transducers with lexicons. In other words, linguistic processes are described with a set of rules, combined with the lexicon and the result is then applied to a text stream on token/morpheme/word/sentence level. This approach can be used for example for morphological analysis and generation, tokenizing and tagging text, fast look-up of strings in a transducer, spell checking and matching/transformation with a RTN (Recursive Transition Network) system.

From a linguistic perspective, a finite-state description of morphology involves a closed set of headwords or lemmas that are associated with their individual stems and finite sets of affixes (prefixes, infixes or suffixes) to form inflectional paradigms where regular semantic meaning and morphology are conjoined. The structure of these descriptions predominantly follow canonical synchronic and diachronic treatises of an individual language and are therefore open to extensive variation.

Since there might arguably be an infinite number of finite-state descriptions for any given language, it is important that a responsible infrastructure be maintained, such as Giellatekno, where testing and development practices with applied HFST tools guide the language modelers to follow and conform to mutual descriptive notations and structuring. At Giellatekno, where HFST tools have been applied most extensively, Northern Saami is the language that has the longest history of development, hence the most extensive set of applications for language research and facilitation of the Finno-Ugric languages. Among these applications are morphological analysis and generation at the word level and contextual disambiguation at the sentence level.

Word-level analysis contributes to morphologically savvy dictionaries. It allows you to find the meaning of a word without knowledge of the headword or even language-specific alphabetical order, just input the form you have found or click on it in a text. When you have a descriptive analyzer for the linguist to delve into standard literary, archaic, dialectic and even regular formations of nonce words, you also have the makings of orthographic as well as basic learning tools. Here is where context-based disambiguation comes into play.

Sentence-level disambiguation, also a target of continued research, is what addresses the multi-ambiguity generated by robust regular descriptions of the language morphology. With contextual disambiguation implemented, work with spellers and intelligent, computer-assisted language learning tools can be extended. Disambiguated morphological readings for all words of a sentence means we can hypothesize text-to-speech readings, work with machine translation, and even introduce user group-specific spell-checkers with correction suggestions specific to the individual context. In a similar vein, context awareness means that language learning can introduce contextual morphological exercises and even chatbots.

In short, a sharing platform of applied HFST tools can and does provide for language researchers and facilitators in tandem with their language communities...

4. What makes Finno-Ugric languages particularly difficult for automatic morphological analysis? How does HFST overcome this?

A majority of the Finno-Ugric languages have few annotated corpora, few or no embeddings. These languages are low-resourced unlike Hindi or German; they actually have few or no natural language processing resources. There are, however, limited but feasible resources such as lexical, morphological and syntactic descriptions that have been collected in fieldwork and research over the past two centuries. These have yet to be applied to the facilitation of the languages by their relevant communities.

One hidden asset of HFST is that we actually attempt to utilize previous research results and outcomes. This may mean copying word lists with their research glosses as notes and introducing research paradigms as testing materials. Actually, these can be seen as the first steps of a workflow for working with finite-state descriptions. Here even limited research materials for a language can provide for an extensible approach to language facilitation.

We have noted that Finno-Ugric languages are extremely rich morphologically, which is different from many languages often studied in Natural Language Processing. Using finite-state technology for both a challenging morphology and a simple one alike allows for detailed descriptions of these phenomena while they are being modeled. It also allows the modeler a choice: to avoid or not to avoid any problems ensuing from rare forms that might not be found when creating a model based on annotated corpora alone.

5. What is the importance of applying HFST in other projects focusing on morphological analysis?

HFST is a very useful component in linguistic research, as we customarily need to analyze texts that have not been annotated at all. We want to be able to select our research data on the basis of what is actually useful for our questions and not be limited by what has been annotated. For this reason, technology to annotate materials automatically is very important, and HFST provides means for doing this.

Since developing workflows with HFST might not initially be obvious to all, it is important that we provide an example. When we annotate our materials with HFST, we often find ourselves in a loop where the technology needs to be continuously improved and texts reannotated. In the end, we may also need to do some corrections manually. Hence, how to document and repeat these types of research workflows is also an important question to solve.

This is especially so in projects where HFST is integrated into tools linguists are already using, such as Niko Partanen’s integration work with HFST and the transcription tool ELAN. Solutions for tasks such as this have already been developed, and these methods are currently in use in several universities.

Of course we see machine learning methods becoming more popular every day, but we think that a variety of approaches can enrich and optimize what we are trying to achieve. Eventually we will be able to train well working neural taggers and parsers for various Uralic languages. But still, tools such as HFST have proven invaluable when we start creating new resources for a new language, and we believe that even in the future rule-based methods that provide for precise verifiable results and analyses will remain relevant.

6. Could you describe your work together with Niko Partanen on Finno-Ugric languages using HFST? Which resources did you build and what is their intended application, both in /computational linguistics and beyond (i.e., wider digital humanities and social sciences)?

As I mentioned, many Finno-Ugric languages do not have very large computational resources. This is of course something we aim to change. One of the larger projects we have been involved in is Universal Dependencies (Czech CLARIN), where we create annotated treebanks in different languages of the world, using essentially the same annotation scheme. This, we hope, allows further comparative work. Treebank means simply materials that have annotations also at a syntactic level, following dependency grammar, where each sentence has the central root element, usually a finite verb, and then other constituents of the sentence are connected to it as leaves.

These materials will be useful not only in computational linguistics, but also for basic linguistic research. Naturally, to show that this can be done is partly our own responsibility, and we are continuously working with the description of various Uralic languages using these tools to prove exactly this.

HFST plays an important role here, because despite the dearth of corpus materials we can make robust descriptions, which as the linguist desires can be utilized or alternatively left out of the analyzer. In work with UD, robust analysis allows us to introduce new extensively annotated corpora for morphologically complex languages which have not been derived from previous corpus projects. Many of the larger majority language projects are basically going through code conversion and annotation alignment, whereas we are starting from scratch. At present Niko and I have dealt extensively with Komi-Zyrian, where we have addressed both literary and dialect language descriptions with one FST. Although the morphophonological description is relatively simplistic from a phonological alternation perspective, Komi does have an extensive combinatory morphology that may even operate on the syntactic level, where verbs can be derived from complex noun phrases, e.g. ‘the student put on a bright red shirt’.

7. Aside from yourself, do you know whether HFST has been successfully used by other researchers?

As I earlier mentioned the Giellatekno infrastructure at the Norwegian Arctic University in Tromsø is a platform where the HFST tools have been applied quite extensively. Since the primary target languages for facilitation are the minority languages of Northern Norway, i.e. Saami languages and Kven, work with other Uralic languages falls into a different category, but by no means is this other category less facilitated. In fact, it introduces collaboration with the Võro Institute in southern Estonia, the Livonian Institute in Latvia, Mari language research in Vienna and the Mari El Republic as well as work with the Komi language in the Komi Republic, and Karelian research in Eastern Finland and Saami languages in Oulu as well as initial steps in work with Indigenous Amazon studies in Belém, Brazil.

Work with the linguistic description of the languages of Greenland also brings contribution and feedback to HFST development from Institute of Language and Communication (ISK), University of Southern Denmark in work on Constraint Grammar disambiguation of the HFST analyzer output.

Research at the Alberta Language Technology Laboratory, headed by Antti Arppe, applies HFST in its development of tools for several indigenous language families of Canada.

Helsinki Finite-State Technology has also found its way into the development of the open-source GATE DictLemmatizer developed in the universities of Sheffield and Duisburg-Essen.

Further networking with FST technologies also draws our attention to the University of Latvia and Introduction to NLP courses taught there for computer science students.

8. Together with Erik Axelson, you have been developing an online self-study tutorial for morphologically rich languages. Could you present this tutorial? What is its target audience?

The tutorial “Morphologically Rich Languages with HFST” demonstrates how HFST tools can be used for generating finite-state morphologies for morphologically rich languages. It is implemented as python notebooks which use the HFST python interface. The tutorial is hosted at CSC - Finnish IT center for science.

This web course is based on a course by Sjur Moshagen and myself organized at the University of Helsinki named “Language Technology for Finno-Ugric Languages - Methods, Tools and Applications”. The course is part of the MA Programme “Linguistic Diversity in the Digital Age”.

This course gives an introduction to mainly rule-based language technology as used in many full-scale, production projects using the GiellaLT and Apertium infrastructures. The technologies and methodologies presented can be used on any language, although the focus is on morphologically complex ones.

The course assumes that the user knows the fundamentals of general linguistics and has basic knowledge on how to use a computer. Some programming experience is desirable and knowledge of Natural Language Processing (NLP) is also a plus.

(At the moment, access requires a HAKA account. If you do not have a HAKA account, you can contact SAFMORIL Helpdesk to arrange for local accounts or request a visitor account directly from the CSC service desk. You also need a join code that you can request from the SAFMORIL Helpdesk.)