Tour de CLARIN: Interview with Jóhannes Gísli Jónsson

Submitted by Jakob Lenardič on 21 December 2020

Jóhannes Gísli Jónsson is Professor of Linguistics at the University of Iceland. He has used the CLARIN-IS Gigaword Corpus and the Icelandic Parsed Historical Corpus for his research into theoretical syntax.

1. Could you please introduce yourself, and what are your main research interests?

I am Professor of Icelandic Linguistics at the University of Iceland. My main research interests are theoretical syntax and the syntax-semantics interface. I have mainly worked on Icelandic and Faroese, and also a little bit on Icelandic Sign Language.

2. What has inspired you to research Icelandic syntax by using corpora?

I like to use corpora to get information that goes beyond my native speaker intuitions and raises new questions. Of course, there is no alternative if you are working on older stages of Icelandic for which you do not have any reliable intuitions, but I have found that corpora are also very useful for the study of Modern Icelandic. Still, corpus data must be complemented with experimental data, such as judgement tasks on grammatical constructions that cannot be found in corpora, and you also need a good theoretical framework to make sense of all the data in linguistic corpora.

3. In your research, you have used the Icelandic Gigaword corpus.  Could you summarize some of the work you have done on the basis of this corpus, and which features made it crucial for your work?

In the past 16 months or so I have been looking at inversion in Icelandic, i.e. word order where the direct object precedes the indirect object in active clauses. An example of this would be Ég gaf bókina Jóni “I gave the book (to) John”. The new Gigaword Corpus has provided me with a lot of information that I could not have gotten through pure introspection. For instance, it turns out, quite surprisingly, that inversion is much more common with some ditransitive verbs than others. It is for instance much more common with afhenda “deliver, hand over” than gefa “give”. Another surprising finding is that the indirect object is heavier than the direct object in about 90% of the cases in the corpus when it follows the direct object, in the sense that it has more words or a stressed word as opposed to an unstressed pronoun. Thus, the Gigaword Corpus has opened up all kinds of questions that I did not even have at the beginning of this study. 

The crucial feature of the Gigaword Corpus for this work is the possibility of searching for strings where the first word is a particular ditransitive verb, followed by some word bearing accusative case and then another word with dative case. Furthermore, the Gigaword Corpus is morphosyntactically tagged, using around 700 different tags. Since Icelandic is a highly inflectional language, a lot of syntactic information can be deduced from morphological tagging, which can compensate for the lack of syntactic parsing. Finally, since the Gigaword Corpus has both an Icelandic and an English user interface, it is also useful for researchers who are not fluent in Icelandic and are not familiar with Icelandic linguistic terminology (which is rather special, since we do not use the Latin-based terms most languages do).

4. You have also done research on the basis of the Icelandic Parsed Historical Corpus (IcePaHC) corpus, could you summarize your findings from this work?  Which linguistic phenomena were you looking at, and what were you able to establish?

I have used the corpus in a paper on subjecthood in Old Icelandic, where I argued that word order is a reliable subject test in Old Icelandic, even if the word order is freer than in Modern Icelandic. Similarly, Brynhildur Stefánsdóttir and I have used the IcePaHC corpus to study the incorporation of prepositions, which refers to leftward syntactic movement across lexical verbs, participles, nouns, and adjectives, in the history of Icelandic. An example of this grammatical process, taken from a 15th century text in the corpus, can be seen in the clause og hefir Oddur af virðing málunum (“and Oddur gains respect from the affairs”), where the preposition af “from” is displaced from its complement málunum “the affairs” via leftward movement so that the noun virðing “respect” now intervenes between the preposition and its object complement. Such instances of incorporation have disappeared as a productive process in contemporary Icelandic, which is attested by the data in the IcePaHC corpus. Furthermore, we have argued on the basis of such examples with prepositional incorporation that Old Icelandic was uniformly a Verb-Object language in terms of word order.

5. Could you discuss the IcePaHC corpus itself? How did you use the corpus to extract the relevant syntactic data? 

IcePaHC is a large diachronic corpus containing Icelandic texts from the 12th century until the early 21st century, so it is a crucial resource for researchers who are interested in historical changes that span many centuries. IcePaHC has excerpts from many different texts from each century, and these texts belong to many different genres – the medieval subpart of IcePaHC is mostly represented by narratives like Old Norse sagas, but biographic, religious, judicial and scientific texts are also included in the corpus. I do, however, wish that there were a syntactically parsed corpus for Old Icelandic with search capabilities similar to those of IcePaHC, which would comprehensively include all the relevant texts from that time period instead of just excerpts.

Iris Edda Nowenstein, who is currently a PhD student in Icelandic linguistics, performed the searches in IcePaHC for my study on word order and subjecthood in Old Icelandic. I was mostly interested in strings where the finite verb immediately precedes two determiner phrases, and finding such strings is easy to do in IcePaHC because it is syntactically parsed.

6. How does the Icelandic CLARIN research infrastructure benefit your research community?

Up to now, digital language resources for Icelandic have been scarce and scattered. It is very convenient to be able to use CLARIN-IS as a hub and access all existing resources from there – both online databases (where available) and downloadable resources in the CLARIN-IS repository. Since new resources are constantly being added to the repository, its value for researchers will increase greatly in the future. For instance, a large machine-parsed corpus of Modern Icelandic (The Icelandic Contemporary Treebank) has recently been added.  I have not yet had the opportunity to use it in my research, but it is potentially highly valuable for syntacticians. 


