Discovering Slovenian Language Structure Using Corpora

Jakob Lenardič
Submitted by Julia Misersky on 14 September 2022

The Project

In recognition of its outstanding quality, Jakob Lenardič’s PhD thesis was recently awarded best of the year 2021/2022 at the Faculty of Arts, University of Ljubljana. What made Lenardič's thesis stand out was the fact that he combined two different approaches: he used a robust theoretical foundation rooted in formalism and paired it with corpus-based methodology more often associated with functionalism.

Lenardič’s thesis is deeply rooted in theoretical linguistics, and the details of his work are difficult to simplify while also doing the theoretical framework justice. From an outsider’s perspective it might be impossible to reflect on the real-world applicability of his work. But in Lenardič’s view, ‘science is inherently good’, and any question is worth exploring, regardless of whether it fits the ‘impact’ buzzword at first sight. Importantly, Lenardič's method is explicit and scientifically robust,  and can be reused in other research contexts. The wider applicability of the method may therefore suggest impact that goes beyond Lenardič’s own niche and that is detached from the theoretical approach to grammar.

Apart from its method, Lenardič’s thesis stands out for another reason: It provides an explicit compositional semantics for Slovenian grammatical structure related to the extended verbal domain. For English, many such structures have already been formalised. Consider the distinction between a passive sentence such as 'The door was opened', which necessarily entails some kind of event initiator, and 'The door opened', which has a wider meaning, namely 'The door opened by itself'. Lenardič explored the role that grammatical features play in meaning-making of such sentences. He did so by trying to formally capture a piece of syntactic structure, which likely necessitates the use of specific grammatical features that govern how event initiation is realised both syntactically and semantically (i.e, participle morphemes). In addition, Lenardič also focused on similar sentence constructions in Slovenian, studying grammatical voice, aspectual interpretation, and the interpretation of person and number features. This approach, he explains, has not been taken before in the case of Slovenian, so Lenardič’s thesis ‘is a bit foundational in this sense’.

'Science is inherently good.' 
Jakob Lenardič on the importance of asking questions, regardless of the wider impact


Lenardič holds a BA and MA in English literature and linguistics from the University of Ljubljana. When he started his PhD in 2016, he had little experience with computational linguistics or digital humanities (DH). Alongside his research, he was offered a job with Darja Fišer as an administrative assistant in the Department of Translation in 2016. That same year, Fišer was appointed Director of User Involvement at CLARIN . Over time, working together had an impact on both Lenardič’s understanding of DH and CLARIN, but also on his linguistic research. Lenardič explains: ‘Even though formally my main job concerned mostly CLARIN-related things such as Tour de CLARIN and CLARIN Resource Families until about 2020, in practice Darja also helped me pursue corpus linguistics research by getting me involved in relevant research projects at the national level, so my role slowly but surely shifted into that of a researcher that does both corpus and theoretical linguistics, often combining the two.’

In a nutshell, Lenardič’s thesis focused on two topics: First, he explored the pronominal system and case assignment in Slovenian. Second, he focused on the syntax-semantics interface of both English and Slovenian in relation to the so-called middle construction, which in English concerns structures like The book reads well. 

Lenardič’s work is based on a syntax-only, formalist approach to grammar, which he claims is underrepresented in linguistics at the University of Ljubljana. More functionalist approaches, he feels, can be vague and speculative in describing interactive factors of context, in other words ‘fuzzy when they don’t need to be fuzzy’.  In his view, it is a misconception that formalist approaches do not consider context, and he believes that corpus-based approaches to grammar could benefit from taking formal aspects and the associated methodologies into account. His thesis is evidence that combining the two approaches leads to outstanding work.


To explore his research questions, Lenardič used the tools developed at CLARIN.SI, such as the noSketch Engine concordancer, on corpora relevant to his research interests. Specifically, Lenardič went on to investigate two sets: Gigafida, which is the reference corpus for written standard Slovenian, and  the corpora of the JANES family, which contains Slovenian computer-mediated communication on platforms such as Twitter and Facebook. Over time, and not least thanks to the expertise at CLARIN.SI, he developed  more sophisticated skills using  the noSketchEngine concordancer, which were essential for exploring the linguistic structures he was interested in. 

Working with corpora was essential, as it helped Lenardič to infer subtle characteristics of Slovenian language structure that he says he would never have figured out ‘by resorting to [...] intuition alone.’ In his view, corpus work requires robust assumptions and should take an advanced approach to querying that goes beyond simple keyword searches, as this is crucial for a highly inflected language with pragmatic word order such as Slovenian.

Future directions - CLARIN and DH

Though his own work has undoubtedly benefited from the CLARIN infrastructure, Lenardič also sees outreach and collaboration as important. CLARIN’s benefits are not limited to the opportunistic use of resources or tools but, he argues, are relevant for the field of DH more broadly. Lenardič has experienced hesitation in the field; not all humanities scholars would agree that DH is the way forward. In Lenardič’s view, DH needs to be integrated at Bachelor and Master level, to  ‘get them while they’re young’. As of this year, Lenardič became involved with a new research programme starting at the University of Ljubljana, which is solely dedicated to Slovenian DH. Led by Darja Fišer, the programme is interdisciplinary but rooted in corpus and computational linguistics and machine learning, and aims to develop six new open-source datasets containing text, speech and images.

'Get them while they're young.' 
Jakob Lenardič on motivating students to engage with digital humanities, so the benefits of CLARIN are not limited to opportunistic use of resources and tools
University of Ljubljana

Lenardič says: ‘In Slovenia, there is a sizable research community which does not seem to be aware of our national consortium and the services and wealth of data that it offers. Funding opportunities such as the Mobility Grants could be especially useful for young researchers.’ To spread the word, he recently led a face-to-face workshop as part of the JTDH 2022 pre-conference programme, which introduced both the CLARIN.SI and CLARIN infrastructures to PhD students in linguistics and the wider humanities. 

Lenardič also plans to continue collaborating with CLARIN in other ways. He says: ‘I hope to continue with the CLARIN Resource Families, which became much broader in scope this year due to the project funding, where a couple of projects are already underway.’

Dr Jakob Lenardič, Faculty of Arts, University of Ljubljana