Tour de CLARIN: Interview with Zrinka Kolaković

Submitted by Jakob Lenardič on 26 November 2021

This interview is with Zrinka Kolaković, who holds a postdoctoral position at the Department of Slavic Languages and Literatures at the University of Klagenfurt. Before that, she worked at the University of Gießen, the University of Regensburg, and the University of Zagreb. Her current and future research interests are mainly focused on clitics and aspect. The interview was additionally edited and proofread by Krystyna Kupiszewska.
Interview by Jakob Lenardič
Please introduce yourself – your academic background and current position.
In 2008, I graduated from the University of Zagreb with a degree in Educational Science (Pedagogy) and Croatian Language and Literature. The profound theoretical linguistic knowledge of my mother tongue that I acquired during my schooling was further expanded during my time as a PhD student not only at the University of Regensburg but also at the University of Zagreb, where I pursued a Cotutelle de Thèse, i.e. joint supervision thesis (the thesis was supervised by Professor Björn Hansen and Professor Zrinka Jelaska).
A vital catalyst that led me to use corpus linguistics for my PhD thesis was a corpus-linguistic course organized at the University of Regensburg by my colleague Edyta Jurkiewicz-Rohrbacher, who is currently a Postdoctoral Assistant at the Department of Slavic Languages and Literatures at the University of Regensburg. Later on, as the main research assistant, I had the opportunity to work with Edyta even closer in the project HA 2659/6-1 Microvariation of the Pronominal and Auxiliary Clitics in Bosnian, Croatian and Serbian. Empirical Studies of Spoken Languages, Dialects and Heritage Languages financed by the German Research Foundation (Professor Björn Hansen was the head of the project). In this, I learned how to design corpus linguistic experiments in order to empirically and statistically prove or disprove formulated hypotheses. Also, to overcome the drawbacks of corpus linguistic methods (primarily the problem of negative evidence), Dušica Filipović Đurđević, who now holds an Assistant Professorship at the Department for Psychology at the University of Belgrade, taught me how to create a fully crossed factorial design for a psycholinguistic experiment, to design stimuli and fillers for acceptability judgment tasks, and to conduct experiments. The results of our several corpus studies and seven psycholinguistic experiments, in which we tested 336 Štokavian native speakers from all over Croatia, are already partially published in Slavistics journals like Rasprave and Jazykovedný časopis, and will very soon be fully published by Language Science Press in a book entitled Clitics in the Wild: Empirical Studies on the Microvariation of the Pronominal, Reflexive and Auxiliary Clitics in Bosnian, Croatian and Serbian.
Work on this project, together with my passion for a better understanding of fascinating clitic phenomena, really made me think about the fundamental principles of empirical research in linguistics and led me to adopt a research philosophy that is based on the triangulation approach to methodology. Currently when I try to find answers to my research questions, I combine introspective, corpus-linguistic, and psycholinguistic methods. However, I must emphasize that corpus-linguistic methods always hold a central place in my research.
I presently hold a postdoctoral position at the Department of Slavic Languages and Literatures at the University of Klagenfurt. Although currently on maternity leave, I am planning to work on my own project proposal in which I would pursue not only a synchronic but also a diachronic study of so-called phrase splitting in various Slavic languages. The sentence Sestrina mi prijateljica sutra dolazi u posjet, ‘My sister’s friend is going to visit me tomorrow’, is an example of the phenomenon, with the pronominal dative clitic mi inserted between the possessive adjective sestrina and the head noun prijateljica.
Your PhD thesis discussed biaspectual verbs at the crossroads of descriptive theories, prescriptive rules, and actual usage. Could you briefly present the thesis? What was the research question?
In my dissertation, entitled Biaspectual Verbs: the Difference between Description, Prescription and Real Use, I investigated a topic in South Slavistics that is empirically poorly studied: biaspectual verbs (BVs) in Croatian on four language levels. The levels were lexical (with a focus on actionality, i.e. (a)telicity, durativity, dynamics and phasality), morphological (with a focus on the formation of new overtly aspectually marked derivatives and factors that influence affixation of BVs), sentential (with a focus on the usage of BVs and their derivatives in aspectual sentential functions, i.e. the concrete-factual, general-factual, iterative/habitual) and textual (with a focus on the usage of BVs and their derivatives in aspectual taxis functions, i.e. sequence of events and coincidence of events).
I addressed only one research question on the lexical level: whether morphologically stable (without any overtly aspectually marked derivatives; e.g. apstrahirati ‘to abstract’, ilustrirati ‘to illustrate’, veljeti ‘to say’) and unstable (with overtly aspectually marked derivatives; e.g. karakterizirati okarakterizirati ‘to characterize’, savjetovatiposavjetovati ‘to advise/to counsel’, častitipočastiti, čašćavati ‘to pay for dinner/lunch/to honour’, eksplodiratieksplodiravati ‘to explode’) BVs differ on the lexical level. That is, are their actional properties significantly different? On the morphological level, I dealt with five research questions, one of which was whether prefixed derivatives of base BVs are equally present in different corpora of the Croatian language, i.e., corpora reflecting the standard and colloquial language use. I also addressed one research question concerning the sentential and textual levels: whether base BVs and their perfective (henceforth, PFV) derivatives differ significantly in respect of their usage on the sentential level (i.e., their distribution in aspectual sentential functions), and whether base BVs and their PFV derivatives differ significantly in respect of their usage on the textual level, i.e., their distribution in aspectual taxis functions.
How did you go about it methodologically i.e., what corpus-linguistic methods did you use? Which corpora did you use for your thesis (any CLARIN corpora)?
The corpus-linguistic data were used as the primary data source in studies on the morphological, sentential, and textual levels. While studying BVs on the morphological, sentential, and textual levels, I used two Croatian CLARIN.SI corpora: the web corpus hrWaC and the reference corpus Riznica. First, these corpora were used to find out which BVs do and which do not form prefixed and suffixed derivatives. Then, if it turned out that they do have such derivatives, I established precisely which derivatives are formed this way and how frequent they are in comparison to their base verbs. In addition, the two corpora were used to sample random sentences with BVs and their PFV derivatives to check the differences in their distribution in aspectual sentential functions (i.e., concrete-factual, general-factual, iterative/habitual) and aspectual taxis functions (i.e., sequence of events and coincidence of events). However, I must emphasize that the results obtained on the sentential and textual levels should be complemented either with more corpus-linguistic data or with experimental data, and preferably with both types to allow greater control over factors that were not studied and might possibly turn out to be confounding. Especially on the textual level, i.e., for the usage of BVs and their derivatives in aspectual taxis functions, the obtained results were inconclusive.
Nevertheless, I can happily say that for the very first time I have managed to demonstrate that factors influencing the prefixation of BVs of both Slavic and foreign origin in Croatian can be analysed on the morphological level with the help of advanced statistical analysis, such as the generalized linear mixed model. The application of statistical methods revealed several exciting results. First, the origin of the base biaspectual lemma has a statistically significant impact as a factor on prefixation: BVs of Slavic origin (e.g. savjetovati ‘to counsel’) are more likely to be prefixed (e.g. posavjetovati) than biaspectual borrowings (ilustrirati ‘to illustrate’). Second, prefixation of BVs with a synchronic and/or diachronic prefix (e.g., doručkovati ‘to have breakfast’) and prefixation of BVs that do not have such a prefix (e.g., karakterizirati ‘to characterize’) differs significantly: having a synchronic and/or diachronic prefix, like in the case of doručkovati, has a negative impact on prefixation of BVs. Third, BVs for which suffixed derivatives are attested (e.g., parkirati ‘to park’, vezati ‘to bind’) are more prone to prefixation. Fourth, BVs with different numbers of meanings differ significantly with respect to prefixation, in the sense that the more polysemous BVs (e.g., častiti ‘to ‘pay for dinner/lunch’/to honor’, vezati ‘to bind’) seem to be more prone to prefixation. Fifth, prefixation of BVs is more frequent in corpora containing colloquial and unproofread texts than in corpora compiled from texts written in the standard Croatian variety.
Furthermore, on the sentential level, by analysing the distribution of BVs and their perfective derivatives in aspectual sentential functions, I showed that derivation of overtly aspectually marked derivatives from base BVs could be motivated in some way by aspectual sentential functions. My assumption is that PFV derivatives might be formed to explicitly distinguish between the progressive and the concrete-factual aspectual sentential functions. Specifically, if a BV is uttered in the future or past tense and obvious taxis signals (such as co-temporality) and other cues are missing, the addressee will not be sure whether the speaker’s intention was to convey the progressive or the concrete-factual function (e.g., Analizirat ćemo te tvrdnje ‘We will analyse those claims/We will be analysing those claims’). However, the aspectual vagueness is eliminated when a perfective derivative is employed (Proanalizirati ćemo te tvrdnje ‘We will analyse those claims’). In that case, the addressee can be certain that only the concrete-factual meaning was conveyed.
And finally, in the study of BVs on the lexical level, I was the first to empirically show that the morphological stability of biaspectuality (non-formation of overtly aspectually marked derivatives) is interrelated with actional features, i.e., the lexical aspect of the base BV. Specifically, the meanings of the analysed BVs without attested overtly marked aspectual derivatives have, almost without exception, telic actional properties. Conversely, BVs with overtly marked aspectual derivatives have many more meanings with atelic actional properties.
What were the novel empirical findings that contradicted the prescriptive rules or shed new light on the descriptive theory?
My results strongly suggest that aspectual affixation of BVs cannot be categorized as the formation of mere pleonasms (i.e., redundant forms), as argued in the normative literature. Quite the contrary, the mentioned phenomenon is highly functionally motivated and triggered or suppressed by various factors, of which I only identified the ones mentioned above. The results are (or about to be) published in the journals Russian Linguistics, Suvremena lingvistika (‘Contemporary Linguistics’) and Zeitschrift für Slavische Philologie (‘Journal of Slavic Philology’) as well as in the book Glagolski aspekt i dvoaspektni glagoli u hrvatskome jeziku: formalno-funkcionalni pristup (‘Verbal Aspect and Biaspectual Verbs in Croatian: A Formal-Functional Approach’).
How has the CLASSLA K-centre helped you in your research? Have you used any specific CLARIN/CLASSLA-related tools (e.g., NoSketch Engine concordancer)? What makes these tools important for researchers working with South Slavic languages?
The CLASSLA K-centre has helped me in many ways. First, in 2013 when I started working on my PhD thesis, I knew that I wanted to compare prefixation of BVs in corpora which reflect standard Croatian language use, and in corpora which contain texts without any external proofreading. This was necessary in order to establish whether the usage of BVs and their derivatives by average users is significantly different than usage by those who strictly follow the norm or have their text externally proofread and, of course, to obtain non-skewed data, which could give me a more precise picture of which factors govern prefixation of BVs in Croatian. In 2013, the CLARIN.SI hrWaC web corpus with a mixture of proofread and non-proofread Croatian language had already been compiled, and it also contained texts from the subdomain with exclusively user-generated non-proofread text. The Croatian National Corpus was also available at that time. Still, I was not entirely happy with it, and I wanted to use an additional corpus representing standard Croatian language usage, Riznica. But at that time, Riznica was neither lemmatized nor morphosyntactically annotated. Not to mention that the concordancer via which it was searchable was not really user friendly, i.e., it did not permit quick queries and sorting of results. Luckily, before I finished and submitted my PhD thesis Riznica became fully annotated and available at Slovene CLARIN via my favourite NoSketch concordancer. The abovementioned CLARIN corpora were also extremely important and relevant for the project Microvariation of the Pronominal and Auxiliary Clitics in Bosnian, Croatian and Serbian. Empirical Studies of Spoken Languages, Dialects and Heritage Languages. Unfortunately, some corpora that could have been relevant for our work, such as Torlak, appeared after we had to stop the empirical phase of the project and concentrate on finishing the manuscript.
Although until March 2022 the CLARIN South Slavic corpora are also available via Sketch Engine, I personally prefer and always suggest that my students use the NoSketch Engine provided by the Slovene CLARIN instead. One of the advantages is that the size of the downloadable example sentences, sorted lemmas, etc. is not restricted to 1,000, whereas in Sketch Engine obtaining more than 1,000 concordances is not free of charge. The COVID pandemic also demonstrated the tremendous value of having corpora available at any time and place. Gathering empirical and big data by linguists is hardly possible otherwise. This was also one of the reasons why, at the Department of Slavic Languages and Literatures at the University of Klagenfurt in 2021, in the first online semester we decided to welcome Tomaž Erjavec and Nikola Ljubešić from CLARIN.SI/CLASSLA for the invited lecture. And finally, I would like to emphasize that online corpora are especially important for those like me who hold PhD, postdoctorate, and similar non-permanent positions and will one day have to compete for new (hopefully permanent) positions, and even more so for those among us with (small) children. I am very grateful for the work of the whole CLARIN.SI team because I can (try to) conduct empirical research from my home even when I am on maternity leave.
Do you feel that South Slavic languages are under-resourced? Why is the CLASSLA K-centre essential for overcoming this issue for these languages?
When I look back at the time when I started working on my PhD thesis or in the project Microvariation of the Pronominal and Auxiliary Clitics in Bosnian, Croatian and Serbian in BCS, I have to say that many new interesting corpora have appeared in the meantime and the existing ones are being monitored, enlarged and better annotated. In our book on clitics there is one whole chapter dedicated to the review of existing BCS corpora, which I will summarize here. If I compare, for instance, the corpora available for Slovene and Russian, their tagging accuracy and the concordancers via which they are searchable, well, then it is clear that David has won over Goliath. The Russian National Corpus is way smaller than Gigafida and its concordancer does not allow for powerful CQL queries which make it possible to look for syntactic patterns. The ruTenTen is impressive in size, but as a user I am not happy either with its tagging accuracy or with its MSD tagset description, since it does not correspond to actual tags. So when trying to formulate CQL queries for ruTenTen one cannot rely on the MSD tagset, but instead has to look up the actual tags in the corpus first.
Croatian is still lagging behind Slovene; however, the situation is not entirely bad. For instance, there is a diachronic corpus (CroDi) as well as a corpus of spoken Croatian (HrAL). Still, they are neither fully annotated nor available via a user-friendly concordancer such as NoSketch Engine.
In my opinion, Serbian and Bosnian are definitely the most under-resourced South Slavic languages, although the situation with Serbian is slightly better; there is, for instance, the Torlak dialectal corpus (also hosted by CLARIN.SI). CLASSLA K-centre is essential since it maintains the existing corpora and improves them, handles the construction of new corpora and tools, and strives to put all available tools and corpora under one roof by providing access to them via the free-of-charge, user-friendly NoSketch Engine concordancer.
You are helping (or plan to help) CLASSLA in building the new generation of web corpora for South Slavic languages. Could you describe the goals of this endeavour?
After email communication with Nikola Ljubešić, a group of Croatian linguists who were keen to help had a kind of brainstorming kick-off meeting where we decided to split into two teams. One team is interested in improving tagging accuracy and the other in splitting hrWaC into subcorpora for different registers. I would like to take part in both groups. Regarding the work of the first team, I am especially interested in the MSD tagging accuracy of clitics, but also other (word)forms which have homonym/homograph (word)forms with other clitics or other parts of speech. However, the current version of the CLASSLA ReLDI tagger is more than satisfactory in dealing with these phenomena, especially in comparison to taggers that were used to tag the Corpus of Contemporary Serbian Language (for instance, all instances of the BCS wordform je are tagged only as a verb and none as a pronoun) and the Croatian National Corpus (where for example the tagging accuracy of the word form te requires checking). As for the other team’s tasks, I am most interested in “cleaning” the hrWaC from all user-generated content (not only content from the subdomain but also other kinds of comments, blogs, and fora) by building a separate subcorpus containing only the CMC data, so that it would then resemble the Slovene JANES corpus. Ideally, this should also be done for the Bosnian bsWaC and Serbian srWaC web corpora, since their current versions entirely lack this kind of CMC subcorpus (for Croatian, at least the Forum subcorpus exists).
What can be done (by CLASSLA) to facilitate the creation of new comparable resources in these languages?
I already mentioned that for Bosnian and Serbian, or better to say for the CLARIN.SI-hosted bsWaC and srWaC corpora, large-scale subcorpora with user-generated content are missing (I am familiar with the and corpora, but these are small in size and due to the nature of tweets the language in them is very specific in its syntax and other characteristics). Such content would be very useful, especially for those linguists who are interested in language variation and generally in establishing differences between the prescribed norm and the actual language use. But for Bosnian and Serbian in general, decent widely available and fully annotated corpora of standard varieties which would be searchable via a user-friendly concordancer are also missing. It would also be nice to have (comparable) dialectal, and diachronic corpora of spoken Bosnian, Croatian, and Serbian searchable via NoSketch Engine.