Skip to main content

Tour de CLARIN: Interview with Kristīna Korneliusa

Kristīna Korneliusa is a Ph.D. student and junior lecturer at the University of Latvia. Her research interests include corpus linguistics, systemic functional linguistics, and stylistics.

Please introduce yourself – what is your academic background?

In 2021, I became involved in the LexEcon project as an Erasmus+ trainee. The project was carried out by the University of Pisa, the University of Padua, and the University of Palermo. As part of the Pisa team, I worked on a multilingual corpus of texts on political economy published between 1750 and 1900. I was responsible for the texts published in English between 1841 and 1850, where I first conducted bibliographical research and then performed corpus editing, which involved cleaning the errors caused by the imprecisions of OCR. The participation in the project inspired me to pursue my academic journey in corpus linguistics.

Members of the LexEcon team introduced me to sentiment analysis – I later explored this topic in my MA thesis, in which I used the subcorpus that I had created in the project.

When working on my MA thesis, I discovered VARD, which is a CLARIN-UK tool aimed at dealing with spelling variations in diachronic corpora. I have recommended it to other members of the Pisa team and they are now using it for the further development and analysis of the LexEcon corpus.

After completing my MA studies at the University of Latvia (UL), I applied for the PhD programme there, because my adviser Zigrīda Vinčela encouraged me to continue working on the topic of my MA thesis. It was also my dream to teach at UL one day. I expanded the boundaries of my MA thesis and started exploring not only sentiment, but also subjectivity in language.

I was not satisfied with the tools and resources I had used before, especially the sentiment lexicons – they only list words in isolation, and do not take context into account. The tagging of the lexicons is also limited to the three-way distinction of  positive, negative, and neutral. I felt that this distinction was too general for the exploration of subjectivity in language.

I also realised that I needed to learn the theoretical background so that I could identify which linguistic features I am interested in and which tools to use in order to extract them.

So, I turned to theoretical sources and started looking for definitions of subjectivity across different areas of expertise, and in linguistics in particular. This is how I came across Douglas Biber’s work on the dimensions of the English language. There are two dimensions related to subjectivity that I became interested in – overt expression of persuasion on the one hand and informational versus involved on the other, where informational refers to written documents that contain factual details on a particular topic, while involved refers to interactive, spontaneous, spoken texts (in the majority of cases).

 

How did you get to know CLARIN? Are you involved with the Latvian CLARIN consortium in any way?

During the first semester of my PhD studies, I signed up for a MOOC course called Corpus Linguistics: Method, Analysis, Interpretation, which is offered every year by Lancaster University. There, I was introduced to two CLARIN-UK tools, the #LancsBox software for corpus analysis and the CQPweb concordancer. In addition to these two tools, I now also use the semantic tagger WMatrix (also a CLARIN-UK tool), which my adviser Zigrīda Vinčela had introduced me to. I later on mastered these tools, along with CLAWS and Voyant, at a PhD course taught at UL by my adviser together with Ilze Auziņa, a member of CLARIN’s User Involvement Committee. It was Ilze who introduced me and my coursemates to the wider CLARIN infrastructure.

As one of the few students regularly using LancsBox, Wmatrix and Voyant in my research and daily work, I was approached in spring 2023 by Ilze regarding the participation in the PhD section of the CLARIN conference, and in the end, was selected to participate. I would also like to thank our national coordinator Inguna Skadiņa for her support and mentorship in this. 

Since September 2023, I have been involved in the project Language Technology Initiative, which also involves CLARIN Latvia. I am a part of the working group of educators developing study courses involving digital technologies. My advisor Zigrīda Vinčela and I are developing study materials for the Master’s study programme course Corpus Linguistics. The tasks we are developing involve CLARIN tools such as #LancsBox, CLAWS and Wmatrix.

 

Please present the Latvian and American Political and Sports News corpus. How did you build it – what were the corpus creation tools; what were the sources; was anyone else involved in the preparation of the corpus?

Latvian and American Political and Sports News (LAPIS) is a micro-corpus consisting of 24 texts and around 12,000 tokens. I compiled this corpus on my own simply by creating txt files and uploading them into #LancsBox and Wmatrix. The main sources include the English version of the lsm.lv portal (Public Broadcasting of Latvia), from which I manually extracted six samples of political news, based on their date of publication (I selected the most recent ones at the time of compilation) and six samples of sports news focusing specifically on the bronze medal that the Latvian hockey team had won at the World Championships in 2023. This is an event that was also covered internationally. The American political news was extracted from Financial Times; the sports news related to our achievement in hockey was extracted from various sources, including Sports Illustrated, Washington Times and NBC Sports.

 

Is the corpus available in a CLARIN repository  – if not, are there any plans to make it available there?

It is not yet available, as LAPIS is only a small subcorpus of what I’m collecting for my PhD research – the final corpus will be larger in scope, expanded with other genres such as TED talks, on the basis of which I’ll be able to compare texts written by native and non-native speakers. I’ll likely upload it to the Latvian repository once this larger corpus is finalised.

 

How concretely did you avail yourself of CLARIN tools for the preparatory extraction?

I used CLARIN UK’s #LancsBox tool to extract features relevant to Biber’s aforementioned informational vs. involved dimensions. The features were extracted from my political and sports news corpus LAPIS. Apart from basic syntactic categories, such as prepositions, nouns, and adjectives, the tool allowed me to extract fine-grained morphosyntactic features, such as first and second person pronouns, the use of the auxiliary ‘do’ in elided verbal phrases (e.g., he did that too), and contracted forms. These features are among those that are more predictive of the involved texts, which are more interpersonal and perhaps less formal than the informational ones. The tool is also easy to use – for instance, for extracting second person pronouns, I simply specified the regular expression ‘you|your|yourself|yours|thou|thee|thine|thy|thyself’, or ‘I|my|me|myself|we|our|ours|ourselves+mine P.* + us P.*’ for first person pronouns.

There are some things that I wasn’t able to do, however. One pertains to cases where the #LancsBox tagset does not define a category that would otherwise be relevant for the informational vs. involved dimension, such as discourse particles. Another pertains to the lack of syntactic annotation, so it was impossible to accurately extract prepositions that are separated from their nominal complements (who … to?), which is less formal than the variant where the preposition and the who word are directly adjacent (to whom?). Lastly, it was difficult to extract constructions with ‘deleted’ features (e.g., the optional deletion of the subordinator ‘that’ in dependent statements, such as ‘John said (that)’, which is by assumption also more characteristic of less formal language).

 

How concretely did you avail yourself of CLARIN tools for the semantic tagging?

My first semantic tagging experience was with the CLAWS demo tool, but it did not allow me to upload large volumes of texts. Wmatrix, which largely uses the same syntax as CLAWS, does not have this problem, and is the best semantic tagging tool I have discovered so far.

With Wmatrix, I compared the use of emotional expressions and expressions denoting psychological processes in four subcorpora: Latvian politics (texts from LSM portal), Latvian sports (again the LSM portal), US politics (Financial Times), and US sports (NBC Sports, Sports Illustrated, and Washington Times).

What I was able to show, for instance, is that the US politics subcorpus contains a greater number of expressions of emotions and psychological processes than the Latvian political subcorpus. For instance, the average number of emotional expressions in the Latvian subcorpus was 26.31 tokens, whereas it was 104.67 in the US ones. This presumably has to do with the fact the articles in the Financial Times are aimed at providing an analysis of events, and don’t just simply report them (which means that they constitute ‘involved’ discourse rather than ‘informational’ discourse in Biber’s terms). By contrast, the Latvian sports subcorpus contains more emotions of expression than the US sports subcorpus, presumably because the event described in the corpus – that is, the victory of the Latvian hockey team at the in 2023 World Championships – was such an important and joyful event for our country.

 

Do you see any room for improvement for the tools that you’ve used?

The tools that I have used – that is, #LancsBox and Wmatrix – are not designed to perform statistical analysis, so you cannot use them to calculate the weight of each linguistic feature, which would show how generalisable the feature is from the selected group of texts. In order to do the calculations, you have to build a correlation matrix of all the features, and there is software specifically designed for that. The applications I have tried are not a part of the CLARIN infrastructure, and they also have a serious drawback – they are designed specifically for Biber’s dimensions. They do not allow me to add my own linguistic features, for instance my own dimension of subjectivity that I am working on and that does not wholly correspond to any of the dimensions in Biber’s original proposal.

I would therefore be happy to find a multidimensional analysis tool in the CLARIN infrastructure that would allow me to add new dimensions and linguistic features, and would help me to do the necessary statistical calculations.

Something that I would like to see added to the Wmatrix tool specifically are tags related to sentiment (for instance, the difference in connotation between the near synonyms unpleasant and horrible), as well as the differences between comparative and superlative forms of adjectives, and a more fine-grained classification of modals, such as the distinction between epistemic and non-epistemic uses.

 

What are your plans for future work, especially regarding the use of CLARIN tools and resources?

I am creating other subcorpora, and plan to create new ones, which I expect to analyse in my research on subjectivity. Once the corpus is finished, I intend to publish it in our national CLARIN repository.

I will continue exploring semantic tagging in Wmatrix - despite the fact that I miss some features, which I mentioned above, the tool is great for grouping the words belonging to the same semantic fields. I hope that I will be able to incorporate semantic tagging into multidimensional analysis - currently I use these two methods separately.

I am also eager to start exploring  other tools for multidimensional analysis and sentiment analysis provided by CLARIN. So far I have encountered several sentiment analysis tools in the context of the CLARIN Resource Families, but they are all for languages other than English, while the Etuma Customer Feedback Analysis tool, which does include English, is limited to analysing customer feedback specifically.  I am nevertheless curious to see how it compares to tools such as Wmatrix for researching subjectivity.

 

Read the introduction to the CLARIN-LV B-centre