Using a Monitor Newspaper Corpus to Trace Changing Language as a Result of COVID-19

Koenraad De Smedt
Submitted by Karina Berger on 9 December 2021

The Project

This project illustrates the possibility to trace, almost in real time, changes in language in response to a crisis using a monitor newspaper corpus. The study ‘Contagious “Corona” Compounding by Journalists in a CLARIN Newspaper Monitor Corpus’ examines the linguistic changes that occurred in the Norwegian language during the first wave of the COVID-19 pandemic.

As a response to the dramatic developments that took place in early 2020, a sudden and dramatic increase in vocabulary took place in a very short period of time, most notably in the creation of new compounds with the stem corona (or its spelling variant korona). Based on up-to-date CLARIN newspaper monitor corpora from 9 January 2020 to 8 March 2021, this study tracks and analyses these new compounds to determine both the quantity as well as the frequency of new creations. In addition, it also traces the spelling change from corona to korona, after the Language Council of Norway normalised the spelling in January 2020.

'The pandemic provided an exceptional opportunity to demonstrate the use of this CLARIN monitor newspaper corpus.’ 
Koenraad De Smedt



This study used the Norwegian Newspaper Corpus as its data source. All occurrences of words starting with corona/korona in the period from 9 January 2020 to 8 March 2021 were retrieved using the Corpuscle corpus management and search system, and then downloaded as a tab-separated file with keywords, newspaper codes and dates. The final word list, after some forms and errors had been removed, had 167957 tokens. Pre-processing, analysis and plotting was performed with a shell script that called programs in Awk, Python and R.

As there are currently no pre-made tools available to build a timeline, this part of the programming was done manually. The resulting extensive script produced the graphics.
Cumulative increase of the corona compound vocabulary.



Not only was the occurrence of new compounds with the stem corona/korona in the studied timeframe very high, but the speed of vocabulary growth and the diversity of new words was also noteworthy. The earliest occurrence of relevant compounds in the Norwegian Newspaper Corpus appeared on January 9, 2020, with coronavirus and its definite form coronaviruset. Initially, the use of these and other compounds remained modest. However, on February 26 of the same year, when the virus was detected in Norway, there was a marked increase.

This explosion of new compounds included words such as koronatelefon (corona telephone), koronadødsfall (corona death), koronafrykten (fear of corona), corona­cruiset (corona cruise) and coronatider (corona times). Most compounds were nouns, such as koronapsyken (corona psyche), some were verbs, such as koronastenge (close down due to corona), some were adjectivally used participles, such as corona­stanset (stopped by corona) and some were adjectives, notably koronafast (stuck due to corona).
Distribution of c- (light) and k- (dark) over time. No bars are shown for days without occurrence of either.

Many of the new compounds are heavily context-dependent: for instance, korona­telt (corona tent), koronautsettelsene (corona postponements), coronalov (corona law) and corona­kompensasjon (corona compensation). Several of the compounds are metaphorical and have emotional connotations, such as the final parts knekken (breakdown), knipen (pinch), spøkelset (ghost), tsunamien (tsunami) and tabu (taboo).

The study also illustrates that the creativity of creating new compounds did not stop or slow during the studied timeframe, but was sustained throughout the entire period, with new words continuing to emerge.

In terms of the change in spelling, there was, perhaps surprisingly, a rapid shift: while in January 2020, the spelling with c- was still very dominant, the majority of the media had adopted the new spelling with k- within about a month of the intervention by the Language Council of Norway. However, despite the initial sharp rise, the change was never fully achieved, but plateaued at about seventy to eighty per cent.

‘This is the first study to demonstrate the effect of such a spelling change in various Norwegian media sources.’
Koenraad De Smedt


CLARIN Tools and Resources

This study used the Norwegian Newspaper Corpus as its data source. The corpus is part of the CLARIN Resource Family ‘Newspaper Corpora’. It is updated every night by harvesting publicly accessible articles from ten major Norwegian online newspapers. At every automatic update, boilerplate is removed so that nearly clean text is left, and each article is tagged with the date and the source.

The corpus was accessed through the Corpuscle corpus management and search system, which was developed at the CLARINO Bergen Centre. This system has a user-friendly interface and a powerful and efficient query system. It allows the specification of arbitrary start and end dates in queries and also offers download of matching strings, with optional annotation features, to a file with tab­separated values.

De Smedt says: 'The useful thing about Corpuscle is that it has powerful search possibilities and that it allows the download of the data in a very usable format. Those are the advances of Corpuscle.'


Views on CLARIN

'Newspaper monitor corpora, which incorporate new materials on a regular basis, are particularly useful for tracking linguistic changes spurred by current developments.' 
Koenraad De Smedt
'What I’d like to see is a bigger, more up-to-date media corpus. If you look at the ParlaMint project, I think something similar could be done for the media. Then you could do comparable searches, for example look for corona compounds in all compounding languages, such as Norwegian, Dutch and German. It would be more difficult, because unlike parliamentary records, media material is not public. But it’s still possible, I think. I’m currently involved in a project trying to put together a new media monitor corpus with the help of the National Library of Norway. This would also include material from TV and radio. Potentially it would be a lot bigger and include material that is behind a paywall. If there were several of these then you could think about investigating corona creativity across different languages. That would be really interesting.' 
Koenraad De Smedt


Koenraad De Smedt, Professor of Computational Linguistics, Department of Linguistic, Literary and Aesthetic Studies, University of Bergen, Norway

See here for more information on how CLARIN has responded to COVID-19.