Skip to main content

Tour de CLARIN: Resource from Sweden - the Riksdag's open data

Submitted by karolina@clarin.eu on

Blog post written by Darja Fišer and Jakob Lenardič 


Since parliamentary speech has a great societal impact on account of its language and content, the creation and availability of big parliamentary multimodal corpora—a topic that has been the subject of a recent CLARIN-PLUS workshop—plays a pivotal role in humanitarian and social research.

The Riksdag's open data is one such corpus. It is the digitized collection of Swedish parliamentary data and consists of roughly 30,000 documents pertaining to Sweden’s national political decision processes. It has been made available for download on the website of the Swedish parliament. In addition, the Swedish National Library has digitized and published the public reports of inquiry for the period between 1922 and 1999 under the CC0 license on the parliamentary website, with newer reports now being digitized from the very outset.

This parliamentary corpus is available in Korp  and consists of 1.25 billion tokens. It can also be downloaded in the XML format from the resource page of Språkbanken. The annotation was performed with the SWE-CLARIN’s tool Sparv and consisted of tokenisation, lemmatisation, as well as lemgram (inflectional paradigm) and word sense identification, and compound splitting.

The resource has been successfully used by scholars working in the Social Sciences and Digital Humanities. Fredrik Norén from the Department of Culture and Media Studies at Umeå University has researched how social information in Sweden was structured in the period between 1965 and 1975, with a focus on uncovering how the government informed its citizens and communicated with them during this period. He has used Korp to search through SOU, a subset of the parliamentary corpus that contains the official reports of the government.

Norén has also collaborated with Roger Mähler from the Center of Digital Humanities at Umeå University to analyse the changes in governmental discourse on the basis of the nouns’ distribution. Using topic modelling they were able to identify how information discourse arose in the 1960s and infiltrated governmental policies. Norén and Pelle Snickars have also used similar methods to analyse policies related to Swedish film in the 20th century on the basis of 4500 reports in the SOU corpus. All in all, digitized language data like the Riksdag’s open data corpus have made it possible to study the evolution of concepts like information in great detail, and by extent, they unveil historic change in a more precise and nuanced manner than ever before.

IMAGE: Frequencies of the lemmas informationupplysning ("information"), underrättelse ("notification"), meddelande ("message") and propaganda throughout the 20th century.

 


Click here to read more about Tour de CLARIN