Written by Arne Jönsson
The CLARIN Knowledge Centre for Swedish in a Multilingual Setting (CLARIN-SMS) is primarily directed at researchers in the Social Sciences and Humanities (SSH) and beyond with a need for analysis, annotation or data mining of Swedish or multilingual texts, and of Swedish Sign Language.
CLARIN-SMS makes resources in the form of tools for linguistic processing, as well as corpora available for research in the Humanities and Social Sciences. The resources include monolingual (mainly Swedish) and multilingual corpora across several domains, and tools for the basic processing of text, including tokenisation, morphological analysis, part-of-speech tagging, syntactic parsing, and named entity recognition.
Main Areas of Expertise
CLARIN-SMS offers special expertise:
For researchers interested in exploring Swedish texts by providing support for the creation and processing of Swedish texts with a variety of computational methods, such as linguistic annotation at different levels, or sentiment analysis.
For researchers interested in comparative analyses by providing support for the creation and processing of parallel and comparable corpora, including alignment and machine translation, as well as cross-linguistically consistent annotation within the framework of Universal Dependencies which allows for easy comparative analyses.
For researchers interested in education and content accessibility by providing support for computation and evaluation of measures of text complexity.
For researchers and users of Swedish Sign Language by providing support for the creation of lexicons and corpora for SSL, and annotation of SSL (including glosses, part-of-speech tagging and syntactic structure).
The support is provided by several partners participating in the CLARIN-SMS distributed Knowledge Centre:
- Linköping University, Department of Computer and Information Science
- Stockholm University, Department of Linguistics
- Uppsala University, Department of Linguistics and Philology.
Although each CLARIN-SMS node works as a separate unit and promotes its services and resources in various ways, including promotion tours at universities, with web pages presenting projects and resources, and with presentations at CLARIN-related events, the K-centre is a common resource. CLARIN-SMS is a vibrant community, and, in accordance with CLARIN’s general mission of creating and promoting language resources, a variety of activities has been carried out at the respective nodes, including tool and resource development for language analysis, both multilingual and Swedish only.
An Active Research Hub
A number of activities are focused especially on promoting the use of language technology in SSH. For instance, one of the projects includes analysing the development of the concept of handicapped from a Swedish parliamentary perspective. In this project, we help researchers process and analyse the Swedish Government’s official reports from early 1900 to the present day with a variety of SweClarin resources and language technology tools, such as the SPARV pipeline.
Another example is the analysis of the protocols of the Swedish National Bank (Sw. Riksbanken), where we compare protocols from the period when they were anonymous to protocols from the period when they were not. One of the goals of this study is to see if we can identify individual speakers from the period of anonymous protocols. Another goal is to provide the National Bank with information about potential differences and similarities in argumentation between the two types of protocols. To this end, we use a variety of SweClarin resources, such as the sensaldo-v02 sentiment lexicon or the SPARV pipeline for parsing, in combination with, for instance, topic and sentiment analysis models.
A further example is a project that is led in cooperation with management researchers, in which we are analysing Swedish companies’ adherence and adoption of the information security standard ISO 27001. The aim of the project is to examine the communicative constitution of preventive innovation in organisations. For this project, we helped create a corpus and analyse it from multiple interdisciplinary perspectives using SweClarin tools and resources, such as the sensaldo-v02 sentiment lexicon or the SPARV pipeline for parsing, as well as other language technology tools, including word clouds.
Some Signature CLARIN-SMS Tools and Resources
Tools and models:
- SWEGRAM: Aims to provide a tool for text analysis in Swedish and English. You can upload one or several texts and annotate them at different linguistic levels with morphological and syntactic information. The annotated texts can then be used to extract statistics about the text properties with respect to text length, number of words, readability measures, part-of-speech, and much more.
- Sapis - StilLett Service: A web service ( API) including tools for measuring text complexity and text simplification.
- Gold alignments for 1164 English-Swedish sentence pairs for the purpose of testing word alignment software. Source data from Europarl v.2.
- Universal Dependencies is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with more than 300 contributors producing nearly 200 treebanks in over 100 languages. CLARIN-SMS provides the application of both Swedish-specific as well as UD-based annotations. The resource is useful for studies of applications, such as multilingual parsing and of language typology. Moreover, parallel UD treebanks can be used for studies of human and machine translation.
Corpora:
- A corpus of corporate texts from Swedish company websites that mention the information security standard ISO 27001.
- A parallel corpus with some 4000 English original sentences from different sources and their Swedish translations primarily created for the study of the usage of function words and grammatical constructions.
- Svensk Diakronisk korpus (Swedish Diachronic Corpus): A corpus of texts covering the time period from Old Swedish to the present day, with a wide variety of text types and freely available for download and search through the Korp web interface https://spraakbanken.gu.se/korp.
- SOU corpus: This corpus contains cleaned and processed versions of Swedish Government Official Reports - Statens offentliga utredningar (SOU). The documents are based on html versions from Riksdagens öppna data and cover the years 1994 to 2020.
- Data sets for causality recognition: Three data sets of Swedish text annotated for the presence of causality. The sets are annotated with two different tasks in mind, namely causality recognition and causality ranking with respect to a query prompt containing at least a cause or an effect.