Tour d e CLARIN: The CLARIN PL-B-Centre

Submitted by Karina Berger on 18 December 2021

Written by Krzysztof Hwaszcz and Jan Wieczorek

The Polish consortium CLARIN-PL, which is a founding member of CLARIN , operating since 2012, was already presented in Tour de CLARIN in 2018 (Volume 1). Not much has changed since then in terms of organisation and scope: the centre is based at the Wrocław University of Science and Technology (Department of Computational Intelligence) and is still coordinated by Maciej Piasecki. From the user’s point of view, the B-centre constitutes the pillar of the Polish CLARIN infrastructure, as it is focused on the development and the maintenance of the existing services from the scientific and technical perspectives. The main objective of the centre is to maintain and make available a unique infrastructure for the computing equipment and data repositories integrated with the European CLARIN infrastructure, as well as natural language processing tools and services. Although CLARIN-PL focuses on Polish, the centre is open for cooperation with other languages as well. Thus far, it has incorporated more than ten languages in its tools and services.

The mission of the B-centre is not only to maintain and promote, but also to develop tools and services that are already available within the CLARIN-PL infrastructure, such as the Polish wordnet plWordNet, the valency lexicon Walenty, and the topic modelling tool Topic, and to create completely new tools and resources. The development model is bottom-up – that is, the directions of activities at the centre respond to user demand. For instance, there has been a recent increase in demand for processing texts that were previously transcribed using optical character recognition. Automatic conversion typically introduces various misspelling errors. As a result, the centre provided a set of new standardisation tools, which are designed to correct such documents. Similarly, there is an increasing demand to work with texts originally containing personal data in order to avoid legal difficulties, these texts need to be anonymised; consequently, the Anonimizer service has been developed, which aims to overcome these issues. Some other tools and resources that have been recently developed may be categorised into four groups: (i) corpus tools, (ii) tools for sentiment analysis, (iii) text standardisation and (iv) multilingual use. Let us present them briefly below:

New Tools for Working with Text Corpora:

Topic for topic modelling
ComCorp for comparing the linguistic features of corpora
Cat for simple text classification

Sentiment Analysis

Sentemo for determining the text sentiment
Wydźwięk for the analysis of emotional overtone
MultiEmo for multilingual sentiment analysis in eleven languages

Tools for Text Standardisation and Correction:

Paragraph for dividing the text into sentences or paragraphs
Symspell for removing redundant spaces
Wordifier for abbreviation expansion
Txt Clean for document cleaning
Speller for improving the spelling
Punctuator for improving the punctuation
Anonimizer for removing sensitive data

Implementation of New Languages in Existing Services:

WebSim for detecting text similarity and clustering (Polish, English)
InterLem for processing literary texts (Polish, English, German, Spanish, Russian)
Topic ML for topic modelling (Polish, English, German, Spanish, Hungarian, Russian)
WebSty ML for stylometric analysis (Polish, English, French, German, Spanish, Hungarian, Russian, Hebrew)
MultiEmo for multilingual sentiment analysis (Polish, English, Chinese, Italian, Japanese, Russian, German, Spanish, French, Dutch and Portuguese).

An integral part of the CLARIN B-centre is the Knowledge Centre for Polish Language Technology (PolLinguaTec), founded in 2017. It has been involved in the implementation of language services in research projects. The main objective of PolLinguaTec is to provide knowledge on the application of tools and systems for natural language analysis. The scope of research conducted with the use of the CLARIN-PL infrastructure includes areas such as economics, political science, sociology, social psychology and linguistics, to mention only a few. A more detailed description of the CLARIN K-centre was published in the 3rd volume of Tour de CLARIN.

PolLinguaTec provides experts able to solve problems related to the use of language processing resources. Apart from that, a number of instructions, guidelines and tutorials have been created to enable or to facilitate the autonomous application of the CLARIN tools by our users. To maximise the efficiency of the CLARIN initiative, both centres (B and K) are oriented towards operating in close collaboration. Overall, since November 2019, PolLinguaTec has helped to plan the implementation of the CLARIN infrastructure in about thirty-five research projects; the authors of fifteen of the projects have also benefited from our support in preparing their grant applications.

For instance, Agnieszka Hess collaborated with PolLinguaTec in the project on the identification of social representations of civil dialogue in Poland and the description of intended and unintended consequences of activities of civil dialogue participants.

More than 42,000 plenary sessions, commissions and interpellations were extracted from the Polish Parliamentary Corpus. PolLinguaTec provided the tools (Topic, TermoPL, MeWeX, Sentemo, Wydźwięk, MultiEmo) and assistance with the analysis of the frequency of certain terms, with topic modelling, with the identification of domain-specific terms and their contexts and with the sentiment analysis of participants’ statements in civil dialogue.

Another research project was conducted in collaboration with the Educational Research Institute (IBE), which carries out interdisciplinary research into the functioning and effectiveness of the education system in Poland. It was assumed that the descriptions of learning outcomes could provide a basis for comparing qualifications. The research team needed a tool which would automatically compare the qualification with the use of techniques. The work of PolLinguaTec was divided into the following stages: (i) developing a qualification classifier, (ii) creating labels for fields common in the entire data set, (iii) describing learning outcomes, (iv) clustering, and (v) preparing the interface for target users. The results of the cooperation between PolLinguaTec and both Agnieszka Hess as well as IBE team were presented during the conference 'CLARIN-PL-Biz – Language Technologies for Learning and Business II'.

In relation to sociology – or social media, to be more precise – PolLinguaTec helped in the ComAnCE (Combat Anti-Semitism in Central Europe) project supervised by Viera Žúborová, which aimed to present the phenomenon of discrimination in the Visegrád Group countries (Czechia, Hungary, Poland, and Slovakia) to European organisations, state authorities, police forces, fellows researchers and institutions. The participating countries developed a unique categorisation of anti-Semitic statements classified according to the keywords collected from Facebook users. The concordancer (KonText) and programming tools (CMC and Morphodita, used to correct errors) from our centre and from the CLARIN infrastructure were among the software used in the implementation of the project.

Corpora