ParlaMint: Towards Comparable Parliamentary Corpora


Introduction

ParlaMint is a project, financially supported by CLARIN , which contributes to the creation of comparable and uniformly annotated multilingual corpora of parliamentary sessions. The project is being conducted in two stages:

  • ParlaMint I (July 2020 – May 2021)
  • ParlaMint II (December 2021 – May 2023)

ParlaMint I created and made available corpora for 17 languages, and started to use them in training and research. 

ParlaMint II will upgrade the XML schema and validation, extend the existing corpora to cover data at least to July 2022, add corpora for new languages, further enhance the corpora with additional metadata; and improve the usability of the corpora.

Motivation

National parliamentary data are a verified communication channel between the elected political representatives and society members in any democracy. They need to be made accessible and comprehensive - especially in times of global crises. With the recent advances of artificial intelligence, analytics over unstructured parliamentary data for many languages is rapidly becoming a prerequisite for reliable and trustworthy approaches in checking the veracity of information in contemporary society.

One of the most important characteristics of parliamentary data is its direct correspondence to the most recent events, including the ones with global impact on human health, social life and economics, such as the current COVID-19 pandemic. By comparing the data synchronically and diachronically in a cross-lingual context, scientific and civil communities from various disciplines are able to track the pan-European discussion.

Goals of the Project

The goals of the ParlaMint project is to turn the existing contemporary diverse national parliamentary data into resources that are:

  • Comparable
  • Interpretable and 
  • Highly communicative with respect to the society (researchers, journalists, NGOs, citizens, etc.). 

The project provides data for focused observations on trends, opinions, decisions on lockdowns and restrictive measures as well as on the consequences with respect to health, medical care systems, employment, etc. in times of emergencies. For the ParlaMint project, the emergency case is the COVID-19 pandemic. However, the methodology is scalable to other events as well, such as economic crises, environmental issues, etc. Thus, the main tasks are:

  • Compiling a collection of parliamentary datasets (corpora) in a number of languages and in a harmonised format, covering both the current data and older, reference data
  • Processing the compiled corpora linguistically 
  • Indexing the data with popular concordancers so that interested parties can search and extract the relevant comparable information
  • Showing through appropriate use cases that the ParlaMint corpora and related technologies as part of the CLARIN resource families serve a variety of societal needs.

Thus, observations over democratic processes are approached through parliaments viewed through the following related strategies:

  • Speaker and party statistics (for instance, who spoke more and on which topic; who changed their mind on a certain topic; which party defends/opposes what proposals, etc.)
  • Topic modeling (which topics are most popular at what time; how topics change and interrelate, etc.)
  • Time and context-bound social tendencies (tendencies in policy making over time).

How to Cite

The first stage of the ParlaMint project has already produced freely available comparable and interoperable corpora of 17 European parliaments with almost half a billion words. The project and corpora up to 2021 are described in:

Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Nikola Ljubešić, Kiril Simov, Andrej Pančur, Michał Rudolf, Matyáš Kopp, Starkaður Barkarson ‪Steinþór Steingrímsson, Çağrı Çöltekin, Jesse de Does, Katrien Depuydt, Tommaso Agnoloni, Giulia Venturi, María Calzada Pérez, Luciana D. de Macedo, Costanza Navarretta, Giancarlo Luxardo, Matthew Coole, Paul Rayson, Vaidas Morkevičius, Tomas Krilavičius, Roberts Darģis, Orsolya Ring, Ruben van Heusden, Maarten Marx, and Darja Fišer. The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 2022.

Links to Current ParlaMint Related Data: Corpora, Standards, Services

All corpora follow the same -based encoding schema. It has been developed as a specialised schema over the already existing Parla-CLARIN one, and rich metadata has been added, such as: members-of-parliament bios, parties' bios, ruling vs. opposition information, non-verbal behaviour (like applause, shouting, laughing, entering, leaving the room, etc.) and relations among all of them.

The corpora are available as data (via Github and CLARIN.SI entries) as well as through concordancers. See below for the respective links:

Multilingual comparable corpora of parliamentary debates ParlaMint 2.1

Corpora as data:

Available through the CLARIN.SI repository:

Available on Github:

Corpora in concordancers: 

ParlaSpeech-HR: Training Dataset for Automatic Speech Recognition in Croatian (aligned with the ParlaMint-HR 2.1 corpus). The data are available from http://hdl.handle.net/11356/1494

Dissemination

Events

  • CLARIN Café: ParlaMint Unleashed on 28 June 2021 from 14:00 to 16:00, organised by Tomaž Erjavec, Darja Fišer, Maciej Ogrodniczuk, Petya Osenova. Eight months after the introductory CLARIN Café on ParlaMint we present the results, lessons learnt and showcases of the project. 
  • CLARIN Café - Join Our Parliamentary-flavoured Coffee: ParlaMINT. The ParlaMint project team presented current results and provided information about the opportunities to join either as contributor,  as a user, or both. Organised by Petya Osenova (Sofia University and IICT-BAS) and Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences). 
    • The dedicated call can be found here.
    • Watch the video recordings of the Café here.

Long-Term Impact

Strategy and data availability: the project will establish a strategy for handling parliamentary data and processing in times of any emergency (COVID-19 is just a showcase). Thus, different reference corpora could be produced with parliamentary records from previous times with global crisis states, e.g. the great economic recession, periods of floods in Europe, the Ebola outbreak etc.

Standard development: the Parla-CLARIN encoding standard will be further developed to cover more detailed and specific metadata across languages and parliaments. The corpora will serve as a baseline for further updates. Such uniform updates across the corpora would strongly support various methods of comparative research.

From showcasing to real applications: the availability of comparable multilingual parliamentary data (also made visible through concordancers and services like Parlameter) will boost research in the areas of digital humanities, linguistics, politicology, sociology, psychology as well as in all the related branches of sciences. 

ParlaMint II

Work Plan

The second stage of the project consists of 5 work packages.

WP1: Documentation, Interoperability, Metadata 

Lead: Tomaž Erjavec (IJS), Matyáš Kopp (UFAL)

  • T1.1: harmonisation of encoding
  • T1.2: git management
  • T1.3: adding metadata to existing corpora
WP2: Corpus Expansion

Lead: Tomaž Erjavec (IJS)

  • T2.1: adding new corpora
  • T2.2: extending existing corpora
  • T2.3: data distribution
WP3: Corpus Enrichment

Lead: Nikola Ljubešić (IJS)

  • T3.1: machine translation and semantic tagging
  • T.3.2: multimodality
WP4: Engagement Activities

Lead: Darja Fišer (INZ), Cagri Coltekin (TUB)

  • T4.1: tutorial 
  • T4.2: hackathon 
  • T4.3: shared task 
  • T4.4: showcases
WP5: Coordination

Lead: Maciej Ogrodniczuk, Petya Osenova

  • T5.1: management 
  • T5.2: dissemination 
  • T5.3: external monitoring

Project Partners

In-kind contributing partners who extend their corpora from ParlaMint I:

  • Belgium (Jesse de Does)
  • Bulgaria (Petya Osenova, Kiril Simov)
  • Croatia (Nikola Ljubešić, Michal Mochtak)
  • Czechia (Matyáš Kopp)
  • Denmark (Costanza Navarretta, Dorte Haltrup Hansen, Bart Jongejan)
  • France (Giancarlo Luxardo, Sascha Diwersy)
  • Hungary (Miklós Sebők - ParlaMint I, Noémi Ligeti-Nagy - ParlaMint II)
  • Iceland (Starkaður Barkarson)
  • Italy (Tommaso Agnoloni, Giulia Venturi)
  • Latvia (Roberts Darģis)
  • Lithuania (Tomas Krilavičius, Andrius Utka, Vaidas Morkevičius, Petkevičius Mindaugas, Monika Briedienė)
  • Netherlands (Maarten Marx, Ruben van Heusden)
  • Poland (Maciej Ogrodniczuk, Michał Rudolf, Danijel Korzinek)
  • Slovenia (Tomaž Erjavec, Andrej Pančur, Darja Fišer)
  • Spain (María Calzada Pérez)
  • Turkey (Çağrı Çöltekin)
  • UK (Paul Rayson, Matt Coole)

Partners who will develop new corpora:

  • Austria (Hannes Pirker, Tanja Wissik)
  • Basque Country (Mikel Iruskieta)
  • Bosnia and Herzegovina (Michal Mochtak, Nikola Ljubešić)
  • Catalonia (Nuria Bel)
  • Estonia (Kadri Vider, Neeme Kahusk, Martin Mölder)
  • Finland (Eero Hyvönen, Jouni Tuominen)
  • Galicia (Adina Vladu, Carmen Magariños)
  • Greece (Maria Gavriilidou)
  • Norway (Magnus Breder Birkenes, Jon Arild Olsen, Koenraad De Smedt)
  • Portugal (Amália Mendes)
  • Romania (Petru Rebeja, Madalina Chitez, Cornelia Ilie)
  • Serbia (Michal Mochtak, Nikola Ljubešić)
  • Sweden (Fredrik Norén)
  • Ukraine (Anna Kryvenko, Matyáš Kopp) 

Financial Support

  • CLARIN ERIC: budget of  135,000 EUR for ParlaMint I and of 163,000 EUR for ParlaMint II
  • In-kind contributing partners: please see sections Project Partners and Acknowledgements

Acknowledgements

  • CLARIN ERIC – 'ParlaMint: Towards Comparable Parliamentary Corpora'
  • ARRS (Slovenian Research Agency) P2-103 'Knowledge Technologies'
  • Ministry of Education and Science Republic of Bulgaria DO01-272/16.12.2019 'Bulgarian National Interdisciplinary Research e-Infrastructure for Resources and Technologies CLaDA-BG'
  • ARRS (Slovenian Research Agency) P6-0411 'Language Resources and Technologies for Slovene'
  • LINDAT/CLARIAH-CZ LM2018101 'Digital Research Infrastructure for Language Technologies, Arts and Humanities'
  • Spanish Ministry of Science and Innovation PID2019-108866RB-I0 / AEI / 10.13039/501100011033 'Original, Translated and Interpreted Representations of the Refugee Cris(e)s: Methodological Triangulation within Corpus-Based Discourse Studies'
  • The Research Council of Lithuania P-MIP-20-373 "Policy Agenda of the Lithuanian Seimas and its Framing: The Analysis of the Seimas Debates in 1990 2020"
  • CLARIN-LV, European Regional Development Fund project 1.1.1.5/18/I/016 'University of Latvia and Institutes in the European Research Area – Excellency, Activity, Mobility, Capacity'
  • CLARIN-PL-Biz, financed by the European Regional Development Fund as a part of the 2014-2020 Smart Growth Operational Programme, CLARIN – Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19

Contacts