ParlaMint is a project, financially supported by CLARIN , which contributes to the creation of comparable and uniformly annotated multilingual corpora of parliamentary sessions. The project is being conducted in two stages:
- ParlaMint I (July 2020 – May 2021)
- ParlaMint II (December 2021 – May 2023)
ParlaMint I created and made available corpora for 17 languages, and started to use them in training and research.
ParlaMint II will upgrade the XML schema and validation, extend the existing corpora to cover data at least to July 2022, add corpora for new languages, further enhance the corpora with additional metadata; and improve the usability of the corpora.
National parliamentary data are a verified communication channel between the elected political representatives and society members in any democracy. They need to be made accessible and comprehensive - especially in times of global crises. With the recent advances of artificial intelligence, analytics over unstructured parliamentary data for many languages is rapidly becoming a prerequisite for reliable and trustworthy approaches in checking the veracity of information in contemporary society.
One of the most important characteristics of parliamentary data is its direct correspondence to the most recent events, including the ones with global impact on human health, social life and economics, such as the current COVID-19 pandemic. By comparing the data synchronically and diachronically in a cross-lingual context, scientific and civil communities from various disciplines are able to track the pan-European discussion.
Goals of the Project
The goals of the ParlaMint project is to turn the existing contemporary diverse national parliamentary data into resources that are:
- Interpretable and
- Highly communicative with respect to the society (researchers, journalists, NGOs, citizens, etc.).
The project provides data for focused observations on trends, opinions, decisions on lockdowns and restrictive measures as well as on the consequences with respect to health, medical care systems, employment, etc. in times of emergencies. For the ParlaMint project, the emergency case is the COVID-19 pandemic. However, the methodology is scalable to other events as well, such as economic crises, environmental issues, etc. Thus, the main tasks are:
- Compiling a collection of parliamentary datasets (corpora) in a number of languages and in a harmonised format, covering both the current data and older, reference data
- Processing the compiled corpora linguistically
- Indexing the data with popular concordancers so that interested parties can search and extract the relevant comparable information
- Showing through appropriate use cases that the ParlaMint corpora and related technologies as part of the CLARIN resource families serve a variety of societal needs.
Thus, observations over democratic processes are approached through parliaments viewed through the following related strategies:
- Speaker and party statistics (for instance, who spoke more and on which topic; who changed their mind on a certain topic; which party defends/opposes what proposals, etc.)
- Topic modeling (which topics are most popular at what time; how topics change and interrelate, etc.)
- Time and context-bound social tendencies (tendencies in policy making over time).
How to Cite
The first stage of the ParlaMint project has already produced freely available comparable and interoperable corpora of 17 European parliaments with almost half a billion words. The project and corpora up to 2021 are described in:
Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Nikola Ljubešić, Kiril Simov, Andrej Pančur, Michał Rudolf, Matyáš Kopp, Starkaður Barkarson Steinþór Steingrímsson, Çağrı Çöltekin, Jesse de Does, Katrien Depuydt, Tommaso Agnoloni, Giulia Venturi, María Calzada Pérez, Luciana D. de Macedo, Costanza Navarretta, Giancarlo Luxardo, Matthew Coole, Paul Rayson, Vaidas Morkevičius, Tomas Krilavičius, Roberts Darģis, Orsolya Ring, Ruben van Heusden, Maarten Marx, and Darja Fišer. The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 2022.
Links to Current ParlaMint Related Data: Corpora, Standards, Services
All corpora follow the same -based encoding schema. It has been developed as a specialised schema over the already existing Parla-CLARIN one, and rich metadata has been added, such as: members-of-parliament bios, parties' bios, ruling vs. opposition information, non-verbal behaviour (like applause, shouting, laughing, entering, leaving the room, etc.) and relations among all of them.
The corpora are available as data (via Github and CLARIN.SI entries) as well as through concordancers. See below for the respective links:
Multilingual comparable corpora of parliamentary debates ParlaMint 2.1
Corpora as data:
Available through the CLARIN.SI repository:
- The complete corpora: hdl.handle.net/11356/1432
- The complete corpora with added linguistic annotations: hdl.handle.net/11356/1431
Available on Github:
- Samples of the corpora, the XML schema and various processing and validation scripts: https://github.com/clarin-eric/ParlaMint
Corpora in concordancers:
- Erjavec T. et al. (2021). Multilingual comparable corpora of parliamentary debates ParlaMint 2.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1432
- Erjavec T. et al. (2021). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1. Slovenian language resource repository CLARIN.SI.http://hdl.handle.net/11356/1431
- Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Andrej Pančur, Nikola Ljubešić, Tommaso Agnoloni, Starkaður Barkarson, María Calzada Pérez, Çağrı Çöltekin, Matthew Coole, Roberts Dargis, Luciana D. de Macedo, Jesse de Does, Katrien Depuydt, Sascha Diwersy, Dorte Haltrup Hansen, Matyáš Kopp, Tomas Krilavičius, Giancarlo Luxardo, Maarten Marx, Vaidas Morkevičius, Costanza Navarretta, Paul Rayson, Orsolya Ring, Michał Rudolf, Kiril Simov, Steinþór Steingrímsson, István Üveges, Ruben van Heusden, Giulia Venturi. ParlaMint: Comparable Corpora of European Parliamentary Data. In: (M. Monachini and M. Eskevich, eds.) Proceedings of CLARIN Annual Conference 2021. 2021, pp. 20-25.
- Fišer D., Pahor de Maiti K., Osenova P., Ogrodniczuk M. (202x). Parliaments in focus: Language, Gender and the Pandemic. Gender and Language. (SUBMITTED FOR REVIEW)
- Tutorial by Darja Fišer and Kristina Pahor de Maiti: Voices of the Parliament: A Corpus Approach to Parliamentary Discourse Research.
- Invited talks
- Invited talk of Tomaž Erjavec, at the 1st Workshop on Computational Linguistics for Political Text Analysis at KONVENS2021.
- Pieters M. (2021). A Comparative Analysis on the ParlaMint Corpus. MSc thesis.
- A Return of Science? Mapping Attitudes Towards Science and Expertise in COVID-19 Parliamentary Debates by Ruben Ros for CLARIN Café: ParlaMint Unleashed, June 2021. GitHub repository with code and research report.
- A Comparative Analysis on the ParlaMint Project by Miguel Pieters for CLARIN Café: ParlaMint Unleashed, June 2021.
- Showcase presentation: ParlaMint and ParlaMeter: How Standardised Data Formats Empower End Users by Filip Dobranić for CLARIN Café: ParlaMint Unleashed, June 2021.
- CLARIN Bazaar Poster: ParlaMint: Towards Comparable Parliamentary Corpora presented at the Virtual CLARIN Annual Conference 2020.
- News from work packages: Early results in WP3 (T3.2) thanks to collaboration between CLARIN.SI (Nikola Ljubešić) and CLARIN-PL (Danijel Koržinek) on building the first Automatic Speech Recognition (ASR) training dataset for Croatian – ParlaSpeech-HR. The data consist of 1816 hours of 8-20 seconds segments, with original and normalized transcripts and alignment to the ParlaMint-HR 2.1 corpus. Current best ASR results (XLS-R), fine-tuned on just 200 hours achieve WER 7.77% and CER 2.62%.The dataset is available here: http://hdl.handle.net/11356/1494.
- CLARIN Café: ParlaMint Unleashed on 28 June 2021 from 14:00 to 16:00, organised by Tomaž Erjavec, Darja Fišer, Maciej Ogrodniczuk, Petya Osenova. Eight months after the introductory CLARIN Café on ParlaMint we present the results, lessons learnt and showcases of the project.
- CLARIN Café - Join Our Parliamentary-flavoured Coffee: ParlaMINT. The ParlaMint project team presented current results and provided information about the opportunities to join either as contributor, as a user, or both. Organised by Petya Osenova (Sofia University and IICT-BAS) and Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences).
Strategy and data availability: the project will establish a strategy for handling parliamentary data and processing in times of any emergency (COVID-19 is just a showcase). Thus, different reference corpora could be produced with parliamentary records from previous times with global crisis states, e.g. the great economic recession, periods of floods in Europe, the Ebola outbreak etc.
Standard development: the Parla-CLARIN encoding standard will be further developed to cover more detailed and specific metadata across languages and parliaments. The corpora will serve as a baseline for further updates. Such uniform updates across the corpora would strongly support various methods of comparative research.
From showcasing to real applications: the availability of comparable multilingual parliamentary data (also made visible through concordancers and services like Parlameter) will boost research in the areas of digital humanities, linguistics, politicology, sociology, psychology as well as in all the related branches of sciences.
The second stage of the project consists of 5 work packages.
Lead: Tomaž Erjavec (IJS), Matyáš Kopp (UFAL)
- T1.1: harmonisation of encoding
- T1.2: git management
- T1.3: adding metadata to existing corpora
Lead: Tomaž Erjavec (IJS)
- T2.1: adding new corpora
- T2.2: extending existing corpora
- T2.3: data distribution
Lead: Nikola Ljubešić (IJS)
- T3.1: machine translation and semantic tagging
- T.3.2: multimodality
Lead: Darja Fišer (INZ), Cagri Coltekin (TUB)
- T4.1: tutorial
- T4.2: hackathon
- T4.3: shared task
- T4.4: showcases
Lead: Maciej Ogrodniczuk, Petya Osenova
- T5.1: management
- T5.2: dissemination
- T5.3: external monitoring
In-kind contributing partners who extend their corpora from ParlaMint I:
- Belgium (Jesse de Does)
- Bulgaria (Petya Osenova, Kiril Simov)
- Croatia (Nikola Ljubešić, Michal Mochtak)
- Czechia (Matyáš Kopp)
- Denmark (Costanza Navarretta)
- France (Giancarlo Luxardo, Sascha Diwersy)
- Hungary (Miklós Sebők) - participated only in ParlaMint I
- Iceland (Starkaður Barkarson)
- Italy (Tommaso Agnoloni, Giulia Venturi)
- Latvia (Roberts Darģis)
- Lithuania (Tomas Krilavičius, Andrius Utka, Vaidas Morkevičius, Petkevičius Mindaugas, Monika Briedienė)
- Netherlands (Maarten Marx, Ruben van Heusden)
- Poland (Maciej Ogrodniczuk, Michał Rudolf, Danijel Korzinek)
- Slovenia (Tomaž Erjavec, Andrej Pančur, Darja Fišer)
- Spain (María Calzada Pérez)
- Turkey (Çağrı Çöltekin)
- UK (Paul Rayson, Matt Coole)
Partners who will develop new corpora:
- Basque Country (Mikel Iruskieta)
- Bosnia and Herzegovina (Michal Mochtak, Nikola Ljubešić)
- Catalonia (Nuria Bel)
- Estonia (Kadri Vider, Neeme Kahusk, Martin Mölder)
- Finland (Eero Hyvönen, Jouni Tuominen)
- Greece (Maria Gavriilidou)
- Norway (Magnus Breder Birkenes, Jon Arild Olsen, Koenraad De Smedt)
- Portugal (Amália Mendes)
- Romania (Petru Rebeja, Madalina Chitez, Cornelia Ilie)
- Serbia (Michal Mochtak, Nikola Ljubešić)
- Sweden (Fredrik Norén)
- CLARIN ERIC: budget of 135,000 EUR for ParlaMint I and of 163,000 EUR for ParlaMint II
- In-kind contributing partners: please see sections Project Partners and Acknowledgements
- CLARIN ERIC – 'ParlaMint: Towards Comparable Parliamentary Corpora'
- ARRS (Slovenian Research Agency) P2-103 'Knowledge Technologies'
- Ministry of Education and Science Republic of Bulgaria DO01-272/16.12.2019 'Bulgarian National Interdisciplinary Research e-Infrastructure for Resources and Technologies CLaDA-BG'
- ARRS (Slovenian Research Agency) P6-0411 'Language Resources and Technologies for Slovene'
- LINDAT/CLARIAH-CZ LM2018101 'Digital Research Infrastructure for Language Technologies, Arts and Humanities'
- Spanish Ministry of Science and Innovation PID2019-108866RB-I0 / AEI / 10.13039/501100011033 'Original, Translated and Interpreted Representations of the Refugee Cris(e)s: Methodological Triangulation within Corpus-Based Discourse Studies'
- The Research Council of Lithuania P-MIP-20-373 "Policy Agenda of the Lithuanian Seimas and its Framing: The Analysis of the Seimas Debates in 1990 2020"
- CLARIN-LV, European Regional Development Fund project 22.214.171.124/18/I/016 'University of Latvia and Institutes in the European Research Area – Excellency, Activity, Mobility, Capacity'
- CLARIN-PL-Biz, financed by the European Regional Development Fund as a part of the 2014-2020 Smart Growth Operational Programme, CLARIN – Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19