Introduction
ParlaMint is a CLARIN Flagship project which focuses on the creation of comparable and uniformly annotated corpora of parliamentary debates in Europe. The first stage of the project (ParlaMint I: 2020 – 2021) resulted in the compilation of 17 corpora, while the second stage (ParlaMint II: 2022 – 2023) is increasing the time-span of the corpora, adding corpora for new countries and autonomous regions, providing a machine translated version of the corpora into English, further enhancing the corpora with additional metadata and improving the usability of the corpora.
The corpora developed in the first stage of the project are described in the open access paper:
- Tomaž Erjavec et al. The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 2022. https://doi.org/10.1007/s10579-021-09574-0
There are also shorter publications on the second stage of the project (cf. Publications and Presentations) and the first version of the ParlaMint II corpora already available (cf. ParlaMint corpora).
A part of the ParlaMint project is also the dissemination and uptake of the results, including a tutorial, showcases, etc. (cf. Tutorials and Showcases).
ParlaMint corpora

ParlaMint corpora are openly available under the CC BY license, as well as freely available for analysis through noSketch Engine. The latest version of the corpora is:
- Tomaž Erjavec et al. (2023) Multilingual comparable corpora of parliamentary debates ParlaMint 4.0. http://hdl.handle.net/11356/1859
- Tomaž Erjavec et al. (2023) Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.0. http://hdl.handle.net/11356/1860
- Taja Kuzman et al. (2023) Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.0. http://hdl.handle.net/11356/1864
The ParlaMint project also has a GitHub repository, where samples of the corpora, the XML schema and corpus processing and validation scripts are available.
Tutorials and Showcases
- Tutorial by Darja Fišer and Kristina Pahor de Maiti Voices of the Parliament: A Corpus Approach to Parliamentary Discourse Research.
- ParlaMint and ParlaMeter: How Standardised Data Formats Empower End Users. Filip Dobranić, CLARIN Café: ParlaMint Unleashed, June 2021.
- ParlaMint - A Resource for Democracy. Dario Del Fante and Virginia Zorzi, 'Who Is the Enemy Now?', CLARIN Impact Stories, January 2023.
- Networks of Power - Gender Analysis in European Parliaments. Jure Skubic, Alexandra Bruncrona, Jan Angermeier, Bojan Evkoski and Larissa Leiminger, CLARIN Impact Stories, February 2023.
Publications and Presentations
- Tomaž Erjavec et al. The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 2022. https://doi.org/10.1007/s10579-021-09574-0
- Skubic, Jure, Angermeier, Jan, Bruncrona, Alexandra, Evkoski, Bojan and Larissa Leiminger. (2022). "Networks of Power: Gender Analysis in Selected European Parliaments." In: Proceedings of the 2nd Workshop on Computational Linguistics for Political Text Analysis (CPSS-2022), Potsdam, Germany. (https://old.gscl.org/en/arbeitskreise/cpss/cpss-2022/workshop-proceedings-2022)
- Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çağrı Çöltekin, Matyáš Kopp, Katja Meden (2022): ParlaMint II: The Show Must Go On. In: Proceedings of the LREC 2022 ParlaCLARIN III Workshop on Creating, Enriching and Using Parliamentary Corpora, pp. 1-6, European Language Resources Association (ELRA), Paris, France, ISBN 979-10-95546-85-6 (http://www.lrec-conf.org/proceedings/lrec2022/workshops/ParlaCLARINIII/pdf/2022.parlaclariniii-1.1.pdf)
- Skubic, Jure, and Darja Fišer. "Parliamentary discourse research in sociology: Literature review." In Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pp. 81-91. 2022. (https://aclanthology.org/2022.parlaclarin-1.12/)
- Agnoloni T., Bartolini R., Frontini F., Montemagni S., Marchetti C., Quochi V., Ruisi M. e Venturi G. (2022) “Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus”, Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, Marseille, France, 20/06/2022, edito da European Language Resources Association ELRA (Paris, FRA), pp. 117-124.(https://aclanthology.org/2022.parlaclarin-1.17.pdf)
- Per Erik Solberg, Pierre Beauguitte, Per Egil Kummervold, Freddy Wetjen (2023) A Large Norwegian Dataset for Weak Supervision ASR. In: Dana Dannélls, Simon Dobnik, Nikolai Ilinykh, Beáta Megyesi, Felix Morger, Joakim Nivre (eds.) Proceedings from The SecondWorkshop on Resources and Representations for Under-Resourced Languages and Domains, May 22, 2023, Tórshavn, Faroe Islands, pp.48-52, ©2023 Association for Computational Linguistics, ISBN 978-1-959429-73-9. (https://aclanthology.org/2023.resourceful-1.7/)
- Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Andrej Pančur, Nikola Ljubešić, Tommaso Agnoloni, Starkaður Barkarson, María Calzada Pérez, Çağrı Çöltekin, Matthew Coole, Roberts Dargis, Luciana D. de Macedo, Jesse de Does, Katrien Depuydt, Sascha Diwersy, Dorte Haltrup Hansen, Matyáš Kopp, Tomas Krilavičius, Giancarlo Luxardo, Maarten Marx, Vaidas Morkevičius, Costanza Navarretta, Paul Rayson, Orsolya Ring, Michał Rudolf, Kiril Simov, Steinþór Steingrímsson, István Üveges, Ruben van Heusden, Giulia Venturi. Fišer D., Pahor de Maiti K., Osenova P., Ogrodniczuk M. (202x). Parliaments in focus: Language, Gender and the Pandemic. Gender and Language. (SUBMITTED FOR REVIEW)
- Tomaž Erjavec, Matyáš Kopp, and Katja Meden (2023). "Experience of remote collaborative work in the ParlaMint project using Git". In: TwinTalks Workshop at DH2023, book of abstracts. Graz, Austria. (https://www.clarin.eu/event/2023/twintalks-workshop-dh2023)
- Tomaž Erjavec, Katja Meden and Jure Skubic (2023). "Adding political orientation metadata to ParlaMint corpora". CLARIN annual conference 2023 (in print).
- Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çagrı Çöltekin, Matyáš Kopp, Katja Meden and Taja Kuzman. (2023). "The ParlaMint Project: Ever-growing Family of Comparable and Interoperable Parliamentary Corpora". CLARIN annual conference 2023 (in print).
- Invited talk of Tomaž Erjavec, at the 1st Workshop on Computational Linguistics for Political Text Analysis at KONVENS2021.
- Maciej Ogrodniczuk: The Impact of Parliamentary Datasets
for Society and (Data) Science. At: DH Forum, European Parliament's STOA Panel, Brussels, April 26, 2023.
https://drive.google.com/drive/folders/13ieQuPhr6jl88U4JXACn9-l8Zf4DpR_L
- Marilina Pisani. (2022) Árboles, Gráficos y Matrices de Datos. Codificación en TEI de un Corpus de Interacciones Parlamentarias con Python. Final Master Thesis supervised by Núria Bel. Máster en Humanidades y Patrimonio Digitales. Universidad Autónoma de Barcelona. (https://github.com/marilinapisani/)
- Pieters M. (2021). A Comparative Analysis on the ParlaMint Corpus. MSc thesis.
- A Return of Science? Mapping Attitudes Towards Science and Expertise in COVID-19 Parliamentary Debates by Ruben Ros for CLARIN Café: ParlaMint Unleashed, June 2021. GitHub repository with code and research report.
- A Comparative Analysis on the ParlaMint Project by Miguel Pieters for CLARIN Café: ParlaMint Unleashed, June 2021.
- ParlaMint II: The show must go on presented at the CLARIN Annual Conference 2022.
- ParlaMint: Towards Comparable Parliamentary Corpora presented at the Virtual CLARIN Annual Conference 2020.
- Early results in WP3 (T3.2) thanks to collaboration between CLARIN.SI (Nikola Ljubešić) and CLARIN-PL (Danijel Koržinek) on building the first Automatic Speech Recognition (ASR) training dataset for Croatian – ParlaSpeech-HR. The data consist of 1816 hours of 8-20 seconds segments, with original and normalized transcripts and alignment to the ParlaMint-HR 2.1 corpus. Current best ASR results (XLS-R), fine-tuned on just 200 hours achieve WER 7.77% and CER 2.62%.The dataset is available here: http://hdl.handle.net/11356/1494.
ParlaMint I work plan
WP 1: Testing the approach for four languages (Lead: Maciej Ogrodniczuk (IPI-PAN), Petya Osenova (IICT-BAS))
- T1.1: Preparation of the reference parliamentary corpora
- T1.2: Creation of COVID-19 parliamentary corpora
- T1.3: Mounting of the corpora on the NoSketch Engine and KonText concordancers
- T1.4: Preparation of guidelines and mini-grant procedure
WP 2: Extending the corpora and showcasing (Lead: Tomaž Erjavec (IJS))
- T2.1: Adding additional corpora to the infrastructure
- T2.2: Preparation of showcases
- T2.3: Preparation of the documentation for usage by interested parties
ParlaMint II work plan
- T1.1: Harmonization of encoding
- T1.2: Git management
- T1.3 Adding metadata to existing corpora
- T2.1: Adding new corpora
- T2.2: Extending existing corpora
- T2.3: Data distribution
- T3.1: Machine translation and semantic tagging
- T.3.2: Multimodality
- T4.1: Tutorial
- T4.2: Hackathon
- T4.3: Shared task
- T4.4: Showcases
- T5.1: Management
- T5.2: Dissemination
- T5.3: External monitoring
Project Partners
In-kind contributing partners
Belgium | Jesse de Does |
Bulgaria | Petya Osenova, Kiril Simov |
Croatia | Nikola Ljubešić, Michal Mochtak |
Czechia | Matyáš Kopp |
Denmark | Costanza Navarretta, Dorte Haltrup Hansen, Bart Jongejan |
France | Giancarlo Luxardo, Sascha Diwersy |
Hungary | Miklós Sebők - ParlaMint I, Noémi Ligeti-Nagy - ParlaMint II |
Iceland | Starkaður Barkarson |
Italy | Tommaso Agnoloni, Giulia Venturi |
Latvia | Roberts Darģis |
Lithuania | Tomas Krilavičius, Andrius Utka, Vaidas Morkevičius, Petkevičius Mindaugas, Monika Briedienė |
Netherlands | Maarten Marx, Ruben van Heusden |
Poland | Maciej Ogrodniczuk, Michał Rudolf, Danijel Korzinek |
Slovenia | Tomaž Erjavec, Andrej Pančur, Darja Fišer |
Spain | María Calzada Pérez, Ruben de Libano, Monica Albini |
Turkey | Çağrı Çöltekin |
UK | Paul Rayson, Matt Coole |
New partners
Austria | Hannes Pirker, Tanja Wissik |
Basque Country | Mikel Iruskieta |
Bosnia and Herzegovina | Michal Mochtak, Nikola Ljubešić |
Catalonia | Nuria Bel |
Estonia | Kadri Vider, Neeme Kahusk, Martin Mölder |
Finland | Eero Hyvönen, Jouni Tuominen |
Galicia | Adina Ioana Vladu, Carmen Magariños, Daniel Bardanca, Mario Barcala, Marcos Garcia, María Pérez Lago, Pedro García Louzao, Ainhoa Vivel Couso, Marta Vázquez Abuín, Noelia García Díaz, Adrián Vidal Miguéns, Elisa Fernández Rei |
Greece | Maria Gavriilidou |
Norway | Magnus Breder Birkenes, Jon Arild Olsen, Koenraad De Smedt |
Portugal | Amália Mendes |
Romania | Petru Rebeja, Madalina Chitez, Cornelia Ilie |
Serbia | Michal Mochtak, Nikola Ljubešić |
Sweden | Fredrik Norén |
Ukraine | Anna Kryvenko, Matyáš Kopp |
Financial Support
ParlaMint 1 financial support
- CLARIN ERIC: Budget: 135,000 EUR
- ARRS (Slovenian Research Agency) P2-103 "Knowledge Technologies"
- ARRS (Slovenian Research Agency) P6-0411 "Language Resources and Technologies for Slovene"
- CLARIN-LV, European Regional Development Fund project 1.1.1.5/18/I/016 "University of Latvia and institutes in the European Research Area - Excellency, activity, mobility, capacity"
- LINDAT/CLARIAH-CZ LM2018101 "Digital Research Infrastructure for Language Technologies, Arts and Humanities"
- Ministry of Education and Science Republic of Bulgaria DO01-272/16.12.2019 "Bulgarian National Interdisciplinary Research e-Infrastructure for Resources and Technologies CLaDA-BG"
- Spanish Ministry of Science and Innovation PID2019-108866RB-I0 / AEI / 10.13039/501100011033 "Original, translated and interpreted representations of the refugee cris(e)s: methodological triangulation within corpus-based discourse studies"
- The Research Council of Lithuania P-MIP-20-373 "Policy Agenda of the Lithuanian Seimas and its Framing: The Analysis of the Seimas Debates in 1990 2020"
ParlaMint 2 financial support
- CLARIN ERIC: Budget: 163,000 EUR
- In-kind contributing partners: please see in section Project Partners
- ARRS (Slovenian Research Agency) P6-0411 "Language Resources and Technologies for Slovene"
Nederlandse Organisatie voor Wetenschappelijk Onderwijs CISC.CC.016 "Access to City Councils using Exploratory Search Systems" - ARRS (Slovenian Research Agency) J7-4642 "MEZZANINE"
- ARRS (Slovenian Research Agency) N6-0099 "Flemish-Slovenian bilateral basic research project ‘Linguistic landscape of hate speech online’ (2019-2023)"
- ARRS (Slovenian Research Agency) N6-0288 "the MSCA Seal of Excellence postdoctoral project 'The Changing Discursive Semantics of EU Representations' (2022-2024)"
- Austrian Academy of Sciences - "ÖAW"
- Bulgarian Ministry of Education and Science DO1-301/17.12.21 "Bulgarian National Interdisciplinary Research e-Infrastructure for Resources and Technologies in favor of the Bulgarian Language and Cultural Heritage, part of the EU infrastructures CLARIN and DARIAH"
- Department of Nordic Studies and Linguistics (NorS), University of Copenhagen CLARIN-DK "CLARIN-DK"
Dutch Language Institute - European Commission POIR.04.02.00-00C002/19 "European Regional Development Fund as a part of the 2014-2020 Smart Growth Operational Programme, CLARIN – Common Language Resources and Technology Infrastructure"
- Fundação para a Ciência e a Tecnologia UIDP/00214/2020
- Galician Language Institute, University of Santiago de Compostela
- Hungarian Research Centre for Linguistics
- Institute for Language and Speech Processing / ATHENA RC
- Institute of Computer Science, Polish Academy of Sciences - "statutory research"
- Jožef Stefan Institute CLARIN "CLARIN.SI"
- Ministry of Education, Youth and Sports of the Czech Republic LM2023062 "LINDAT/CLARIAH-CZ: Digital Research Infrastructure for Language Technologies, Arts and Humanities"
- National Library of Norway
- Polish Ministry of Education and Science 2022/WK/09 "National contribution to CLARIN ERIC – European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure 2022–2023 (CLARIN Q)"
- Slovenian Research Agency (ARRS) P6-0436 "Basic national research program 'Digital Humanities' (2022-2027)"
- The Árni Magnsússon Institute for Icelandic Studies
- Xunta de Galicia - University of Santiago de Compostela 2021-CP080 "Nós: Galician in the society and economy of artificial intelligence (2021-CP080), agreement between Xunta de Galicia and the University of Santiago de Compostela"
Contact Persons
ParlaMint I (July 2020 - May 2021)
- Creating a multilingual set of uniformly annotated corpora of parliamentary proceedings dating from November 2019 to July 2020 (thus covering current COVID-19 pandemic situation).
- Creating a set of comparable multilingual reference corpora of parliamentary data from 2015 to October 2019.
- Processing the corpora linguistically to add syntactic structures of Universal Dependencies as well as Named Entities annotation.
- Making the corpora available through concordancers and Parlameter.
- Building use cases in Political Sciences and Digital Humanities based on the corpus data.
Phase 1 (July 1, 2020 - September 30, 2020)
In Phase 1 the approach, described in tasks 1, 2, 3 and 4 in the Tasks section will be tested for 4 pilot languages – Bulgarian, Croatian, Slovene and Polish.
- Creation of the COVID-19 parliamentary corpora (Nov. 2019 - July 2020) as well as the reference parliamentary corpora (2015 - Oct. 2019).
- The data will be processed to adhere to the Parla-CLARIN TEI annotation scheme. [Responsible persons for gathering and conversion of data: Bulgarian (Petya Osenova and Kiril Simov); Croatian (Nikola Ljubešić); Slovene (Andrej Pančur); Polish (Maciej Ogrodniczuk and Michał Rudolf)].
- The data will be be processed linguistically with Universal Dependencies and Named Entities (Responsible person: Nikola Ljubešić)
- Mounting of the corpora on the NoSketch Engine and KonText concordancers. (Responsible person: Tomaž Erjavec)
- Mounting of the corpora to Parlameter. (Responsible person: Filip Muki Dobranić and Tomaž Kunst)
- Preparation of guidelines. (Responsible persons: All)
Phase 2 (October 1, 2020 - May 30, 2021)
Also, three showcases are envisaged:
- Availability in Parlameter: facilitating the corpus graphical exploration suited for politological investigations and investigations by journalists and active citizens. (Responsible person: Filip Muki Dobranić)
- The linguistic showcase will extend the CLARIN tutorial "Voices of the Parliament" showing how corpora can be used to investigate language use and communication practices in a specialised socio-cultural context of political discourse. (Responsible person: Darja Fišer and Kristina Pahor de Maiti). http://doi.org/10.3828/mlo.v0i0.295
- The DH-related showcase will be prepared by a digital historian. (Responsible person: Ruben Ros)
The ParlaMint project started with creating recent corpora of parliamentary sessions for 4 parliaments: Bulgarian, Croatian, Polish and Slovene. In 2021 the project was being extended with data for 13 additional parliaments of the following countries: Belgium, Czech Republic, Denmark, France, Hungary, Iceland, Italy, Latvia, Lithuania, Romania, the Netherlands, Turkey, UK. Release 2.1 of the dataset is now available.
Parliamentary Corpora for 17 languages (release: Q2 2021)
Corpora as Data
- Multilingual comparable corpora of parliamentary debates ParlaMint 2.1 is available at CLARIN.SI repository: hdl.handle.net/11356/1432
- Linguistically annotated version of these corpora is available at: hdl.handle.net/11356/1431
Corpora in Concordancers
- NoSketch: clarin.si/noske
- KonText: https://www.clarin.si/kontext/corpora/corplist
Parliamentary Corpora for 4 pilot languages (release: Q4 2020)
Corpora as Data
- Multilingual comparable corpora of parliamentary debates ParlaMint 1.0 in CLARIN.SI repository: hdl.handle.net/11356/1345
The proposals of the following applicants were assessed and approved by the ParlaMint team in consultation with representatives of CLARIN Board of Directors:
Name | Affiliation | Language |
---|---|---|
Paul Rayson | Lancaster University | English |
Ruben van Heusden | University of Amsterdam – ILPS research group | Dutch |
Steinþór Steingrímsson | The Árni Magnússon Institute for Icelandic Studies | Icelandic |
Tomas Krilavičius | Applied Informatics dept., Vytautas Magnus University (Vytauto Didžiojo University) | Lithuanian |
Barbora Hladká | Charles University | Czech |
Giulia Venturi | Institute for Computational Linguistics "A. Zampolli" (ILC-CNR) | Italian |
Çağrı Çöltekin | University of Tübingen | Turkish |
Costanza Navarretta | University of Copenhagen | Danish |
Miklós Sebők | Centre for Social Sciences, Budapest, Hungary | Hungarian |
Giancarlo Luxardo | Praxiling UMR 5267 | French |
Roberts Darģis | Institute of Mathematics and Computer Science, University of Latvia | Latvian |
Petru Rebeja | Alexandru Ioan Cuza University of Iași | Romanian |
Jesse de Does | Instituut voor de Nederlandse Taal | Belgian Dutch/French |
- Invited talk of Tomaž Erjavec, at the 1st Workshop on Computational Linguistics for Political Text Analysis at KONVENS2021.
- Showcase presentation: A Return of Science? Mapping attitudes towards science and expertise in COVID-19 parliamentary debates by Ruben Ros for CLARIN Café: ParlaMint Unleashed. GitHub repository with code and research report.
- Showcase presentation: A Comparative Analysis on the ParlaMint Project by Miguel Pieters for CLARIN Café: ParlaMint Unleashed.
- Showcase presentation: ParlaMint and ParlaMeter: How standardised data formats empower end users by Filip Dobranić for CLARIN Café: ParlaMint Unleashed.
- CLARIN Bazaar Poster: ParlaMint: Towards Comparable Parliamentary Corpora presented at the Virtual CLARIN Annual Conference 2020.
- CLARIN Café: ParlaMint Unleashed on 28 June 2021 from 14:00 to 16:00, organised by Tomaž Erjavec, Darja Fišer, Maciej Ogrodniczuk, Petya Osenova. Eight months after the introductory CLARIN Café on ParlaMint we present the results, lessons learnt and showcases of the project.
- CLARIN Café - Join Our Parliamentary-flavoured Coffee: ParlaMINT. The ParlaMint project team presented current results and provided information about the opportunities to join either as contributor, as a user, or both. Organized by Petya Osenova (Sofia University and IICT-BAS) and Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences).
- Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences (Coordinator) (also: Michał Rudolf)
- Petya Osenova, Institute for Information and Communication Technologies, Bulgarian Academy of Sciences (also: Kiril Simov)
- Tomaž Erjavec, Jožef Stefan Institute, Ljubljana (also: Darja Fišer and Kristina Pahor de Maiti)
- Andrej Pančur, Institute of Contemporary History, Ljubljana
- Nikola Ljubešić, Jožef Stefan Institute, Ljubljana
- Filip Muki Dobranić, Parlameter.org (also: Tomaž Kunst)
- Ruben Ros, Luxembourg University