Skip to main content

ParlaMint: Comparable and Interoperable Parliamentary Corpora

ParlaMint, a CLARIN flagship project, resulted in the creation of comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and are linguistically annotated up to the level of Universal Dependencies syntax and named entities. 

The second stage of the project, ParlaMint II, introduced various enhancements, including the encoding infrastructure, use of GitHub, the production of individual corpora, the common pipeline for producing their distribution, and the use of CLARIN services for dissemination. Qualitative additions made within the ParlaMint II project include metadata localisation, the addition of new metadata, such as the political orientation of political parties, the machine translation of the corpora to English and its tagging with semantic classes, and the production of pilot speech corpora.

 

The ParlaMint dataset includes uniformly encoded transcriptions of speeches from 29 parliaments across Europe. 

 

ParlaMint Corpora 

ParlaMint corpora are openly available under the CC BY license, as well as freely available for analysis and browsing through noSketch Engine and TEITOK. The latest version of the corpora is 4.1:

The ParlaMint project also has a GitHub repository, where samples of the corpora, the XML schema and corpus processing and validation scripts are available.

Showcases

Emotions Running High?: Average emotion scores for five parties in the Turkish parliament throughout the period available.

Tutorials

Tutorial by Darja Fišer and Kristina Pahor de Maiti Voices of the Parliament: A Corpus Approach to Parliamentary Discourse Research

This tutorial shows how corpora can be used to investigate language use and communication practices in a specialised socio-cultural context of political discourse in order to explore socio-cultural phenomena. It demonstrates the potential of a richly annotated diachronic corpus of Slovenian parliamentary debates for investigating the characteristics and dynamics of the representation of women and their language use in the Slovenian Parliament.

 



Publications and Presentations


  • Tomaž Erjavec et al. The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 2022. https://doi.org/10.1007/s10579-021-09574-0
  • Skubic, Jure, Angermeier, Jan, Bruncrona, Alexandra, Evkoski, Bojan and Larissa Leiminger. (2022). "Networks of Power: Gender Analysis in Selected European Parliaments." In: Proceedings of the 2nd Workshop on Computational Linguistics for Political Text Analysis (CPSS-2022), Potsdam, Germany. (https://old.gscl.org/en/arbeitskreise/cpss/cpss-2022/workshop-proceedings-2022)
  • Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çağrı Çöltekin, Matyáš Kopp, Katja Meden (2022): ParlaMint II: The Show Must Go On. In: Proceedings of the LREC 2022 ParlaCLARIN III Workshop on Creating, Enriching and Using Parliamentary Corpora, pp. 1-6, European Language Resources Association (ELRA), Paris, France, ISBN 979-10-95546-85-6 (http://www.lrec-conf.org/proceedings/lrec2022/workshops/ParlaCLARINIII/pdf/2022.parlaclariniii-1.1.pdf)
  • Skubic, Jure, and Darja Fišer. "Parliamentary discourse research in sociology: Literature review." In Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pp. 81-91. 2022. (https://aclanthology.org/2022.parlaclarin-1.12/)
  • Agnoloni T., Bartolini R., Frontini F., Montemagni S., Marchetti C., Quochi V., Ruisi M. e Venturi G. (2022) “Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus”, Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, Marseille, France, 20/06/2022, edito da European Language Resources Association ELRA (Paris, FRA), pp. 117-124.(https://aclanthology.org/2022.parlaclarin-1.17.pdf)
  • Per Erik Solberg, Pierre Beauguitte, Per Egil Kummervold, Freddy Wetjen (2023) A Large Norwegian Dataset for Weak Supervision ASR. In: Dana Dannélls, Simon Dobnik, Nikolai Ilinykh, Beáta Megyesi, Felix Morger, Joakim Nivre (eds.) Proceedings from The SecondWorkshop on Resources and Representations for Under-Resourced Languages and Domains, May 22, 2023, Tórshavn, Faroe Islands, pp.48-52, ©2023 Association for Computational Linguistics, ISBN 978-1-959429-73-9. (https://aclanthology.org/2023.resourceful-1.7/)
  • Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Andrej Pančur, Nikola Ljubešić, Tommaso Agnoloni, Starkaður Barkarson, María Calzada Pérez, Çağrı Çöltekin, Matthew Coole, Roberts Dargis, Luciana D. de Macedo, Jesse de Does, Katrien Depuydt, Sascha Diwersy, Dorte Haltrup Hansen, Matyáš Kopp, Tomas Krilavičius, Giancarlo Luxardo, Maarten Marx, Vaidas Morkevičius, Costanza Navarretta, Paul Rayson, Orsolya Ring, Michał Rudolf, Kiril Simov, Steinþór Steingrímsson, István Üveges, Ruben van Heusden, Giulia Venturi. Fišer D., Pahor de Maiti K., Osenova P., Ogrodniczuk M. (202x). Parliaments in focus: Language, Gender and the Pandemic. Gender and Language. (SUBMITTED FOR REVIEW)
  • Tomaž Erjavec, Matyáš Kopp, and Katja Meden (2023). "Experience of remote collaborative work in the ParlaMint project using Git".  In: TwinTalks Workshop at DH2023, book of abstracts. Graz, Austria. (https://www.clarin.eu/event/2023/twintalks-workshop-dh2023)
  • Tomaž Erjavec, Katja Meden and Jure Skubic (2023). "Adding political orientation metadata to ParlaMint corpora". CLARIN annual conference 2023 (in print).
  • Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çagrı Çöltekin, Matyáš Kopp, Katja Meden and Taja Kuzman. (2023). "The ParlaMint Project: Ever-growing Family of Comparable and Interoperable Parliamentary Corpora". CLARIN annual conference 2023 (in print).

 

ParlaMint I and ParlaMint II 

For detailed information about ParlaMint I and ParlaMint II, including work plans, project partners and financial support, please see this overview
 

Contact Persons 

Maciej Ogrodniczuk: maciej.ogrodniczuk [at] gmail.com (maciej[dot]ogrodniczuk[at]gmail[dot]com) 
Petya Osenova: petya [at] bultreebank.org (petya[at]bultreebank[dot]org)