ParlaMint: Towards Comparable Parliamentary Corpora



National parliamentary data is a verified communication channel between the elected political representatives and society members in any democracy. It needs to be made accessible and comprehensive - especially in times of a global crisis. With the recent advances of artificial intelligence, analytics over unstructured parliamentary data for many languages is rapidly becoming a prerequisite for reliable and trustworthy approaches in checking the veracity of information in contemporary society.

One of the most important characteristics of new parliamentary data is its direct correspondence to the most recent events, including the ones with global impact on human health, social life and economics such as the current COVID-19 pandemic. By comparing the data synchronically and diachronically within a cross-lingual context, scientific and civil communities will be able to track pan-European discussion and can be quickly updated on any emerging topic.

Mission and Goal

The mission of the ParlaMint project is to turn existing contemporary multilingual and diverse cross-national parliamentary data into resources that are:
  • comparable, 
  • interpretable and 
  • highly communicative with respect to society (NGOs, citizens, researchers, etc.). 
The project will provide data and tools for focused observations on trends, opinions, decisions on lockdowns and restrictive measures as well as on the consequences with respect to health, medical care systems, employment, etc. in times of emergencies. For the ParlaMint project the emergency case is the COVID-19 pandemic. However, the methodology will be scalable to other events, such as economic crises, etc. Thus, the main aims are:
  • to compile a collection of parliamentary datasets (corpora) in a number of languages and in a harmonized format, covering both the current data and older, reference data, 
  • to process the corpora linguistically, 
  • to index the data with popular concordancers so that interested parties can search and extract the relevant comparable information, 
  • to show through appropriate use cases that the CLARIN resources and technology serve societal needs.
For accomplishing this mission a respective -based standard schema has been developed and rich metadata has been added, such as members-of-parliament bios, parties' bios, ruling vs. opposition information and relations among all of them.
Thus, observations over democratic processes are approached through parliaments as digital bodies viewed through the following related strategies:
  • speaker and parties statistics (for instance, who spoke more and on which topic; who changed their mind on a certain topic; which party defends/opposes what proposals, etc.)
  • topic modeling (which topics are most popular at what time; how topic change and interrelate, etc.)
  • time and context-bound social tendencies (tendencies in policy making over time).
The ParlaMint project started with creating recent corpora of parliamentary sessions for 4 parliaments: Bulgarian, Croatian, Polish and Slovene. In 2021 the project was being extended with data for 13 additional parliaments of the following countries: Belgium, Czech Republic, Denmark, France, Hungary, Iceland, Italy, Latvia, Lithuania, Romania, the Netherlands, Turkey, UK.  Release 2.1 of the dataset is now available. (See below under Results.)  A follow-up project is likely to start in Q4 2021. 

Expected Outcome

Strategy and Data availability: The project will establish a strategy for handling parliamentary data and processing in times of any emergency (COVID-19 is just a showcase). Thus, different reference corpora could be produced with parliamentary records from previous times with global crisis states, e.g. the great economic recession, periods of floods in Europe, the Ebola outbreak etc.

Standard development: The Parla-CLARIN encoding standard will be further developed to cover more detailed and specific metadata across languages and parliaments. The corpora will serve as a baseline for further updates. Such uniform updates across the corpora would strongly support various methods of comparative research.

From showcasing to real applications: The availability of comparable multilingual parliamentary data (also made visible through concordancers and Parlameter) will boost research in the areas of digital humanities, linguistics, politcology, sociology, psychology as well as in all the related branches of sciences. 


1. Creating a multilingual set of uniformly annotated corpora of parliamentary proceedings dating from November 2019 to July 2020 (thus covering current COVID-19 pandemic situation).

2. Creating a set of comparable multilingual reference corpora of parliamentary data from 2015 to October 2019.

3. Processing the corpora linguistically to add syntactic structures of Universal Dependencies as well as Named Entities annotation.

4. Making the corpora available through concordancers and Parlameter.

5. Building use cases in Political Sciences and Digital Humanities based on the corpus data.


Phase 1 (July 1, 2020 - September 30, 2020)

In Phase 1 the approach, described in tasks 1, 2, 3 and 4 in the Tasks section will be tested for 4 pilot languages – Bulgarian, Croatian, Slovene and Polish. 

  1. Creation of the COVID-19 parliamentary corpora (Nov. 2019 - July 2020) as well as the reference parliamentary corpora (2015 - Oct. 2019).
    1. The data will be processed to adhere to the Parla-CLARIN TEI annotation scheme. [Responsible persons for gathering and conversion of data: Bulgarian (Petya Osenova and Kiril Simov); Croatian (Nikola Ljubešić); Slovene (Andrej Pančur)Polish (Maciej Ogrodniczuk and Michał Rudolf)]
    2. The data will be be processed linguistically with Universal Dependencies and Named Entities (Responsible person: Nikola Ljubešić)
  2. Mounting of the corpora on the NoSketch Engine and KonText concordancers. (Responsible person: Tomaž Erjavec)
  3. Mounting of the corpora to Parlameter. (Responsible person: Filip Muki Dobranić and Tomaž Kunst)
  4. Preparation of guidelines. (Responsible persons: All) 

Phase 2 (October 1, 2020 - May 30, 2021)

In Phase 2 the parliamentary corpora were extended to more languages and parliaments. For this phase a special Call for interest in participation was published in October 2020. 

Also, three showcases are envisaged:

  • Availability in Parlameter: facilitating the corpus graphical exploration suited for politological investigations and investigations by journalists and active citizens. (Responsible person: Filip Muki Dobranić)
  • The linguistic showcase will extend the CLARIN tutorial "Voices of the Parliament" showing how corpora can be used to investigate language use and communication practices in a specialised socio-cultural context of political discourse. (Responsible person: Darja Fišer and Kristina Pahor de Maiti). 
  • The DH-related showcase will be prepared by a digital historian. (Responsible person: Ruben Ros)


Parliamentary Corpora for 17 languages (release: Q2 2021)

Corpora as Data 
Corpora in Concordancers 

Parliamentary Corpora for 4 pilot languages (release: Q4 2020)

Corpora as Data 
Corpora in Concordancers 

ParlaMint Call Results

The proposals of the following applicants were assessed and approved by the ParlaMint team in consultation with representatives of CLARIN Board of Directors:

Name Affiliation Language
Paul Rayson Lancaster University English
Ruben van Heusden University of Amsterdam – ILPS research group Dutch
Steinþór Steingrímsson The Árni Magnússon Institute for Icelandic Studies Icelandic
Tomas Krilavičius Applied Informatics dept., Vytautas Magnus University (Vytauto Didžiojo university) Lithuanian
Barbora Hladká Charles University Czech
Giulia Venturi Institute for Computational Linguistics "A. Zampolli" (ILC-CNR) Italian
Çağrı Çöltekin University of Tübingen Turkish
Costanza Navarretta University of Copenhagen Danish
Miklós Sebők Centre for Social Sciences, Budapest, Hungary Hungarian
Giancarlo Luxardo Praxiling UMR 5267 French
Robers Dargis Institute of Mathematics and Computer Science, University of Latvia Latvian
Petru Rebeja Alexandru Ioan Cuza University of Iași Romanian
Jesse de Does Instituut voor de Nederlandse Taal Belgian Dutch/French


  • María Calzada Pérez - from the Translation Studies Unit, Universitat Jaume I, Castellón de la Plana



  • CLARIN Café: ParlaMint Unleashed on 28 June 2021 from 14:00 to 16:00, organised by Tomaž Erjavec, Darja Fišer, Maciej Ogrodniczuk, Petya Osenova. Eight months after the introductory CLARIN Café on ParlaMint we present the results, lessons learnt and showcases of the project. 
  • CLARIN Café - Join Our Parliamentary-flavoured Coffee: ParlaMINT. The ParlaMint project team presented current results and provided information about the opportunities to join either as contributor,  as a user, or both. Organized by Petya Osenova (Sofia University and IICT-BAS) and Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences). 

    • The dedicated call can be found here.

    • Watch the video recordings of the Café here.


● Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences (Coordinator) (also: Michał Rudolf)

● Petya Osenova, Institute for Information and Communication Technologies, Bulgarian Academy of Sciences (also: Kiril Simov)

● Tomaž Erjavec, Jožef Stefan Institute, Ljubljana (also: Darja Fišer and Kristina Pahor de Maiti)

● Andrej Pančur, Institute of Contemporary History, Ljubljana

● Nikola Ljubešić, Jožef Stefan Institute, Ljubljana

● Filip Muki Dobranić, (also: Tomaž Kunst)

● Ruben Ros, Luxembourg University

Financial support

  • CLARIN Budget: 135,000 EUR
  • The Spanish parliamentray corpus was financially supported by Spanish Ministry of Science and Innovation, PID2019- 108866RB-I0 / AEI / 10.13039/501100011033, "Original, translated and interpreted representations of the refugee cris(e)s: methodological triangulation within corpus-based discourse studies".

Contact persons

Maciej Ogrodniczuk:

Petya Osenova: