ParlaMint: Towards Comparable Parliamentary Corpora


National parliamentary data is a verified communication channel between the elected political representatives and society members in any democracy. It needs to be made accessible and comprehensive - especially in times of a global crisis. With the recent advances of artificial intelligence, analytics over unstructured parliamentary data is rapidly becoming a prerequisite for reliable and trustworthy approaches in checking the veracity of information in contemporary society.

One of the most important aspects of processing of new parliamentary data is its direct correspondence to the most recent events with global impact on human health, social life and economics such as the current COVID-19 pandemic. By comparing the data synchronically and diachronically within a cross-lingual context, the scientific and civil communities will be able to track pan-European discussion and be quickly updated on any emerging topic.


Our goal is to provide resources and tools for focused observations on trends, opinions, decisions on lockdowns and restrictive measures as well as on the consequences with respect to health, medical care systems, employment, etc. in times of emergencies. In our case this emergency is the COVID-19 pandemic. However, our methodology will be scalable to other events, such as economic crises, etc. Thus, our main aims are:

  • to compile a collection of parliamentary datasets (corpora) in a number of languages and in a harmonized format, covering both the current data and older, reference data, 
  • to process the corpora linguistically, 
  • to index the data with popular concordancers so that interested parties can search and extract the relevant comparable information, 
  • to show through appropriate use cases that our resources and technology serves the society's needs.

Expected Outcome

Strategy and Data availability: The project will establish a strategy for handling parliamentary data and processing in times of any emergency (COVID-19 is just a showcase). Thus, different reference corpora could be produced with parliamentary records from previous times with global crisis states, e.g. the great economic recession, periods of floods in Europe, the Ebola outbreak etc.

Standard development: The Parla-CLARIN encoding standard will be further developed to cover more detailed and specific metadata across languages and parliaments.The corpora will serve as a baseline for further updates. Such uniform updates across the corpora would strongly support various methods of comparative research.

From showcasing to real applications: The availability of comparable multilingual parliamentary data (also made visible through concordancers and Parlameter) will boost research in the areas of digital humanities, linguistics, politology, sociology, psychology as well as in all the related branches of sciences. 


1. Creating a multilingual set of uniformly annotated corpora of parliamentary proceedings dating from November 2019 to July 2020 (thus covering current COVID-19 pandemic situation).

2. Creating a set of comparable multilingual reference corpora of parliamentary data from 2015 to October 2019.

3. Processing the corpora linguistically to add syntactic structures of Universal Dependencies as well as Named Entities annotation.

4. Making the corpora available through concordancers and Parlameter.

5. Building use cases in Political Sciences and Digital Humanities based on the corpus data.


Phase 1 (July 1, 2020 - September 30, 2020)

In Phase 1 the approach, described in tasks 1, 2, 3 and 4 in the Tasks section will be tested for four pilot languages – Bulgarian, Croatian, Slovene and Polish. 

  1. Creation of the COVID-19 parliamentary corpora (Nov. 2019 - July 2020) as well as the reference parliamentary corpora (2015 - Oct. 2019).

    1. The data will be processed to adhere to the Parla-CLARIN TEI annotation scheme. [Responsible persons for gathering and conversion of data: Bulgarian (Petya Osenova and Kiril Simov); Croatian (Nikola Ljubešić); Slovene (Andrej Pančur)Polish (Maciej Ogrodniczuk and Michał Rudolf)]
    2. The data will be be processed linguistically with Universal Dependencies and Named Entities (Responsible person: Nikola Ljubešić)
  2. Mounting of the corpora on the NoSketch Engine and KonText concordancers. (Responsible person: Tomaž Erjavec)
  3. Mounting of the corpora to Parlameter. (Responsible person: Filip Muki Dobranić and Tomaž Kunst)
  4. Preparation of guidelines. (Responsible persons: All) 

Phase 2 (October 1, 2020 - May 30, 2021)

In Phase 2 the parliamentary corpora will be extended to more languages and parliaments. For this phase a special Call for interest in participation will be published in October 2020. The results will be announced in November 2020.

Also, three showcases are envisaged:

  • Availability in Parlameter: facilitating the corpus graphical exploration suited for politological investigations and investigations by journalists and active citizens. (Responsible person: Filip Muki Dobranić)
  • The linguistic showcase will extend the CLARIN tutorial "Voices of the Parliament" showing how corpora can be used to investigate language use and communication practices in a specialised socio-cultural context of political discourse. (Responsible person: Darja Fišer and Kristina Pahor de Maiti)
  • The DH-related showcase will be prepared by a digital historian. (Responsible person: Ruben Ros)


Parliamentary Corpora from Phase 1 

Corpora as Data 
Corpora in Concordancers 



  • CLARIN Café - Join Our Parliamentary-flavoured Coffee: ParlaMINT. The ParlaMint Project team will present current results and provide information about the opportunities to join either as contributor,  as a user, or both. Organized by Petya Osenova (Sofia University and IICT-BAS) and Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences). 

    • The dedicated call can be found here.

    • Watch the video recordings of the Café here.


● Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences (Coordinator) (also: Michał Rudolf)

● Petya Osenova, Institute for Information and Communication Technologies, Bulgarian Academy of Sciences (also: Kiril Simov)

● Tomaž Erjavec, Jožef Stefan Institute, Ljubljana (also: Darja Fišer and Kristina Pahor de Maiti)

● Andrej Pančur, Institute of Contemporary History, Ljubljana

● Nikola Ljubešić, Jožef Stefan Institute, Ljubljana

● Filip Muki Dobranić, (also: Tomaž Kunst)

● Ruben Ros, Luxembourg University

Financial support


Budget: 98,000 EUR

Contact persons

Maciej Ogrodniczuk:

Petya Osenova: