National parliamentary data is a verified communication channel between the elected political representatives and society members in any democracy. It needs to be made accessible and comprehensive - especially in times of a global crisis. With the recent advances of artificial intelligence, analytics over unstructured parliamentary data for many languages is rapidly becoming a prerequisite for reliable and trustworthy approaches in checking the veracity of information in contemporary society.
One of the most important characteristics of new parliamentary data is its direct correspondence to the most recent events, including the ones with global impact on human health, social life and economics such as the current COVID-19 pandemic. By comparing the data synchronically and diachronically within a cross-lingual context, scientific and civil communities will be able to track pan-European discussion and can be quickly updated on any emerging topic.
- interpretable and
- highly communicative with respect to society (NGOs, citizens, researchers, etc.).
- to compile a collection of parliamentary datasets (corpora) in a number of languages and in a harmonized format, covering both the current data and older, reference data,
- to process the corpora linguistically,
- to index the data with popular concordancers so that interested parties can search and extract the relevant comparable information,
- to show through appropriate use cases that the CLARIN resources and technology serve societal needs.
Thus, observations over democratic processes are approached through parliaments as digital bodies viewed through the following related strategies:
- speaker and parties statistics (for instance, who spoke more and on which topic; who changed their mind on a certain topic; which party defends/opposes what proposals, etc.)
- topic modeling (which topics are most popular at what time; how topic change and interrelate, etc.)
- time and context-bound social tendencies (tendencies in policy making over time).
Strategy and Data availability: The project will establish a strategy for handling parliamentary data and processing in times of any emergency (COVID-19 is just a showcase). Thus, different reference corpora could be produced with parliamentary records from previous times with global crisis states, e.g. the great economic recession, periods of floods in Europe, the Ebola outbreak etc.
Standard development: The Parla-CLARIN encoding standard will be further developed to cover more detailed and specific metadata across languages and parliaments. The corpora will serve as a baseline for further updates. Such uniform updates across the corpora would strongly support various methods of comparative research.
From showcasing to real applications: The availability of comparable multilingual parliamentary data (also made visible through concordancers and Parlameter) will boost research in the areas of digital humanities, linguistics, politology, sociology, psychology as well as in all the related branches of sciences.
1. Creating a multilingual set of uniformly annotated corpora of parliamentary proceedings dating from November 2019 to July 2020 (thus covering current COVID-19 pandemic situation).
2. Creating a set of comparable multilingual reference corpora of parliamentary data from 2015 to October 2019.
3. Processing the corpora linguistically to add syntactic structures of Universal Dependencies as well as Named Entities annotation.
4. Making the corpora available through concordancers and Parlameter.
5. Building use cases in Political Sciences and Digital Humanities based on the corpus data.
Phase 1 (July 1, 2020 - September 30, 2020)
In Phase 1 the approach, described in tasks 1, 2, 3 and 4 in the Tasks section will be tested for 4 pilot languages – Bulgarian, Croatian, Slovene and Polish.
- Creation of the COVID-19 parliamentary corpora (Nov. 2019 - July 2020) as well as the reference parliamentary corpora (2015 - Oct. 2019).
- The data will be processed to adhere to the Parla-CLARIN TEI annotation scheme. [Responsible persons for gathering and conversion of data: Bulgarian (Petya Osenova and Kiril Simov); Croatian (Nikola Ljubešić); Slovene (Andrej Pančur); Polish (Maciej Ogrodniczuk and Michał Rudolf)].
- The data will be be processed linguistically with Universal Dependencies and Named Entities (Responsible person: Nikola Ljubešić)
- Mounting of the corpora on the NoSketch Engine and KonText concordancers. (Responsible person: Tomaž Erjavec)
- Mounting of the corpora to Parlameter. (Responsible person: Filip Muki Dobranić and Tomaž Kunst)
- Preparation of guidelines. (Responsible persons: All)
Phase 2 (October 1, 2020 - May 30, 2021)
In Phase 2 the parliamentary corpora will be extended to more languages and parliaments. For this phase a special Call for interest in participation will be published in October 2020. The results will be announced in November 2020.
Also, three showcases are envisaged:
- Availability in Parlameter: facilitating the corpus graphical exploration suited for politological investigations and investigations by journalists and active citizens. (Responsible person: Filip Muki Dobranić)
- The linguistic showcase will extend the CLARIN tutorial "Voices of the Parliament" showing how corpora can be used to investigate language use and communication practices in a specialised socio-cultural context of political discourse. (Responsible person: Darja Fišer and Kristina Pahor de Maiti). : http://doi.org/10.3828/mlo.v0i0.295
- The DH-related showcase will be prepared by a digital historian. (Responsible person: Ruben Ros)
Parliamentary Corpora from Phase 1
Corpora as Data
- Multilingual comparable corpora of parliamentary debates ParlaMint 1.0 in CLARIN.SI repository: hdl.handle.net/11356/1345
Corpora in Concordancers
- NoSketch Engine: clarin.si/noske/parlamint.cgi
- NoSketch Engine (public): clarin.si/noske/index-en.html. Look for:
- ParlaMint-SI 1.0 (parliament: COVID)
- ParlaMint-BG 1.0 (parliament: COVID)
- ParlaMint-HR 1.0 (parliament: COVID)
- ParlaMint-PL 1.0 (parliament: COVID)
- Kontext: clarin.si/kontext
The proposals of the following applicants were assessed and approved by the ParlaMint team in consultation with representatives of CLARIN Board of Directors:
|Ruben van Heusden||University of Amsterdam – ILPS research group||Dutch|
|Steinþór Steingrímsson||The Árni Magnússon Institute for Icelandic Studies||Icelandic|
|Tomas Krilavičius||Applied Informatics dept., Vytautas Magnus University (Vytauto Didžiojo university)||Lithuanian|
|Barbora Hladká||Charles University||Czech|
|Giulia Venturi||Institute for Computational Linguistics "A. Zampolli" (ILC-CNR)||Italian|
|Çağrı Çöltekin||University of Tübingen||Turkish|
|Costanza Navarretta||University of Copenhagen||Danish|
|Miklós Sebők||Centre for Social Sciences, Budapest, Hungary||Hungarian|
|Giancarlo Luxardo||Praxiling UMR 5267||French|
|Robers Dargis||Institute of Mathematics and Computer Science, University of Latvia||Latvian|
|Petru Rebeja||Alexandru Ioan Cuza University of Iași||Romanian|
|Jesse de Does||Instituut voor de Nederlandse Taal||Belgian Dutch/French|
- María Calzada Pérez - from the Translation Studies Unit, Universitat Jaume I, Castellón de la Plana
- CLARIN Bazaar Poster: ParlaMint: Towards Comparable Parliamentary Corpora presented at the Virtual CLARIN Annual Conference 2020
CLARIN Café - Join Our Parliamentary-flavoured Coffee: ParlaMINT. The ParlaMint project team will present current results and provide information about the opportunities to join either as contributor, as a user, or both. Organized by Petya Osenova (Sofia University and IICT-BAS) and Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences).
● Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences (Coordinator) (also: Michał Rudolf)
● Andrej Pančur, Institute of Contemporary History, Ljubljana
● Nikola Ljubešić, Jožef Stefan Institute, Ljubljana
● Ruben Ros, Luxembourg University
Budget: 135,000 EUR
Maciej Ogrodniczuk: email@example.com
Petya Osenova: firstname.lastname@example.org