You are here

Parliamentary corpora

Introduction

Parliamentary corpora are a very important multidisciplinary language resource that can be approached from many research perspectives, including not only political science, but also sociology, history, psychology, and applicative approaches to linguistics, for instance, critical discourse analysis. The good availability of parliamentary proceedings in digitized form and granted access rights to public information in the EU countries have motivated a number of national as well as international initiatives to compile, process and analyse parliamentary corpora.

The CLARIN ERIC infrastructure offers access to 18 parliamentary corpora, covering almost all of the languages spoken in countries that are either members or observers in CLARIN ERIC.  In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged and mostly available under public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 18 June 2018.

Parliamentary corpora in the CLARIN infrastructure

Corpus Language Description Availability

Czech Parliamentary Meetings

Size: 88 hours, 0.5 million tokens
Annotation: error correction of transcriptions, division into speech sections with speaker information
Licence: CC-BY
 

Czech

The corpus contains recordings of the parliamentary sessions as well as corresponding transcriptions. 

The corpus is available for download from LINDAT and through the concordancer KonText. 

Concordancer

Download

The Danish Parliament Corpus 2009 - 2017, v1

Size: 40.6 million words
Annotation: no annotation
Licence: CC-BY
 

Danish

The corpus contains Danish parliamentary debates from 2009 to 2017. 


The corpus is available for download from the DK-CLARIN repository.

Download

Hansard corpus

Size: 1.6 billion tokens
Annotation: tokenised, PoS-tagged, lemmatised, semantic tagging

English

The corpus contains British parliamentary debates from 1803 to 2005. It is semantically tagged with the USAS semantic tagger and the Historical Thesaurus Semantic Tagger (HTST).

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rayson et al. (2015).

Concordancer

Parliamentary Debates on Europe at the House of Commons (1998-2015)

Size: 190,000 tokens
Annotation: tokenised
Licence: CC-BY
 

English

The corpus contains British parliamentary debates from 1998 to 2015.

The corpus is available for download from Ortolang.

Download

Transcripts of Riigikogu (Estonian Parliament)

Size: 13 million tokens
Annotation: tokenised
Licence: CLARIN_ACA
 

Estonian

The corpus contains Estonian parliamentary debates from 1995 to 2001.

The corpus is available for download from a dedicated webpage and through a concordancer on the same webpage.

Concordancer

Download

Plenary Sessions of the Parliament of Finland

Size: 22.4 million tokens
Annotation: tokenised
Licence: CC-BY
 

Finnish

The corpus contains Finnish parliamentary debates from 2008 to 2016.

 The corpus is available through the concordancer Korp.

Concordancer

Parliamentary Debates on Europe at the Assemblée nationale (2002-2012)

Size: 137,000 tokens
Annotation: tokenised
Licence: CC-BY

French

The corpus contains French parliamentary debates from 2002 to 2005.

The corpus is available for download from Ortolang.

Download

Parliamentary Debates on Europe at the Bundestag (1998-2015)

Size: 417,000 tokens
Annotation: tokenised
Licence: CC-BY

German

The corpus contains German parliamentary debates from 1998 to 2015.

The corpus is available for download from Ortolang.

Download

ParlAT beta

Size: 75.2 million tokens
Annotation: tokenised, linked data (e.g., speaker information)

German (Austrian)

This corpus contains Austrian parliamentary proceedings from 1996 to 2017.

Currently in development, ParlAT is planned to be a monitor corpus with new material added over time.

For the relevant publication, see Wissik and Pirker (2018).

 

Hellenic Parliament Sittings (2011-2015)

Size: 28.7 million tokens
Annotation: tokenised
Licence: CC-BY

 

Greek

The corpus contains Greek parliamentary debates from 2011 to 2015.

 The corpus is available for download from the CLARIN:el repository.

Download

Lithuanian Parliament Corpus for Authorship Attribution

Size: 23.9 million tokens
Annotation: tokenised, PoS-tagged, lemmatised

Lithuanian

The corpus contains Lithuanian parliamentary debates from 1990 to 2013.  It is annotated with Lemuoklis (morphological analyzer for lemmatization) and MaltParser (generation of dependency tags).

The corpus is available for download from the repository of CLARIN-LT.

Download

Proceedings of Norwegian Parliamentary Debates

Size: 29 million tokens
Annotation: tokenised

Norwegian

The corpus contains Norwegian parliamentary debates from 2008 to 2015.

The corpus is available through the concordancer Corpuscle.

Concordancer

Talk of Norway

Size: 63.8 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: NLOD
 

Norwegian

The corpus contains Norwegian parliamentary debates from 1998 to 2016.

The corpus is available for download from the CLARINO repository.

Download

Polish Parliamentary Corpus

Size: 300 million tokens
Annotation: tokenised, MSD-tagged, named entities, etc.

Polish

The corpus contains Polish parliamentary debates from 1991 to 2017. It is annotated with Morfeusz SGJP (morphological analyser), Pantera (disambiguating tagger), Spejd (shallow parser), Nerf (named entity recognizer).

The corpus is available for download from a dedicated webpage and through the concordancer NKJP. 

For the relevant publication, see Ogrodniczuk (2012) and Ogrodniczuk (2018).

Concordancer

Download

PTPARL Corpus

Size: 1 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: ELRA END USER/ELRA VAR

Portuguese

The corpus contains Portuguese parliamentary debates from 1970 to 2008. It is annotated with LX-Tokenizer, LX-Tagger, MBT, MBLEM (lemmatisation).

The corpus is available for download from the ELRA catalogue.

For the relevant publication, see Généreux et al. (2012).

Download

Slovenian parliamentary corpus SlovParl 2.0

Size: 3.2 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: CC-BY

Slovenian

The corpus contains Slovenian parliamentary debates from 1990 to 1992.

The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText.

For the relevant publication, see Pančur and Šorn (2016).

Concordancer

Download

Riksdag’s Open Data

Size: 1.25 billion tokens
Annotation: tokenised, lemmatised
Licence: CC-BY

Swedish

The corpus contains Swedish parliamentary debates from 1971 to 2016. It is annotated with Sparv

The corpus is available for download from Språkbanken (all entries with "Riksdag's Open Data" in the subtitle) and through the concordancer Korp.

 For the relevant publication, see Borin et al. (2016).

Concordancer

Download

Europarl: European Parliament Proceedings Parallel Corpus 1996-2003

Annotation: sentence/aligned

21 languages

This corpus contains parliamentary debates from the European Parliament from 1996 to 2011.

The corpus is available for download from a dedicated webpage.

Download

Other parliamentary corpora

Corpus Language Description Availability

Korpusbasierte Analyse österreichischer Parlamentsreden

Size: 1.2 million tokens
Annotation: tokenised, PoS-tagged

German (Austrian)

The corpus contains Austrian parliamentary debates from 2013 to 2015. It is annotated with the Stanford Tagger.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Sippl et al. (2016).

Download

Corpus of Bulgarian Political and Journalistic Speech

Size: 10 million tokens
Annotation: tokenised, PoS-tagged, lemmatised

Bulgarian

The corpus contains Bulgarian parliamentary debates from 2006 to 2012.

The corpus is available through a dedicated concordancer.

Concordancer

CzechParl

Size: 81.9 million tokens
Annotation: tokenised, MSD-tagged and lemmatised
 

Czech

The corpus contains Czech parliamentary debates from 1993 to 2010. It is annotated with ajka.

The corpus is available through the Sketch Engine.

For the relevant publication, see Jakubíček and Kovář (2010).

Concordancer

DutchParl

Size: 800 million tokens
Annotation: tokenised, PoS-tagged, lemmatised

Dutch

The corpus contains Dutch parliamentary debates from 1814 to 2014. It is annotated with Frog

The corpus is available for download (the authors needs to be contacted) and is also accessible online through the Political Mashup environment.

For the relevant publication, see Marx and Schuth (2010).

Concordancer

Download

HanDeSeT: Hansard Debates with Sentiment Tags

Size: 1251 motion-speech units taken from 129 separate debates
Annotation: sentiment tags
Licence: Open Parliament Licence V3.0 and Open Data Commons Open Database License (ODbL)

English

This corpus contains English parliamentary debates from 1997 to 2017.

The corpus is available for download from a dedicated webpage. 

For the relevant publication, see Abercrombie and Batista-Navarro (2018).

Download

UKParl Dataset

Size: 354,400 tokens
Annotation: fine-grained topic annotation, additional semantic information (entity links)

English

This corpus contains British parliamentary debates of the House of Commons from 2013 to 2016.

The corpus is available for download from Google Drive.

For the relevant publication, see Nanni et al. (2018).

Download

polmineR corpus

Only a small sample available



 

German

A small sample is available for download from the GitHub webpage of the corpus.

Download

SEIMA corpus

Size: 12.5 million tokens
Annotation: tokenised, lemmatised

Latvian

The corpus contains Latvian parliamentary debates from 1993 to 2016.

The corpus is available through noSketchEngine. 

Concordancer

Additional materials

CLARIN-PLUS Workshop "Working with Parliamentary Records". 27-29 March 2017, Sofia, Bulgaria. [html

Videolectures of the CLARIN-PLUS workshop. [html]

ParlaCLARIN@LREC2018. 7 May 2018, Miyazaki, Japan. [html]

Videolectures of the ParlaCLARIN workshop. [html]

Darja Fišer, Maria Eskevich, Franciska de Jong (eds.), Proceedings of the ParlaCLARIN Workshop at LREC2018. [pdf]

Publications on the parliamentary corpora

[Abercrombie and Batista-Navarro 2018] Gavin Abercrombie and Riza Theresa Batista-Navarro. 2018. A Sentiment-labelled Corpus of Hansard Parliamentary Debate Speeches. 

[Borin et al. 2016] Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, Anne Schumacher. 2016. Sparv: Språkbanken’s corpus annotation pipeline infrastructure. 

[Branco and Silva 2006] António Branco and João Silva. 2006. A Suite of Shallow Processing Tools for Portuguese: LX-Suite.

[Généreux et al. 2012] Michel Généreux, Iris Hendrickx, Amália Mendes. 2012. A Large Portuguese Corpus On-Line: Cleaning and Preprocessing.

[Jakubíček and Kovář 2010] Miloš Jakubíček, Vojtěch Kovář. 2010. CzechParl: Corpus of Stenographic Protocols from Czech Parliament.

[Marx and Schuth 2010] Maarten Marx and Anne Schuth. DutchParl: The Parliamentary Documents in Dutch. 

[Nanni et al. 2018] Federico Nanni, Mahmoud Osman, Yi-Ru Cheng, Simone Paolo Ponzetto, Laura Dietz. 2018. UKParl: A Data Set for Topic Detection with Semantically Annotated Text.

[Ogrodniczuk 2012]  Maciej Ogrodniczuk. 2012. The Polish Sejm Corpus.

[Ogrodniczuk 2018] Maciej Ogrodniczuk. 2018. The Polish Parliamentary Corpus. 

[Pančur and Šorn 2016] Andrej Pančur, Mojca Šorn. 2016. Smart Big Data: use of Slovenian parliamentary papers in digital history.

[Rayson et al. 2015] Paul Rayson, Alistair Baron, Scott Piao, Steve Wattam. 2015. Large-scale Time-sensitive Semantic Analysis of Historical Corpora. 

[Sippl et al. 2016] Colin Sippl, Manuel Burghardt, Christian Wolff, Bettina Mielke. 2016.  Korpusbasierte Analyse österreichischer Parlamentsreden.

[Wissik and Pirker 2018] Tanja Wissik and Hannes Pirker. 2018. ParlAT beta: Corpus of Austrian Parliamentary Records.