You are here

Newspaper corpora

Introduction

Collections of newspapers in digital form are a rich source of information for researchers in a number of disciplines in the Humanities and Social Sciences and are especially valuable for synchronic as well as diachronic studies, ranging from history, media and communication studies to lexicography for which newspapers are a rich source of neologisms and other lexicographic phenomena.

The CLARIN ERIC infrastructure gives access to 27 newspaper corpora, 4 of which are multilingual and 23 monolingual. The available corpora contain newspaper articles in the following 9 languages: Arabic, Czech, Finnish, French, German, Norwegian, Polish and Swedish. Almost a third of the newspaper corpora are historical, with the oldest articles from the 18th century. The majority of them richly tagged and are available under public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 17 August 2018.

Newspaper corpora in the CLARIN infrastructure

Monolingual corpora

Corpus Language Description Availability

An-Nahar Newspaper Text Corpus

Size: 24 million tokens
Annotation: tokenised
Licence: ELRA END USER

Arabic

This corpus contains articles from the Arabic newspaper An-Nahar from 1995 to 2000.

The corpus is available for download from the ELRA catalogue.

Download

SYN2006PUB: corpus of Czech newspapers

Size: 300 million tokens
Annotation: tokenised, lemmatised, PoS-tagged
Licence: CC-BY
 

Czech

This corpus contains articles from 11 Czech newspapers from 1989 to 2004.

The corpus is available for download from the Czech repository LINDAT.

Download

SYN2013PUB: corpus of written Czech newspapers

Size: 935 million tokens
Annotation: tokenised, lemmatised, morphologically tagged
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus contains articles from Czech newspapers from 2005 to 2009. 

The corpus  is available for download from the Czech repository LINDAT.

Download

The Karelian Finnish Newspaper Corpus

Size: 500,000 tokens
Licence: CLARIN ACA

Finnish

This corpus contains articles from the Finnish newspaper Karjalan Sanomat from 2012 to 2014.

The corpus is available through the concordancer Korp.

Concordancer

BREF-80

Size: 100 hours of speech materials
Licence: ELRA END USER/ELRA VAR

French

This corpus contains recorded readings of articles from the French newspaper Le Monde.

The corpus is available for download from the ELRA catalogue.

Download

Corpus journalistique issu de l'Est Républicain

Licence: CC-BY

French

This corpus contains articles from the French newspaper l'Est Républicain from 1999 to 2003.

The corpus is available for download from Ortolang.

Download

Tübingen Treebank of Written German / Newspaper Corpus

Size: 1.8 million tokens
Annotation: tokenised, MSD tagged, lemmatised, syntactic constituency, named-entities
Licence: CLARIN RES

German

This corpus contains articles from the German newspaper Die Tageszeitung.

The corpus is available through a dedicated concordancer with an institutional account.

Concordancer

TIGER Corpus

Size: 900,000 tokens
Annotation: tokenised, PoS-tagged, parsed, lemmatised
Licence: CLARIN PUB

German

This corpus contains articles from the German newspaper Frankfurter Rundschau.

The corpus is available for download from a dedicated webpage.

Download

MTP Annotated German corpus - tagged version

Size: 500,000 tokens
Annotation: tokenised, MSD tagged
Licence: ELRA END USER/ELRA VAR

German

This corpus contains articles from two German newspapers Die Frankfurter Allgemeine Zeitung and Die Zeit from 1992.

The corpus can be downloaded from the ELRA catalogue.

Download

Mannheim Corpus of Historical Newspapers and Magazines

Size: 4.1 million tokens
Annotation: tokenised

German

This corpus contains articles from 21 German newspapers from the 18th and 19th century. 

The corpus is available for download from the CLARIN-D repository.

Download

The Norwegian Newspaper Corpus:

Size: 700 million tokens
Annotation: multitagged

Norwegian

This corpus contains articles from 24 Norwegian newspapers from 1998 onwards.

The corpus is available through the concordancer Corpuscle.

Concordancer

ChronoPress Corpus of Polish Press Texts

Size: 20 million tokens
Annotation: tokenised, PoS-tagged, named entities
Licence: CLARIN PUB

Polish

This corpus contains articles from various Polish newspapers from 1945 and 1962. 

The corpus is available through a dedicated concordancer.

Concordancer

8 sidor

Size: 678,000 tokens
Annotation: tokenised, PoS-tagged, parsed, compounds
Licence: CC-BY

Swedish

This corpus contains articles from the Swedish newspaper 8 sidor from 2003 to 2012.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Dagny

Size: 8.1 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Dagny from 1886 to 1913. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

DN 1987

Size: 5 million tokens
Annotation: tokenised, PoS-tagged, parsed, compounds
Licence: CC-BY

Swedish

This corpus contains articles from the Swedish newspaper Dagens Nyheter from 1987. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

GP 1994 and 2001-2011

Size: 271 million tokens
Annotation: tokenised, PoS-tagged, parsed, compounds
Licence: CC-BY

Swedish

This corpus contains articles from the Swedish newspaper Göteborgsposten from 1994 and from 2001 to  2011.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Hertha

Size: 3.8 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Hertha from 1914 to 2015. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Idun

Size: 2 million tokens
Annotation: tokenized, PoS-tagged, parsed

Swedish

This corpus contains articles from the newspaper Idun from 1887 to 1917. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Kvinnornas Tidning

Size: 5.5 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Kvinnornas Tidning for the period between 1921 and 1925. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Morgonbris

Size: 3.5 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Morgonbris from 1904 to 1924. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Rösträtt för Kvinnor

Size: 2.2 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Rösträtt för Kvinnor from 1912 to 1919. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Smittskydd

Size: 691,000 tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Smittskyd from 2002 to 2010. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

The Webbnyheter corpus

Size: 272 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from various Swedish online newspapers from 2001 to 2013. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Multilingual corpora

Corpus

Language

Description

Availability

MLCC Multilingual and Parallel Corpora

Size: 100 million tokens
Annotation: tokenised
Licence: ELRA END USER

Dutch, English, French, German, Italian, Spanish

This corpus contains articles from newspapers in Dutch, English, French, German, Italian and Spanish from 1986 to 1994.

The corpus is available for download from the ELRA catalogue.

Download

The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version
Size: 8.8 billion tokens
Annotation: tokenised
Licence: CC-BY

Swedish and Finnish

This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1770 to 2011.

The corpus can be accessed through the concordancer Korp.

Concordancer

The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)

Licence: CC-BY

Swedish and Finnish

This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1771 to 1874. 

The corpus can be downloaded from FIN-CLARIN.

Download

Corpora of Newspaper Texts

Size: 435 million tokens
Annotation: tokenised

Swedish, English and Finnish

This corpus contains articles from a variety of Swedish, English and Finnish newspapers. 

The corpus can be found in the FIN-CLARIN repository although its availability and licence are still under negotiation.

 

Other newspaper corpora

Monolingual corpora

Corpus

Language

Description

Availability

Zurich English Newspaper Corpus

Size: 1.6 million tokens
Annotation: tokenised
Licence: public

English

This corpus contains articles from various English newspapers (mainly newspapers from London) from the 17th and 18th century. 

For access, contact the authors.

deu_newscrawl_2011

Size: 426 million tokens
Annotation: tokenised

German

This corpus contains articles from various German newspapers from 2011.

The corpus is available through a dedicated concordancer.

Concordancer

CRIPCO

Size: 43,000 documents
Annotation: coreference resolution
Licence: proprietary

Italian

This corpus contains articles from the Italian newspaper L’Adige from 1999 to 2006. 

The corpus is available for download through META-SHARE.

Download

"LA REPUBBLICA" CORPUS

Size: 380 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: CC-BY

Italian

The corpus contains articles from the Italian newspaper La Repubblica

The corpus is available through a dedicated concordancer.

Concordancer

WItaC - NewsReader Wikinews Italian Corpus

Size: 40,231 tokens
Annotation: entities, events, event factuality, temporal information, semantic roles, and intra-document and cross-document event and entity coreference
Licence: CC-BY

Italian

This corpus contains Italian translations of 120 English Wikinews articles.

The corpus is available for download from a dedicated website.

For a related publication, see Minard et al. (2016).

Download

Corpus of Contemporary Serbian Newspapers and Magazines

Size: 916 million tokens
Annotation: tokenised, PoS-tagged and lemmatised
Licence: CC-BY

Serbian

This corpus contains articles from over a 100 Serbian newspapers from 2004 to 2012.

For access, contact the resource manager.

Multilingual corpora

Corpus

Language

Description

Availability

Europeana Newspapers NER Corpora

Size: 500, 000 tokens (182,483 Dutch; 207,000 French;  96,735 German)
Annotation: named entities
Licence: CC-ZERO

Dutch, French and German

This corpus contains articles from Europeana newspapers for the following time periods: 1811-1856 for the Dutch subcorpus, 1871-1916 for the French subcorpus, and 1926 for the German subcorpus.

The corpus is available for download from the KB Lab.

For a related publication, see Neudecker (2016).

Download

Timestamped JSI web corpus

Size: 35 billion tokens
Annotation: tokenised, PoS-tagged

18 languages

This corpus contains articles from newsfeed from 2014 to 2017.

The corpus is available through noSketchEingine.

For a related publication, see Bušta et al. (2017).

Concordancer

Additional materials

CLARIN-PLUS workshop: "Working with Digital Collections of Newspapers", 19-21 September 2016, Leuven, Belgium. [html]

Videolectures of the CLARIN-PLUS workshop. [html]

Workshop "Hacking the News: from digitised newspapers to the archived-web: an introductory workshop to text and data-mining", 5-6 March 2018, Helsinki, Finland. [html]

Slides for "Hacking the News" workshop. [gdoc]

Publications on the newspaper corpora

[Bušta et al. 2017] Jan Bušta, Ondřej Herman, Miloš Jakubíček, Simon Krek, Blaž Novak. JSI Newsfeed Corpus. [pdf]

[Minard et al. 2016] Anne-Lyse Minard , Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, Chantal van Son. 2016. MEANTIME, the NewsReader Multilingual Event and Time Corpus.

[Neudecker 2016] Clemens Neudecker. An Open Corpus for Named Entity Recognition in Historic Newspapers.