Skip to main content

Tour de CLARIN: CLARIAH-AT Presents the Austrian Media Corpus

Written by Hannes Pirker

A public-private cooperation between the Austrian Academy of Sciences (ÖAW) and the Austrian Press Agency ( ) has made it possible to provide the scientific community with the Austrian Media Corpus (amc), a unique corpus that almost fully covers the whole country’s print media production of the past 30 years.

The collection of the original data was initiated by the Austria Press Agency (APA), beginning with their own press releases in 1986. From 1992 onwards, the content of print media, such as newspapers and weekly or monthly magazines, was added to the collection, as well as selected transcriptions of interviews and news stories from several television channels, reaching an almost complete coverage of Austria’s print media production from 1998 onwards. The amc is a plain text corpus comprising born-digital data only.

As of 2022, the corpus contained 47 million articles from 58 different media outlets, constituting more than 11 billion tokens. Thus, the amc ranks among the largest collections of its kind. It is annually updated  with new data provided by APA, further increasing the amc by approximately 500 million tokens a year.  

 
 
 

Annotations, Legal Aspects and Applications

The texts in the corpus are provided with basic metadata, i.e. name of the news media, date of publication, the geographical region of origin and a rough classification of texts on the  basis of the newspaper section they were published in. In terms of linguistic annotation, the texts have undergone some de-duplication heuristics, are tokenised, lemmatised, and annotated with different part-of-speech taggers, a dependency parser and a named entity recogniser.

The conditions of use are specified by APA as the collector of the original data and holder of the rights to the collection. Access to the amc is exclusively granted for the purpose of linguistic research. The majority of users, i.e. academics and students, can access the corpus free of charge. Fees apply when the amc is used within a funded project.

The majority of projects which make use of the amc are concerned with the analysis of linguistic variation in general, and lexical variation in particular. The amc is also used professionally as a source of information by the Digitales Wörterbuch der Deutschen Sprache (DWDS) project, and by the Council for German Orthography for monitoring the application of orthographic rules in everyday life.