You are here

CMC corpora

Introduction

Computer-mediated communication (CMC) constitutes public and private communication on-line, such as posts on blogs, forums, comments on online news sites, social media and networking sites such as Twitter and Facebook, instant chat rooms such as, mobile phone applications such as WhatsApp and e-mail. Because corpora that compile computer-mediated communication often include very informal styles of writing, they are interesting for a wide range of research fields, such as language variation, pragmatics, media and communication studies, etc. They are also very important for the development of robust NLP tools that can deal with non-standard spelling, vocabulary and grammar. Compilation and dissemination of such corpora are hindered by the unclear legal status of CMC data when distributed as resource to the scientific community, which is further exacerbated by the rapidly changing terms of service by content providers.

The CLARIN infrastructure offers 12 CMC corpora - most are available for Slovenian, but also for Czech, Dutch, Estonian, Finnish, French, German and Lithuanian. Most of the corpora are richly tagged as well as available under public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 29 May 2018.

CMC corpora in the CLARIN infrastructure

Corpus  Language Description Availability

Corpus of contemporary blogs

Size: 1 million tokens
Annotation: tokenised
Licence: CC-BY
 

Czech

This corpus contains blog posts.

The corpus is available for download from  LINDAT.

Download

SoNaR New Media

Size: 35 million tokens
Annotation: tokenised, PoS-tagged, lemmatised

Dutch

This corpus contains tweets, chats and SMS from 2005 to 2012.

The corpus is available for searching online through the OpenSONAR environment.

For the relevant publication, see Sanders (2012).

Concordancer

The Mixed Corpus: New Media

Size: 25 million tokens
Annotaiton: tokenised
 

Estonian

This corpus contains chat room messages, forum posts and news comments from 2000 to 2008.

The corpus is  available for download from a dedicated webpage associated with CLARIN Estonia and through a dedicated concordancer.

Concordancer

Download

Suomi 24 Corpus

Size: 2.6 billion tokens
Annotation: tokenised, MSD-tagged
Licence: CLARIN ACA

Finnish

This corpus contains forum posts from the Suomi24 website from 2001 to 2016.

The corpus is available for download from the FIN-CLARIN repository and through the concordancer Korp.

Concordancer

Download

CoMeRe repository

Size: 80 million tokens
Annotation: tokenised
Licence: CC-BY

French

This corpus contains e-mails, forum posts, online chats, tweets, and SMS.

The corpus is available for download from Ortolang.

Download

Dortmund Chat Corpus

Size: 1 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: CC-BY

German

This corpus contains online chats from 2000 to 2006.

The corpus is available for download from the repository of CLARIN-D.

For the relevant publication, see Beißwenger (2013).

Download

LITIS v.1

Size: 190,000 comments
Licence: CLARIN_ACA

Lithuanian

This corpus contains forum posts from portals delfi.lt and lrytas.lt from 2010 to 2014.

The corpus is available for download from the CLARIN-LT repository.

Download

Blog post and comment corpus Janes-Blog 1.0

Size: 34 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains blog posts from RTV Slovenija and Publishwall.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

Concordancer

Download

Forum corpus Janes-Forum 1.0

Size: 47 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains forum posts from Avtomobilizem.com, MedOver.net and RTV Slovenija.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

Concordancer

Download

News comment corpus Janes-News 1.0

Size: 14 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains news comments from RTV Slovenija, Mladina and Reporter.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

Concordancer

Download

Twitter corpus Janes-Tweet 1.0

Size: 139 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains tweets written by Slovenian Twitter users from 2013 to 2017.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

Concordancer

Download

Wikipedia talk corpus Janes-Wiki 1.0

Size: 5 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains Slovenian Wikipedia user and talk pages.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

Concordancer

Download

Other CMC corpora

Corpus Language Description Availability

Flemish Online Teenage Talk

Size: 2.9 million tokens
Annotation: tokenised

Dutch

This corpus contains Facebook posts and WhatsApp messages from 2015 and 2016. 

For the relevant publication, see Hilte et al. (2016).

Dereko – News and Wikipedia subcorpus

Size: 670 million tokens
Annotation: tokenised

German

This corpus contains content from newsgroup posts and Wikipedia.

The corpus is available through a dedicated concordancer.

Concordancer

DWDS – Blogs

Size: 102 million tokens
Annotation: tokenised

German

This corpus contains blog posts. 

The corpus is available through a dedicated concordancer.

Concordancer

Monitor corpus of tweets from Austrian users

Size: 40 million tweets
Annotation: tokenised, lemmatised

German and English

The corpus contains tweets from 2007 to 2017.

For the relevant publication, see Barbaresi (2016).

 

FORUMAS_INDV corpus

Size: 600,000 tokens
Annotation: tokenised

Lithuanian

The corpus contains forum posts from the lyrtas.lt portal from 2014. 

The corpus is available for download froma dedicated webpage.

For the relevant publication, see Kapočiūtė-Dzikienė et al. (2015).

Download

INT_KOMETARAI_INDV2 corpus

Size: 4 million tokens
Annotation: tokenised

Lithuanian

This corpus contains comments from the delfi.lt portal from 2015.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Kapočiūtė-Dzikienė et al. (2015).

Download

NTAP climate change blog corpus

Size: 21 million tokens (Norwegian subcorpus)
Annotation: tokenised

Norwegian, English, French

The corpus contains blog posts focusing on climate change from 2000 to 2014. 

For the relevant publication, see Salway et al. (2016).

 

Corpus of Highly Emotive Internet Discussions

Size: 160 million tokens
Annotation: tokenised

Polish

The corpus contains tweets.

For the relevant publication, see Sobkowicz (2016).

For access, contact the authors.

sms4science

Size: 0.5 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
 

Swiss German, German, French, Italian, Romansh

This corpus contains around 25000 SMS from 2009.

The corpus comes in two different versions which are available through separate concordancers - SMS Navigator and ANNIS. The version accessible through ANNIS is more richly annotated and includes PoS-tagging, normalization, annotation of nonce borrowings, etc. Access through the concordancers requires free registration.

For the related publication, see Dürscheid and Stark (2011).

Concordancer

What's up, Switzerland?

Size: 5 million tokens
 

Swiss German, German, French, Italian, Romansh

This corpus contains 216 WhatsApp chats from 2014.

For the related publication, see Ueberwasser and Stark (2017).

Available upon request until 31 December 2018 . After that: freely available.

The Corpus of Welsh Language Tweets

Size: 7 million tokens
Annotation: tokenised
Licence: unclear

Welsh

The corpus contains tweets.

The corpus is available for download from a dedicated webpage.

 

Download

Additional materials

Tutorial at CMC-Corpora 2017: "How to use TEI for the annota­tion of CMC and social media resources: a prac­tical introduction", 4 October 2017, Bolzano, Italy. [html]

CLARIN-PLUS workshop "Creation and Use of Social Media Resources", 18-19 May 2017, Kaunas, Lithuanian. [html]

Videolectures of the CLARIN-PLUS workshop. [html]

Publications on the CMC corpora

[Barbaresi 2016] Collection and Indexing of Tweets with a Geographical Focus.

[Beißwenger 2013] Michael Beißwenger. 2013. Das Dortmunder Chat-Korpus: ein annotiertes Korpus zur Sprachverwendung und sprachlichen Variation in der deutschsprachigen Chat-Kommunikation.

[Dürscheid and Stark 2011] Christa Dürscheid and Elisabeth Stark. 2011. sms4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland.

[Hilte et al. 2016] Lisa Hilte, Reinhild Vandekerckhove, Walter Daelemans. 2016. Expressiveness in Flemish Online Teenage Talk: A Corpus-Based Analysis of Social and Medium-Related Linguistic Variation.

[Kapočiūtė-Dzikienė et al. 2015] Jurgita Kapočiūtė-Dzikienė, Ligita Šarkuté, Andrius Utka. 2015. The Effect of Author Set Size in Authorship Attribution for Lithuanian. 

[Salway et al. 2016] Andrew Salway, Dag Elgesem, Knut Hofland, Øystein Reigem, Lubos Steskal. 2016. Topically-focused Blog Corpora for Multiple Languages. 

[Sanders 2012] Eric Sanders. Collecting and Analysing Chats and Tweets in  SoNaR.

[Sobkowicz 2016] Antoni Sobkowicz. 2016. Political Discourse in Polish Internet - Corpus of Highly Emotive Internet Discussions. 

[Ueberwasser and Stark 2017] Simone Ueberwasser and Elisabeth Stark. 2017. What’s up, Switzerland? A corpus-based research project in a multilingual country.