Skip to main content

CLARIN-PLUS workshop "Creation and Use of Social Media Resources"

, - ,

Goals of the workshop


With the increasing volume and impact of communication on social media, social media analysis has become one of the most trending topics in natural language research, which can be observed in a growing number of workshops and conferences dedicated to this topic, projects funded, and research centers established. As a result, a number of social media resources containing chats, online commentaries, reviews, blogs, emails, forums, etc., as well as audio and video recordings, have been accumulated in the repositories of CLARIN centers. What is more, due to their distinct communicative characteristics, they pose new technical challenges for the standard natural language processing tools as well as new legal and ethical challenges for the dissemination of such resources, which has also been addressed by CLARIN, making the available infrastructure an important means for attracting new users to the CLARIN community.


The aims of the workshop are: to demonstrate the possibilities of social media resources and natural language processing tools for researchers with a diverse research background who are interested in empirical research of language and social practices in computer-mediated communication; to promote interdisciplinary cooperation possibilities; to initiate a discussion on the various approaches to social media data collection and processing. 


This workshop is the fourth an last in a series of workshops funded in the conetxt of the CLARIN-PLUS project. We aim to attract about 30 specialists in discourse analysis, psychology, sociology, political science, media studies, corpus linguistics, sociolinguistics from all CLARIN member countries. Researchers who are currently working outside of CLARIN projects and networks, but with a background relevant for the topic of the workshop are very welcome, since one of the main aims of these workshops is to reach new users and start new collaborations. However, places are limited, and mostly allocated via the national CLARIN consortiums, but please get in touch with clarin [at] (clarin[at]clarin[dot]eu) if you are interested in participating, or in adding information to the overview of relevant data sets and initiatives, or contributing otherwise.  


Videos of invited talks, presentations by participants and interviews with participants have been published on Videolectures.

Topics and perspectives

The workshop is open to the discussion of these and other questions proposed by the participants:

Creation questions

  • What is the extent and balance of different genres and registers in different social media channels?
  • How are ethical and legal issues of data mining and dissemination dealt with?
  • How can data be annotated to be re-usable and useful to researchers in different fields, or other interested parties?

User involvement questions

  • How can cooperation and feedback between the different parties involved in social media analysis be increased? These parties include producers of data; computational linguists; researchers of social, medical, technical sciences and humanities; the interested parties from the educational, commercial, legal, political, medical and other sectors of society.
  • What are new emerging theoretical and methodological approaches to social media data analysis?
  • How could establishing of a knowledge sharing infrastructure for social media within CLARIN contribute to the involvement of new scientific or commercial users interested in social media research?


  • Overview of communities of use of CLARIN's social media data analysis tools, their needs and possible directions of interdisciplinary cooperation.
  • Suggested recommendations for metadata and resource search interfaces for social media data.
  • Overview of candidate institutes that could become a centre of expertise in the CLARIN knowledge sharing infrastructure.

Practical details

Format: The 2-day workshop will consist of a combination of plenary talks, practical hands-on sessions, presentations by participants, and group discussions.

Venue and accommodation: Kaunas Hotel (

Local organizer: CLARIN-LT, Vytautas Magnus University

Travel:  You can book flights to Kaunas (
                 or Vilnius (    For more details: link 



Videos of invited talks, presentations by participants and interviews with participants have been published on the CLARIN Videolectures channel.

Thursday 18 May

08:45-09:15 Registration
09:15-09:30 Opening (Rūta Petrauskaitė, Darja Fišer) (slides)
09:30-10:15    Invited talk 1 by prof. dr. Michael Beißwenger (University of Duisburg-Essen): Creation of Standards for Social Media Corpora: a Digital Humanities Topic Par Excellence (slides)
10:15-11:00 Presentations by participants  (10 minutes):
  • Ana Slavec: Legal and Ethical Challenges Related to the Use of Social Media Data and Related Data (slides)
  • Simone Ueberwasser: What's up, Switzerland?: Challenges of a Large, Multilingual CMC Corpus (slides)
  • Adrien Barbaresi: Corpora from the Blogosphere: Why and How? (slides)
  • Andrew Salway: Creating and Using Topically-Focused Blog Corpora (slides)
11:00-11:15 Q&A session
11:15-11:45  Coffee break
11:45-12:30  Invited talk 2 by dr. Dirk Hovy (University of Copenhagen): meets Computational Social (Media) Science (slides)
12:30-13:15  Presentations by participants (10 minutes):
  • Andrea Cimino: Analysis of Italian Social Media Texts: from Tools and  Resources to Applications (slides)
  • Nikola Ljubešić: The JANES Project: Tools and Resources for Linguistic Analysis and Automatic Processing of User-Generated Content in Slovene (slides)
  • Jurgita Kapočiūtė-Dzikienė: A Comparison of Authorship Attribution Approaches Applied on the Morphologically Complex Language Using Internet Comments (slides)
  • Steven Wilson: Social Media as Social Science Data (slides)
13:15-13:30 Q&A session
13:30-14:30 Lunch
14:30-15:15  Invited talk 3  by dr. Rebekah Tromble (Leiden University): Thinking Critically About Digital Data Collection: Twitter and Beyond (slides)

Presentations by participants (10 minutes)

  • Eric Fleury: SoSweet: the sociolinguistics of Twitter (slides)
  • Jiyoung Ydun Kim: Challenges of cross-national comparative research on Facebook (slides)
15:45-16:00 Q&A session
16:00-16:30 Coffee break
16:30-18:00  Hands-on Session by dr. Nikola Ljubešic  (Jozef Stefan Institute):  Harvesting, Processing and Visualising Geo-Encoded Data from Social Media (slides 1, slides 2, slides 3)
19:00  Dinner


Friday 19 May

09:00-09:30   Invited talk 4 by dr. Reinhild Vandekerckhove (University of Antwerp): Collection, Storage and Analysis of Online Teenage Talk: Assets and Challenges (slides)
09:30-10:15 Presentations by participants (10 minutes):
  • Jūratė Ruzaitė: New Media as a New Empirical Resource in Discourse Studies and Applied Linguistics (slides)
  • Yin Yin Lu: #VoteLeave or #StrongerIn: Resonance and Rhetoric in the EU Referendum (slides)
  • Steven Coats: Multilingual Clusters and Gender in Nordic Twitter (slides)
  • Ester Appelgren: The Reasons Behind Tracing Audience Behavior: A Matter of Paternalism and Transparency (slides)
10:15-10:30 Q&A session
10:30-11:00   Coffee break
11:00-11:45 Demo session by dr. Diana Maynard (University of Sheffield): An Open Source GATE Toolkit for Social Media Analysis (slides)
11:45-12:30  Invited talk 5  by prof. dr. Els Lefever (Ghent University): Text Analysis for Social Media Cybersecurity: the AMiCA Project (slides)
12:30-12:45  Q&A session
12:45-13:30  Workshop findings and future plans, closing (Darja Fišer, Andrius Utka)
13:30-14:30  Lunch


Abstracts of invited talks

Invited talk 1: “Creation of standards for social media corpora: a digital humanities topic par excellence"

Prof. dr. Michael Beißwenger, University of Duisburg-Essen

Even though empirical research of computer-mediated communication (CMC) has a tradition of almost two decades, there are still only very few annotated CMC/social media corpora which are available to the scientific community and the public. The major reason for that situation is the lack of standards and tools for collecting, representing, annotating and providing resources of that type.

One crucial issue is the unclear legal situation w.r.t. CMC/social media data. On the example of a legal expertise sought for the integration of an existing German chat corpus into CLARIN-D, the talk will highlight this issue (according to German law) and describe how it has been handled in the project. Another crucial issue arises from the fact that, due to the distinct communicative characteristics of CMC/social media discourse, standards and tools for the representation and annotation of text corpora can not be adopted for CMC/social media corpora without modifications. The creation of standards and the adaptation of NLP tools for that new type of language resource is a digital humanities topic par excellence since (1) it focuses on data which are born digital while at the same time (2) it requires a combination of expertise from humanities and computational sciences.

Invited talk 2: “NLP meets Computational Social (Media) Science”

Dr. Dirk Hovy, University of Copenhagen

Language is the ultimate social medium: We don't just communicate to convey information, but also to entertain, to gossip, to console, and much more. Social media data is one of the purest expressions of all of these aspects of language, and often includes additional information about the place, time, and author of a message. This combination has allowed NLP to work on real-life, situated, individual language, rather than on abstract general corpora, and lead it into areas that were previously the sole domain of social sciences. These areas open up a wide range of exciting new applications, but also present a host of new challenges - technically, linguistically, and ethically. 

In this talk, I will illustrate both opportunities and problems, and end with a number of open questions that I believe will challenge computational social science for the years to come.

Invited talk 3: “Thinking critically about digital data collection: Twitter and beyond”

Dr. Rebekah Tromble, Leiden University

This talk will offer a critical perspective on some of the most common techniques used to collect data from the internet and social media platforms--with particular concern for how these techniques potentially influence (i.e., bias) our analyses. While the talk will primarily rely on examples from Twitter, it will also consider how these same issues affect data collection from other platforms.

Invited talk 4: Collection, storage and analysis of online teenage talk: assets and challenges

Dr. Reinhild Vandekerckhove, University of Antwerp, Belgium

I will address a range of issues based on 10 years of experience with sociolinguistic research on informal computer-mediated communication (CMC) produced by youngsters. Starting from the two main datasets we are currently working with (corpus 2007-2013 and corpus 2015-2016), I’ll discuss some challenges with respect to gathering data on the social profile of the informants and some ethical issues. Next, attention will be devoted to the consequences of the size and (often imbalanced) composition of CMC-corpora for the data processing. In order to illustrate the challenges of the genre I'll briefly deal with a specific methodological issue: whether or not to operationalize the occurrence of CMC-features as binary or ordinal variables. Finally, while large corpora generally trigger (and necessitate) quantitative data processing, I want to stress that supplementary qualitative research may be indispensable if we do not want to get alienated from CMC-pragmatics.

Invited talk 5: “Text Analysis for Social Media Cybersecurity: the AMiCA Project”

Prof. dr. Els Lefever, Ghent University

The text analysis part of the AMiCA project (, a cooperation between the University of Antwerp and the University of Ghent, developed methods and software to help moderators detect occurrences of  unwanted or dangerous situations in their social networks. More specifically, the project developed prototype systems for the detection of cyberbullying, suicide announcements, and sexually transgressive behavior. In this talk I will focus on the text analysis methods that were used for normalization of social media text, for profiling users, and for detecting dangerous content. I will describe the architectures and results of the three resulting applications.

Demo session: An open source GATE toolkit for social media analysis

Dr. Diana Maynard, University of Sheffield, UK

In this presentation we will demonstrate our open source toolkit based on GATE, which provides the whole lifecycle of social media analysis, from twitter collection to analysis and finally indexing, querying and visualisation of results. Analysis includes entity and topic recognition, semantic annotation and entity linking, and sentiment analysis, among other things. We will demonstrate case studies based on previous and current work analysing Brexit, Trump tweets, the UK and French elections, and Earth Hour. Tools for crowdsourcing training and evaluation data are also available as part of the kit.

Hands-on session: “Harvesting, processing and visualising data from Twitter”

Dr. Nikola Ljubešić, Jožef Stefan Institute

In this hands-on session the participants will learn how to use the newest version of the TweetCat tool ( which enables collecting tweets that are either (1) written in a low-frequency language or (2) geotagged and published in a specific geographical perimeter. The hands-on will include the stages of (1) setting up a project (2) running the data harvesting procedure, (3) defining specific variables of interest and extracting them, and (4) visualising some of the extracted variables on maps. 

Practical requirements:
- PC (Windows, Linux) or Mac connected to the Internet
software (installation before trip to Kaunas is recommended):
- Python2.7 (
- Python tweepy v3.5.0 (
- Python (
- Twitter account (

As a follow up on the workshop we published two blog post:

Adrien Barbaresi's reflections on the CLARIN-PLUS workshop "Creation and Use of Social Media Resources"

Impressions of CLARIN-LT on the CLARIN-PLUS Workshop on Social Media Resources - blog post written by the organisers



Kaunas Hotel
Laisvės alėja 79