Blog post by Adrien Barbaresi, Research Associate at the Austrian Academy of Sciences (Academy Corpora) and Berlin-Brandenburg Academy of Sciences (Zentrum Sprache).
The H2020-project CLARIN-PLUS is related to the European research infrastructure CLARIN-ERIC. One of its goals is to promote networking and scientific events among various communities of researchers. In this context, the workshop dedicated to "Creation and Use of Social Media Resources", held on the 18 and 19 May 2017 in Kaunas, Lithuania, was a good occasion to gather specialists from various countries working on comparable language data, from microblogs and short messages to online communication platforms. The participants came from Austria, Belgium, Denmark, Finland, France, Germany, Italy, Lithuania, Latvia, the Netherlands, Norway, Slovenia, Sweden, Switzerland and the UK.
A series of invited talks covered varied topics, from ethical constraints on sensitive information to standards for archiving and republication, including processing issues and methods. In this regard, the approaches vary according to research focus and disciplinary tradition, ranging from opportunistic collection and data-driven analysis to project-based curation and analysis of CMC data. In the case of sensitive topics such as cyber-bullying or -harassment, it is even feasible to simulate online interaction in order to generate research data.
Workshop discussions focused on a comparison of methodological issues across the disciplines. It was interesting to observe that researchers working with CMC data have gathered comparable experience in spite of the diversity of topics and European languages they deal with. In that sense, ethical concerns as well as on copyright and republication have made for lively debates.
CMC research in general proves to be a broad topic, with various research questions raised during the talks, such as social, gender, and generation-related biases on different platforms. They thus have to be described and accounted for in studies. Likewise, phenomena exceeding the traditional range of alphabetic characters, such as emojis and emoticons, need to be considered from a practical point of view (access and processing) but in a systematic way (are there frequent patterns of use, is there a "grammar"?).
Twitter data plays a preponderant role as it was in the focus of a significant number of researchers at the workshop. This is still very much an open issue, there are diverging approaches concerning methodology and data. Besides the apparent consensus regarding tweet IDs as the "exchange currency" for scientific cooperation and replication studies, open questions of data reuse for existing "wild" archives or derivative resources, such as linguistically annotated data. In general, ethical and legal aspects of such databases have to be clarified, which could be a task for CLARIN, as well as gathering information and classifying the increasing number of processing tools dedicated to different platforms and languages.
All in all, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or enfold beyond the data. The debate among the research community (especially institutions funded by public money) is all the more necessary since corporations whose business model resides in the ongoing collection and storage of information made publicly available or propagated through a social network are not likely to voice concerns about these digital aggregates. The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and techno-scientific issues in the general public.