Yana Strakatova is a PhD student at the University of Tübingen who works in the project “MoCo – Lexical-Semantic Modelling of Collocations”.
1. Could you tell us a little bit about yourselves, your academic background and your current work?
Eva Gredel (EG): My name is Eva Gredel and I am a postdoctoral researcher at the Chair of German Linguistics of the University of Mannheim. After my first degree in German, Romance and Media and Communication Studies, I received my doctorate at the University of Mannheim in 2014. In my doctoral thesis I evaluated corpus linguistic methods for discourse analysis.
Currently, I am a substitute professor at the Chair of German Linguistics in Mannheim and I am responsible for the coordination of the linguistic courses offered in German Studies. In Mannheim, teaching in linguistics is not only carried out by lecturers of the university, but also by colleagues from the Mannheim Leibniz Institute for the German Language (IDS), which is a CLARIN-D Centre.
Yana Strakatova (YS): My name is Yana Strakatova and I received my first degree in Russia at the Vladimir State University, where I studied English and German and received a diploma in teaching these languages. My education in the field of linguistics really started in Germany at the University of Dresden, where I completed the Master’s program in European Languages. The syllabus had a few courses in corpus linguistics and programming that got me interested in computational linguistics. My Master’s thesis was interdisciplinary: it was situated at the intersection of neonatology, psychology, and linguistics. I focused on the linguistics part and studied the language of birth stories using corpora and statistics. The variety of research directions in the field of corpus and computational linguistics motivated me to pursue a PhD. I am currently in the last year of my doctoral studies at the University of Tübingen, where I work in the project MoCo – Lexical-Semantic Modelling of Collocations funded by the German Research Foundation (DFG).
2. How did you hear about CLARIN-D and how did you get involved?
EG: Linguistics at the University of Mannheim has a corpus linguistic focus in research and teaching. In teaching, the lecturers use the CLARIN corpus infrastructure that is developed at the IDS in Mannheim. I therefore came into contact with CLARIN resources and the CLARIN infrastructure at a very early stage of my own studies and used them for empirical language analysis at all qualification levels.
In my doctoral thesis, I then evaluated the usability of various CLARIN resources for discourse linguistic studies in the tradition of Foucault. In my post-doc project, I now extend discourse-linguistic models in such a way that they can be used to analyse digital discourses, for example in Wikipedia. In 2016, 14 colleagues and I applied for a project with the German Research Foundation (DFG) (in the funding line “scientific network”) in which corpus linguistic approaches to digital discourses also play a central role. In the context of the network, I was invited to participate in CLARIN-D working group 1, “German Philology”. The task of this working group is to set up documentations for applications of digital data and tools for research questions (e.g. screencasts with use cases).
3. How has CLARIN-D influenced your way of working? How does your research benefit from the CLARIN-D infrastructure?
EG: The CLARIN-D infrastructure has always been a good starting point for my own study projects and qualification work to analyse language empirically. In my doctoral thesis, I investigated newspaper language for metaphorical patterns in media discourses on the discourse object “virus” – a topic that has just taken on a whole new topicality and relevance – and evaluated, on a methodological level, various CLARIN-D resources with regard to their potential for such questions.
YS: CLARIN-D provides resources and tools for researchers with different backgrounds and research goals. I have never studied computer science and find myself somewhere in between computational and theoretical linguistics. The tools that assist me in my PhD project have been created in a way that anyone can benefit from using them by adjusting them to their needs. If there is a word or phrase that seems ambiguous and causes a lot of discussion, the user-friendly interface can give a quick answer to resolve the issue. I can, moreover, make use of the raw data and obtain the statistics or certain combinations and patterns I am interested in.
Moreover, CLARIN-D provides an opportunity to store and share the new data collected and annotated in the research process. One of the crucial steps in my PhD was creating a gold standard dataset of collocations that we could use for different experiments. Its creation was based on the tools and resources already integrated in the CLARIN-D infrastructure, and now this dataset (codenamed “GerCo”) can be used by other researchers interested in the topic.
CLARIN-D is not just a collection of resources, it is also a community that offers great opportunities for exchanging experience and knowledge. In the first year of my PhD, I took part in the CLARIN Annual Conference in Pisa. If I am not mistaken, it was the first time there was a PhD students poster session at the conference and it was a success. I presented some experiments I conducted for our project and got a lot of useful feedback from more experienced researchers.
4. Which CLARIN-D tools and corpora have you used and how did you integrate them into your existing research?
EG: Several CLARIN resources play a central role in my research projects and publications. In my doctoral thesis, I evaluated corpora that are available via the German Reference Corpus (DeReKo) and the Digital Dictionary of the German Language (DWDS) and that contain newspaper texts, while the focus of my postdoctoral projects has shifted to corpora of computer-mediated communication (CMC corpora). Especially the Wikipedia corpora in DeReKo, provided by the IDS, are the basis of my current research projects. In case studies, I research how Wikipedia can be analysed as a discursive space. Not only the encyclopaedic texts of Wikipedia (article pages) are relevant here, but also the hypertextually linked pages (talk pages), where Wikipedia authors exchange ideas about the collaborative text production. In particular, the subcorpora of talk pages provide empirical access to internet-based communication and, for example, to Wikipedia-specific net jargon, which is characterised by numerous word formation products.
YS: In the MoCo project, I investigate the semantic properties of collocations, i.e. combinations of two words that co-occur together more often than by chance, e.g. “black coffee”. For now, we focus on the German language since there are several lexical resources that can serve as an empirical basis for our research. The Digital Dictionary of the German Language (DWDS), a digital resource developed at the CLARIN-D centre BBAW (Berlin-Brandenburg Academy of Sciences and Humanities), contains numerous German corpora, statistical data based on these corpora, detailed dictionary entries with grammatical, etymological and semantic information and examples from the corpora. We work a lot with polysemous words and use the DWDS dictionary for word sense disambiguation. For collecting the data for our research, we used the Wortprofil, a sketch-engine like application of the DWDS. The Wortprofil provides ranked lists of co-occurrences based on the statistical data from the DWDS corpora. This kind of automatic preprocessing made it feasible to collect and study a lot of data rather quickly. Another CLARIN-D resource that is a part of my daily work is the German wordnet GermaNet. It provides information about the lexical and conceptual relations between words. All the entries in GermaNet are manually created by experts in the field and, therefore, are very reliable.
5. In how far is the tool/corpora you described important for your research? Which specifics does it possess?
EG: The above-mentioned Wikipedia corpora of the IDS serve me as a database for empirical case studies; for example, I use them to analyse the distribution of word formation products in the digital discourse of Wikipedia. A feature of these corpora is that especially the different types of talk pages of Wikipedia are characterised by CMC features and that these different types of pages are each actually available as subcorpora: The availability of subcorpora containing article talk pages, user talk pages and redundancy talk pages allows for extensive analysis of language use in digital discourses on Wikipedia. A special feature of the available Wikipedia corpora is certainly also that the Leibniz Institute for the German Language Wikipedia provides corpora in eight additional languages (including English, French, and Spanish). This makes it possible to evaluate Wikipedia texts corpus linguistically even for contrastive studies.
YS: All the resources I described above were not only used at a certain stage in the development of my PhD project: I use them every day to resolve new issues and confirm new hypotheses.
Those resources provide not only the empirical data for my research, but also serve as the basis for developing a theoretical framework for describing this data. I get the statistics about the use of natural language from the corpora, and all the lexical and semantic information comes from the work of many experts in lexicography, lexical semantics, and syntax. I know I can rely on this data and it allows me to make quick progress in my own research and make new discoveries.
6. What are the methodological and technical challenges that you face in your particular field?
EG: One challenge is certainly the enormous complexity of discourses in Wikipedia, which arises over their non-linearity – specifically over links. Thus, there are numerous hyperlinks between the above-mentioned types of talk pages that are relevant for analysing digital discourses. By splitting the Wikipedia data into subcorpora according to the different types of talk pages – which makes sense in many places, both technically and in terms of content – these links and thus the complexity of the discourses are nevertheless lost to a certain extent.
YS: In computational linguistics, we need large amounts of text data for training and testing different models. Getting access to the required data can sometimes be very challenging, mainly due to the strict copyright laws. Another challenge is that language is not static; rather, it is constantly changing, so there can never be a resource that can be considered complete and account for all the variety there is in a language.
7. Which CLARIN-D resources, tools and services would you recommend to your colleagues?
EG: I can unreservedly recommend the CLARIN resources and infrastructure described above for language analysis to my colleagues.
YS: Apart from the resources and tools that I have already described, I would definitely recommend to take a look at the variety of tools provided by WebLicht. It can assist researchers in such tasks as tokenizing, POS-tagging, and parsing text corpora.
8. Do you integrate CLARIN-D tools and resources into your university teaching? If so, how exactly do you integrate them?
EG: CLARIN-D tools and resources play a very important role in my university teaching. In my seminars, I introduce students to empirical work with corpora and show how, e.g., morphological analyses can be performed using corpora. Screencasts with use cases, which CLARIN-D has made publicly available on YouTube, for example, have proven to be very helpful. Students and learners in general can then follow in detail how individual CLARIN tools and resources can be used for corpus linguistic studies. The spectrum of possible studies ranges from grammatical and morphological to semantic questions. Students greatly appreciate the empirical work using corpora and achieve impressive results. For example, it is very popular to investigate neologisms using CLARIN resources.
9. What was the latest course with CLARIN-D resources and tools that you taught? How did you experience the students’ reactions?
EG: Apart from my current course "Media Linguistics" in the spring term 2020, my last course in which I accessed corpora with students was the seminar “Digital Discourses” in the fall term 2019. In this seminar, the students were very much involved in the corpus linguistic analysis of language using CLARIN resources. Some very good student research papers were produced.
10. What would you recommend students who are interested in the Digital Humanities?
EG: For students interested in Digital Humanities it is, of course, first of all useful to get a good overview of existing resources and infrastructures. CLARIN’s offers and services provide an excellent overview. Secondly, it is important to reflect on the specific characteristics of the available language material and how the respective infrastructure deals with these specifics (for example, the availability of Wikipedia subcorpora for the different types of (talk) pages). In the case of Wikipedia corpora, for example, it is then relevant to understand that language on the article pages of Wikipedia is fundamentally different from the language on the talk pages. CMC features of Wikipedia texts can only be analysed very limitedly using the subcorpora of article pages, which are characterised by a special encyclopaedic style. For questions in this area, it is useful to use the subcorpora for talk pages, where Wikipedia authors engage in social interaction. Finally, it is then relevant to match good research questions with the appropriate resources.
11. What’s your vision for CLARIN-D and the Digital Humanities 10 years from now?
EG: One great wish would certainly be to continue working on the idea of building multimodal corpora. With Wikipedia as the object of investigation, for example, the relevance of image material becomes clear in many areas and it would often be appropriate to take the multimodality of the data material into account. The further expansion of CMC corpora that allow contrastive analyses would also be a great wish.
YS: I would like to a see a uniform user-friendly platform connecting all the resources and tools and providing free access for all researchers. I am sure that a lot of researchers would benefit immensely from a well-structured environment with clearly defined pipelines for the basic tasks. In that way, one would have more time for more complicated tasks.
Click here to read more about Tour de CLARIN