Tour de CLARIN highlights prominent User Involvement (UI) activities of a particular CLARIN national consortium. This time the focus is on the Czech Republic and Dr Radim Hladík, a postdoctoral researcher at the Institute of Philosophy at Academy of Sciences of the Czech Republic in Prague and at the National Institute of Informatics in Japan. The following interview took place via Skype and was conducted and transcribed by Jakob Lenardič.
1. Please describe your academic background and your current position(s).
I received my PhD in sociology at the Faculty of Social Sciences in Prague. Currently, I’m a JSPS post-doctoral fellow at the National Institute of Informatics in Japan, where I am representing my sending organization, the Institute of Philosophy of the Czech Academy of Sciences. Many of my colleagues in Japan are computer scientists, so this is a wonderful opportunity for me to improve my coding skills and to be inspired about how to combine computational methodologies with social science research topics.
2. How did you get involved with Czech CLARIN consortium? Could you describe your collaboration with the consortium?
Two years ago, as a delegate of the Institute of Philosophy, I was one of the coordinators in a fairly large Digital Humanities project by the Library of the Czech Academy of Sciences. The proposal was ambitious, since we wanted to strike up a collaboration between many Czech institutes relevant for Digital Humanities, such as libraries, universities, and various institutes for linguistics and social sciences. Sadly, the project never left the planning stages, but it nevertheless brought together proponents of Digital Humanities, including me and the colleagues from Czech CLARIN. I was very inspired by their work and soon started learning how to code and apply computational approaches to my own research, which is otherwise rooted in sociology and media studies. Since then, I’ve been using tools and resources that Czech CLARIN provides and am in contact with their experts like Pavel Straňák, with whom I discuss my work and who has often helped me with technical issues.
As for concrete collaborations, we’ve recently established the Czech Association of Digital Humanities, for which I currently serve as the Chair. Several people from Czech CLARIN are very active in this association, like Eva Hajičová and Silvie Cinková. We’ve also submitted a project under the Czech DARIAH node last year with Czech CLARIN as the principal investigator. Its goal is to conduct an extensive corpus-based analysis of modern Czech texts from various domains (e.g., 20thcentury philosophy). I’ll be involved as a representative of the Institute of Philosophy, which aims to contribute its historical and philosophical corpora and texts collections to Czech CLARIN. I believe that such a collaboration is of great importance for both sides. On the one hand, Czech CLARIN will give us an invaluable platform for the curation and sustainability of our resources while on the other, they’ll be able to expand the applicability of their tools to new domains and across historical language variations on the basis of our resources.
3. Which are the tools and resources provided by Czech CLARIN that you use in your research? Could you discuss how you use them in your own work?
If you work with texts in a language that is as morphologically complex as Czech, lemmatisation and morphosyntactic annotation of texts is needed even for the simplest analyses. In this sense, the tools that Czech CLARIN provides are absolutely essential for my current work.
I’d like to point out MorphoDiTa, which is a tool for tokenisation, lemmatisation and morphological analysis. What I especially appreciate about MorphoDiTa is its flexibility in that you don’t need to install it as a stand-alone program on your computer, but you can use it as an API service which you easily integrate in your own code. This way, I don’t need to worry about having additional components installed and their dependencies. I often come across tools that require a complicated installation processes, which dissuades me from using them.
What I also appreciate is that the Czech CLARIN repository keeps track of all the versions of a resource you upload. I believe this takes a lot of pressure off the whole publishing process since I know that I can always publish a newer version of a specific dataset in case I do some additional work on it, making me more confident in releasing a dataset sooner, since the repository also welcomes non-final versions, which are then automatically linked to newer ones.
4. Your research scope is very broad; among others, you apply a Digital Humanist approach to the study of scientific writing in social sciences. Could you briefly describe how you conduct your research in connection with this topic?
In my postgraduate work, I have been interested in how historical events are represented through mediated communication and why only certain statements about the past are regarded as a truthful representation. Currently, I’ve been tackling similar questions in connection with scientific writing, where I’m mostly interested in how scientists establish the validity of their claims. However, most sociological research on this topic has been purely qualitative or conducted on a handful of sampled texts. I find such an approach limited, since you can’t really make general claims about whole decades of scientific writing in a particular domain based on a few dozen papers.
Consequently, I soon started wondering what a proper digital approach would reveal about this topic and I began working on creating a corpus of Czech sociological articles from scratch. Currently, my corpus is fairly small – after the clean-up it consists of around 500 articles but will hopefully grow with time.
5. Have there been any significant results yet?
I’ve obtained some interesting results by combining my corpus with a corpus of literary texts that I downloaded from the repository of Czech CLARIN. I brought the two corpora together by creating a vector space model of the documents consisting of very low-level features – the most frequent verbs that are shared between the corpora. I then applied clustering methods to the combined corpus to see which specific sociological texts have the most in common with the literary texts. As an example, clustering showed that such sociological texts often give voice to their data, by providing quotes of the people who are the subjects of the study in question. But the clusters do not only differ in language use. What I found out is that such texts are also more likely to be written by female authors and often tend to be cited less than those texts which have little in common with fiction. Both of these observations turned out to be statistically significant. I plan to release this sociological corpus through the Czech CLARIN repository once it’s completed.
6. Why is an infrastructure like Czech CLARIN (or CLARIN ERIC in general) important for the general research community?
I’ve met quite a few researchers from non-technical disciplines who oppose the use of quantitative methods in what they perceive to be qualitative research questions. I understand their point of view, which I used to share to an extent. But now that I have some experience with using language tools and resources myself, I find that such opposition often isn’t really justified although researchers must be aware of potential limitations and make sure to use the right tools for their purposes. In other words, there are many misconceptions about quantitative research and I believe that Czech CLARIN can help a lot in this regard through its User Involvement events. After my personal experience of auditing the CLARIN-PLUS workshop: “Working with Digital Collections of Newspapers”, I think that the workshops are especially important because they’re a platform where CLARIN experts can show how their tools work and how they do not only answer specific research questions from various disciplines, but also open up many approaches to doing research. An event that directly involves its participants is definitely much more convincing than a dry lecture on Digital Humanities that does not provide any kind of concrete examples.
Additionally, such events are often the starting points of many fruitful cross-disciplinary collaborations in which social scientists or humanities researchers team up with computer science experts. Due to such collaborations, getting involved in Digital Humanities does not necessarily mean that you have to become an expert programmer yourself; you often only need to get intuitively acquainted with the computational methodologies and learn the basic skills, just enough to find common ground for conversations with the specialists.
7. How do your students and fellow researchers embrace the Digital Humanist approach? How are Digital Humanities in general represented in the Czech academic environment?
At the Institute of Philosophy, there is quite a lot of enthusiasm for Digital Humanities, since the management and many researchers see it as a step forward in scientific research. At universities, it depends a lot on the particular department. For instance, I once attended a course on programming in R that was given by Silvie Cinková from Czech CLARIN. Many students who also attended this course were from various humanities disciplines. They were very enthusiastic about learning how to programme and potentially applying programming skills to research questions within their own domains. Consequently, I think there are more students who are interested in such quantitative approaches than the management of humanities departments might realise. The problem, of course, is that the faculty at such departments doesn’t usually have the required skills to teach a Digital Humanities course, so they often invite external teachers from the industry to teach a course or two. However, fully embracing the Digital Humanities would probably require a revamping of the curriculum with a greater number of DH courses tailored to topics that are directly relevant to Humanities research interests.
8. What is your vision for the future of Czech CLARIN?
What I really appreciate about Czech CLARIN is that they have managed to develop tools for Czech that can easily compete with state-of-the-art language technologies developed for larger languages like English. At LREC 2018, it was obvious to me that language technologies are rapidly becoming more and more advanced worldwide. I’m confident that Czech CLARIN will continue to keep up and make sure that their tools are always in touch with the state-of-the-art.
If there’s one thing that I’d like to see improved, it’s the documentation of the tools and resources, which could be made more user-friendly and contain more examples of use because learning a new tool can be very intimidating.
Click here to read more about Tour de CLARIN