Tour de CLARIN: Interview with Ondřej Tichý

Submitted by Jakob Lenardič on 28 August 2019

In this Tour de CLARIN blog post, we present an in-depth interview with Ondřej Tichý, a corpus linguist who is deputy chair of the Department of English Linguistics at the Facuty of Arts at Charles University. Dr Tichý collaborates with and is a regular user of the Czech National Corpus. The interview was conducted via e-mail.

1. Please describe your academic background and current position. What inspired you to take a digital humanist approach to linguistics?

I earned my PhD in English Linguistics at the Faculty of Arts, Charles University in 2014. I have been teaching and conducting research at the same faculty since 2008, specializing in historical and corpus linguistics, quantitative and computational linguistics, digitization and digital humanities. Between 2014 and 2018, I served as a vice-dean for information resources and since 2018 I have been the deputy head of the Department of English Linguistics. Parallel to my academic career, I have been working in IT since late 1990s and it has been primarily due to my background in IT and my academic interests in diachronic linguistics that I took the digital approach. Another motivation for my involvement in the digital approach was to make important resources, that I used for my own research, available to the wider public as well, resulting in the digitization of an Anglo-Saxon dictionary for my MA thesis and then conducting automatic analysis of Old English morphology for my PhD. Finally, the projects based on the Helsinki corpora that were compiled when corpus linguistics started to emerge as one of the major linguistic diachronic methodologies in the 1990s have been very inspiring to me from the very beginning.

2. What is your involvement with the CNC K-Centre?

I am both a dedicated user of their infrastructure and a collaborating researcher. I have been invited by the Centre to give talks on diachronic corpus topics (for instance, on lexical obsolescence in Late Modern English or on the quantification of orthographical variation in Early Modern English, which are two of my current research interests), I have consulted on some of these projects with a number of colleagues at the centre and I hope our fruitful collaboration to continue in the future as well. But mainly, I use the centre’s infrastructure, tools and expertise to host and analyse corpora I need for my own projects. Many of these corpora are not in the public domain (either by the decision of their compilers or due to the licensing restrictions of their source material) and are only hosted for licensed users for research and teaching, but in cooperation with the centre, we have also started publicly hosting data from the Early English Books Online (EEBO) project, and are about to host the Old Bailey Corpus, which is based on a selection of the Proceedings of the Old Bailey, the published version of the trials at London's Central Criminal Court.

3. Which data collections in CNC do you use in your own research? Could you present and discuss some of your research that has resulted from your use of the CNC corpora?

I mostly use English diachronic corpora that the centre specifically processed and hosts for our department and students, but I have also used the DiaKorp, InterCorp and the SYN corpora for a contrastive angle.

One example of the research I do using the centre’s infrastructure is my recent work on spelling variation in Early Modern English based on the Parsed Corpus of Early English Correspondence. I introduced a novel methodology for the quantification of spelling regularity, which allowed a more objective assessment of its progression in time and which also makes use of the metadata provided by the CEEC such as gender, letter authenticity or relationship/kinship between the author and the recipient. I have explored interactions of such variables from the diachronic perspective using quantified levels of spelling regularity.

The measure introduced for this purpose is based on weighted information (Shannon) entropy, as a measure of predictability of a spelling of individual functionally defined types, and its calculation is partly based on the morphological tagging of the parsed version of the corpus. I have also tackled the problem of underrepresentation in certain periods by establishing a size-based sampling for scalar variables like time. For instance, I was able to show that letters written by women showed a greater degree of entropy – so a greater degree of variability – in spelling regularity than letters written by men through the whole period (roughly from 1410 to 1680). However, this difference turned out to be a function of another sociolinguistic variable that I was accounting for besides the author’s gender; namely, the relationship between the author and the recipient. Female authors corresponded significantly more with other members in the family than male authors who mostly corresponded with acquaintances outside the family. In a familial context, there might be less pressure to conform with spelling standards, hence the greater degree of variation.

Another example is an older study on measuring the typological change in English that was based on the parsed versions of the Helsiki corpora. In this paper, my colleague Jan Čermák and I proposed a quantitative, but also holistic, methodology for establishing a level of morphological syntheticity within a language – that is, how much a language relies on morphological markings to convey syntactic information. The methodology is based on a series of corpus-based probes into the morphological behaviour of selected high-frequency nouns, adjectives and verbs from Old English to Present-Day English in corpora hosted by the CNC. We thereby managed to establish several levels of syntheticity that correspond to the well-known typological re-shaping that happened in the history of English, which shifted from a heavily synthetic language in its early days to an analytic one in the present day. For instance, Old English was highly synthetic, its nouns ending in 7 different inflections corresponding to the complex case system, whereas Present-Day English nouns only use the -s affix to mark plurality, and our proposed methodology was able to capture this quite precisely.

It should be also noted that CNC often consults and helps out indirectly, not with their corpora or tools, but with their scientific and technical expertise. E.g. in my research into the obsolescence of multi-word expressions in the history of English, it was only thanks to a colleague at CNC and the centre’s computing resources that I was able to pre-process most of the Google Ngram dataset (about 2 terabytes of data).

4. Which challenges does one face when doing diachronic linguistics with corpora? Do CNC corpora employ any features that are specifically tailored to diachronic analysis? Is there any additional feature that you would like to see implemented in the future?

The specific challenges of diachronic corpus linguistics are numerous. Those that often trouble me are scarcity of data coupled with their representativeness, the quality of the data and, in the case of English, the formal variation that can be found on almost all levels of linguistic description. Such variation is more often than not problematic for tools that are geared for the analysis of Present-Day English. The CNC tools (rather than corpora), while not specifically tailored towards diachronic analysis (except perhaps for SyD), do however yield to it quite well. I am very happy with KonText and how our colleagues at the CNC are both able and willing to tweak it to make things work for specialized users, especially the treatment of metadata and the ways these can be analysed and searched seem better to me than in e.g. the CQP web or SketchEngine.

Another advantage we are just going to make use of is the possibility to analyse metadata at utterance level, which means that we will associate metadata to parts of texts rather than with entire texts only. As an example: a user can start by limiting the query (search for a particular form/function) by the gender of the speaker or a specific timeframe, then view the frequency analysis based on the properties of the text containing the direct speech (e.g. by the type of offence in trial proceedings) and finally create a table interrelating two attributes (like social class of the speaker and the orthography of the keyword). This makes corpora like the Old Bailey Corpus much more approachable to less experienced users, since they do not have to overcome the steep learning curve of CQL or similar query languages and they can also see some of their results in a neat tabular format without the need to export the results and run a statistical tool on them. It should also be noted that while many similar features may be available in similar tools, KonText is open source and free to use.

5. What are the main benefits of the KonText search interface? Do you use any of the other CNC tools, such as SyD, Morfio, in your work?

In my research, I mainly use KonText and recently the brand-new Corpus Calculator. I use SyD for teaching – as a tool roughly comparable to Google Ngram Viewer – since it provides a very user-friendly way to compare lexemes across the CNC corpora both synchronically and diachronically. As I noted in my previous answer, I like KonText because it allows me to quickly search and analyse metadata. I like to focus on social aspects of language changes. I also like the CQL since it is easy to teach and learn. Furthermore, it is very well documented in the CNC Wiki and it is very similar to query languages used in other concordancers. From a teacher’s perspective, CQL and other search options in KonText make it easy to start with and yet are very powerful at the same time.

6. What kind of feedback have you provided on the CNC corpora and its user interface? What is your experience with the CNC User Forum? Why is it important for the CNC K-Centre to offer such user support?

Since CNC often accommodates me by hosting all kinds of corpora that tend to be different than the Czech corpora they are predominantly focused on, I often request changes or new features – mostly by e-mail to specific colleagues but also through GitHub. While the CNC may not always immediately implement all my outlandish ideas it has in general been very forthcoming about my requests. Here is one example, where I requested that headers be added to the .csv and .xlsx files exported from KonText, and the CNC team quickly implemented the change.

7. How do you use the CNC corpora in your teaching? Have your students obtained any interesting results from the CNC corpora?

I use KonText in most of my classes focused on History of English to showcase specific changes and I also teach how to use the interface in my English Diachronic Corpora course. Almost all of the students at our department learn to use KonText and InterCorp, and the majority of theses in our linguistic programmes are corpus-based, so most of the final theses (several dozen a year) are based on CNC-hosted corpora. A lot of the theses are based on the contrastive approach focusing on features of Present-Day English and Czech, but there have been a number of diachronic theses and papers as well. One of my PhD students is now working with the CNC-hosted EEBO data to research lexical losses in Early Modern English, another of our PhD students is developing a parallel corpus of Old English and Latin translations that will again be hosted by CNC that has already extended its support in this. Finally, one of my PhD students prepared lessons in English available on the CNC wiki for using the diachronic EEBO corpus, which show how KonText can be used to account for spelling variation, looking at diachronically competing word forms, analysing morphology, among other uses. We hope that some of our students will develop a similar online course for the Old Bailey Corpus.

Click here to read more about Tour de CLARIN