Tour de CLARIN: Interview with Sidsel Boldsen

Submitted by Karina Berger on 12 April 2022

Sidsel Boldsen is a PhD Student in Natural Language Processing ( ) and digital humanities, with a special interest in historical languages and linguistic knowledge representation. She is part of the interdisciplinary research project ‘Script and Text in Time and Place’ at the University of Copenhagen.

1. Please describe your academic background and current position.

I’ve just handed in my PhD thesis, and am due to defend it at the beginning of May. My background is in historical linguistics and comparative linguistics, which I did for my BA. But then I moved on to language technology and did a Master’s in IT and Cognition at the University of Copenhagen. In my PhD, I focus on language technology and I have a special interest in language change, which comes from my background in comparative linguistics. I became interested in language technology because I thought that programming and scripting could offer interesting research avenues for linguistic studies involving digital corpora.

2. You are part of the research project ‘Script and Text in Time and Place’ at the Department of Nordic Research at the University of Copenhagen. Could you describe the project?

The project is very interdisciplinary and is a qualitative and quantitative study of about 300 medieval Danish charters from the thirteenth to the sixteenth centuries. The goal is to study the script and language of medieval Denmark through these resources. These charters are very interesting from both a historical and linguistic point of view because they have not been edited in any way, so they’re direct sources of language and history. They are also dated and geographically localised, based on where they were produced, so you can get a very nuanced picture.

Often when you work with historical texts, it’s an edition of an edition, and it can be difficult to say what the ‘real’ language actually is, and what the later additions are. So there are philologists working on the project, but also historians looking into the monastic history.

The project will end in May 2022. The main output will be an open-source, digital scholarly edition of the charters. This scholarly edition will enable scholars within philology and history to search these charters – to see the texts in different layers, where we have annotated the different features of the script, and other levels, too. For instance, we lemmatised the texts so that you can search for word forms, and we also annotated the people and places that occur in the text.

3. What has your role in the project been?

One focus of the project has been to develop tools for automatic linguistic analysis of texts, as well as automatic dating, localising and identifying scribal schools. Such customised digital tools should improve the quantitative analysis of historical sources. I am involved in that part.

There aren’t many tools available to date or localise for Danish script, and little systematic analysis has been conducted on Danish medieval texts so far. We have been working towards new methods for automated dating, localisation and grouping of texts based on machine learning techniques (MLT), which will improve our understanding of what the relevant factors are for establishing the date, place and scribe of primary sources from the Middle Ages. The benefit of using MLT in the project is that it allows us to take advantage of the dated charter material for building a reference and training corpus.

What I did was to look at how language changed and whether it was possible to develop tools to date texts without a date. These corpora are all dated, so can you use these dated corpora to develop tools to automatically date texts that don’t have a date. Perhaps this tool could help to make undated texts more accessible. We started to developed the tool – that was the starting point for my thesis. But then my work became more theoretical and I began to study how language change is actually captured in language models, and what kind of features they recognise or are sensitive to.

I’ve tried to look at different layers. Of course, there’s topical change, with different places being named or different expressions being mentioned differently through time, a sort of topic model. But then I also looked at sound change. In that period, we know that some sound changes were supposed to have happened, but when we look in the corpus, can we actually identify those? So I was also interested in change on a phonological level.

4. Can you say a little more about the tool you have been developing?

The tool – it’s not a fully developed tool yet – uses support vector machines (SVMs) in order to automatically assign the manuscripts to a specific time period, or bin. It represents a text in a vector space, either the words it contains or smaller segments such as character n-grams, and then it projects these into the space and tries to create learning boundaries between different classes. In this case, the classes are the specific time periods – it could be centuries, or it could be spans of 50 years. And then the tool tries to learn how to divide those that are projected within that given space.

When you receive a new document, you map that new document into that space, and then you can evaluate how well the space was constructed with respect to how well documents can be divided into those periods. We received pretty good accuracy. We reached around 75 % - which means we were able to date almost 75% of the charters with a 25-year error margin, which is used by philologists as a standard of the precision with which medieval texts can be dated manually. But it’s a bit complex also as to how these so-called bins are actually constructed. And we didn’t really find a way to address that in our paper on the topic.

If I was to develop the tool further, I think the work should focus on what type of features it actually recognises. For it to be useful, it would have to work for another corpus, which had been annotated using different schemas, for example. We have yet to test that. But that would be the questions: how well is it able to generalise across corpora?

We have another corpus of charters or medieval documents called Diplomatarium Danicum, and it would be interesting to test the tool on this resource because it is a much bigger corpus spanning a broad period. It would be interesting to see how well the tool that has trained on this very specific corpus would transfer to that. In principle, I think the tool could be useful for other corpora as well, at least within the same domain. We are currently working on a contribution investigating the generalisability of such methods, in which we also plan to make the tool available for scholars to test on their own data. So, stay tuned!

5. You are applying machine learning techniques to the analysis of medieval Danish texts with the cooperation of CLARIN-DK. How did you start collaborating with CLARIN-DK? Have you used any specific CLARIN tools as part of your research?

One of our project members, Bart Jongejan, is in the CLARIN-DK team. So when we needed tools, we used one that was in the CLARIN-DK repository. Although the charters had already been transcribed, they were annotated in a CSV-like format, in which each row represents one token, and they needed to be converted to XML format for the actual edition. We used the workflow manager for NLP called Text Tonsorium in order to automatically convert the format.

We used the same tool for automatic part-of-speech (PoS) tagging of the Latin charters. For the Danish charters, we wanted to annotate PoS and lemma manually and, in this case, we used the tools offered in Text Tonsorium as a starting point for the annotation. This was very useful as it dramatically reduced the workload of the manual annotation. One of the great features of the Text Tonsorium is that it offers many different pipelines and workflows, so you can quickly test out different parsers or lemmatisers. Otherwise, you would have to set up all the different tools and try them out, but this is one common format where you can try them all out in one go.

For my research area, CLARIN-DK provides everything I need. They have trained both an old Danish lemmatiser and a Latin one, and also offer PoS-tagging.

6. Why is it important to take a computational approach, such as natural language processing, in the humanities?

I think the contribution from computational methods is two-fold: the most important, in my view, is to make corpora more accessible and searchable, so that qualitative researchers can use these resources in a more focused way. I think it’s more about how language technology can assist certain research questions that qualitative researchers work with. So, a way to assist, not to replace different methods. You can work with larger resources and filter them, for example.

And the other is to actually use those tools in order to carry out humanities research, which can also be very interesting. For example, if you develop a tool to date text manually, you could maybe also learn from those tools: what are the predictive features of language change? In that way you use those tools not only to be able to annotate, but also to learn from those models and use them to actually study language and text. But I think the first contribution is the more important one.

7. What is in store for your future collaboration with the CLARIN-DK?

I don’t have a project in mind at the moment, but I’d definitely be open to working with CLARIN again. The collaboration with CLARIN-DK was a very positive experience for me. I’m busy finishing the project, getting the whole scholarly edition online. My defence is at the beginning of May and after that I’ll be able to think about what lies ahead.

The scholarly digital edition will be accessible here at the end of May 2022.

Text Tonsorium is also included in the Text Normalisers CLARIN Resource Family.