Tour de CLARIN: Interview with Menno van Zaanen

Submitted by Karina Berger on 12 October 2021

Written by Jakob Lenardic

Menno van Zaanen is Professor of Digital Humanities at the South African Centre for Digital Language Resources (SADiLaR).

1. Could you please introduce yourself, your academic background and current work?

I am Menno van Zaanen, research manager and Professor in Digital Humanities at the South African Centre for Digital Language Resources (SADiLaR). I received a Master’s degree in computer science (focusing on 'low-level' computer science, such as operating systems and computer networks) at the Vrije Universiteit, and a Master’s degree in computational linguistics at the University of Amsterdam (both in Amsterdam, the Netherlands). In these educational programmes, I noticed that there is an overlap of techniques used in both fields. I was wondering if this could be extended, so I focused on the interaction between computational linguistics and computer science. My graduation project for computer science dealt with applying a robust parsing technique from computational linguistics in the area of error correction in the compilers of computer programmes.

For my PhD (University of Leeds, UK), I initially wanted to develop a grammar checking tool by applying error correction techniques from compilers in natural language processing. However, I realised that for that to work, a complete grammar of the natural language would be required. So instead, I started working on language learning tools. I developed a grammatical inference system that aims to learn syntactic grammars for natural language. After that, I realised that such systems can also be used for other sequential information, such as music.

Following that, I have always tried to reuse techniques, such as grammatical inference and machine learning, in different fields. This has led to interesting research and wonderful collaborations in a range of research areas. Currently, I am working together with several of my colleagues to see how far digital humanities techniques that are successful, say for English, can also be applied to South African languages. For these languages, limited amounts of resources (datasets and tools) are available, and the quality of the tools is not always very high (this seems to be mostly to do with the domain on which the tools have been trained). We aim to identify approaches that are robust, in the sense that errors in earlier steps in the process do not have a major impact on the final result.

2. How did you hear about CLARIN and how did you get involved?

When I was still working in the Netherlands (Tilburg University, Tilburg), I regularly received information on the events and activities organized by CLARIN. Several of my colleagues were active in CLARIN (and later CLARIAH) funded projects, and I was also involved in the OpenSoNaR project. This aimed to provide a user-friendly interface for the Dutch SoNaR corpus, which is a 500 million-word corpus of Dutch and Flemish texts. Having access to such a corpus is wonderful; however, given the size of the corpus, handling such large amounts of data is non-trivial, especially if you also take the annotations that are present in the dataset into account. A resource like OpenSoNaR is therefore essential.

Here in South Africa, I am involved on a different level. Working at SADiLaR means that I now see more of what is needed to be a CLARIN centre. This includes providing detailed documentation of the services and facilitating various aspects of standardisation for the resources.

3. How has CLARIN influenced your way of working? How does your research benefit from the CLARIN infrastructure? Which CLARIN resources, tools and services would you recommend to your colleagues?

The fact that CLARIN exists has made me realise more that we as researchers do not work in isolation, but are part of a larger network. As such, I especially enjoy the events organised by CLARIN. For instance, the Twin Talks events in which people present how they work together with researchers from different fields provide wonderful ideas on how to tackle the interdisciplinary communication problems. Inge van de Ven (Tilburg University, Tilburg) and I also presented how we collaborate as researchers with different backgrounds (culture studies and computational linguistics/digital humanities) and for the preparation of this presentation we had to reflect on what problems we ran into and how we solved these. Hopefully, such presentations also help other researchers with their collaborations.

Additionally, events such as the CLARIN Bazaar are nice places to meet other researchers in the field. During the informal discussions at the Bazaar, I have learned about new resources that I did not know existed and I have met people working in the same field that I did not know yet.

Another CLARIN service that I like is the Virtual Language Observatory. This is a tool that allows for searching in a wide range of repositories. Not many South African language resources are available, so this tool is very useful to identify the resources that are out there. Having a central point, like the Virtual Language Observatory, where different resources can be found, makes it easier for me as well as interested researchers to start working with, for instance, different South African languages. As such, I also find it very important that SADiLaR’s resources are findable in these services.

At the moment, many digital humanities researchers in South Africa are interested in using more computationally oriented tools, but do not know where to start. CLARIN provides training and resources for such researchers, and SADiLaR specifically provides training (currently in the form of workshops, but we are also developing online course material as well) to boost the computational skills of researchers from a South African language perspective. This includes information on digitization, but also on the practical use of the computational linguistic tools and general computational skills.

4. Which CLARIN tools and corpora have you used and how did you integrate them into your existing research?

Recently, I started using OCR systems and named entity recognisers developed for the South African languages, which are available through the SADiLaR repository. As I do not speak any of the indigenous South African languages personally, I rely on my colleagues who do speak them to interpret the results and evaluate the quality of the output. The discussions based on the output of the OCR systems and named entity recognisers has made the collaboration between us much more concrete.

For instance, we have experimented with identifying social networks in fiction, mainly in novels and plays, in different South African languages. We first apply the OCR system to scans of the physical books and then use named-entity recognition to identify the characters in the books. The next step is to try to identify the relationships between the characters (based on co-occurrence). These relationships are then visualised in graph form. This approach allows us to get a sense of the robustness of the approach. For instance, we know the named entity recognisers are not perfect (they miss characters and also tag words that are not names of characters), but we hope that the errors made by the named entity recognisers do not have a major impact on the identification of the relationships. This can be evaluated by looking at the resulting social networks.

5. Why is the tool and corpus you described important for your research? Which specifics does it possess?

In general, there are not many language technologies available for the South African languages. This holds for corpora as well as tools. If you want to do research on one (or more) of the South African languages, you quickly end up using tools that are in SADiLaR’s repository. Without the available tools, much of the research in the field of Digital Humanities for the South African languages simply cannot be done or will require huge amounts of manual annotation. If, for instance, the Afrikaans, Xitsonga, and Tshivenḓa OCR systems and named entity recognisers were not available, then the identification of social networks that we have done would be impossible.

6. What are the methodological and technical challenges that you face in your particular field?

Unfortunately, for many of the South African languages the tools that are available do not always provide high-quality output. I think there are several reasons for this, but the main complicating factor is the limited availability of (annotated) corpora. These linguistic datasets are essential in the development and training of the computational linguistic tools. For some languages and tasks it is seriously difficult to find suitable datasets, limiting the training and evaluation of these tools. With the research that I described earlier pertaining to identifying social networks in South African fiction, I am trying to come up with methods that still yield useful results even if the quality of the output is not always perfect, such as by performing minimal manual corrections.

Another approach that several of my colleagues take is to use language independent tools. For instance, they take the text of the constitution of South Africa, which is available in all the eleven official languages, and align the different language versions. Based on the aligned texts, they try to identify the terminology that is used in the English version of the constitution and analyse how the terminology is translated in the other languages to see if similar translation strategies have been used. No language specific tools are needed for this kind of analysis, although linguistic knowledge of the different languages is essential.

7. What would you recommend to students who are interested in the digital humanities?

I think the field of digital humanities in South Africa is still rather fresh. This also means that there is a wide range of opportunities. When more people become active in the field, the field itself will grow in turn, leading again to new opportunities. This is thus the right time to start with research in the field of digital humanities, and a perfect time for a research infrastructure such as SADiLAR to be fully embedded in the discipline from the very beginning!

Practically, I think what is needed for students interested in the field is to acquire the right set of skills. For students from the field of humanities, this most likely means learning some computational skills: how to handle and convert files in different formats, how to execute (computational linguistic) tools, and so on. For students who already have a computational background, this means learning more of the open problems, methodologies, terminology, and theories used in the field of humanities. SADiLaR is organising workshops (currently mostly aimed towards humanities researchers) to help boost their computational skill set. During these workshops, participants learn to use language-specific and language-independent tools. Typically, South African language datasets are used during these workshops, making it easier for participants to understand the applicability. This training program will be extended in the near future with the aim to involve all universities in the country. Unfortunately, during the lockdown no face-to-face training events have taken place. To still allow for training, SADiLaR is currently making course material available online. These training events are excellent starting points for researchers interested in digital humanities.

8. What is your vision for CLARIN and the digital humanities ten years from now?

I hope that in the next ten years, the field of digital humanities will have grown considerably in South Africa, as well as and in the rest of Africa. There are still so many open questions, for instance related to African culture, African languages, and so on, that need further attention. This is not something researchers can do by themselves. This requires a solid research field with knowledgeable researchers, which is exactly what SADiLaR as the South African CLARIN consortium aims to foster. Once the field is more active, I expect that more datasets will become available, which again will lead to the development of more computational tools, which again allows for more research.

I think that CLARIN can have an important function in boosting digital humanities research. I hope that SADiLaR will be able to make the South African resources that are out there more findable and accessible, which is especially crucial given the fact that the Bantu languages are generally under-resourced and under-researched from a digital humanities perspective. Training will also help in getting more researchers active in interdisciplinary digital research, and I think that through collaboration, as well as really making use of the worldwide network provided by CLARIN at the level, this will improve the sharing of information and more importantly experiences.