You are here

Basque

Tour de CLARIN: Interview with Mikel Iruskieta

            

            
                                      

Mikel Iruskieta is a computational linguist who is part of the Ixa Research Group and the Didactics of Language and Literature Department at the University of the Basque country. He has collaborated with the CLARIN IMPACT CKC K-Centre, which has helped him and his colleagues digitize Basque texts. The interview was conducted via e-mail. 

            

1. Could you briefly describe your academic and research background?

My current research focuses on the didactics and analysis of Basque language, mostly regarding discourse parsing and evaluation of discourse structure. For the last 5 years, I have mainly worked on adapting language technologies for teaching and learning purposes. With that goal, I have created and now co-lead a postgraduate programme in Basque (University Specialist in ICT and Digital Competences in Education, Continuing Education and Language Teaching), a research group working in Digital Humanities and Education. Our aim is to build a research community that will conduct research and teach in Basque by adopting a critical approach and using language technologies in a pedagogical context. In this postgraduate programme (a summary and student projects can be accessed here), my colleagues and I are developing a new framework of the socio-tech pedagogy for Basque that will cover the following topics:

  • The Basics of Technology and Pedagogy

  • Formal Education and Technology
  • Continuing Education and Technology
  • Language Teaching and Technology Development
  • Society and Education, Opportunities and Risk of Technology
  • E-learning: approaches and resources, and
  • Digital research: Methods and resources.

2. Does the fact that Basque is a language isolate have any bearing on the development of language tools tailored to it?

The history and current situation of the Basque language are both complex and interesting. Basque has a relatively small community of speakers (751,700 active and 1,185,500 passive speakers) which lives in contact with three powerful language communities such as Spanish and French (as official languages in the Basque Country) and English (as a foreign language). It is also not supported enough by official language policies. As a result, Basque is still considered an under-resourced language. In this context, the work of the research Ixa Group for NLP is highly valuable. They have developed basic resources for Basque (as well as for other languages) which are used by the research community, for example IXApipes (a modular set of NLP tools which provide easy access to NLP technology for several languages that can be used or exploit its modularity to pick and change different components) and ANALHITZA (a web service to analyze Basque, Spanish and English texts without any technical experience). Many more basic and advanced tools and resources for Basque can be found on the website of the HiTZ: Basque Center for Language Technology.

3. How did you get involved with the IMPACT K-Centre and how did they help you with your research?

I learned about the IMPACT K-Centre when they joined CLARIN. Because I was working on several different digitization projects for Basque and for Spanish, I immediately got in touch with them and asked for their help. Isabel Martinez Sempere, the manager of IMPACT, helped me solve a digitization issue that I encountered when I was analyzing the most frequently occurring words in Pulgarcito, which is a Cuban children’s magazine from 1919–1920. This magazine consists of very diverse materials, such as drawings and handwritten texts, which are normally very difficult to digitize. I first tried a commercial OCR tool, but the results were very poor. I then got in touch with IMPACT, telling them that I needed good quality OCR results presented in a machine-readable format like XML. IMPACT promptly responded to my request and managed to digitize the entire journal within a week with significantly fewer errors than when I had used the commercial OCR tool.

In another project, which was led by the Ixa Group but also involved the Basque Ikastola Schools and the Faculty of Informatics, we had three corpora that contained texts for 4–6 year old children. The first corpus is a Basque collection of stories that is used in education. The second is a corpus of old European fairytales, such as Rapunzel, The Beauty and the Beast, Sleeping Beauty, and Snow White, which are translated and adapted into Basque. The third corpus is a modern version of the European fairytales which have been adapted for co-educative purposes, meaning that they are suitable for mixed-gender classrooms.

However, the co-educative modern version wasn't machine-readable, so we asked IMPACT if they could give us the OCR version of this collection. IMPACT were again happy to do so and their experts extracted all the pages from the corpus and performed OCR with Abbyy FineReader (version SDK 11) on the Basque texts.

4. Can you share any interesting results?

As soon as IMPACT digitized the fairytale corpora, my colleagues and I used the ANALHITZA tool to determine whether the texts in the corpora contained gender-inclusive language from the perspective of the characters’ roles in the narrative. To this end, we performed an analysis of several expressions, such as eder (beautiful), polit (beatiful), gaizto (evil), and indarra (power) , which we extracted with the Voyant Tools from the OCRed corpora.

In the traditional fairytale corpus, it turned out that expressions associated with concepts such as beauty and fear (e.g., eder “beautiful”) were almost exclusively used in reference to female characters, while expressions related to concepts such as power (e.g., indarra “strength”) were used to refer to male characters. Such a sharp linguistic division between the two genders in learning materials for very young children reinforces problematic gender dichotomies like the idea that male characters inherently play an “active” and adventurous role in the story, whereas female characters are “passive”, dependent characters associated with concepts such as home but not power.

Let’s give concrete examples from the two corpora. In the traditional fairytales corpus (Table 1), the noun indar (power) and its inflectional variants refer to male characters 4 out of 5 times. By contrast, in the modern co-educative corpus (Table 2), indar is now used 6 of 10 times in reference to female characters, so the usage is almost evenly split between female and male characters, which is desirable if one wants to ensure that the language is used gender-inclusively.

Document

Left

Term

Right

 Manual interpretation

7

bat ikusi zuen eta, azkeneko

indarrak

ateraz, haraino joan zen. Hondartza

Male character

5

eta orduan, braust!, Gretelek bere

indar

guztiarekin bultzatu zuen sorgina labe

Female character

5

eta orduan, braust!, Gretelek bere

indar

guztiarekin bultzatu zuen sorgina labe

Male character

7

asko nekatu zen. Ez zuen

indarrik

igerian jarritzeko eta itoko zela

Male character

7

txiki-txiki batzuk ziren. Gulliver

indarka

hasi zen bere burua askatzeko

Male character


Table 1: Usage of the expressions indar (power) in the traditional fairytale corpus, where it is associated with male characters in 4 out of 5 cases. For instance, the first KWIC line – bat ikusi zuen eta, azkeneko indarrak ateraz, haraino joan zen – is roughly translated into English as “He saw one other person, and, drawing his last strength, he went on”, describes an action of a male character. By contrast, the second KWIC line – eta orduan, braust!, Gretelek bere indar guztiarekin bultzatu zuen sorgina labe – is roughly translated into “And then, Gretel with all of her strength pushed the witch”, which this time around describes the action of a female character.

Doc

Left

Term

Right

Manual interpretation

2

Edurne Zuri erabat suspertu zen,

indarrez

 bete zen eta bizitza berriari

Female character

4

nagusia zen jadanik; bera, ordea,

indartsua

 eta arina zen. Murruetatik gora

Male character

1

bihurri batetik gora hasi zen.

indar

 bitxi batek tira egiten zion

Female character

5

atera zuen leihotik eta bere

indar

 guztiekin egin zuen garrasi: -Kaixoooooo

Female character

4

zitekeen herrixkara. Aitak ez zituen

indarrak

 sobera zituen Ederrak. -Ederki, halaxe

Male character

4

ere handik joan nahi. Neskak

indarrez

 estutu zion eskua, eta eskatu

Female character

4

eta gerritik zintzilik zituen giltzak

indarrez

 kentzen zizkion bitartean- Eta niri

Female character

4

etorri zenetik, Ederra piztia baino

indartsuago

 sentitu zen. Bira egin, eta

Female character

0

hunkituta. Galtza igo zion, zangoak

indartsuak

 eta ile ugariz beterik zeuden

Male character

0

jarri, eztarria garbitu eta ahots

indartsuz

 esan zuen: -Gustuko zaitut, Monty

Male character

Table 2: The use of the expression indar (power) in the modern, co-educative corpus, where it is used 6 times with female and 4 times with male characters. For instance, the first KWIC line Edurne Zuri erabat suspertu zen, indarrez bete zen eta bizitza berriari is roughly translated as “Sleeping Beauty was completely revived, full of strength (indarrez) and new life”, where the concept of power is associated with Sleeping Beauty, a female character, while the second line – nagusia zen jadanik; bera, ordea, indartsua eta arina zen. Murruetatik gora – is roughly translated into “he was already the boss; but he was strong (indartsua) and agile”, where strength is associated with a male character.

5. What’s your vision for the future of the IMPACT K-Centre?

Generally speaking, Artificial Intelligence and the work of the IMPACT K-Centre is crucial to explore the past of humanity. There are many documents which are still not available in a machine-readable form. Making these information sources accessible, analyzing languages in them, linking objects, data and documents, enriching texts with metadata, and making accessible or referenceable the created virtual corpus through decentralised collections will enable a new way to interact with our past, understand the present and plan for the future. Researchers will be able manage a large amount of data and finish their research in less time, allowing them to use more of their time to focus on the truly interesting, ground-breaking research questions rather than on non-innovative technical tasks.

As for my more concrete wishes for the future development of the IMPACT infrastructure, an important topic that would be highly valuable to tackle next is handwritten texts. For us, this would be particularly valuable because we have a large handwritten learner corpus of Basque annotated with errors (some sample material can be consulted here) by language professional testers who have passed a rigorous ALTE audit. Digitizing handwritten text, however, is notoriously difficult in comparison to printed text. Nevertheless, I think the process can be streamlined with the development of new machine learning techniques. Luckily, IMPACT already provides an enormous amount of OCRed data which could be used to train new models for digitizing handwritten texts.