Tomaž Erjavec Awarded the 2021 Steven Krauwer Award for CLARIN Achievements


The 2021 Steven Krauwer Award for CLARIN Achievements was awarded to Tomaž Erjavec (Jožef Stefan Institute / CLARIN.SI)

Motivation of the Jury


Tomaž Erjavec is an expert on language technologies with a focus on multilingual applications, methods and standards for the compilation and annotation of language resources. He completed his BSc, MSc and PhD at the Faculty of Computer and Information Science at the University of Ljubljana and completed an MSc at the Centre for Cognitive Sciences at the University of Edinburgh.

Tomaž Erjavec is an Associate Professor for the field of language technologies at the Faculty of Arts at the University of Ljubljana. He has worked at the Jožef Stefan Institute, now as senior researcher at the Department of Knowledge Technologies, since 1984. Previous positions were at the University of Edinburgh, the University of Tokyo and at the EU Joint Research Centre in Ispra, Italy. He has taught at the Jožef Stefan International Postgraduate School, the Faculty for the Humanities at the University of Nova Gorica, and the Faculty for the Humanities at the University of Graz. He has supervised several PhDs and has served as a member of several Masters and PhD committees at home and abroad.

Research Interests

His research interests lie in the field of computational and formal linguistics and language technologies, especially in the compilation and annotation of language resources. A large part of his work is devoted to the Slovene language: he has collaborated in the production of most of the Slovene reference corpora and many specialised corpora. In the past years, his work has also included the digital humanities – he is active in the area of complex digital editions and digital libraries, and bridging the gap between humanities research and computer science. A continuing thread of his work is the emancipation of the Slovene language through the compilation of language resources and tools, and enabling free, open and stable access to such research data and programs, as well as the written cultural heritage of Slovene. He is the first-ranking researcher for the field of linguistics nationwide in terms of the number of citations and the h-index.

Involvement in CLARIN

In 2014, Tomaž Erjavec became the national coordinator of the Slovene research infrastructure for language resources and technologies CLARIN.SI, which became a member of CLARIN in 2015. CLARIN.SI inter alia maintains and develops a and CLARIN-certified repository of language tools and resources. Tomaž Erjavec participates in the work of SIST, the Slovene Institute for Standardisation, of which he is the Slovene representative, and collaborates in the preparation of standards in the area of the encoding of language resources.

He has coordinated and participated in more than thirty national, bilateral and EU projects in the fields of human language technologies, corpus linguistics and digital humanities. Currently, he is coordinating the project 'Development of Research Infrastructure for the International Competitiveness of the Slovenian RRI Space -SI-CLARIN' (2018–2021), funded by the Ministry of Science and Education.

Recently, Tomaž Erjavec participated in the CLARIN ERIC-funded ParlaMint Project. This ambitious data engineering task included both creating a multilingual set of uniformly annotated corpora of parliamentary proceedings, as well as processing the corpora linguistically to add syntactic structures of Universal Dependencies and Named Entities annotation. He invented the interoperable annotation format used for the corpus based on Parla-CLARIN recommendations, created validation schemata and conversion scripts, and managed the repository and distribution of the resulting datasets. This unique data collection presents a crucial milestone for research in the digital humanities and political sciences.

ParlaMint showcases CLARIN's relevance for transnational and comparable data-driven research that is applicable in a wide spectrum of disciplines, thereby contributing the resources, technologies, infrastructure and know-how to achieve a better understanding of past and contemporary European societies, addressing societal needs in times of crises, and facilitating access and increasing transparency of democratic institutions for scholars, journalists, NGOs and active citizens.

One of the most recent contributions by Tomaž Erjavec, which will go beyond the scope of the ParlaMint Project, is the Parla-CLARIN framework for encoding corpora of parliamentary proceedings. It consists of guidelines, a formal schema, and derived XML schemas in various schema languages. Its components are intended to be used for encoding corpora of parliamentary proceedings, regardless of the language or country of origin, for the purposes of scholarly investigations, be they from the field of linguistics, political science, history or other humanities and social science disciplines. The Parla-CLARIN recommendations adopt the descriptive approach (i.e. keeping as much as possible of the original data distinctions in the target encoding) while trying to limit the encoding options available in TEI to those that could be sensibly applied to corpora of parliamentary proceedings.

Tomaž Erjavec's contributions to Parla-CLARIN and ParlaMint establish an innovative strategy for handling and processing parliamentary data. Its novelties relate to the proper and unified handling of cross-lingual and across-parliament comparable data, and to making this data uniformly available. The ParlaMint framework developed is becoming a de-facto standard for national parliamentary data and will be further developed to cover more detailed and specific metadata across languages and parliaments. 

The ParlaMint corpora encoded in ParlaMint format and all Tomaž Erjavec’s contributions – scripts, guidelines and documentation – are publicly available in CLARIN repositories, GitHub or are integrated into the CLARIN website. The corpora were recently used in one of the tasks of the Helsinki Digital Humanities Hackathon DHH21, which focused on the comparison of parliamentary debates before and during COVID across Europe from a linguistic, sociological, politological and/or computational perspective (see here).

The visibility of Tomaž Erjavec’s work clearly goes beyond ParlaMint – he is maintaining the certified CLARIN.SI repository, which currently contains more than 200 language resources and tools, or approximately 200 GB data for 80 languages, 65 of which were (co-)authored by Tomaž Erjavec himself. Besides his proficiency in language resources and evaluation, linguistic standards and parliamentary corpora, what makes Tomaž Erjavec a great colleague, according to the jury, is his knowledge and scientific professionalism, commitment, sense of humour and confidence, which makes working with him a real pleasure. 

Interview with Tomaž Erjavec

Watch the interview with Tomaž Erjavec on the occasion of the Steven Krauwer Award.

The Award Ceremony

The 7th edition of the award ceremony took place during the virtual CLARIN Annual Conference of 2021.

The Steven Krauwer Award for CLARIN Achievements was initiated in 2017. As of this year multiple awards can be awarded. This year there were no nominations for the Steven Krauwer Award for Young Scientist Award.

The awards, named in honor of Steven Krauwer (the first Executive Director of CLARIN ERIC) are given annually to exceptional scientists or engineers in recognition of outstanding contributions towards CLARIN goals in the areas of language resource building, tools or service.