You are here

Tour de CLARIN: Tutorial and workshop on automatic sentence selection from corpora


One of the most valuable aspects of an international research infrastructure such as CLARIN ERIC is the knowledge sharing among the national consortia. A successful example of this is the tutorial and workshop on automatic sentence selection for dictionary construction. The event, which was organized by Ildikó Pilán and Elena Volodina from Språkbanken, took place at the University of Gothenburg from 26 May to 1 June and brought together researchers from the Swedish, Estonian and Slovenian consortia.

The aim of the tutorial was to give an introduction to corpus data processing with Python and machine learning approaches for lexicography as well as offer opportunity for practical hands-on sessions with scikit learn and WEKA.

At the workshop, Ildiko Pilán described the HitEx extraction system, which is being developed at Språkbanken and is tailored to the automatic identification of corpus sentences for the exercises aimed at learners of Swedish as a second language. Adapted to various language-proficiency levels on the basis of the CEFR criteria, HitEx is a powerful system that allows for dynamic machine-assisted learning as it provides teaching processionals, lexicographers and students with options to set their own parameters, such as the difficulty level of the words they wish to learn. Iztok Kosem presented how the automatic extraction of corpus data has been successfully implemented in Slovene lexicography. He introduced the Collocations Dictionary of Slovene project, which is in the process of compiling the first corpus-based dictionary of collocations for Slovene. Kristina Koppel presented the on-going work on compiling the Estonian Collocations Dictionary, which is scheduled for release in 2018 and will primarily be aimed at learners of Estonian at the B2-C1 levels.

The HitEx user interface for sentence selection with advanced search options.

Corpus example sentence selection results for fisk “fish” at B1 (intermediate) level.


Blog post written by Darja Fišer and Jakob Lenardič. 

Click here to read more about Tour de CLARIN