On December 6-8, 2017, Swe-Clarin / Språkbanken at the University of Gothenburg hosted a CLARIN workshop on Interoperability of Second Language Resources and Tools.
Idea of a workshop
The workshop on interoperability of second language resources and tools started as an idea from a very pragmatic question raised in the new project, “SweLL - research infrastructure for Swedish as a second language”. The question was: do we want to create a corpus of second language (L2) learner essays that we will be able to compare with some other L2 corpora? What exactly do we need to consider to ensure that?
The hunt for “those out there” with experience and ready answers started, and the first “witch” in our hunt has become ASK corpus - a corpus of L2 Norwegian. Luckily, the more experienced ASK researchers (Kari Tenfjord, Paul Meurer, Silje Ragnhildstveit) were keen on passing on their experiences and recommendations, and in June 2017 the two groups - SweLL and ASK - met in Gothenburg. There and then, Koenraad de Smedt proposed that we extend the audience to include other CLARIN countries into this discussion, to spread the importance of ideas on “interoperability” among starting projects, and hear recommendations from experienced groups.
If corpora are interoperable, it means for example, that it is possible to compare how learners of different target languages (e.g. German, Japanese) with a certain mother tongue (e.g. Swedish) perform on different tasks (grammatical, lexical, etc.); or whether the same categories (e.g. building plural versus building presence) are learnt in the same order despite different mother tongues. There are a lot of potential research scenarios for, among others, Second Language Acquisition and Language Testing researchers. To achieve this interoperability, we need to make sure that L2 corpora have comparable error taxonomy (deviations in orthography, tense, etc), associated metadata variables (age, gender, task, etc), file formats, first mother tongue groups, etc.
It is, of course, rather utopian to hope that everyone will have the same error taxonomy (it is impossible, just because languages are typologically different), the same set of personal metadata (this is steered, among others, by national legislation and personal data protection regulations) or that different projects will use the same tools to ensure the same data format (here we have technology development and evolvement of standards that can’t be downplayed). But it is necessary to initiate steps in that direction so that we are able at some point to make generalizations across different target languages. And the most important step of them all is creating a network of relevant researchers and research groups who are interested in discussing these issues.
After the invitation to the workshop was sent out to CLARIN national networks, we got 41 registrations within three days. This fact alone has shown that there is a need to create a meeting space for L2 corpora researchers and developers where they can discuss relevant questions and shape the future of the field. All in all, the workshop hosted 27 participants from 15 European countries, with a keynote speaker, Sylviane Granger, coming from Belgium. The number of participants was limited due to the funding possibilities.
The first day was devoted to presentations of ongoing projects within L2 corpora, featuring target (L2) languages, such as Croatian, Czech, Finnish, Latvian, Lithuanian, Swedish - most of them the first corpora of this kind for these languages, with an inspiring talk about the need of infrastructure for continuous collection, processing and maintenance of learner corpora.
On day two, we devoted the first half of the day to metadata questions, legal issues, and error taxonomies, with a powerful keynote talk by Sylviane Granger on the need for standardization of metadata. During the second half of the day, we talked about tools and software used in the L2 projects, for example for transcription of handwritten texts, for error annotation or adding target hypotheses. An eye-opener was a talk on lessons learned in MERLIN project, where issues with annotation freedom and “lossy” conversions between output from different tools adopted from other projects demanded a huge investment of time. Annotation task management was suggested as a way of avoiding parts of the problems.
Sylviane Granger is giving her invited talk
During day three, we devoted the morning session to the “Happy user” theme, among others, showcasing importance of user-friendliness of search interfaces; looking at a case study with a real-life example of using an L2 corpus for answering questions posed in Second Language Acquisition research; and gaining an NLP perspective on L2 corpora with examples of applications that can be built on top of L2 corpora. The day was wrapped up by a fruitful discussion session organized in a “World Café” fashion by Nives Mikelic Preradovic, Maarten Janssen, Therese Lindström Tiedemann and Silje Ragnhildstveit.
World Café discussions
The details of the workshop, talks, slides and summary of the World Café discussions and information on participants and represented countiries can be found on the workshop website.
Koenraad de Smedt, a National CLARIN Coordinator for Norwegian CLARIN node (CLARINO), was present for the whole workshop and has greatly helped us set the workshop discussions into a wider CLARIN perspective and to drive these discussions towards usable outcomes.
Koenraad de Smedt is presenting CLARIN perspectives on resources and tools
During the workshop we started a document where participants could add their suggestions on what to do next, i.e. after the workshop. As a result, we have a whole spectrum of suggestions and initiatives that are worth undertaking in a hopefully not-too-far-away future: goo.gl/bW24Sq
It will not be an overstatement to say that all workshop participants have gained a lot from the workshop. To start with, we have established ground to this network; we already have a volunteer to organize a follow-up workshop, as well as a volunteer to apply for EU funding to organize a COST action on the topic of learner corpora.
The workshop served as a knowledge sharing event. Many participants have expressed gratitude for getting information about tools that they potentially can use in their projects, as well as for getting input on, among others, the metadata needs and error taxonomy considerations. This will undoubtedly help us foster new generation of L2 corpora that will be (at least slightly) easier to compare with each other across languages. Also, ideas on collaboration between participants were pretty much in the air.
An obvious strength of the workshop was that it brought together people of different “profiles” that work with L2 corpora: technical ones (like software engineers, language engineers, NLP specialists) and non-technical ones (like linguists, Second Language Acquisition researchers, language testing researchers, corpus linguists, teachers). It was useful for the two groups to hear and discuss each other’s considerations and problems - and thus build a better understanding of how to work together.
The social part
It goes without saying, that the social program helped us create a relaxed and friendly atmosphere that was very much appreciated by the participants. We had three wonderful dinners in the (rainy, sigh!) Gothenburg with some Swedish specialties like fresh “catch of the day” and köttbullar (Swedish meatballs); and though the budget was not high, we managed to get a couple of glasses of wine to the food. On day three we celebrated the end of the workshop with a Christmas cake.
Elena Volodina and the Christmas cake
Gothenburg offered quite some opportunities to explore, and get into a Christmas mood. Below, you can see pictures taken by the workshop participant Inga Znotina from Latvia (by courtesy of the author).
Gothenburg in December 2017
As it happened, I was the only one among the workshop organizers with affiliation in Gothenburg, and thus had to take care of the local organization myself. I would love to say special thanks to the two volunteers who didn’t have to, but still took their time to help along with the local organization of the workshop: Dan Rosén (Språkbanken, University of Gothenburg) and Julia Prentice (Dpt of Swedish, University of Gothenburg).
Many thanks go to the workshop co-organizers - Kari Tenfjord, Nives Mikelic Preradovic, Maarten Janssen, Therese Lindström Tiedemann and Silje Ragnhildstveit - who made the program run smoothly, organized a wonderful World Café event, and worked on plenty other things!
And a big thank-you goes to Koenraad de Smedt who was the mastermind and the driving force behind this workshop organization.