CLARIN is a European Research Infrastructure Consortium composed of a number of expert centres across Europe and beyond which create and curate digital language resources, and support their use in research, teaching and learning. This workshop will introduce a number of ways in which CLARIN can support teachers by helping with the discovery, use and sustainability of resources, and by providing materials for teaching.
This half-day workshop will is a pre-conference event for participants at the Teaching and Language Corpora conference to be held at the University of Limerick (Ireland),13-16 July 2022. Full conference details are available here. TaLC conferences have taken place every two years since 1992, and are the principle international event where issues relating to language corpora and teaching are explored.
A Short Introduction to CLARIN (Francesca Frontini)
An introduction to the CLARIN European Research Infrastructure Consortium, with a focus on its existing and planning activities and services relating to teaching and learning with digital language resources.
Tutorial 1: UPSKILLS: Integration of Research Infrastructures into Teaching (Iulianna van der Lek and Martin Wynne)
The UPSKILLS project is an Erasmus+ strategic partnership for higher education that seeks to identify and tackle the gaps and mismatches in skills for linguistics and language students through the development of a new curriculum component and supporting materials to be embedded in existing programmes of study. The role of CLARIN in UPSKILLS is to provide guidelines and learning content to support teachers and trainers to integrate the language resources, tools and services distributed via the research infrastructure into teaching. The goal of this tutorial is to show how teachers can use the CLARIN infrastructure to collect an annotated corpus from scratch using the following web services:
- The Virtual Language Observatory ( )
The VLO contains references to more than 700,000 resources, the majority of which are hosted at CLARIN centres, but it also contains references to relevant resource collections maintained by other organisations. Metadata from repositories in the 25 CLARIN member and observer countries are harvested on a weekly basis, covering many languages, both national and regional. Each centre may use different but interoperable metadata profiles. Through advanced medata searches, the VLO enables fast identification of relevant resources, allowing researchers, lecturers and students to reuse resources that already exist, rather than having to produce their own from scratch.
- The Language Resource Switchboard
Due to the abundance of online resources, researchers and students may have difficulty identifying suitable data, tools or software that match their specific research requirements. The general need for guidance and recommendation is addressed by catalogues, such as the EOSC Marketplace or the SSH Open Marketplace. Successful searching in catalogues relies, to a large extent, on the ability of the users to abstract away from their specific information needs and to understand the relevance of resources with a wider applicability. It is more convenient for users to start the search for suitable tools with what they easily have at hand: representative snippets or fragments of content (text, audio, video). This functionality is offered by the Language Resource Switchboard.
WebLicht (“Web-based Linguistic Chaining Tool”) is a user-friendly web service for automatic annotation of text corpora and it is hosted by the CLARIN centre at the University of Tübingen. tools (e.g. sentence splitting, tokenization, lemmatization, POS tagging, morphological analysis, named entity recognition, dependency parsing, constituency parsing) are made interoperable and encapsulated as web services, which can be combined by the user into custom processing chains. The resulting chain can then be visualised. WebLicht is tightly integrated into the CLARIN infrastructure. It uses information from the Centre Registry to harvest tool metadata from all CLARIN centre repositories. The tool metadata from the Centre Registry are automatically harvested several times each day, ensuring that all tool information is up to date. WebLicht also supports log in with CLARIN Federated Identity, which allows users to log in through their academic institutions.
Tutorial 2: ParlaMint – Parliamentary Discourse in the Classroom (Darja Fišer and Kristina Pahor de Maiti)
In this demo session we will present the award-winning tutorial Voices of the Parliament which teaches the key corpus linguistics techniques on the research problem of women's representation in the parliament and the ParlaMint family of 17 comparable corpora of parliamentary proceedings which contain rich metadata and linguistic annotations to answer sociolinguistic research questions. The practical examples will be performed on the ParlaMint-GB corpus which contains proceedings of the UK parliament from 2015 to March 2021. We will explore the data with the help of the NoSketch Engine concordancer which enables free and easy access to analyse all of the ParlaMint corpora.
The showcase will combine quantitative and qualitative analysis to explore the role and representation of women in the parliament and the impact of the pandemic on their activity. In the demo, we will use frequency data, keywords and collocation candidates together with metadata on MP's gender and time of the sittings, and linguistic annotations of syntactic relations to investigate the production of MPs, the prevalent topics addressed by female and male MPs, and the characterisation of women and women-related issues. The final part of the session will be dedicated to valuable lessons learned during the production of the tutorial, and an open discussion about the potentials and obstacles for reuse and adaptations of the tutorial and the ParlaMint corpora for specific classroom settings (e.g. other languages, research topics).
Tutorial 3: CLiC (Corpus Linguistics in Context) - Between Close and Distant Reading (Michaela Mahlberg and Michele McIntosh)
The web application CLiC is a tool for reading and analysing narrative fiction. Unlike general concordancers, CLiC has been optimised to run searches across full texts as well as within particular sections of texts, i.e. across direct speech, narration and narratorial interruptions of character speech (‘suspensions’). Therefore, CLiC is particularly suited to address research questions and educational applications around properties of narrative fiction, e.g. on characterisation. The CLiC corpora mainly contain texts from the nineteenth century, including a corpus of Dickens’s novels, a corpus of children’s fiction, and a collection of African American Writers. CLiC currently offers access to over 150 books and 16 million words. The CLiC interface has been designed to be user-friendly, aimed at both research and educational applications. It has a mobile-friendly version, too, for ‘concordancing on the go’. CLiC has been successfully used in secondary school contexts for the teaching of language and literature, and it has direct applications in second language teaching through fiction. The tutorial will give users hands-on experience of basic functionalities, as well as the KWICGrouper and the annotation tool that supports the analysis of concordance lines through user-defined categories. CLiC comes with a downloadable Activity Book (Mahlberg et al., 2017) and the CLiC blog illustrates example applications through a range of guest posts written by researchers and educators. Participants of the workshop will have the opportunity to submit a guest post.
Organisers and Speakers
Francesca Frontini is a Research Scientist at the Institute for Computational Linguistics of the Consiglio Nazionale delle Ricerche (ILC-CNR) in Pisa and member of the CLARIN Board of Directors, where she is overseeing activities related to user involvement and outreach. Her research interests lie in Language Resources and Technologies, NLP and linguistic knowledge representation. In addition, she has published extensively on issues relating to language resource documentation, preservation and standardisation.
Darja Fišer is Associate Professor at the Department of Translation Studies at the Faculty of Arts, University of Ljubljana, and Senior Research Fellow at the Institute of Contemporary History and at the Department of Knowledge Technologies at the Jožef Stefan Institute. She is also National Coordinator of DARIAH-SI, and Chair of the Steering Committee of ESSLLI. She leads the national basic research programme for Digital Humanities.
Michaela Mahlberg is Professor of Corpus Linguistics at the University of Birmingham, UK, where she is also the Director of the Centre for Corpus Research, a CLARIN-UK centre. She is the editor of the International Journal of Corpus Linguistics (John Benjamins) and together with Gavin Brookes she edits the book series Corpus and Discourse (Bloomsbury). Michaela is a Fellow of the Alan Turing Institute in London, and she hosts the podcast “Life and Language”. She has been leading the development of the CLiC web application, which was funded by the AHRC.
Kristina Pahor de Maiti is a PhD student in Linguistics at the CY Cergy Paris University and University of Ljubljana. Her research interests include spoken language and computer-mediated communication. Currently, she is focusing on corpus-based analysis of political and socially unacceptable online discourse with a special interest in figurative language. In addition, her activities include developing training materials to promote corpus linguistics research approaches in the Social Sciences and Humanities.
Iulianna van der Lek is Training and Education Officer at CLARIN, closely collaborating with the CLARIN community and university lecturers from the humanities and social sciences disciplines to accelerate the further integration of CLARIN into university curricula. In addition, she is coordinating CLARIN's activities in the UPSKILLS project, an ERASMUS+ initiative that aims at upgrading the skills of Linguistics and Language Students. In the past, Iulianna worked as a trainer in Computer-Aided Translation and Terminology Management.
Martin Wynne is a Senior Researcher in Corpus Linguistics at the University of Oxford, National Coordinator of CLARIN-UK, and a former Director of User Involvement for CLARIN ERIC. He is a former member of the Organizing Committee of the TaLC conferences, and has organized a number of pre-conference workshops at corpus linguistics conferences, and teaches a Masters-level course at the University of Oxford on Text Analysis.
University of Limerick