|11:00 - 12:00||
Presentations of Accepted Abstracts
|12:00 - 12:10||Q & A|
|12:10 - 13:00||
Reusing the UPSKILLS Learning Content
(The participants attending this workshop are invited to browse the UPSKILLS learning content on Moodle prior to the workshop. During the workshop, we will discuss possible scenarios of how the materials can be reused in teaching and training.)
This presentation shares the experience of reusing and adapting the educational materials from the CLARIN Learning Hub and DELAD project for a workshop at AITLA 2023. The workshop was inspired, motivated and tuned towards the sensitive data typically associated with atypical speech that we deal with at CLARIN's Knowledge Center for Atypical Communication Expertise (ACE: https://ace.ruhosting.nl/) coordinated by Henk van den Heuvel. The Privacy by Design in Linguistic Research learning content is built around three components: Introduction to the GDPR and its impact on linguistic research, Group discussion of two use cases,DPIA roleplay. Parts 1 and 2 are based on the author’s experience as a data steward at the Faculty of Arts at Radboud University. For more details about these educational materials, see this CLARIN Impact Story: Navigating GDPR with Innovative Educational Materials
What if learning syntax could become a gamifiable, highly engaging activity, instead of a boring topic? Have you ever dreamed you could generate tons of language exercises for your Moodle class, based on authentic texts instead of made-up sentences?
Join Antonio Balvet as he introduces a new platform that seamlessly transforms CONLL structured syntactic annotations into Moodle-compatible quizzes. The demonstration will centre on French, but the scripts can be applied to any CONLL corpus available from the Universal Dependencies web repository. The code for the current version of the corpus2quiz processing chain is available at https://github.com/abalvet/ACE.
This virtual course offers basic knowledge and skills in programming. There are many Python courses, but this one mostly focuses on text processing and data analysis related to linguistics, language studies, digital humanities and cognitive science. The core of the course consists of a series of Jupyter notebooks that combine examples of Python code with explanatory text. The notebooks demonstrate simple language processing and aggregating and visualising qualitative and quantitative data, including data from real language research, CLARINO, and other sources. They also suggest exercises that should be solvable based on the given examples and explanations. Ideally, the course should be presented by a teacher, and the exercises should be supervised, but the modules are also suitable for self-study.
Experience with the course shows that students get started quickly because they do not have to install any software. The combination of text and code in Jupyter Notebooks makes the course largely self-explanatory. Students in linguistics and language studies benefit from the focus on language. Still, the course has also attracted students from information science, communication and media studies, cognitive science, digital culture, computer science, and digital security.
The exercises stimulate active learning. Although the course is self-contained and suitable for self-study, experience shows that its use in classroom teaching is preferable, especially for absolute beginners, because such teaching allows interaction through questions and answers if something is not well understood. Also, sessions offering help with the exercises were appreciated.
After several iterations, the course is now fairly stable, but further improvements are possible. Several examples could be made even more relevant to language studies, and using more language datasets from CLARIN is being considered. There are no solutions to exercises, but students have requested these, especially for self-study. The addition of quizzes may be considered. The Google Colaboratory platform is very easy to use, but it has limitations, and its conditions for use may change. Alternative platforms, such as Deepnote, Kaggle or Binder, have been successfully tested but are not essentially better. Some students prefer to run the Jupyter notebooks on their own machines. Ideally, the code and runtime should be hosted on an academic cloud service, such as NIRD (Norway), but that gave too many administrative hurdles and did not provide all packages.
The course materials are licensed under Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Citation: Introduction to programming for NLP with Python. Web-based course at the University of Bergen. https://mitt.uib.no/courses/38115.
We are proposing a mentoring and assistance pipeline for individual scholars or groups from the Digital Humanities (DH) community to create training data for NLP tools in their languages and specific domains (e.g., poetry, historical texts), drawing on Universal Dependencies as the current standard for linguistic annotation.
DH typically deals with out-of-domain texts. Since corpus annotation is no longer a topic for NLP research, it is the DH researchers themselves who are in charge of training corpora for their languages/domains to be included in the regular updates of established NLP workflows (e.g. UDPipe) as well as ad hoc (diverse SpaCy models on HuggingFace). However, the DH researchers need a friendly nudge to get going. In our previous teaching events, we have demonstrated to our students that it is daunting work, which, on the other hand, does not require the extent of technical (programming) skills they tend to fear.
We continue elaborating on 'NLP annotation for scholars' as a pedagogical concept. The corpus data can be random data for the given domain, but the more typical use case is that of a scholar already having a corpus they are interested in, for which no adequate tagger exists. This corpus can already be loaded with other annotations, e.g. XML-TEI links to a facsimile or an audio track. We teach our students to operationalise their research questions in common linguistic terms, formulate them in terms of Universal Dependencies, and query them with a corpus query language. For all these steps, we use TEITOK - an online environment explicitly designed to integrate the NLP annotation in complex document structures and make the resulting corpus searchable across the different annotation layers.
The first step is to select a portion of text in the pre-established corpus collection and manually annotate it from scratch or pre-processed by a (sub-optimal) tagger. For the manual annotation, the system will ask for the correct lemma, POS tag, and morphological features for each word. By default, syntactic dependencies will not be annotated in this step since scholars with no linguistic background get easily discouraged by mentions of syntax. At the same time, they usually have a good grasp of morphological categories. The manual annotation is iteratively used to train/improve the tagger and facilitate further annotation with improved pre-processing. Deep learning means fewer training data are needed to reach an adequate tagger accuracy.
This process thus provides, from the very start, an automatic tagger and lemmatiser, which will become increasingly accurate with more training data. In the set-up we provide for this, the model will be available for download, and there will also be an online interface where anyone can use the newly trained tagger. As is done by UDPipe, it will include authors' details to ensure their due credit. The session will demonstrate the current setup and its future prospects.
The National Interdisciplinary Research E-Infrastructure for Bulgarian Language and Cultural Heritage Resources and Technologies integrated within European CLARIN and DARIAH infrastructures (ClaDa-BG) have already made available on its website a rich collection of services, resources and tools (https://clada-bg.eu/en/centers-and-services/language-technologies/services-and-tools.html). These have proved their undeniable usefulness in providing the university educational process with solid data collections and the possibility for enriching it with modern-day student-oriented work formats.
The present article focuses on some of the free-access associative dictionaries of the LABLASS – a web-based system for presenting and studying word associations: The Dictionary of Child Word Associations (DCHWA 2022), The New Dictionary of Child Word Associations (NDCHWA 2023) and Bulgarian Norms of Word Associations (BNWA 1984). The article looks at their data representation advantages compared to traditional associative dictionaries regarding their use in the educational and research processes. The application of these resources in writing theses and curriculum course projects in disciplines such as lexicology, psycholinguistics, and child speech linguistics is particularly interesting. The interdisciplinary potential of the presented digital resources is discussed in terms of linguistics and logopedics student research.
ERICs such as CLARIN and DARIAH are well-placed to provide a conduit between industry and education, given our wide-ranging contacts with both communities. DARIAH and CLARIN already collaborate closely within the context of the DH Course Registry, maintained by both infrastructures. The registry, a platform to collect metadata on digital humanities programmes across Europe, served as a glue between the research infrastructures and the DH programmes, leading to a new joint initiative.
In the spring of 2023, we set out to explore effective strategies and best practices for facilitating the career success of graduates of DH Master’s programmes in the private sector. The skills acquired within Digital Humanities (DH) postgraduate degrees are interdisciplinary and, therefore, transferable, something that has been recognised among larger multinational companies. Moreover, a strong humanities background and familiarity with our methods can benefit the commercial sector. Yet among small and medium enterprises (SMEs), employing a graduate from a field still in its relative infancy compared with more traditional disciplines can be considered a risk. It therefore becomes necessary to identify the gaps between the current provision of training among DH scholars at a Master’s level and the needs of companies and future employers of DH graduates.
It is necessary to foster internships that encourage and nurture experimental data spaces between cultural heritage, industry and academia - and CLARIN and DARIAH are ideal forums to cultivate these synergies. To do so, we contacted a series of DH Master’s Programme leaders (directors, coordinators and/or representatives) to learn how they approached the issue of internships with private industry. At the DARIAH Annual Event 2023 in Budapest, Hungary, these efforts culminated in a joint effort led by CLARIN and DARIAH to bring together 25 DH Master’s heads (https://doi.org/10.5281/zenodo.8071224) to examine these questions more closely during a pre-conference workshop, which resulted in a White Paper. This presentation is, therefore, a continuation of this conversation with the CLARIN community to ensure that our impact is as wide and representative as possible.
If you have questions about this event, please get in touch with Iulianna van der Lek at email@example.com.
Location: Irish College Leuven