Workshop objectives, content and target audience
The objective of the two and a half days workshop is to foster collaboration between social sciences and humanities researchers in Central and Eastern Europe and the research communities in these fields represented in CLARIN (the Common Language Resources and Technology Infrastructure, involving 25 countries), and in the EU funded PARTHENOS Infrastructure project (16 partners in 9 countries).
For the workshop we have selected the following topics, which by their very nature lend themselves for collaborative, cross border and cross discipline research, as well as for education purposes:
- Working with Parliamentary Records
- Challenges in Literary History
- Oral History: working with interview data
The target audience are researchers and lecturers in the social sciences and humanities in a broad sense who use language data in their research and/or teaching (such as literary studies, history, political science, communication science, media studies, etc.) from countries in Central and Eastern Europe that are not participating in the PARTHENOS project, with a special focus on Albania, Belarus, Bosnia, Bulgaria, Croatia, Hungary, Moldavia, Montenegro, North Macedonia, Romania, Serbia, Slovakia, Slovenia, Ukraine.
The number of places is limited to 25, with a maximum of 2 participants from each of the countries listed above. If you want to participate you can use this form to apply. If there are more than 25 applicants a selection will be made on the basis of (a) the partcipants' profile, (b) their motivation, and (c) the overall balance, including geographical.
Eligibility conditions: candidates should be researchers, teachers, or post-graduate and higher students in the humanities or social sciences, or practitioners in these fields (e.g. in cultural heritage institutions or libraries). They should work or study in one of the countries listed above.
Accommodation and meals will be provided by the PARTHENOS project, and support fro travel is available up to a maximum of 300 euro.
Application deadline and notification
Application form: https://forms.gle/pMppE6W5Zb8KgYWW6
Applications can be submitted until Monday, July 15
Notification of acceptance Thursday, August 1
For more information please contact the organisers: Darja Fišer and Steven Krauwer at firstname.lastname@example.org.
|Time||Monday 7 October|
|09.00 - 09.30||Welcome and Registration|
|09.30 - 10.00||Opening, Introduction to CLARIN and Parthenos (Steven Krauwer)|
|10.00 - 10.30||Introduction to CLARIN Resource Families and Parthenos Training Module (Darja Fišer)|
|10.30 - 11.00||Coffee break|
|11.00 - 11.30||Presentation of CLADA (Petya Osenova and Kiril Simov)|
|11.30 - 12.00||Presentation of workshop participants|
|12.00 - 13.00||Lunch|
|13.00 - 14.30||Compiling parliamentary corpora, Lecture (Tomaž Erjavec and Andrej Pančur)|
|14.30 - 15.00||Coffee break|
|15.00 - 17.00||Compiling parliamentary corpora, Hands-on (Tomaž Erjavec and Andrej Pančur)|
|Time||Tuesday 8 October|
|09.00 - 10.30||Corpus Design Principles and Challenges in the COST Action 'Distant Reading for European Literary History, Lecture (Roxana Patras and Carolin Odebrecht)|
|11.00-13.00||Corpus Design Principles and Challenges in the COST Action 'Distant Reading for European Literary History, Hands-on (Roxana Patras and Carolin Odebrecht)|
|14.00 - 15.30||A multidisciplinary approach to the use of technology in research: the case of interview data, Lecture (Louise Corti and Christoph Draxler)|
|15.30 - 16.00||Coffee break|
|16.00 - 18.00||A multidisciplinary approach to the use of technology in research: the case of interview data, Hands-on (Louise Corti and Christoph Draxler)|
|Time||Wednesday 9 October|
|09:00 - 10.30||
Breakout session: Consultation with lecturers
(creating resources for another language, adapting a tool to another language, methodological issues, in small groups)
|10:30 - 11.00||Coffee break|
|11.00 - 12.00||Discussion and Closing (feedback, next steps)|
|12:00 - 13.00||Lunch|
Compiling parliamentary corpora
Tomaž Erjavec, Jožef Stefan Institute, Ljubljana, Slovenia
Andrej Pančur, Institute of Contemporary History, Ljubljana, Slovenia
The session introduces corpora of parliamentary proceedings, with a focus on building and esp. encoding such corpora. We give the motivation for research on parliamentary proceedings,mention formats in which they are typically available, sketch the tool-chain needed for their download, clean-up, and structural and linguistic annotation, and discuss existing and emerging encoding schemes for their mark-up. Here we concentrate on the Text Encoding Initiative Guidelines where we first introduce the TEI Guidelines, and then demonstrate the mark-up of parliamentary corpora on several existing cases, discussing issues such as annotating sessions, speeches and interruptions, meta-data on speakers and sittings, using typologies, and including linguistic markup.
The session also includes a hands-on part. Before the workshop we will enquire as to the expectations and technical skills of the participants, but the default scenario is that the participants bring with them a short excerpt from a parliamentary debate as a Word file, which they can first roughly annotate and then automatically convert to TEI, and then do the final annotations in TEI, using the Oxygen XML editor (which can be used free of charge for one month). With this, the participants get some hand-on experience in the TEI structure of parliamentary corpora. We will also demonstrate the use of such corpora on some pre-existing ones with noSketch Engine.
Corpus Design Principles and Challenges in the COST Action 'Distant Reading for European Literary History
Roxana Patras, Alexandru Ioan Cuza University, Iaşi, Romania
Carolin Odebrecht, Humboldt University, Berlin, Germany
In its first part, the lecture outlines the challenges of corpus building for the Romanian ELTeC collection. As shown below, some of them originate in the undisputable linguistic and cultural specificity of Romanian texts, others drawing from the post-communist policies concerning the digitization and open-access treatment of cultural heritage: a. scarcity of digitized resources from the period 1850-1920, thus a difficult extraction/ checking of metadata; b. analysis and automation tools that are still unadjusted to the diachronical particularities of Romanian; c. eligibility and composition principles, most of them deducted from Western literary traditions - i.e. book length, number of editions, 30% canonical works, 10-30% female authors; sampling according to various time-slots, etc - that are rather inapplicable to the frame of Eastern European literary phenomena and institutions; d. in the case of Romanian, the slow process of language standardization raises difficult issues concerning clean-up and normalization. For instance, novels published between 1850 and 1865 are printed in a Cyrillic-Latin alphabet that cannot be read by regular OCR engines, while novels published after 1865, albeit in Latin alphabet, still have some special glyphs that result in bad OCR output. In the second part of the lecture, I introduce a few practical solutions that I have tested and that have proven effective in addressing the aforementioned issues: customized digitization (focused on novel subgenres); customized dataset and DOI assignment on zenodo; support repositories for different text formats on github and zenodo; HTR models for specific prints (such as the ones using the Romanian Transition Alphabet).
The tutorial introduces sampling and balancing criteria as well as encoding principles for the multilingual European Literary Text Collection (ELTeC). We will look at the ELTeC-TEI encoding principles and we will use text examples to work with the encoding schemas for metadata and markup. The tutorial will also present the Action's goals and our working environments.
A multidisciplinary approach to the use of technology in research: the case of interview data
Louise Corti, University of Essex, UK
Arjan van Hessen, University of Twente, Netherlands
In the first part of this session, the lecture introduces different scholarly approaches when working with interview data as a primary or secondary data source. We set out some of the distinct traditions and differences in analytic practices and use of tools across the disciplines. The wide CLARIN family of digital methods and tools, in use by linguists and speech technologists, such as automated speech recognition, annotation, text analysis and emotion recognition tools, are open to wider exploitation, for example by digital humanities scholars, historians and social scientists. We show how they can be used to support different phases of the research process, from data preparation to analysis and presentation. Connecting up tools to help meet the needs of a researcher’s analytic journey can also be beneficial. In this respect, we describe the work of the CLARIN Oral History ‘Transcription Chain’ (TChain), a tool that supports transcription, alignment and editing of audio and text in multiple languages.
The second part of the session offers a hand-on workshop, giving participants the opportunity to work with the TChain; using a dedicated portal to convert audio-visual material into a suitable format, use automatic speech recognition (ASR), correct the ASR results, and download them.
The venue is the Rectorate of Sofia University “St. Kl. Ohridski”, situated in the center of the city.