Electronic Literature: Documenting and Archiving Multimodal Computational Writing.
Scott Rettberg (University of Bergen, Norway
The field of Electronic Literature comprises new forms of literary creation that merge writing, computation, interactivity, and design in the creation of writing that is specific to the context of the computer and the global network. While electronic literature is a field of experimental writing with a history that stretches back to the 1950s, it has grown most expansively in the late two decades. Forms of electronic literature such as combinatory poetics, hypertext fiction, kinetic and interactive poetry, and network writing bridge the 20th century avant-garde and practices specific to the 21st century networked society. Yet electronic literature has faced significant hurdles as it has developed as a field of study, related to the comparative instability of complex computational objects, which because of their formal diversity are often not easily accommodated by standardized methods of digital archiving, and are subject to cycles of technological obsolescence. Rettberg's presentation will address efforts to disseminate, document, and archive the field of electronic literature. After providing some examples of genres of electronic literature, Rettberg will discuss projects such as the Electronic Literature Collections, the ELMCIP Electronic Literature Knowledge Base, and the Electronic Literature Archive that seek to preserve a corpus of work and criticism for the future.
Corpus-Driven Investigation of Language Use, Variation and Change - Resources, Models, Tools.
Elke Teich (University of the Saarland, Saarbrücken, Germany)
When we set out to study language use we will immediately observe two core properties of language: it varies according to context (register, dialect), and it changes over time. While linguistic variation and change may be considered an annoyance from the perspective of computational processing, to a linguistic scholar, variation and change are fascinating research topics that involve a number of challenging questions: What are the mechanisms of variation and change? What are the linguistic features involved in variation and change? Why does change happen? How does change proceed and what are its effects?
Taking the perspective of a “humanist-as-scientist”, in the talk I will reflect on the requirements of empirically investigating language use, variation and change with special regard to computational resources, models and tools. As an example, I will focus on the diachronic development of scientific English, showing how data-driven methods using state-of-the-art computational language models (e.g. n-gram models, word embeddings) combined with information-theoretic measures (e.g. entropy, surprisal) can be effectively integrated with linguistic micro-analysis for comparison of corpora/models along relevant dimensions of variation, such as time and register.
Towards a Universe of Local Time Machines - building an open eco-system for applied heritage fuelled by common language resources and existing infrastructure
Toine Pieters (Utrecht University, The Netherlands)
CLARIN ambassador Toine Pieters will connect past experiences of his Clarin-related project-portfolio with current challenges of building an open-ecosystem for digital heritage organized within the context of Time Machine Europe. Issues of sustainability, scalabilty and interoperability will be raised and put up for discussion. He will argue that honoring shared values and principles is key to a successful alignment between existing data repositories and technological infrastructures.
CLARIN Blog writing Masterclass
Mićo Tatalović (chairman of the Association of British Science Writers)
The seminar is designed to introduce blogs as an increasingly important communication channel in academia and research infrastructures. Participants will learn more about the characteristics of the emerging genre, guidelines and best practices in blog writing for the academic community. The seminar will include hands-on activities on pitching original ideas and writing attractive headlines and blog post openings. The seminar, which will take place before the CLARIN Annual Conference on 30 September 14:00-16:00 in Leipzig, will be given by Mićo Tatalović, Chair of the Association of British Science Writers and an expert in science communication.
This paper presents experimental work on Named Entity Recognition and Annotation for ancient Greek using INCEpTION, a web-based annotation platform built on the CLARIN toolWebAnno. Data described in the paper is extracted from the Deipnosophists of Athenaeus of Naucratis.
Enriching Lexicographical Data for Lesser Resourced Languages: A Use Case.
Dirk Goldhahn, Thomas Eckart, Sonja Bosch
This paper presents a use case for enriching lexicographical data for lesser-resourced languages employing the CLARIN infrastructure. Basis of the presented work are newly prepared lexicographical data sets for under-resourced Bantu languages spoken in southern regions of the African continent. These datasets have been made digitally available using well established standards of the Linguistic Linked Open Data (LLOD) community. To overcome the insufficient amount of freely available reference material, a crowdsourcing Web portal for collecting textual data for lesser-resourced languages has been created and incorporated into the CLARIN infrastructure. Using this portal, the number of available text resources for the respective languages was significantly increased in a community effort. The collected content is used to enrich lexicographical data with real-world samples to increase the usability of the entire resource.
This paper analyses data to address a specific linguistic problem, i.e. the acquisition of the modification potential of the three more or less synonymous Dutch degree modifiers heel, erg and zeer, all meaning ‘very’, which show syntactic differences in modification potential. It continues the research reported on in (Odijk, 2016). The analysis makes crucial use of linguistic applications developed in the CLARIN infrastructure, in particular the treebank search applications PaQu (Parse and Query) and GrETEL Version 4.00. The analysis benefits from the use of parsed corpora (treebanks) in combination with the search and analysis options offered by PaQu and GrETEL. Earlier work showed that despite little data for zeer modifying adpositional phrases adult speakers end up with a generalised modification potential for this word. In this paper, we extend the dataset considered, and find more (but still little) data for this phenomenon. However,we also find a similar amount of data that form counterexamples to the non-generalisation of the modification potential of heel. We argue that the examples with heel concern constructions with idiosyncratic semantics and therefore are not counted as evidence for the general rule of modification.We suggest a simple statistical analysis to account for the fact that children ‘learn’ that heel cannot modify verbs or adpositions though there is no explicit evidence for this and they are not explicitly taught so.
Training Workshops in the Bi-directional Model of the Language Technology Infrastructure Development.
Maciej Piasecki and Jan Wieczorek.
In this paper we describe the evolution of training workshops offered by the CLARIN-PL. We focus on the types of workshops, the competences of participants and the role which the workshops are aimed to fulfill in a bi-directional model of the language technology infrastructure development assumed for CLARIN-PL. The paper also discusses our experience collected during four years and examples of the influence of the workshops on users and their cooperation with CLARIN.
OpeNER and PANACEA: Web Services for the CLARIN Research Infrastructure.
Riccardo Del Gratta and Davide Albanesi.
This paper describes the necessary steps for the integration of OpeNer and PANACEA Web Services within the CLARIN research infrastructure. The originalWeb Services are wrapped into a framework and re-implemented as REST APIs to be further exploited through both Language Resource Switchboard and WebLicht and made available for the CLARIN community.
CLARIAH chaining search: A platform for combined exploitation of multiple linguistic resources.
Peter Dekker, Mathieu Fannee and Jesse De Does.
In this paper, we introduce CLARIAH chaining search, a Python library and Jupyter web interface to easily combine exploration of linguistic resources published in the CLARIN/CLARIAH infrastructure, such as corpora, lexica and treebanks. We describe the architecture of our framework and give a number of code examples. Finally, we present a case study to show how the platform can be used in linguistic research.
Manually PoS tagged corpora in the CLARIN infrastructure.
Tomaž Erjavec, Jakob Lenardič and Darja Fišer.
This paper provides a comparison of corpora that are manually annotated for word-level morphosyntactic information, i.e. part-of-speech (PoS) tags, and are available for download within the CLARIN infrastructure. Since such corpora provide gold-standard data, they are an important resource for training new PoS taggers as well testing the accuracy of the existing ones. It is therefore valuable to have a better understanding of the languages that are supported in this way through CLARIN, under what licences such corpora are available for download and to compare their encodings and PoS tagsets used in order to see to what extent they are interoperable.The rest of the paper is structured as follows: Section 2 gives an overview of the manually PoS tagged corpora available through the CLARIN infrastructure and compares their encodings and PoS tagsets; Section 3 compares the corpora against the most comprehensive multilingual dataset of PoS annotated corpora, namely the Universal Dependencies treebanks; and Section 4 concludes the paper.
Use Case for Open Linguistic Research Data in the CLARIN Infrastructure. The Open Access Database for Adjective-Adverb Interfaces in Romance.
Gerlinde Schneider, Christopher Pollin, Katharina Gerhalter and Martin Hummel.
The AAIF project is establishing appropriate ways to make linguistic research data on Adjectiveadverbs in Romance languages openly accessible and reusable. Special focus is set on adhering to the FAIR data principles. Using a project-specific annotation model, it annotates corpora of linguistic phenomena related to adjectives with adverbial functions in Romance languages. This paper documents the approaches we use to accomplish these goals. An important part of this is the use and provision of data via the formats and interfaces defined by the CLARIN infrastructure.
CLARIN Web Services for TEI-annotated Transcripts of Spoken Language.
Bernhard Fisseni and Thomas Schmidt.
We present web services implementing a workflow for transcripts of spoken language following TEI guidelines, in particular ISO 24624:2016 “Language resource management – Transcription of spoken language”. The web services are available at our website and will be available via the CLARIN infrastructure, including the Virtual Language Observatory and WebLicht.
This article presents some applications of the open-source software tool DiaCollo for historical research. Developed in a cooperation between computational linguists and historians within the framework of CLARIN-D’s discipline-specific working groups, DiaCollo can be used to explore and visualize diachronic collocation phenomena in large text corpora. In this paper, we briefly discuss the constitution and aims of the CLARIN-D discipline-specific working groups,and then introduce and demonstrate DiaCollo in more detail from a user perspective, providing concrete examples from the newspaper “Die Grenzboten” (“messengers from the borders”) and
other historical text corpora. Our goal is to demonstrate the utility of the software tool for historical research, and to raise awareness regarding the need for well-curated data and solutions for specific scientific interests.
Corpus-Preparation with WebLicht for Machine-made Annotations of Examples in Philosophical Texts.
This paper is an outline of an architecture used for harvesting examples from a corpus of philosophical writings. CLARIN-DE’s WebLicht is used for preprocessing the corpus. The operation mode of a two-stage process on top of it is described. The produced data on example usage in philosophical works is valuable for recent research in literary studies and philosophy.
Lifespan change and style shift in the Icelandic Gigaword Corpus.
Lilja Björk Stefánsdóttir and Anton Karl Ingason.
We demonstrate research on syntactic lifespan change and style shift in Icelandic that is made possible by recent advances in Language Technology infrastructure for Icelandic. Our project extracts data from the Icelandic Gigaword corpus and allows us to shed light on how social meaning shapes the linguistic performance of speakers using big data methods that would not have been feasible for us to use without a corpus of this type.
Studying disability related terms with Swe-Clarin resources.
Lars Ahrenberg, Henrik Danielsson, Staffan Bengtsson, Hampus Arvå Linhem, Lotta Holme and Arne Jönsson.
In Swedish, as in other languages, the words used to refer to disabilities and people with disabilities are manifold. Recommendations as to which terms to use have been changed several times over the last hundred years. In this exploratory paper we have used textual resources provided by Swe-Clarin to study such changes quantitatively. We demonstrate that old and new recommendations co-exist for long periods of time, and that usage sometimes converges.
To Ask or not to Ask: Informed Consent to Participate and Using Data in the Public Interest.
Krister Lindén, Aleksei Kelli and Alexandros Nousias.
The development and use of language resources often involve the processing of personal data.Processing has to have a legal basis. The General Data Protection Regulation (GDPR) provides several legal grounds. In the context of scientific research, consent and public interest are relevant. The main question is when researchers should rely on consent and when on public interest to conduct research. Both grounds have their advantages and challenges. For comparing them, the Clinical Trial Regulation is used as an example.
Data collection for learner corpus of Latvian: copyright and personal data protection.
Inga Kaija and Ilze Auziņa.
Copyright and personal data protection are two of the most important legal aspects of collecting data for a learner corpus. The paper explains the challenges in data collection for the learner corpus of Latvian “LaVA” and describes the procedure undertaken to ensure protection of the texts’ authors’ rights. An agreement / metadata questionnaire form was created to inform the authors of the ways their texts are used and to receive the authors’ permission to use them in the stated way. The information, permission, and the metadata questionnaire are printed on one side of an A4 size paper sheet, and the author is supposed to write the text on the other side by hand, thus eliminating the need to identify the author of the text separately. After scanning and adding to the corpus, the text originals are returned to authors.
Liability of CLARIN Centres as Service Providers: What Changes with the New Directive on Copyright in the Digital Single Market?
Pawel Kamocki, Andreas Witt, Erik Ketzan and Julia Wildgans.
Providing online repositories for language resources is one of the main activities of CLARIN centres. The legal framework regarding liability of Service Providers for content uploaded by the service users has very recently been modified by the new Directive on Copyright in the Digital Single Market. A new category of Service Providers — Online Content-Sharing Service Providers (OCSSPs) — is subject to a complex and strict framework, including the requirement to obtain licenses from rightholders. The proposed paper discusses these recent developments and aims to initiate a debate on how CLARIN repositories should be organised to fit within this new framework.
The extent of legal control over language data: the case of language technologies.
Aleksei Kelli, Arvi Tavast, Krister Lindén, Kadri Vider, Ramūnas Birštonas, Penny Labropoulou, Irene Kull, Gaabriel Tavits and Age Värv.
The article aims to increase legal clarity concerning the impact of data containing copyrighted content and personal data on the development of language technologies. The question is whether legal rights covering data affect language models.
In this article we describe a user support solution for digital humanities. As a case study we show the development of the CLARIN-D helpdesk from 2013 into the current support solution that has been extended for DARIAH-ERIC as well as number of other non-CLARIN-D software and projects and a describe a way forward for common support platform for CLARIAH-DE that we are currently building towards as well.
CLARIN AAI and DARIAH AAI Interoperability.
Peter Gietz and Martin Haase.
Both CLARIN and DARIAH1 have developed an Authentication and Authorization Infrastructure (AAI) which allows language and humanities researchers, respectively, to access on-line resources using their institutional accounts. Both AAIs are based on the SAML2 OASIS standard, and by virtue of this fact, lend themselves to interoperability. While CLARIN has established a "Service Provider federation" leveraging the German DFN-AAI federation and the international eduGAIN meta-federation, the DARIAH AAI has been built solely on top of eduGAIN via membership in the DFN-AAI, recently enhanced by the introduction of an IdP-SP Proxy for DARIAH services according to the AARC Blueprint Architecture. Both AAIs were successfully interconnected in 2018, already allowing many CLARIN and DARIAH users to access services from both communities today. However, there are some optimization possibilities that also are detailed in this paper.
Word at a Glance is a highly customizable web application serving as a word profile generator aggregating a diverse set of possible data resources and analytic tools. It focuses on providing means for expert-based interpretation and presentation of the data and, at the same time, it makes the results easily accessible to general public.
In recent years, the reproducibility of scientific research has more and more come into focus, both from external stakeholders (e.g. funders) and from within research communities themselves. Corpus linguistics and its methods, which are an integral component of many other disciplines working with language data, play a special role here – language corpora are often living objects: they are constantly being improved and revised, and at the same time, the tools for the automatic processing of human language are also regularly updated, both of which can lead to different results for the same processing steps. This article argues that modern software technologies such
as version control and containerization can address both issues, namely make reproducible the process of software packaging, installation, and execution and, more importantly, the tracking of corpora throughout their life cycle, thereby making the changes to the raw data reproducible for many subsequent analyses.
Metadata of a resource is information about a resource but not part of the resource itself. However, providing metadata is a crucial aspect of resource sustainability. In this contribution we show examples how to collect and provide additional process metadata, i.e. data on the creation process of a resource and decisions made in this process, to further increase the value of a resource.
The Best of Three Worlds: Mutual Enhancement of Corpora of Dramatic Texts (GerDraCor, German Text Archive, TextGrid Repository).
Frank Fischer, Susanne Haaf and Marius Hug.
In most cases when tackling genre-related research questions, several corpora are available that comprise texts of the genre in question (like corpora of novels, plays, poems). This paper describes how to combine the strengths of different corpora to increase corpus sizes, correct mistakes and mutually enhance the depth and quality of the markup. The use case demonstrated regards three TEI-encoded corpora of German-language drama: the dedicated German Drama Corpus (GerDraCor) and the two implicit subcorpora of dramatic texts contained in the CLARIND-maintained German Text Archive (DTA) and the DARIAH-DE-run TextGrid Repository.
Mapping METS and Dublin Core to CMDI: Making Textbooks available in the CLARIN VLO.
Francesca Fallucchi and Ernesto William De Luca.
In a time where the amount of digital resources and the complexity of the relations between them are expanding rapidly and unpredictably it is necessary to manage and to find electronic resources. Descriptive metadata characterise a resource with keyword-value pairs. The use of such descriptions allows researchers clearer and easier access to available resources. In this way, users can manage and find research data beyond traditional publications. The Georg Eckert Institute (GEI) creates and curates various digital resources that are offered to the community of international textbook research and other scientific field. This paper discusses how to provide a CMDI (Component MetaData Infrastructure) profile for our textbooks, in order to integrate them into CLARIN infrastructure and thus open them make them fairer. After adapting to CMDI profile for research project data, we now look into the creation of a new profile for the digitized historical textbooks of "GEI-Digital". We describe our workflows and the adversities and problems faced when trying to convert METS metadata into CMDI.
The RI-cluster project PARTHENOS is coming to an end after four years of intensive work.One of the main goals was the integration of metadata from the diverse domains represented by the partners. To this end, a common semantic model has been devised, aimed to capture the main entities of the knowledge generation process as they are present in resource metadata. In this paper we elaborate on the results of the aggregation process with the (traditional) focus on metadata quality.
Starting in 2018 Swe-Clarin members are working cross-instituionally on special themes. In this paper we report ongoing work in a project aimed at the creation of a new gold standard for Swedish Named-Entity Recognition and Categorisation. In contrast to previous efforts the new resource will contain data from both social media and edited text. The resource will be made freely available through Spr°akbankenText.
Documents digitised in mass-digitisation projects end up as high quality images and the text in them represented in one of the standard optical character recognition (OCR) formats. The Text Encoding Initiative (TEI) provides a much better way to encode the digitised content as it offers means to capture the metadata of the document and make detailed annotations. Since the OCR output only contains minimal markup that treats every isolated block of text as a paragraph, we developed models to automatically infer the structural markup and produce a richer TEI document. In particular we developed models to identify titles, subtitles, footnotes and page headers, and label OCR artefacts and surplus contents. In this paper we describe the capabilities of these models, our text encoding choices and the open challenges.
Enhancing Lexicography by Means of the Linked Data Paradigm: LexO for CLARIN.
Andrea Bellandi, Fahad Khan and Monica Monachini.
This paper presents a collaborative web editor for easily building and managing lexical and terminological resources based on the OntoLex-Lemon model. The tool allows information to be easily manually curated by humans. Our primary objective is to enable lexicographers, scholars and humanists, especially those who do not have technical skills and expertise in the Semantic Web and Linked Data technologies, to create lexical resources ex novo even if they are not familiar with the underlying technical details. This is fundamental for collecting reliable, finegrained, and explicit information, thus allowing the adoption of new technological advances in
the Semantic Web by the Digital Humanities.
Aggregating Resources in CLARIN: FAIR Corpora of Historical Newspapers in the German Text Archive.
Matthias Boenig and Susanne Haaf.
Newspapers, though an important text type for the study of language, were not primarily part of the ef-forts to build a corpus for the New High German language carried out by the Deutsches Textarchiv (Ger-man Text Archive, DTA) project. After the finalization of the DTA core corpus, we started our efforts to gather a newspaper corpus for the DTA based on digital data from various sources. From the beginning, this work was done in the CLARIN-D context. Thanks to the willingness of external partners to pass on project results and their cooperation it was possible to gather a corpus of historical newspapers, adapt it to a homogeneous set of guidelines and offer it to the community for free reuse. The poster is intended to provide insights into the newspaper and journal corpus of the DTA and to point out research possibilities which result from the aggregation of the digitized texts from various sources in the DTA. (Poster)
CLARIN and Digital Humanities. A successful integration.
Elisabeth Burr, Marie Annisius and Ulrike Fußbahn.
The collaboration between the European Summer University in Digital Humanities “Culture & Technology” (ESU) and CLARIN-D is a concrete example of the successful integration of a digital language resources and technology research infrastructure for the humanities and social
sciences and Digital Humanities. While a thorough analysis of this collaboration, its outcome and its impact need to be postponed to a later date, we would like to offer at least some insight into this collaboration. In the first part of our presentation, we will outline briefly the foundation and specific nature of the ESU. The second part will explain how the collaboration between the ESU and CLARIN-D came about and what it consists of. The third part is dedicated to results and tries to draw a few conclusions before it expresses some hopes for the future. As the proposal was not discussed with colleagues from CLARIN-D, the view on this collaboration is a personal and partial one.
AcTo : how to build a network for historical Occitan.
Gilda Caiti Russo, Jean-Baptiste Camps, Gilles Couffignal, Francesca Frontini, Hervé Lieutard, Elisabeth Reichle and Maria Selig.
We present the AcTo project, a network of language resources and tools for Medieval Occitan. The proposed poster presentation aims at illustrating the resources in the network, as well as the first steps towards their integration, aiming towards the harmonisation and interoperability of NLP and lexical resources for the annotation of digital editions.
A parsing pipeline for Icelandic based on the IcePaHC corpus.
Tinna Frímann Jökulsdóttir, Anton Karl Ingason and Einar Freyr Sigurðsson.
We describe a novel machine parsing pipeline that makes it straightforward to use the Berkeley parser to apply the annotation scheme of the IcePaHC corpus to any Icelandic plain text data.We crucially provide all the necessary scripts to convert the text into an appropriate input format for the Berkeley parser and clean up the output. The goal of this paper is thus not to dive into the theory of machine parsing but rather to provide convenient infrastructure that facilitates future work that requires the parsing of Icelandic text.
Optimizing Interoperability of Language Resources with the Upcoming IIIF AV Specifications.
Jochen Graf, Felix Rau and Jonathan Blumtritt.
In our presentation, we discuss how the upcoming IIIF AV specifications could contribute to interoperability of annotated language resources in the CLARIN infrastructure. After some short notes about IIIF, we provide a comparison between the concepts of the IIIF specifications and the ELAN annotation format. The final section introduces our experimental Media API that intends to optimize interoperability.
Praat (Boersma and Weenink, 2019) is a versatile, open-source platform that provides a multitude of features for annotating, processing, analyzing and manipulating speech and audio data. By using the built-in scripting language, Praat can be easily extended and adjusted for different purposes while reducing manual work. The Speech Corpus Toolkit for Praat (SpeCT) is a collection of Praat scripts that can be used to perform various small tasks when building, processing and analyzing a speech corpus. SpeCT can help both beginners and advanced users solve some common issues in, e.g., semi-automatic annotation or speech corpus management. This work describes some of the general functionalities in SpeCT. A selection of the scripts will also be made available via the Mylly service at the Language Bank of Finland, maintained by FIN-CLARIN.
CLARIN-IT and the definition of a Digital Critical Edition for Ancient Greek Poetry: a new project for Ancient fragmentary texts with a complex tradition.
Anika Nicolosi, Monica Monachini and Beatrice Nava.
Ancient Greek studies, and Classics in general, is a perfect field to demonstrate how Digital Humanities could become the humanist way of building models for complex realities, analysing them with computational methods and communicating the results to a broader public. Ancient texts have a complex tradition, which includes many witnesses (texts that handed down another texts) and different typology of supports (papyri, manuscripts and also epigraphy). These texts are fundamental for our cultural Heritage, since they are the basis of all European Literatures, and it is crucial to spread their knowledge, in a reliable and easy way. Our project on ancient Greek fragmentary poetry (DEA - Digital Edition of rchilochus: New models and tools for authoring, editing and indexing an ancient Greek fragmentary author) develops and grows out of existing experiences and try to define a new digital and critical edition which includes the use of Semantic Web and Linked Open Data. Our goal is to provide a complete and reliable tool for scholars, suitable for critical study in the field, and also userfriendly and useful for non-specialist users. The project represents one of the attempts within the context of CLARIN-IT to contribute to the wider impact of CLARIN on the specific Italian community interested to Digital Classics and may improve services in fostering new (and sustaining existing) knowledge in SSH digital research.
Research Data of a PhD Thesis Project in the CLARIN-D Infrastructure. “Texts of the First Women’s Movement” as Part of the German Text Archive.
Anna Pfundt, Melanie Grumt Suárez and Thomas Gloning.
The authors of this paper are going to present parts of a PhD thesis, that examines the use of words in the German discussion about women’s suffrage around 1900. The study refers to a variety of written texts (including journal articles, books, and controversial writings) that began to condense in the 1880s and developed a complex thematic network until the introduction of women’s suffrage in 1918. The focus of this paper is the presentation of the corpus compilation (ongoing and already published to some extent) for the CLARIN-D infrastructure component German Text Archive (“Deutsches Textarchiv”, hereafter DTA). This project addresses a basic user need, to make new texts available from the very beginning of a project. Each new text increases the material basis for the dissertation, which can be analysed with the powerful search tool architecture of the DTA. On the other hand, the textual repertoire of the DTA grows with each text. Finally, it’s a win-win situation both for the author, for the infrastructure and for the whole research community.
Granularity versus Dispersion in the Dutch Diachronical Database of Lexical Frequencies TICCLAT.
Martin Reynaert, Patrick Bos and Janneke van der Zwaan.
The Nederlab project collected the digitized diachronical corpora of Dutch and made them available to all researchers in a single, explorable and exploitable portal within the CLARIN infrastructure. We are now building a database of lexical items and their frequencies collected according to the best known year of text production or publication on the basis of the 18.5 billion word tokens in the corpus.We here briefly discuss the corpus contents, major database design decisions we have taken, the tools we use and the approaches we take.
Cross disciplinary overtures with interview data: Integrating digital practices and tools in the scholarly workflow.
Stefania Scagliola, Louise Corti, Silvia Calamai, Norah Karrouche, Jeannine Beeken, Arjan van Hessen, Christoph Draxler, Henk van den Heuvel and Max Broekhuizen.
Progress in computer science with regard to capturing and interpreting forms of textual human expression has obviously affected the research practices of many humanities scholars in the last decades. This does however not seem to be the case when considering the standard scholarly approach to interview data. To set the stage for assessing the potential integration of new technology in this field, a community of experts from the Netherlands, Great Britain, Italy and Germany who engage with interview data from different perspectives, decided to organize a series of CLARIN funded workshops (Oxford, Utrecht, Arezzo and Munich; 2016-2018). This paper presents the preliminary results and envisioned further lines of research. It sketches the goals and the selection of participants, data and tools. It also reflects on how the invited scholars coped with unfamiliar approaches and digital tools. It describes how in the next stages efforts will be made to include new languages, new open source annotation tools, and how research will be conducted on the research behaviour with regard to new technology within the various disciplines. A multilingual archive of oral history interviews covering the topic of migration brought together by the organizers, was the basis for the first exploration, and will be used for further experiments to assess whether and how cross-disciplinary collaboration and the exchange of methods, data and tools can lead to innovation in methodology, use and services.
This paper looks at the Definiteness Effect (DE) in the history of Icelandic and argues, using the Icelandic Parsed Historical Corpus (IcePaHC), that DE in its current form is relatively recent. This is in line with Ingason et al. (2013) who argued that DE played a crucial role in the development of the so-called New Impersonal Passive in Icelandic.
This extended abstract presents the creation of integration language resources for Bulgarian with knowledge sources like ontologies and linked open data to support join usage of language resources and cultural and historical heritage objects. We have started with integration of lan-guage resources for Bulgarian. Then on basis of available Bulgarian parts of resources like Wikipedia, DBpedia and Wikidata we construct the first version of a Bulgarian-centered Knowledge Graph to represent the conceptual information for Bulgarian E-Infrastructure CLaDA-BG.
Application of a topic model visualisation tool to a second language.
Maria Skeppstedt, Magnus Ahltorp, Andreas Kerren, Rafal Rzepka and Kenji Araki.
We explored adaptions required for applying a topic modelling tool to a language that is very different from the one for which the tool was originally developed. The tool, which enables text analysis on the output of topic modelling, was developed for English, and we here applied it on Japanese texts. As white space is not used for indicating word boundaries in Japanese, the texts had to be pre-tokenised and white space inserted to indicate a token segmentation, before the texts could be imported into the tool. The tool was also extended by the addition of word translations and phonetic readings to support users who are second-language speakers of Japanese.
CTS-R: Connecting Canonical Text Services with the Statistical Analytics Environment R.
This paper describes a software library for the statistical programming language R that builds an interface to the large scale implementation of the Canonical Text Service (CTS) protocol ((Smith, 2009) and (Tiepmar, 2018)). This way the vast amount of textual data that has been and will be collected in the Canonical Text Infrastructure is opened up to all the analytics frameworks and workflows that are available in R. Since the data sets should be usable for any process that is built in R, this drastically increases the reach that these can gain. On the other hand this also increases the amount of textual data that is available in R for textual analysis.
Shapeshifting Digital Language Resources - Dissemination Services on ARCHE.
Martina Trognitz and Matej Durco.
The Austrian Centre for Digital Humanities of the Austrian Academy of Sciences hosts ARCHE – A Resource Centre for the HumanitiEs. ARCHE aims at stable and persistent hosting as well as the dissemination of digital research data and resources for the Austrian humanities community.This paper presents how data in ARCHE can be represented in multiple forms or shapes by using bespoke dissemination services. A focus will be kept on the description of dissemination services for digital language resources, such as XML documents, and showcase a few use cases as well as discuss possible integration of such kind of services into the Virtual Language Observatory of
Wablieft: An Easy-to-Read Newspaper corpus for Dutch.
Vincent Vandeghinste, Bram Bulté and Liesbeth Augustinus.
This paper presents the Wablieft corpus, a two million words corpus of a Belgian easy-to-read newspaper, written in Dutch. The corpus was automatically annotated with CLARIN tools and is made available in several formats for download and online querying, through the CLARIN infrastructure. Annotations consist of part-of-speech tagging, chunking, dependency parsing, named entity recognition, morphological analysis and universal dependencies. By making this corpus available we want to stimulate research into text readability and automated text simplification.
Semantic parsing using Interpreted Regular Tree Grammars.
Several common tasks in natural language processing (NLP) involve graph transformation, in particular those that handle syntactic trees, dependency structures such as Universal Dependencies (UD) or semantic graphs such as AMR and 4lang. Interpreted Regular Tree Grammars (IRTGs) encode the correspondence between sets of such structures and have in recent years been used to perform both syntactic and semantic parsing. The poster presents the process of generating such a grammar and our results on Surface Realisation Shared Task 2019. (Poster)
Holistic Approach for the e-Documentation of the ASINOU Church Monument with the use of an immersive hybrid book.
A book is a perfect vehicle for building story, environment and character. Since ancient times people have commonly referred to books in order to gain knowledge, find or seek entertainment. The religious atmosphere, together with the history of books and libraries of the Orthodox Church provides a motivation to use this form of communication along with technology to provide an immersive interactive experience. The potential to add digital content to text on a ‘piece of paper’ creates opportunities and challenges for visualisation and enrichment of written content, storytelling and the composition of interactive narratives which draw on a holistic approach to documentation of a cultural heritage monument or site. Data from this type of memory of the past can generate diverse forms of multimedia such as: 3D models, images, video, audio and text. Therefore, the analogue and digital content create a new engagement with the monument, which is not experienced by visiting the church itself. The creation of an immersive installation requires extensive content creation, through techniques such as 3D modelling, video/image editing, visual design and software development. Transparent complex data have been filtered, assembled and presented in a form visible to the human eye, through an installation adapted and made functional for almost any group of users, becoming a personalised educational environment. These technologies have been here incorporated in an immersive e-book, in the context of a unique monument, Panagia of Asinou church in Cyprus. (Poster)
Narrative Detection for Lithuanian Language.
Automatic narrative detection is an important tool in media analysis. However, it is very hard to detect, i.e. it is easy to miss relating elements of narrative while it is too late to be of interest, or unintentionally to assign false positives to make the case look stronger. Hence, current research results show number of promising directions, but most of them are still at the early stage of research. Almost no results are reported for Lithuanian language. However, fast evolution of Machine Learning and easier almost real-time access to different medias sources, looks promising for a number of complex language technologies related applications, including narrative detection. Based on these assumption, we present a study on automatic detection of narrative structure for textual sources in Lithuanian language using automatic machine learning methods.
This research is novel and challenging due to the following reasons: 1) In order to obtain meaningful results, it is necessary to take a large amount of data; 2) Lithuanian language is morphologically complicated.
The main contribution of this work is study of narrative structure and the ways that it affects personal perception. A narrative is a report of connected events, real or imaginary, presented in a sequence of written or spoken words, or graphical elements, too. Our goal is, for every clause in a narrative, to label it with one of the elements of narrative structure.
We analyze a fundamental problem: how to choose automatic methods that could achieve the highest accuracy in our solving narrative detection task. The related research analysis will help us to select the methods which have demonstrated the best results on the other languages and apply them to the Lithuanian corpora. We look forward to examining the relationship between narrative structure and discourses by examining whether the first determination of the structural element of each sentence fragment can help to detect indirect discourse relationships. (Poster)
Direct Democratic Institutions in Representative Political Systems: An Investigation of the German State Parliaments.
This dissertation investigates direct democratic institutions in representative democratic systems. The overarching research question shall be answered in ve parts, all of which shall take the format of individual papers. All papers combined shall provide evidence to answer the question: How do parliaments and parties on the German state level behave in the policy eld of direct democracy, and how may their behavior be explained? (Poster)
Annotation of Social Data for Influencer Detection.
We define an influencer as an individual who impacts the decisions of other individuals by interacting with them. Detecting influencers permits to use them as “communication mediators” in applications like marketing or political campaigns and to prevent dissemination of fake or dangerous information. Interactions being the at the core of social medias, they are the best resource for influencers’ detection. To design an influencer’s model, we used a generalist forum in English made for public debate and identified three main steps in the influence process with its respective linguistic realisations in texts. In order to develop our influencer detection system, we need annotated data. In this paper we describe the annotation process and its output. The produced resource will be made publicly available for research purpose. The annotation task consists in identifying text spans of expressions corresponding to the linguistic instances of our model. The data used for annotation are messages extracted from the forum previously mentioned and tweets in English from individuals that are considered as radicalised. We created datasets so that that each one contains one textual type and approximatively 100 messages. We organised five successive annotation sessions with NLP students as annotators. Each session consisted in the annotation of datasets distributed among groups of two or three annotators. This configuration made it possible to evaluate agreement between annotators and its progression. 20 datasets have been annotated, 9 of them have been reconciled by the annotators to build gold annotation sets. If the agreement scores (Kappa) are particularly bad ([-0.93 - 0.00]), we got a strong progression throughout the sessions (a gradient up to 60%) and reversed the trend in favour of convergent annotations. (Poster)
Third-Party Observer Gaze as a Source of Information in Speech Science.
This study aims at developing and implementing the “observer gaze” methodology in order to study the structural organization of multimodal, face-to-face conversations. Tracking the gaze of a third-party human observer can be used to learn about human real-time expectations of the dialogue and provide an online and implicit method for investigating the dynamics of interactions and the relevant features that capture the observer attention according to the amount of information the observer has available and her/his ability to understand the language. The way an outsider visually perceives the interactional exchanges may also suggest a better modelling of human-human communication for human-machine communication systems. Prototype versions of the method have proven promising, but the techniques are still rough and in need of development, quantification, and validation. The current study focuses on this and similar, relevant methods of acquiring annotation and labels of recordings of human interaction. (Poster)
The Digilang metadata portal - simplifying the discovery of linguistic datasets.
Discovering new types of research data as well as making one's own datasets discoverable by others is an important task for today's researchers, and linguists are no exception. Services like Metashare (https://www.metashare.org) have long been used for documenting existing linguistic resources and providing a stable way to reference the resources in actual research. In many cases, however, the process of finding new data suitable for addressing a specific research question could be made easier and more approachable especially for inexperienced users of language data.
The Digilang metadata portal (digilang.utu.fi) developed at the University of Turku is an attempt to simplify the process of discovering linguistic research data. It focuses on
1) representing information about linguistic datasets in a clear and clutter-free manner
2) providing a reasonable set of filters and search facilities for finding relevant datasets and 3) using up-to-date web development technologies (ReactJS, Django REST framework) for providing a modern, mobile-friendly user interface. This poster presentation describes the ways in which these three goals have been implemented as well as the challenges that have been encountered along the way. The decisions and technical solutions used in the project are divided into two categories. First, we discuss the constructing of the dataset insertion form and how it is an attempt to help the researchers that have produced the dataset to describe its properties in a helpful way. Second, special attention is given to the problem of building the end-user-facing representations of the data and the filters and search facilities provided to the user. These facilities include a question-based wizard that gives the user suggestions based on the user's description of the research problem he or she is trying to solve. (Poster)
Heaviness on the left edge - Observing linguistic processing in historical corpora.
Ingunn Hreinberg Indriðadóttir
This paper examines the relationship between heaviness and optional movement to the edge of a clause -- demonstrating how a digitized and syntactically annotated corpus of historical texts can contribute to the study of phenomena associated with linguistic processing. We focus on so-called weight phenomena in word order variation and find that heaviness draws phrases to both edges of a clause -- not just the right edge as sometimes assumed. For our study, we searched the Icelandic Parsed Historical Corpus (IcePaHC) which is accessible through CLARIN.It is a well known observation that syntactic constituents sometimes appear at the end of a clause rather than in their canonical position when they are heavy/long. This tendency is manifested in Heavy NP shift, the type of alternation shown in (1) where the direct object can shift to the right of the PP adjunct [on the street].
(1) a. I met [my rich uncle from Detroit] [on the street].
b. I met [on the street] [my rich uncle from Detroit] .
Despite several studies on weight effects, it still remains a matter of investigation why such movement takes place. Proposed explanations appeal to some aspects of processing and include that such movement facilitates parsing or utterance planning and production.In this paper we make an empirical point that in our opinion seems to escape attention in some of the most important studies on weight effects. Heaviness is not only positively correlated with movement to the right edge of a clause, but also to the left edge, e.g. by left dislocation (2).
(2) a. I forgot about [my rich uncle from Detroit].
b. [My rich uncle from Detroit], I forgot about him.
This is important because it suggests that weight-driven movement is, at least in part, about amending situations where one needs to backtrack from a deeply embedded structure in the middle of an utterance rather than moving to the right. (Poster)
Parliamentary language and its written form.
Within my PhD thesis, I am studying parliamentary language in terms of linguistic analysis and editing of records of parliamentary debates. I am interested in the characteristics of spoken texts in general, their written form, the concept of authenticity in spontaneous speech, and in particular the practical problems in transposing spoken parliamentary language into a written form. The subject of my research are the minutes of the National Assembly of the Republic of Slovenia, more precisely the records of parliamentary debates, which are being published since 1963.
In parliamentary language – according to experts testimonies – more and more dialectal, colloquial, and foreign expressions have emerged. With a greater degree of spontaneity, sentence structures that are not characteristic for written language are also associated. Among other things, I am interested in: how these facts affect the experts’ decisions on how to write spoken language, to what extent are the final versions of records of parliamentary debates still comprehensive, understandable, and also if they preserve the meaning and the sense of what was told, as well as the characteristics of each speaker, which are the main editorial guides.
The research will be done with the help of parliamentary corpora that are included in the CLARIN infrastructure (comparison between parliaments), and especially corpora from CLARIN.SI, such as the corpus siParl, which provides insight into redacted minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992 and the minutes of the National Assembly of the Republic of Slovenia from the 1st to the 7th legislative period 1992-2018. I assume the wide time frame of the minutes included in the corpus will enable to observe the changes in the language and editing during that period. (Poster)
Predicting the unpredictable. Developing a lexicon model for Norwegian multiword expressions.
Gyri Smørdal Losnegaard
The objective of this project is to create a broad-scope, multipurpose and reusable lexical resource of Norwegian multiword expressions (MWEs). This work involves developing a linguistically informed and largely language-independent methodology for identifying, classifying and representing linguistic properties of MWEs. The methodology builds on existing knowledge about MWE properties and draws on principles of word classification. The lexicon model development process is guided by three main perspectives: automatic analysis (the NLP perspective), lexicon development (the lexicographic perspective) and foreign language acquisition and use (the language learning perspective). The framework Lexical Functional Grammar (LFG) is used for linguistic analysis and Lexical Markup Framework (LMF) for MWE description.
The development process can be broken down into several subtasks. Identification is the task of distinguishing MWEs from free combinations, and involves an operationalization of the notions of idiosyncrasy and productivity. Classification is the task of distinguishing types of MWEs. In this project, this involves applying a range of criteria with the aim of arriving at a holistic, extendable and linguistically motivated classification model. Delimitation of the MWE lemma concerns distinguishing between MWE variants and new MWE lemmas by determining the possible variation scope. Finally, MWE description involves determining what are the necessary and sufficient properties to be represented in the lexical resource, and how to represent this information.
The project employs several tools and resources related to CLARINO. The data serving as a basis for model development are approximately 2000 MWE candidates compiled during the construction of NorGramBank, a large LFG treebank for Norwegian that was developed during INESS, a CLARINO collaborative project. NorGramBank is hosted by the INESS infrastructure, which is part of the CLARINO Bergen Centre, and supplementary data is retrieved from the treebank using INESS search – a querying system for treebanks in a variety of formats. " (Poster)
Formal and textual narrative properties representing trauma experience as predictors of PTSD development.
Analysis of narratives can bring many advantages, especially to psychologists who listens to other people. Content of speech can tell many things about others mood and troubles. In the past psychologist used notebook and pen to write down what their patients have to say. Nowadays we have modern technology which help not only to take notes but also to analyse what people said. Using Clarin Infrastructure we can discover unobvious information hidden in narrative. In my PHD project I will analyze narratives of people who experienced trauma events. I try to answer following questions:
- Which characteristic of narratives are predictors of PTSD occuring.
- Which part of language is an indicator of overcoming PTSD?
- Are there any difference between narratives people who suffered from PTSD and those who didn’t?
The poster will present a proposition of research using CLARIN tools. (Poster)
Transfer learning of language models for argument mining.
Argumentation as long been studied in various social sciences such as philosophy or linguistics. For example, the Clarin Virtual Language Observatory shows a wide range of argumentation resources: for decision making (Argumentation and Argument Visualisation in Promoting Strategic Reading and Decision-making), problem-solving (Argumentation in Studying Problem-solving Skills in Social Work Education in Finnish Polytechnics) and argument analysis (Araucaria). The automatic computation of argumentation can assist: querying different points of view or opinions, second language learners improving their essays writing or a decision-making process, among others. We describe the development and evaluation of an automatic argumentative discourse classifier. The argument discourse classifier identifies argument and non-argument structures in a written text. We start by replicating the state-of-the-art resorting to a machine learning neural network classifier. We then proceed to improve the results using a transfer learning technique. Transfer learning is a machine learning technique that leverages knowledge from multiple classification tasks to improve an algorithm generalization and thus obtaining better results. The training of the machine learning models uses the knowledge shared among tasks. The share of knowledge among tasks can improve the accuracy of the individual tasks beyond the single-task learning. We report on the results from the knowledge transfer of language modeling tasks. Language models estimate the relative likelihood of sequences of tokens and can be obtained using distributional semantic models. Distributional semantic models are pre-trained real-valued vectors representing words obtained from word prediction tasks. We also report on the results from the transfer learning of a natural language inference task. The natural language inference task is the classification of sentence pairs with the labels: entailment, contradiction or neutral. The experimentation space included different neural network architectures and hiper-parameterization.(Poster)
Transparent automatic genre classification of newspaper articles.
Systematic study of genre in newspapers sheds light on the development of journalism discourse. The genre conventions that can be discerned in a newspaper text signal the underlying discursive norms and practices of journalism as a profession. Historical newspapers are increasingly becoming available thanks to digital newspaper archives (in the Netherlands available through Delpher.nl), providing the opportunity for large-scale empirical research. However, the digital archives do not contain the fine-grained genre information that is required for this purpose. The NEWSGAC project has adopted a machine learning approach to add genre labels to newspaper articles.
Classifying genre in a standardized and reliable manner is challenging, though, because genre is a typical example of a ‘latent’ content category, which needs considerable interpretation. To ensure the reliability of the results and to evaluate the machine learning approach, it is crucial to make the methodological impact of various machine learning pipelines transparent. Our project, therefore, had a dual aim: (1) to develop and improve machine learning approaches for automatic genre classification by systematically testing and evaluating the performance of different algorithms on historical newspaper articles and (2) to support humanities scholars with the necessary information to make an informed decision about the best preprocessing tools and machine learning algorithms for their research question and source data.
The transparency-driven platform we have developed facilitates running, comparing and critically assessing machine learning pipelines. It allows scholars to explore the underlying decision-making process of the machine learning pipeline, among others through data visualisations that show the performance of the classifier per genre, article-level and classifier-level explanations for the performance, and comparison between various machine learning pipelines. Evaluating pipelines beyond evaluation metrics, such as accuracy scores, enabled us to choose the pipeline most suited for automatic genre classification. (Poster)
All abstracts of the CLARIN Bazaar can be found here
Back to the main conference page.