The CLARIN Bazaar 2020


Below you will find a list of the virtual stalls that can be visited at the CLARIN Bazaar.  Without the need to travel,  the CLARIN2020 Bazaar is an even more global event than previously, with many countries and continents represented. Please go and talk to stallholders, see their virtual material, and share your ideas!


Bazaar presentations (sorted by topical category)

CLARIN Core | CLARIN COVID-19 ActvitiesCLARIN in SSHOCCountry presentations | Other

CLARIN Core

Title of stall
 
Description
     

Core components for CLARIN metadata and other developments

CMDI task force

 

The CMDI task force has been working on the design and implementation of a set of 'core metadata components' with the aim to simplify the creation of metadata for a wide variety of use cases that is both FAIR and optimised for the CLARIN infrastructure. In the current development stage we are looking for ideas and feedback from metadata creators and modellers, repository managers, developers of software processing metadata, and anyone else with experience with or interest in metadata in the context of CLARIN and the broader research infrastructure landscape.

Of course, we are also there to answer your CMDI related questions, or discuss any other matters related to metadata in CLARIN. (slides)

     

Sharing corpora of disordered speech and finding relevant use cases

Henk van den Heuvel and Esther Hoorn

 

Corpora of Disordered Speech (CDS) are difficult to find, very costly to collect, and for privacy reasons hard to share. On the other hand, due to their small size and dedicated purpose, they should be combined to be suited for re-use. A strong need is felt by the research community to bring together existing and new CDS in an interoperable and consistent way that is both legal and ethically safeguarded.

DELAD is an initiative to join forces on sharing CDS. Together with CLARIN-ERIC DELAD organizes workshops to work on guidelines and solutions regarding legal, ethical and technical issues. In this light, steps are made to get a clear view on use cases to mitigate the risks for participants in the light of the GDPR. In the GPDR a Data Protection Impact Assessment is the required method to assess risks and design mitigation measures.

A team at the University of Groningen received a COMENIUS senior fellowship for educational innovation for the project Privacy In Research: Asking the right questions. The team uses the DPIA method to actively involve students, researchers and support staff to build privacy in the design of research projects and explore the dilemmas, risks and use cases in a specific domain. We want to show you a video on a case on Parkinson disease and other materials. Please join us to share your insights in field specific use cases and ethical good practices.

In the Bazaar session the DELAD and COMENIUS team will join forces to explore specific needs and use cases for privacy compliant data sharing in the field of Disordered Speech.

In our contribution we will describe the services and information that DELAD can offer, we will explain relevant issues in the light of the GDPR, including DPIA’s, and present a number of use cases by way of illustration. There will be ample time for questions from the audience. (slides, material)

     

Pursuing the elusive KPI: Filling the gaps in centre self-published standards-related information

CLARIN Standards Committee (CSC)

 

One of the goals of the Standards Committee is, to quote from the Bylaws, "to maintain the set of standards supported by CLARIN and adapt them to new developments within or outside CLARIN"

Given the differences in the perspectives and goals of users of such a set, we realised early that in order to talk about this meaningfully, we need to start small, in a well-defined area, and first establish the parameters of description (i.e., the terms with which to characterise the function of the particular standard) and only then expand from such an early small set to one that can be considered "supported by CLARIN" in a meaningful way.

The initial set that we aim at defining luckily emerges from one of the Key Performance Indicators that are used in measuring CLARIN's impact: the percentage of B-centres that publish explicit information on what formats they accept. From this information, we should be able to see, e.g., what the given centre considers a standard format, what formats are most commonly supported, and which centres support formats that (nearly) no other centre is interested in. We can also try to unify both the presentation format, and the content of these sets (concerning e.g. the granularity of terms used -- is "XML" really enough to describe what data the centre is prepared to accept?).

Sadly, we can't blame the pandemic alone for the fact that no complete picture is presentable at CAC-2020. It turns out that some centres do not yet publish the information that we (and, crucially, also users turning to CLARIN for assistance) could use to establish which data formats are the most sought-after, or which centre is willing to accept our "not-so-standard" data format.

We would like to invite centre representatives to have a look at the list of centres that have published explicit lists of accepted formats at https://www.clarin.eu/content/standards-and-formats#formats and to come over to our stall for a chat, especially if your centre is missing from that list. Let us try to bring the KPI in question as close to 100% as we can, and then benefit from the result!

     

CLARIN COVID-19 Actvities

Title of stall
 
Description
     

CLARIN Hackathon on COVID-19 Related Disinformation - Mid-term event

Alexander König

 

This is the mid-term event of the ongoing CLARIN Hackathon on COVID-19 Related Disinformation (see https://www.clarin.eu/event/2020/hackathon-covid-19-related-disinformation for details).

With this hackathon, CLARIN intends to bring together cross-disciplinary groups of researchers to work on the task of disinformation detection in the context of the COVID-19 pandemic. They are invited to use existing data sets containing disinformation and fake news in order to create algorithmic solutions to research questions of their choice, e.g. by assigning the likelihood for a text to be disinformation or automatically detecting re-postings of known conspiracy theories even if they are rephrased. Analytical and comparative contributions are also welcome.

This bazaar stall is offered with the purpose to give all teams active in the hackathon a slot for discussion where they can update each other on their progress and discuss any identified obstacles. Each team will be able to present their progress as well as point to problems encountered along the way. This is meant to allow cooperation between the different teams, and to grant them the possibility to learn from each other on how to solve certain problems that may have already been encountered by others.

Please note that you have to be in one of the registered hackathon teams to participate.

     

ParlaMint: Towards Comparable Parliamentary Corpora

Maciej Ogrodniczuk and Petya Osenova

 

We will report on the ongoing work within the ParlaMint Project (July 2020 - May 2021). One of the most important aspects of processing of new parliamentary data is its direct correspondence to the most recent events with global impact on human health, social life and economics such as the current COVID-19 pandemic. By comparing the data synchronically and diachronically within a cross-lingual context, the scientific and civil communities will be able to track pan-European discussion and be quickly updated on any emerging topic. The stall will present parlamentary corpora for 4 languages: Bulgarian, Croatian, Polish and Slovene. A number of issues will be considered: the ParlaFormat TEI standard extension and adaptation in a cross-lingual context; the challenges in gathering recent parlamentary data; comparability of data, etc. (poster)

     

CLARIN in SSHOC

Title of stall
 
Description
     

CLARIN services for the SSH

Daan Broeder,  Nicolas Larousse, Willem Elbers, Emanuel Dima

 

In the SSHOC project we are generalising the CLARIN LR Switchboard and CLARIN Virtual Collection Registry into SSHOC Switchboard and Virtual Collection Registry that should be useful for the broad Social Sciences and Humanities. This requires finding and describing new use-cases, adapting the software, training and engagement documentation and, not to forget, discussing potential new governance structures for supporting these services. The SSHOC Switchboard and VCR will be integrated with new to develop SSHOC services as the SSH Open Marketplace and the FAIR SSH Citation infrastructure. (poster, slides)

     

Facilitating Annotation of Collective Bargaining Agreements with Keywords Extraction

Daniela Ceccon

 

 

Since 2012, the WageIndicator Foundation has maintained a Collective Agreements Database, where the texts of 1338 collective agreements (CBAs) from 52 countries and 26 languages have been uploaded, coded and annotated. This database is a unique example at global level. For each agreement, the team answers to a series of questions and selects the appropriate piece of text (clause) for each. The coding process is now very time-consuming. How to speed it up? The use of keywords has been identified as a possible solution: together with the University of Amsterdam and within the SSHOC project, a script is being written to find the relevant set of keywords for each language and each question, and to possibly have the correct piece of text for each topic automatically found in a new collective agreement. The first tests proved that keywords work well on already selected clauses - pre-defined pieces of text. However, when highlighted in new collective agreements, keywords are too many and not specific enough, so it is difficult for the annotators to spot the right pieces of text. Also, working with so many different languages is proving to be a hard task, especially for lemmatisation. In this stall, you will hear about how far we are in this process, how we are thinking to proceed and (we hope!) give your suggestions and comments. (poster)

     

The SSH Open Marketplace: it's relevance for the CLARIN communities and CLARIN services

SSHOC: Stefan Buddenbohm (for SSH Open Marketplace), Laure Barbot (for DARIAH communities), Daan Broeder (for CLARIN communities)

 

The SSH Open Marketplace, built under the SSHOC project, is a discovery portal which pools and harmonises SSH resources (e.g. tools and services, datasets, training materials and workflows as well as chosen research papers), offering a high quality and contextualised answer at every step of the SSH research data life cycles.

The SSH Open Marketplace will not only be part of the wider EOSC environment - particularly as a subset of the EOSC Catalogue - but also make available and contextualise established research infrastructures’ resources. CLARIN and DARIAH are cooperating closely within SSHOC and exploiting their adjacent approaches to text and language related research domains and communities.

With a more technical look at CLARIN, SSHOC seeks to lift CLARIN services into the EOSC, to scale up their uptake, visibility and - most important - to inquire interoperability scenarios with other services. Particularly of interest are the Virtual Collection Registry (VCR), the Switchboard (LRS) and the CLARIN resource families.

The CLARIN resources will be harvested to populate the SSH Open Marketplace with the corpora, lexical resources and tools already curated by CLARIN and be a valuable contribution to the marketplace’s content.

When it comes to the CLARIN Virtual Collection Registry, several possibilities for a technical integration are explored. The main goal is to create an added value for the users of the mentioned services, be it the VCR, the LRS or the SSH Open Marketplace. For instance it would be a valuable enhancement of the Marketplace to allow its users to incorporate Marketplace content in an individual virtual collection. This function won’t be provided by the Marketplace but could be, by the VCR.

A similar use case is at hand with the CLARIN Switchboard. A user of the Marketplace could invoke the Switchboard to explore further tools for a specific resource. Here the integration effort is not only on the direct linking of Switchboard and Marketplace but also lays on enhancing the Switchboard with additional services as such.

Furthermore, the SSH Open Marketplace is involving its potential user community at different stages of the development process, e.g. by public consultations, workshops or webinars to align the development progress with the user requirements. The CLARIN communities are very valuable for us to review and revise our development regarding functions, the content and the presentation of content in the SSH Open Marketplace and direct contributions as curator of the SSH Open Marketplace will be possible by the end of 2020, which marks the official beta release of the service. A shared governance model gathering several ERICs and partners of the SSHOC project is also under discussion and will ensure the sustainability of the service and its uptake by the SSH communities already involved in existing networks. (slides)

     

Social Science and Humanities Open Cloud: why vocabularies matter

Iulianna Van der Lek and Monica Monachini

 

Why

The breadth and vastness of the Social Science and Humanities sector leads to the increase in diversity of research methods and work practices in the field and in the fragmentation in the use of vocabularies to describe, discover and access research content. The SSHOC EU thematic cluster project aims to reconcile the current situation and to create the conditions for sharing and optimising research data and services in a sustainable way across domains, thus developing the SSH area of the European Open Science Cloud for the sector.

In this scenario, an urgent need is felt to reconcile the current practices in terms of vocabularies‘ use. As a part of the SSHOC project, CLARIN has launched an initiative with the aim to collect, register and harmonize SSH vocabularies/terminologies/ontologies to allow a unified access to research contents.

How

Different action points have been identified: A first step foresees an analysis and collection of the requirements for a vocabulary registry based on SSHOC. During the first year, SSHOC has already provided a review of the main used vocabulary platforms and identified the technical requirements for the vocabulary management platform to be used in SSHOC. As a second step, a series of online information sessions is being organized to raise awareness in the SSH communities about open-source vocabulary hosting platforms, investigate suitable platform(s) for storing and managing the SSHOC vocabulary registry, explore platforms that can serve some infrastructure specific needs. Once other technical features emerge, the preliminary SSHOC results will be re-evaluated and updated recommendations for the SSHOC vocabulary hosting platform will be published.

The last step foresees the selection of a SSH vocabulary hosting platform and the creation of an inventory of known vocabularies, whose metadata will be harmonized and deposited in the SSHOC platform.

Whom

This initiative strongly relies on community participation: the strong involvement of academic and/or industry professionals together with input from external experts and other organizations that have experience with vocabulary management and publication platforms is a key to success for this task. (slides)


Country presentations

Title of stall
 
Description
     

Time Entities Matter!?: developing Linked Open Data for temporal entities for humanities research

Go Sugimoto

 

ACDH-CH is developing Linked Open Data for numeric date entities, initially covering the duration of 6000 years, called Linked Open Date Entities (LODE). It provides over 4 million entities in RDF with the full range of precision between a millennium and a day. It is especially designed to allow users to execute Entity Linking for datasets describing a single point in time for historical research, which is currently highly limited for other Linked Open Data resources. We present the development of the data model and demo to discuss about its scope and potential. The project is highly interdisciplinary, therefore, we welcome anybody interested in temporal entities for their research and Semantic Web/Linked Open Data. Although some knowledge on Semantic Web is helpful to attend the session, the topic should be relatively easy to understand without a prior knowledge. This is an opportunity to know about the project and share opinions and ideas. Critical feedback as well as constructive suggestions are all welcome. We also look for potential users and collaborators. (poster)

     

As You Like It: Event Annotation with INCEpTION

Laska Laskova, Kiril Simov, Petya Osenova, Iva Anastasova, Preslava Georgieva

  Building a Bulgaria-centric Knowledge Graph is one of the main objectives of CLaDA-BG, the Bulgarian national research infrastructure for resources and technologies for language, cultural and historical heritage. To achieve this objective, the coordinator of CLaDA-BG, the Institute of Information and Communication Technologies at the Bulgarian Academy of Sciences (IICT-BAS) launched a collaborative pilot project with one of the data and expertise partners in the CLaDA-BG consortium, the Institute of Balkan Studies and Centre of Thracology “Alexander Fol” (IBSCT-BAS). The ultimate goal is to link manually annotated Named Entities, events, and participants in domain specific texts to concepts form the CIDOC-CRM ontology, WordNet, and Wikipedia. While IBSCT-BAS provided expert scientific publications as source of information for persons, organizations, locations, artefacts and events of cultural and historical salience, IICT-BAS offered an event annotation scheme implemented in the INCEpTION platform developed at the UKP Lab at the TU Darmstadt. During several months of joint work, the initial annotation scheme has changed to accommodate for the needs of the annotators who came from different scientific backgrounds and to model better the content of the texts. At our stall, we will present the current version of the annotation scheme, the factors that shaped it, and some insights into the collaboration between humanities scholars and computational linguists. (slides)
     

The Corpus InterLangue project: storing language learner data in a Huma-Num Nakala database for automatic online retrieval

Thomas Gaillat and Leonardo Contreras Roa

 

The Corpus InterLangue (CIL) project is a collection of spoken and written productions from learners of English and French as second languages (L2). The corpus provides various sources of learner input completing different tasks (Ellis 2003). Learner data have been a source for evidence-based research in Second Language Acquisition for over two decades (Granger, Gilquin, and Meunier 2015). This type of data gives insights into learners’ language features which can be analysed in the light of the interlanguage (IL) hypothesis (Selinker 1972).

The CIL data have been collected since 2008 as part of a research programme conducted by the LIDILE research team. The data sources have been stored digitally in non-public spaces. The LIDILE team now wishes to make this data available to the community.

The CIL is divided into two parts which compile data of two learner profiles: learners of L2 French (CIL-FLE) whose L1 is English, Spanish, Mandarin, or Arabic, and learners of L2 English (CIL-ALE) whose L1 is French. The same data collection protocol is followed for both languages and learners perform the same tasks: 1) a 10-to-15-minute semi-structured interview, 2) a reading aloud task, which prompts 3) a writing task. Spoken data are transcribed and time-aligned. Handwritten text data are retyped into digital format. All transcriptions are saved in CHILDES and TEI XML compliant formats. Learner metadata have also been collected and compiled in table-like format in .tsv files. Each column corresponds to a sociolinguistic variable. All publicly accessible data are anonymised.

The database will store three types of documents including WAV, UTF-8 text or tsv files and PDF images of handwritten documents. Each document will be characterised with Dublin Core metadata including title, author, date, file format and document type (audio, text or image). The database will be a Triplestore in the sense that all document metadata, their relationships (isPartOf…) and their collection(s) will be stored according to the RDF standard. A REST API with SPARQL-based queries will be developed to allow up-to-date automatic retrieval of selected data by external applications. Results will be returned as JSON files.

The data storing workflow (See Figure 1) includes three main stages. In stage 1, users (linguists) collect the data in format compliant files. In stage 2, DC metadata files are created for each document, uploaded in batch mode and stored in RDF. In stage 3, public access will be provided to allow automatic retrieval.

https://www.leonardocontrerasroa.com/images/Workflow-DB.png 

Figure 1: Data workflow used to store the CIL corpus in the Huma-Num Nakala database

Such work can benefit research in three main areas. AI-based Tutoring Systems may avail of up-to-date data for model training tasks. Researchers in SLA may access the data for modeling learner language according to socio-linguistic variables such as learners’ L1 and L2 exposure. Language instructors and course designers may use the retrieved data for Data Driven Learning activities.

     

Ti racconto l’italiano. Italian art, culture and economy as told through speech archives

Sabina Magrini, Piero Cavallari, Cecilia Valentini, Silvia Calamai

  Ti racconto l’italiano is a research project held by the Istituto Centrale per i Beni Sonori ed Audiovisivi (ICBSA), in collaboration with Siena University. It aims at cataloguing, labelling, publishing, and enhance a collection of unpublished video-interviews with Italian personalities, that have been recorded in the 80s on behalf of the ICBSA. The same documents are simultaneously analysed by a research group of Università per Stranieri di Siena for educational use (teaching Italian as a foreign language). The records are divided into three series according to the interests or activities of the interviewees: the first series is dedicated to visual artists, the second to poets and writers, while the third contains interviews with important business figures. Currently the project work consists on creating finding aids that will help users search through the collection easily and efficiently. Documents need to be indexed: each segment is classified at regular time intervals and labelled with key words and controlled vocabulary. These terms are hierarchically listed in specifically created thesauri; these are based on a system known as Nuovo soggettario, developed by The Central National Library of Florence (https://thes.bncf.firenze.sbn.it/index.html). These terms will similarly include proper nouns and new terms, providing a comprehensive description of the documents. Indexing is done via AVIndexer, a software developed by Davide Merlitti that makes use of SKOS (http://www.informaticaumanistica.com/open-source/avindexer/). Finally, the records will be published on an internet portal, modeled on the digital library Ti racconto la storia (https://www.tiraccontolastoria.san.beniculturali.it/). The second phase of the project will focus on the creation of dissemination activities through the creation of easy-to-access extracts from the interviews, according to key words and specific themes. (poster)
     

Profiling-UD: a tool for linguistic profiling of multilingual texts

Dominique Brunato, Andrea Cimino, Felice Dell’Orletta, Simonetta Montemagni e Giulia Venturi, Institute of Computational Linguistics "Antonio Zampolli" (ILC-CNR), ItaliaNLP Lab

  We would like to present at the Clarin Bazaar a recent outcome of the research carried out at the ItaliaNLP Lab of the Institute of Computational Linguistics “Antonio Zampolli” (ILC-CNR): Profiling-UD (Brunato et al., 2020). It is an open-source text analysis tool, available at http://www.italianlp.it/demo/profiling-UD/, inspired to the principles of linguistic profiling (Van Halteren, 2004; Montemagni 2013) which can support scholars working on language variation from several theoretical and applicative perspectives.

The tool allows to reconstruct a rich linguistic profile of a text by automatically extracting from it more than 130 features, spanning across different levels of linguistic description and modeling lexical, grammatical and semantic phenomena that, all together, contribute to characterize language variation within and across texts. Beyond the large number of features that can be monitored, a main novelty of Profiling-UD is that it has been specifically devised to be multilingual since it is based on the Universal Dependencies annotation framework (Nivre, 2015). This makes it particularly suitable to carry out comparative and cross-linguistic studies.
To date, Profiling-UD has been successfully tested in several case-studies within the computational linguistics, digital humanities and education community: for instance, in language acquisition to monitor the evolution of written language competence in L1 and L2 students; in stylometry-related work to predict demographic characteristics of text’s authors (e.g. gender, age) from its writing style; in linguistic complexity and readability assessment research to automatically modeling the perception of sentence complexity by readers.

Ongoing work is devoted to specialize the tool in several directions, from the operationalization of other linguistic features to monitor to its use in new application scenarios.
     

Arbitrary Collocations of Lithuanian: Identification, Description and Usage
(ARKA)

Erika Rimkutė, Loïc Boizou, Ieva Bumbulienė, Jolanta Kovalevskaitė, Jurgita Vaičenonienė

  We will present an ongoing project “Arbitrary Collocations of Lithuanian: Identification, Description and Usage (ARKA)” which aims to describe the criteria of arbitrary collocation identification for the Lithuanian language. This research is funded by a grant (No. S-LIP-20-18) from the Research Council of Lithuania.

Collocations can be seen as motivated or arbitrary (non-motivated), especially problematic for foreign language learners and translators. In contrast to motivated, arbitrary collocations can not be identified on the basis of semantic restrictions which are largely predictable as they depict the nonverbal reality (e.g., mėlynas paltas/ blue coat). Lexical restrictions which govern the collocability of arbitrary collocations are neither predictable nor may be universal (e.g., žila senovė/ *grey old times). Combinations with abstract nouns in which one element is used in a figurative sense can be seen as a common structural type of arbitrary collocations (e.g., sveika aplinka/ healthy environment; laikas bėga/ time flies).

The database of 12,000 Lithuanian multi-word expressions compiled in the project “Automatic Identification of Lithuanian Multi-word Expressions (PASTOVU)” will be used to retrieve arbitrary collocations for the selected 50-100 most common nouns. The list of arbitrary collocations and information on their usage with accompanying lexicographical and teaching resources will be uploaded to the database.
     

DEmo of SASTA and SASTADev

Jan Odijk

  I will demonstrate the SASTA application and the SASTAdev tool. This demo accompanies the paper on the semi-automatic analysis if spontanous language for Dutch (see extended abstract and Powerpoint presentation).

I will demonstrate the public web application version of SASTA, and the SASTADev tool, and I will show what output files SASTAdev generates to asses the performance of the system, and to identify problematic cases easily. I can also illustrate how specific language measures of the assessment methods have been implemented.
     

South African Centre for Digital Language Resources (SADiLaR): CLARIN outside Europe

Liané van den Bergh and Menno van Zaanen

 

  As one of the two CLARIN centres outside Europe and the only one in Africa, the South African Centre for Digital Language Resources (SADiLaR) may be less visible than the other centres. SADiLaR consists of a hub, located at North-West University, and five nodes: Council for Scientific and Industrial Research (CSIR), NWU Centre for Text Technology (CTexT), Inter-Institutional Centre for Language Development and Assessment (ICELDA), University of Pretoria, and University of South Africa. It runs two parallel programs: a digitization program, which focuses on the creation and dissemination of language resources for all eleven official South African languages, and a digital humanities program, which aim to build research capacity in the field of digital humanities (for instance through training and workshop events).

In this Bazaar stall, we would like to show what SADiLaR is doing. We provide an overview of the available digital language resources (data and tools) that are available for the South African languages through the repository and the projects related to digitization of such data. Additionally, we show the different training programs that SADiLaR
offers.

As a centre that is geographically far away from most of the other CLARIN centres, we are interested in practical collaborations, which can take several forms, from collaborative research to participation in colloquia. Visit our virtual stall to discuss possibilities. (video)
     

INTELE: Language Technology Infrastructure

German Rigau

 

The general objective of the strategic INTELE network is to promote activities to achieve the official participation of Spain in CLARIN and DARIAH. The participation of Spain must contribute to the advancement of the Spanish research in the humanities and social sciences, as well as to its strategic positioning in international projects and programs, mainly in the context of the European Research Area.

INTELE intends to connect the groups that are interested in these European research infrastructures that have the objective of promoting new multidisciplinary lines of research in the humanities and social sciences, participating in a digital transformation of the same with the help of language technologies. (poster)

     
     

Other 

Title of stall
 
Description
     

Corpora for Cybersecurity Term Extraction Project

Andrius Utka (Vytautas Magnus University), Aivaras Rokas (Vytautas Magnus University), Agnė Bielinskienė (Vytautas Magnus University), Sigita Rackevičienė (Mykolas Romeris University), Liudmila Mockienė (Mykolas Romeris University), Marius Laurinaitis (Mykolas Romeris University)

  The team of researchers of two universities in Lithuania (Vytautas Magnus University and Mykolas Romeris University) have started the scientific project “Bilingual automatic terminology extraction”, the aim of which is to design a methodology for automatic extraction of English and Lithuanian terms in a special domain from parallel and comparable corpora and create a bilingual termbase. Cybersecurity (CS) terminology was chosen as a special domain for the project. In our stall we present the ongoing work on compilation of comparable and parallel corpora of the cybersecurity domain covering the period 2010-2020.

The investigation of the cybersecurity sources reveals that this domain is highly heterogeneous. It encompasses diverse types of information accumulated in various genres of texts. Most sources are suitable for compilation of the comparable corpus, which will consist of the original texts in English and Lithuanian. Meanwhile, the sources suitable for the parallel corpus (English original texts and their translations into Lithuanian) are much more sparse.

At the current stage of compilation, the comparable corpus includes the following genres of texts: legislative and executive documents, official reports, academic publications, information publications for the general public and mass media articles. The parallel corpus includes mostly the EU legal acts and other documents extracted from the EUR-Lex database and other EU institutional repositories.

Thus, the collected sources will reflect the usage of the CS terminology in a variety of text types developed in national and international settings and provide rich material for automatic term extraction and compilation of the cybersecurity term base. The accumulated resources will be deposited in CLARIN-LT repository.

The research is funded by the Research Council of Lithuania (LMTLT, agreement No. P-MIP-20-282) and is included as a use case in the COST action “European network for Web-centred linguistic data science” (CA18209). (poster)
     

Linking Latin. The LiLa Knowledge Base of Interlinked Linguistic Resources for Latin

Marco Passarotti, Rachele Sprugnoli, Giovanni Moretti, Flavio Massimiliano Cecchini, Greta Franzini, Eleonora Litta, Francesco Mambrini, Paolo Ruffolo, Marinella Testori

 

Several textual and lexical resources, as well as NLP tools, are currently available for Latin. However, throughout the years, more attention has been given to making linguistic resources for Latin (much like other languages) grow in size, complexity and diversity, rather than making them interact.

As a consequence, presently linguistic resources and tools for Latin often live in isolation, a condition which prevents them from benefiting a large research community of historians, philologists, archaeologists and literary scholars.

A current approach to provide a comprehensive overview of the annotations available in the separate collections of linguistic (meta)data is interlinking linguistic resources by taking up Linked Data principles, so that (meta)data are interlinked through connections that can be semantically queried. The LiLa: Linking Latin project (2018-2023) was awarded funding from the European Research Council to build a Knowledge Base of linguistic resources for Latin based on the Linked Data paradigm, i.e. a collection of multifarious, interlinked data sets described with the same vocabulary of knowledge description (by using common data categories and ontologies).

In our stall, we present the structure of the lexical basis of LiLa, which serves as the backbone of the Knowledge Base. We detail the architecture supporting LiLa, with special focus on how we approach the challenges raised by harmonizing different strategies of lemmatization. Furthermore, we show an on-line software demo to query the resources currently interlinked in the LiLa Knowledge Base. (poster)

     

Pipeline to process and analyze Paris’s old property address directories (XIXe - XXe)

Gabriela Elgarrista, Frédérique Mélanie-Becquet, Carmen Brando, Mohamed Khemakhem, Laurent Romary, Jean-Luc Pinol

  A considerable volume of address directories have been published during the last centuries, disclosing historical information about European cities, their citizens and their way of life. Researchers frequently consult such collections looking for historical facts. Nevertheless, the vast majority of such books remain hidden in library collections without being fully exploited by digital tools. In this sense, the TGIR Huma-Num consortium Paris Time Machine aims at facilitating the systematic analysis of these sources by historians, providing the opportunity to unveil the underlying information.

We are implementing a tool pipeline to process volumes of property address directories published between 1898 and 1923, containing 350k of address units along with owners’ names set up in two-column lists. Our pipeline consists of tools widely used in digital humanities for extracting textual information from digitized texts (OCR/HTR training). Such information is then structured in XML/TEI relying on GROBID-Dictionaries, and finally processed into a table in order to perform quantitative and spatial analysis with R and GIS.

The needs of processing these kinds of semi-structured texts are becoming more important in historical research and in DH. In the last years, we have seen a strong interest for dealing with dictionaries, art catalogues, notaries’s registers. The CLARIN Bazaar seems to be the place to stimulate discussions on the way that the CLARIN infrastructure could help to fulfil such needs by providing, for instance, information extraction tools. For our presentation, we intend to showcase the current state of the system and discuss collaborations. (slides)
     

GOTRIPLE: Building an innovative discovery platform for the social sciences and humanities

Francesca Di Donato / Stefanie Pohle

 

TRIPLE - Transforming Research through Innovative Practices for Linked Interdisciplinary Exploration is an ongoing project funded under the European Commission program INFRAEOSC-02-2019 “Prototyping new innovative services”. TRIPLE started in October 2019 and will end in March 2023.

The TRIPLE consortium is made up of 19 partners from 13 different countries and involves almost 90 people with diverse disciplinary backgrounds who collaborate to develop the GOTRIPLE platform, an innovative multilingual and multicultural discovery solution for the social sciences and humanities (SSH). GOTRIPLE will provide a single access point to explore, find, access and reuse materials such as literature, data, projects and researcher profiles at European scale.

The GOTRIPLE platform is composed of a core component, built upon the Isidore search engine developed by by Huma-Num (CNRS), and complemented by a variety of connected innovative tools: a web annotation service (Pundit), a Trust Building System and a recommender system. The front-end visualisations are based on open technologies developed by Open Knowledge Maps. Moreover, we plan to connect a crowdfunding platform to GOTRIPLE. In order to facilitate internal and external discussions among users, a Forum will be implemented as well.

GOTRIPLE will be one of the dedicated services of OPERAS, the research infrastructure supporting open scholarly communication in the social sciences and humanities in the European Research Area.

The proposed poster presents the goals of the TRIPLE project and the ways the project is addressing them both through the work of its 8 intertwined work packages, and via the collaboration with existing research infrastructures in SSH, i.e. mainly CESSDA, CLARIN and DARIAH. (poster)

     

For general information on the Conference, see the event page.