Below you will find a list of the stalls that can be visited at the CLARIN Bazaar. Please go and talk to stallholders, see their wares, and share your ideas!
Bazaar presentations | Workshop presentations
Bazaar presentations
Title of stall | Description | |
---|---|---|
The European Open Science Cloud and the TextCrowd Pilot Project Thomas Zastrow |
The "European Open Science Cloud" ( ) tries to overcome the fragmentation of existing research infrastructures in Europe. Goal is the development of a collaborative approach to allow the use of research data over discipline specific borders. In the EOSC pilot projects, various European research infrastructures and organizations are taking over functions as early adopter (Science Demonstrator). One of these Science Demonstrators is the TextCrowd pilot project [1]. TextCrowd will offer advanced text-based services addressing common research needs in the fields of cultural heritage and humanities. One example is enabling the semantic enrichment of text sources through cooperative, supervised crowdsourcing, based on shared semantics in a cloud based fashion and make this work interdisciplinary available to other researchers via EOSC. This would benefit many scientists in the long-tail even if delivering such a service presents real challenges around interoperability and multilingualism. TextCrowd will make use of a virtual research environment (VRE) powered by D4Science [2]. based tools for named entity recognition in various European languages were incorporated. Further cloud based resources and services provided by EGI and will be integrated into the TextCrowd Science Demonstrator. Click here to look at the poster. [1] https://eoscpilot.eu/science-demos/textcrowd |
|
Europeana and CLARIN: Cultural Heritage Data for the Digital Humanities Twan Goosen |
One of the lines of action of Europeana is to facilitate research on the digitised content of Europe’s galleries, libraries, archives and museums, especially for the digital humanities and the social sciences. Within the scope of Europeana's Digital Service Infrastructure (DSI), CLARIN implemented an integration of data sourced from Europeana into its infrastructure, allowing members from the CLARIN community to discover cultural heritage data from selected data sets via the familiar user interface of the Virtual Language Observatory and, moreover, apply Europeana currently aggregates, enriches and disseminates over 50 million cultural heritage resources from hundreds of providers distributed over many data sets. Thus far, a relatively small number of data sets has been selected to be harvested into CLARIN's infrastructure based on relevance and quality criteria. As Europeana has started to cater more specifically to research communities, this number can be expected to grow substantially in the near future. |
|
CLARIAH Media Suite Roeland Ordelman and Julia Noordegraaf |
At the Bazaar we will present the Media Suite, a portal developed as part of the Dutch Common Lab Research Infrastructure for the Arts and the Humanities (CLARIAH), focusing on audio-visual content including multimedia context collections. Earlier projects focused on the development of individual, specialised tools for media research, and were tested as prototypes using small, private collections or subsets extracted from Dutch archival institutions. Within CLARIAH, the aim is to integrate the functionalities of these ‘early prototypes’ into a sustainable research infrastructure that allows access to (ideally) full archives of Dutch institutions securing IPR and privacy restrictions, and provides scholars with the flexibility to deploy a set of robust tools for multimedia content analysis. At the Bazaar, we will present our work towards version 2.0, due December 2017, that includes centralised authentication, collection registry and selection, analytic metadata inspection, advanced single collection search, comparative search, explorative browsing, annotation using free labels, thesauri or links, and content presentation (play-out, viewing). ‘Under the hood’, the research infrastructure deploys automatic transcription and enrichment tools such as speech recognition and entity detection. |
|
CLARIN metadata best practices and support Menzo Windhouwer and Twan Goossen |
After implementing and specifying 1.2, the CLARIN task force set out to describe a set of best practices for modelling and authoring metadata within CLARIN. While this is still work in progress, a first draft has been completed and will be available for viewing at our bazaar stall. Members of the task force will be present to discuss the best practices with anyone interested. We appreciate any feedback or suggestions in this regard. Of course we will also be more than happy to answer any other CMDI related questions or you may have! More information about the best practices (work in progress) can be found in our abstract "Component Metadata Infrastructure Best Practices for CLARIN" accepted for and to be presented at this year's annual conference. |
|
Long-term archiving of and access to Oral History data at DANS Ilona von Stein |
Oral history is the collection and study of historical information about individuals, families, important events, or everyday life using audiotapes, videotapes, or transcriptions of planned interviews. At (Data Archived and Networked Services), the Netherlands Institute for Permanent Access to Digital Research Resources, there are available more than 3000 oral history datasets in over 75 collections. Oral history data at DANS in the electronic archive EASY (www.easy.dans.knaw.nl) consists of the audiovisual files, metadata on collection and interview level, transcriptions of the audiovisual narratives and additional related files (such as summaries, biographies and photos). The data are archived for the long-term, and if access/security level permits made available to users by streaming versions and/or transcriptions. Purpose of the oral history stall at the CLARIN bazaar is meeting people from other center and countries, exchange ideas and challenges, and talk about future collaborations. |
|
TLA-FLAT: a CLARIN-compatible Repository Solution Menzo Windhouwer |
The CLARIN B Centre and certifications put specific requirements to the repository system of a CLARIN centre. The TLA-FLAT repository solution developed by The Language Archive (a collaboration between the Meertens Institute and the MPI for Psycholinguistics) is designed to meet those, but also to be easily adapted to specific requirements of the centre itself. The repository is based on Fedora Commons and Islandora, but adds support for Component Metadata to them and a special software component: the DoorKeeper. The DoorKeeper executes a configurable sequence of actions, e.g., validation of the metadata and assignment of PIDs, to ingest data and metadata into the Fedora Commons repository. It does that in a way that is compatible with Islandora, so the wealth of solution packs for visualizing all types of resources can be used. At the bazaar the latest version of TLA-FLAT will be demonstrated. |
|
Ramble-On Navigator: Tracing Trajectories over Time Stefano Menini |
RAMBLE-ON is a freely available system that integrates a user-friendly visualisation interface to view persons' movements and a background information extraction engine that, given biographies, extracts people’s trajectories and enriches them with information to plot them on a map. |
|
Inforex - a web-based system for qualitative corpora annotation Marcin Oleksy |
Inforex (inforex.clarin-pl.eu/) is a web-based system for collaborative text corpora annotation and analysis. It was developed to construct corpus-based linguistic resources for various tasks in the field of natural language processing but it is also used by the scientist for other purposes. Inforex is a part of Polish CLARIN infrastructure. It is integrated with a digital repository for storing and publishing language resources (clarin-pl.eu/dspace/). Inforex supports manual text annotation on the semantic level e.g. annotation of Named Entities (NE), anaphora, Word Sense Disambiguation (WSD) and relations between named entities. The system also supports manual text clean-up and automatic text pre-processing including text segmentation, morphosyntactic analysis and word selection for WSD annotation. Inforex is being gradually developed thanks to a constructive feedback from the researchers in Humanities and Social Sciences attending CLARIN-PL workshops on NLP tools and resources. |
|
BlackLab AutoSearch: Corpora for Everyone! Jan Niestadt |
What if linguistically annotating and searching any text was easy? Many researchers have textual data they'd like to search, analyze and share with others. But converting it to a supported format, tagging it with lemma and part-of-speech, and indexing it with a corpus search tool can be a daunting task, requiring detailed technical knowledge. BlackLab AutoSearch is designed to assist non-technical researchers with this process, using smart defaults and point-and-click customization. A simple demo version is already online, and a much improved version is currently in development. We'll demonstrate the next version of AutoSearch and explain how developing it also made BlackLab itself easier to use. |
|
Data Bridge - A Novel Discovery Tool Arcot Rajasekar |
Traditional discovery mechanism is search-based and rely on keyword or metadata based search. These mechanisms help us find documents from a corpus based on word or metadata occurrences. The Data Bridge system provides a novel alternative approach by defining ‘signatures’, based on domain-specific analytical algorithms, for every document and comparing signatures to cluster similar documents using socio-metric network algorithms. The application of analytical algorithms for aiding search can be viewed as “deep indexing” that provides an abstract but focused level of discovery at a finer level of granularity than provided by textual or schematic metadata. We have applied Data Bridge to social science polling data, clinical trials studies and schizophrenia demographic data to identify clusters of datasets of interest. |
|
Europeana Research Marjolein de Vos |
Europeana Research was established as a link between cultural heritage institutions and researchers. It recognizes that undertaking research on the digitised content of Europe’s galleries, museums, libraries, and archives has huge potential that should be exploited. But issues with regards to licensing, interoperability, and access can often impede the re-use of that data in research. Governed by an Advisory Board comprising of renowned digital humanities experts, Europeana Research aims to help with these issues, liberating cultural heritage for meaningful academic re-use. We work on a series of activities to increase the use of Europeana data in research, and develop the content, capacity, and impact of Europeana by fostering collaborations between Europeana and the cultural heritage and research sector, as well as liaising with other digital research infrastructures and networks. |
|
DH Course Registry Tanja Wissik |
The DH Course Registry offers an online platform and underlying information system that provides a searchable registry with information on courses for the digital humanities. It currently covers a selection of DH courses offered by European academic organizations. Students, lecturers and researchers can search the database on the basis of the location, ECTS credits or the academic degrees that are awarded. The goal of the DH Course Registry is to provide information to: (i) students and researchers who intend take up a study in the field of Digital Humanities, (ii) lecturers who are looking for examples of good practices in the DH field or want to promote their own DH-related teaching activities and material, and (iii) administrators who aim to attract and facilitate staff mobility and exchange. At the moment the registry contains 118 course from 14 countries, including courses at BA and MA level as well as Summer Schools and other training courses. However the registry is steadily growing and the bazaar is a good opportunity to enlarge the network. During the bazaar there will be the opportunity to interact with the platform and also to sign up and enter new courses. |
|
TextImager: a Distributed UIMA-based System for NLP Wahed Hemati, Tolga Uslu and Alexander Mehler |
More and more disciplines require NLP tools for performing automatic text analyses on various levels of linguistic resolution. Since computational power is rapidly increasing, analyzing big data in the range of terabytes or even petabytes is getting popular in NLP. However, the usage of established NLP frameworks is often hampered for several reasons: in most cases, they require basic to sophisticated programming skills, they have no mechanism to store big data, interfere with interoperability due to using non-standard I/O-formats and often lack tools for visualizing computational results. This makes it difficult especially for humanities scholars to use such frameworks. In order to cope with these challenges, we present TextImager, a distributed UIMA-based framework that offers a range of NLP and visualization tools by means of a user-friendly GUI. Using TextImager requires no programming skills. |
|
Hosting CLARIN services in the EGI cloud Gergely Sipos, Boris Parak and Willem Elbers |
EGI is an e-Infrastructure collaboration that provides advanced computing and data services for research and innovation. The collaboration operates a federated, publicly-funded e-infrastructure that currently comprises more than 300 resource centers from Europe and beyond. Over the last decade this infrastructure was the enabler of Open Science conducted by over 50,000 researchers through the whole spectrum of science from High-Energy Physics, to Earth Sciences, Life Sciences, Chemistry, Astrophysics, and Humanities. In 2015 the CLARIN approached EGI to find a cloud site where the centrally manage CLARIN services can be hosted. The CESNET cloud provider (Czech Republic) was found as the ideal candidate and the following services were setup there:
CLARIN is planning to expand its cloud resource use by moving additional services to CESNET – such as its SVN and Trac servers. EGI HTC and cloud resources are open for the whole CLARIN community. |
|
Uptake of EUDAT services within the CLARIN infrastructure Daan Broeder and Willem Elbers |
The EUDAT Collaborative Data Infrastructure (CDI) is essentially a European e-infrastructure of integrated data services and resources to support research. This infrastructure and its services have been developed in close collaboration with over 50 research communities spanning across many different scientific disciplines and involved at all stage of the design process. Researchers can rely on innovative data services to support their research collaboration and data management. Additionally they benefit from a common service management framework delivered by CDI service providers and the connection between sites. With the CLARIN uptake plan CLARIN ERIC supports CLARIN centres to increase uptake of the EUDAT services, such as B2SAFE, B2STAGE and B2DROP, by liaising between the CLARIN and EUDAT stakeholders and providing technical support where needed. This poster aims to provide an overview, per EUDAT service, of the progress made on the integration of the EUDAT services into the various CLARIN centres. |
|
The PARTHENOS Project Sheena Bassett |
PARTHENOS aims at strengthening the cohesion of research in the broad sector of Linguistic Studies, Humanities, Cultural Heritage, History, Archaeology and related fields through a thematic cluster of European Research Infrastructures. This objective through the definition and support of common standards, the coordination of joint activities, the harmonization of policy definition and implementation, and the development of pooled services and of shared solutions to the same problems. Some of the key outputs are:
|
|
Software for the analysis of East Asian languages Martin Wynne
|
Do you have software for Chinese, Japanese or other East Asian languages? CLARIN is planning a training workshop on analysis of social media and computer-mediated communication at the conference 'Digital Youth in East Asia: Theoretical, Methodological and Technical Issues' hosted by the East Asia Studies (EASt) research unit of the Université Libre de Bruxelles (ULB). I am gathering information on research software relevant for the analysis and exploration of East Asian languages, in order to be able to provide a brief and high-level overview of the opportunities for the workshop participants. Monolingual software or tools trained or tailored for these languages are all within scope. Come along and tell me about it! |
|
The Thing Recognizer Attila Novák and Borbála Siklósi |
Neural word embedding models trained on sizable corpora have proved to be an efficient means of representing meaning. However, the abstract vectors representing words and phrases in these models are not interpretable for humans by themselves. Although indirect interpretation of vectors based on the known meaning of nearest neighbors is possible, a good command of the language is often necessary for this. We present the Thing Recognizer, which assigns explicit symbolic semantic features from a finite list of terms to words present in an embedding model, making the model interpretable and covering the semantic space by a controlled vocabulary of semantic features. We do this in a cross-lingual manner, applying semantic tags taken form lexical resources in a resource-rich language (English) to the embedding space of a resource-scarce language (Hungarian). Nonetheless, the method is applicable within a single language as well to assign a relatively accurate feature-based semantic representation to lexical items not present in the original semantic resource. |
|
Tools for corpus-driven research of Hungarian: the Hungarian Gigaword Corpus and the Verb Argument Browser Bálint Sass |
Two corpus query tools are presented which provide access to the Hungarian Gigaword Corpus. The original version of the corpus itself was created 15 years ago. There was a substantial upgrade recently, the second version has been online since 2014. Its size has increased from 200 million to 1 billion words, and also the linguistic analysis has become mode detailed. The corpus contains texts from various genres trying to represent the Hungarian language as well as possible. The first tool is a general purpose corpus query system. It is essentially the NoSketchEngine system improved by a "detailed search" interface which besides is able to query individual phonological features provides easy access to the whole Hungarian morphology. The other tool represents a different approach. This is the so called "Verb Argument Browser" for investigating argument structure of verbs. This tool reveals the typical usages of verbs in terms of how nouns (or adjectives) are collocated with verbs as certain dependents. Both tools are accessible freely with a shared registration at http://hnc.nytud.hu and at http://corpus.nytud.hu/vab. |
|
Discovery of hidden patterns of multimodal communication Laszlo Hunyadi |
Finding patterns in behaviour is challenging for at least five reasons:
Our analysis of multimodal behavioural patterns is based on the software environment Theme (Magnusson 2000, 2017) aimed at being able to appropriately handle the above challenges. As input data we use the manual and machine annotation of the 50 hours Hungarian HuComTech corpus of dialogues. The automatic annotation included the use of the CLARIN sponsored Webmaus as well as prosody stylization. |
|
DARIAH - Data Re-Use Charter Jessie Labov |
The Cultural Heritage Data Reuse Charter is developed in order to frame the conditions of collaboration between Cultural Heritage Institutions and scholars. It simplifies information retrieval and transactions related to the scholarly use of cultural heritage data. The Charter is an online environment that allows one to declare their commitment to the reuse conditions expressed for each collection or object registered in the Charter. The key elements each of the parties commits to in terms of access, use and reuse are to be found there. As a trusted network of stakeholders, it generates reciprocate awareness and raises visibility and transparency. The Charter is conceived to complement, not replace, existing infrastructures and catalogues. The initiative comes from DARIAH but strives for a wider involvement of a community of interest that includes for instance infrastructures like CLARIN-EU, E-RIHS, Europeana and affiliated projects such as HaS, IPERION-CH, EHRI, PARTHENOS. The Charter is neither a substitute to CHI content catalogues nor a copyright clearing environment. It provides information on reuse, recommendations and links to content that might be of interest for the user (catalogues, license information, etc.). |
|
CLARIN ERIC: Reaching out to researchers Darja Fišer, Jakob Lenardič, Karolina Badzmierowska and Leon Wessels |
Recently, CLARIN ERIC started a number of new initiatives to reach out to researchers with an interest in language data. By displaying what CLARIN has to offer, we hope to convince even more researchers to use our infrastructure. In this stall, three of these initiatives will be presented:
Are you interested in our overviews of parliamentary, newspaper or computer-mediated corpora? Do you want your consortium to be in the spotlights of Tour de CLARIN? Would you like to know what kind of content you can find on our Videolectures channel? Come to our stall and talk with us! We promise we won't bite (hard). |
|
e-magyar: a free, open, modular processing toolchain for Hungarian Tamás Váradi |
e-magyar.hu is a free, open, interoperable modular toolchain for Hungarian, the result of the collaborative effort of the Hungarian language technology community. The various modules are based on earlier existing tools, some of them, notably the morphological analyser, thoroughly redesigned, both the annotation scheme and the engine were replaced. The e-magyar pipeline was implemented in the GATE framework, assuring interoperability with the services available under GATE. |
|
Semantic Annotation of Cultural Heritage Content Uldis Bojārs |
||
Workshop presentations
Title of stall | Description | |
---|---|---|
CLARIN DSpace Jozef Mišutka and Pavel Straňák |
CLARIN DSpace evolved from a long term project by the LINDAT/CLARIN centre in the Czech Republic. The reason for the transition from LINDAT/CLARIN DSpace to CLARIN DSpace was the increasing number of installations at various CLARIN centres. Having a sustainable long term solution requires active engagement from multiple stakeholders. Therefore, we started the transition to a CLARIN wide project last year. |
|
Transcribing Oral History Audio Recordings – the Transcription Chain Workflow Christoph Draxler |
There exist many oral history audio recordings, and more are made today. It is a major scientific and technological challenge to transcribe, analyse and archive these recordings, and to make both, the audio and its content available. The Transcription Chain Workflow aims to facilitate the transcription of such audio recordings in two ways: 1) where possible, the recordings are processed by automatic speech recognition (ASR) to obtain a raw text transcript, which is then manually corrected by human transcribers. 2) if no ASR is available, human transcribers manually transcribe the recordings using modern web-based tools in a collaborative annotation process. The result of the transcription is a time-aligned text transcript with varying degrees of alignment granularity. The Transcription Chain Workflow faces major challenges:
Finally, the different scientific fields involved, e.g. oral history, linguistics, phonetics, speech technology, etc. have divergent requirements. It will be our task to tackle these challenges. |
|
Multilingual Text Annotation of Slovenian, Croatian and Serbian with WebLicht Tomaž Erjavec |
Linguistic annotation of text corpora is a prerequisite for corpus linguistics or any advanced explorations of information content of language. While annotation tools do exist for many, if not all CLARIN languages, they are often not available on-line, making it difficult to use them by humanities researchers. In November 2016 a CLARIN workshop was organised in Slovenia, with the aim to gather CLARIN members that have locally developed annotation tools and set the stage to offer them as web services in the scope of the WebLicht architecture. Presentations were given on WebLicht and Croatian, Czech, Estonian, Italian, Latvian, Serbian, and Slovenian annotation tools, and an implementation plan was drafted. Currently, we can report on the CLARIN.SI suite of open source trainable tools, such as diacritic restoration, word-normalisation and part-of-speech tagging with lemmatisation, which have been trained for three Slavic languages: Slovene, Croatian and Serbian, as well as the trial integration of the tagger/lemmatiser and dependency parser for these three languages with WebLicht. Future work includes offering more tools and for more languages in WebLicht, localising basic WebLicht documentation to national languages, stress-testing the functioning of the tools in WebLicht, and a user centred evaluation. |
|
CLARIN-PLUS Workshop: Working with Parliamentary Records Petya Osenova and Kiril Simov |
From 27 to 29 March 2017 the third CLARIN-PLUS Workshop was held in Sofia with the Institute of Information and Communication Technologies as its local host. The workshop aimed to discover the ways in which NLP technology, developed within CLARIN, would be helpful for: There were three invited contributions:
Two hands-on sessions have been organized: on Talk of Europe best practices and corpus analysis. The main issues detected in this area are as follows:
|
|
CLARIN-PLUS Workshop: Creation and Use of Social Media Resources Andrius Utka, Jolanta Kovalevskaitė, Jurgita Vaičenonienė |
The poster will present the CLARIN-PLUS Workshop “Creation and Use of Social Media Resources“ which took place in Kaunas, May 18-19, 2017. The workshop attracted researchers interested in social media data from computational linguistics, social sciences, psycholinguistics, corpus linguistics, language variation and other research domains. The aims of the workshop were: to demonstrate the possibilities of social media resources and natural language processing tools for researchers with a diverse research background who are interested in empirical research of language and social practices in computer-mediated communication; to promote interdisciplinary cooperation possibilities; to initiate a discussion on the various approaches to social media data collection and processing. |
|
CLARIN-PLUS Workshop: Working with Digital Collection of Newspapers Ineke Schuurman and Bram Vanroy |
September 2016, the 2nd CLARIN-PLUS Workshop took place in Leuven (Belgium) at the premisses of KU Leuven. Three types of participants were present:
The idea behind this workshop was to demonstrate how the application of language and speech technology tools and services on digital language material can advance humanities and social sciences research in fields other than linguistics, like history and sociology. And we also wanted to hear from them what their needs are:
We received lots of answers, after the presentation and demonstration sessions, but especially also in the discussion session at the end. There was one invited talk: Tracing conceptual change in messy data (2): Self-reliance as boon and bane, Joris van Eijnatten (Utrecht University) In this talk it already became clear that there are still problems to solve ... |
|
CLARIN workshop type I: Towards Interoperability of Lexico-Semantic Resources Maciej Piasecki |
The main goal of the workshop was to initiate the works on the improvement of interoperability, usability and ease of access of CLARIN L-SRs for (the needs of) their better visibility for H&SS users and their enhanced utilization in research applications. The key idea was to initiate coordinated development of a system of web services for accessing L-SRs and a common virtual Lexical Platform built on top of them. The platform is intended to be an open generic solution that will allow for effective linking, displaying and browsing of the rich variety of data included in CLARIN L-SRs. One of the functions of the platform will be a kind of federated search for L-SRs The platform will be an open system, implemented both as an open source code and open for all L-SRs. We can expect many potential installations and many web applications based on them. The main topics discussed during the workshop included: |
|
Back to the main conference page.