Skip to main content

The CLARIN Bazaar 2016

The following people will be at the CLARIN Bazaar. Please go and talk to them, see their wares, and share your ideas!

Name of stall-holder(s) Title of stall Description
Sara Tonelli The ALCIDE platform: ongoing work for the analysis of political discourse We would like to demonstrate ongoing work related to the platform ALCIDE, whose goal is to provide tools for the automated analysis of political discourse. In particular, we would like to discuss ongoing work on the extraction and visualisation of information related to the temporal dimension of discourse (i.e. whether the speaker refers to present, past or future, and why).
Federico Boschetti Web Services for Latin Morphological Analysis The objective of this stall at the Bazaar is the presentation of a demo, which provides services for morphological analysis of Latin texts both through a web user interface and a service returning a JSON object with the analyses.
The ideal audience is constituted by digital philologists, historical linguists, digital epigraphists and students in classics.
Demo online:
Source code on github:
Jörg Knappen Presenting the Royal Society Corpus We present a new resource interesting for both linguists and historians: The Royal Society Corpus (RSC) made from the Proceedings of the Royal Society of London from 1665 to 1869. Our presentation will include the possibility to perform live queries on the using cqpweb.
Alessia Bardi OpenAIRE: Open Access Infrastructure for Research in Europe OpenAIRE is a socio-technical network that supports the implementation and monitoring of Open Science policies, including Open Access to publications and research data:
• Implementation is enabled by a pan-European network of Open Access/Open Science experts – the National Open Access Desks (NOADs) , present in every EU country and beyond. The NOADs work together to align national policies, define shared solutions and best practices, and coordinate outreach and advocacy activities through a range of targeted training events and support materials.
• Monitoring is achieved by means of an advanced data infrastructure consisting of a decentralized network of data sources, namely publication repositories, data repositories, and current research information systems, established by research institutions, individual scientific communities, and publishers. By harnessing the contents of “compatible” publication, data, software, and method repositories (both institutional and disciplinary) and linking them to other research entities (researchers, institutions, projects), OpenAIRE produces a 360˚ picture of the impact of European research funding. More information at
Algirdas Šukys, Rita Butkienė, Linas Ablonskis Editor of Business Vocabularies and its Application for Semantic Search We would like to present a tool (SBVR editor) for specification of domain vocabulary and our experience in using it and vocabularies specified applying SBVR standard. SBVR (Semantics of Business Vocabulary and Business Rules) standard is an adopted standard of the Object Management Group (OMG) intended to be the basis for formal and detailed natural language declarative description of a complex entity, such as a business. SBVR standard is based on the linguistics, logic, and computer science. It provides a way to capture specifications in natural language and represent them in formal logic so they can be machine-processed.
SBVR editor can be used as the stand-alone tool and as an integral part of the system. We will demonstrate how SBVR based vocabularies are specified, how they can be transformed to OWL 2 ontology schemas automatically, how we have used this tool for creation of the lexicon for natural language interface in the semantic search system.
Marc Kemps-Snijders Meertens software booth Showcasing the work in progress of the Meertens Institute; OpenSKOS, FLAT archive, MTAS multitier annotation search, Nederlab
Elena González-Blanco EVILINHD, a virtual research environment for creating DH projects including LINDAT Clarin repository EVILINHD is a virtual research environment just created at LINHD, the Digital Humanities Innovation Lab at UNED; which let users creating different types of digital humanities projects (digital editions using XML markup with a self-developed cloud-based tagging tool and TEIPublisher, digital libraries using Omeka and project websites using Wordpress), all projects are registered and published in LINDAT, the Clarin repository that has just be implemented in our Clarin-K centre to build a project catalog, combined with Tadirah controlled vocabularies for digital humanities projects.
Twan Goosen, Dieter van Uytvanck Software testing Good software requires good testing. There are various stages of software testing, some of which can be automated and some of which cannot. To deal with the latter, CLARIN proposes to establish a 'testers pool', consisting of people that are interested in trying out new versions of tools and applications at an early stage, and report their findings to the development team. Software to be tested may be prototypes of new software or alpha or beta versions of upcoming releases of existing software. Depending on the needs of the development team, general feedback regarding functionality, usability or stability may be desired, while in other cases there may be a fixed test plan that needs to be carried out on various devices, operating systems, browsers etc. In our stall we can provide you information about software testing and the plans for the testers pool, and if you feel like it you will have the ability to sign up as a CLARIN software tester!
Claire Clivaz eTalks: multimodal literacies and academic publishing ( Academic publications and pedagogy have been deeply reconfigured by the emergence of a new kind of knowledge produced by multimodal literacies (text, image and sound together). Academic publishing needs a digital multimedia editing platform, that can be carefully edited and quoted in details, in the same way that printed sources are. Consequently, the Swiss Institute of Bioinformatics (Vital-IT, Lausanne, CH) is developing such a platform with the “eTalks”. The eTalks application is implemented via an easy-to-use editor interface, designed for the use of researchers themselves, to create and edit original eTalks. This permits the linking together of images, sounds and textual materials with hyperlinks, enriching it with relevant information. The final release of eTalks allows complete ‘citability’ of its contents: each and every portion of the researchers’ talks can be precisely referred to and thus cited with a specific identifier, just like any traditional, paper-based academic publication but with all the potential for plural literacies. It is openly accessible and the code is open source, including guidelines to install the eTalks. It is notably developed in collaboration with the Erasmus+ project #dariahTeach. The DRM (Digital Right Management) is a key issue in such an open access editing platform.
João Silva (NLX Group, University of Lisbon) NLX's tools and resources for of Portuguese Objective: Present the tools and resources developed in the group.
Audience: People interested in language resources and NLP tools for (European) Portuguese.
The Web demos, services, resources and tools may be found at
Claus Zinn The Language Resource Switchboard The CLARIN Language Resource Switchboard (LRS) aims at helping users to connect resources with the tools that can process them. The LRS lists all applicable tools for a given resource, lists the tasks the tools can achieve, and invokes the selected tool in such a way so that processing can start immediately without any or little prior tool parameterization. We show how the LRS can be used in connection with the , and in a standalone fashion ( The software demo complements our oral presentation on the LRS and seeks to elicit feedback from users, and encourage tool providers to make available their tools via the LRS.
Arianna Ciula Work in progress: Collaborative Research Project on Modelling between Digital and Humanities In my stall I will present the work in progress conducted as part of an international collaborative project funded by the Volkswagen Foundation (scheme “Original – isn’t it?” New Options for the Humanities and Cultural Studies, Funding Line 2 Constellations), 2016-2017. In addition to myself, the other co-invetsigators in the project are Øyvind Eide (University of Cologne, Germany), Cristina Marras (ILIESI, CNR, Italy), and Patrick Sahle (University of Cologne, Germany). The project aims to link scholarly modelling as a formal and informal reasoning strategy across disciplinary boundaries, and to bridge between (digital) modelling in research and teaching.
Given that many theorisations on modelling have a linguistic basis or are anchored to concepts developed in philosophy of language, it would be useful to receive feedback from CLARIN participants.
I will use as basis to discuss the project the poster presented at the Digital Humanities Conference 2016 made available on my laptop screen; the full abstract is available at
Daan Broeder services for CLARIN EUDAT is a general eInfrastructure that provides data management services for research communities. CLARIN is one of the founding partners of EUDAT and we are currently executing a plan integrating the EUDAT data replication service in CLARIN centre repositories. Besides that EUDAT has several other services of interest for the CLARIN community that we will also demonstrate.
Liesbeth Augustinus & Ineke Schuurman Querying parallel treebanks with Poly-GrETEL We present Poly-GrETEL (, an online tool which enables syntactic querying in parallel treebanks. It is based on the monolingual GrETEL environment, and it allows users to query parallel treebanks using either an XPath expression or an example sentence in order to look for similar constructions. Currently Poly-GrETEL contains the Europarl parallel treebank for Dutch and English, which is automatically parsed and aligned on sentence and node level. By combining the example-based query functionality with node alignments, we limit the need for users to be familiar with the query language and the structure of the trees in the source and target language. In this way, Poly-GrETEL facilitates the use of parallel treebanks for comparative linguistics and translation studies.
Geneviève Romier on behalf of EGI EGI The objective is to present EGI to the attendees.
EGI is a publicly-funded federation of over 300 data and computing data centres spread across Europe and worldwide.
EGI has over 45,000 users from a wide range of fields.
EGI provides access to over 650,000 logical CPUs and 500 PB of disk and tape storage.
EGI offers a wide range of services for compute, storage, data and support.
More about EGI:
Go Sugimoto DBpedia enrichment - (I can’t get no) satisfaction, so start me up That’ what the Rolling Stones says and I have a similar opinion in a different context. Do you think Wikipedia is excellent? Yes. What about DBpedia? Well, yes but I’m not completely happy…It is good that there are now a lot of projects providing Linked Open Data (LOD) in Cultural Heritage and Digital Humanities, but many of them seem to be satisfied after linking to DBpedia and other services. They often do little afterwards. My question is if DBpedia really generates some new knowledge that people have expected? And, if so, for example, what inferences can we make?

This pitch is about an early work in progress to demonstrate an idea of enriching DBpedia in an attempt to unlock its full potential and achieve a fully-fledged knowledge generation, using some Natural Language Processing techniques. You will find out why DBpedia is not satisfactory and how I plan to pave the way to the next level. I also look for collaborators who may be able to work on linguistic algorithm and software development for this idea. If you have any good ideas and suggestions, you are more than welcome. “Hey hey hey, that's what I say”.

Jan Niestadt BlackLab: six years of search BlackLab, started in 2010, is a fast, well-tested and feature-rich corpus search library and web service based on Apache Lucene. CLARIN Federated Content Search 2.0 will be fully supported soon. Other current work includes: optimizing particularly complex and demanding queries, scaling to ever larger corpora, and making it even easier to work with for end-users and computational linguists alike. Check out the project webpage (, with an introduction, guides and reference documentation, or go straight to the source ( Please stop by and chat, whether you're already a happy user or just curious!
Pavel Straňák Open Source at LINDAT/CLARIN We will present an overview of software development at LINDAT/CLARIN and our experience with the Open Source model. We develop all LINDAT/CLARIN services in the open with public Github repositories, issue tracking and wiki pages, as well as adoption of the typical GitHub workflow ( . See projects at

We want to present our positive experience and promote similar development model fostering cooperation for all CLARIN centres and services.

Jozef Misutka integration: How-To integrate into your application Meant for web application maintainers and developers who would like to use PIDs (handles) in their system e.g., to reference a particular query over a database in specific version. The handles are monitored, have special error pages if not resolveable and contain metadata that can identify the query.


Michael Beißwenger, Darja Fišer, Tomaž Erjavec Network: CMC and Social Media Corpora for the Humanities The stall represents a bottom-up network of researchers interested in combining language-centered research on computer-mediated communication (CMC) and social media in linguistics, philologies, communication sciences, media and social sciences with research questions from the field of corpus and computational linguistics, language technology, text technology and machine learning. The network is organizing a conference series (CMCCORPORA, with previous events held in Germany, France and Slovenia and is driven by the joint interest in exchanging tools and best practices and in creating standards and quality criteria for CMC and social media corpora. Proceedings of the most recent edition of the conference have been published online ( We're always looking for new collaborators!
Petya Osenova CLARIN-PLUS Workshop on Parliamentary data in 2017 I would like to introduce the CLARIN-PLUS Workshop on Parliamentary data that will be held in Sofia in March 2017 to the CLARIN community. The aim is to increase the awareness of CLARIN community about the workshop aim, which is: to demonstrate the application strength of language and speech technology in the area of parliamentary records. Three types of participants are envisaged: 1. SSH researcher who works with parliamentary proceedings data; 2. Curator or creator of parliamentary proceedings data; 3. Expert in speech and language technologies useful for enhancing or analysing parliamentary data.


Andrius Utka, Jurgita Vaičenonienė CLARIN-PLUS workshop "Creation and Use of Social Media Resources" Event:
From Thursday, 26 to Friday, 27 January, 2017 (preliminary), the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”, organized by the Lithuanian national consortium CLARIN-LT, will take place in Kaunas, Lithuania.
The aims of the workshop are to show the possibilities of NLP tools and social media resources for researchers with diverse research background, which is not limited to computational linguistics; to demonstrate interdisciplinary cooperation possibilities. The target participants of the workshop include:
• Specialists of discourse analysis and corpus linguistics;
• Researchers of social science and medicine;
• Computational linguists and NLP professionals.


Dr. Klára Vicsi Multilingual database collection for different speech disorders: depression, Parkinson’s disease and pathological speech Quality assessment on speech has increasing interest in diagnostic systems. Identifying the cause or causes of a voice disorder is the first key step in its treatment. We can extract many acoustic characteristics of speech that can provide useful diagnosis for the human condition. The general goal is to build a multilingual automatic tool that can recognize disordered speech, which is generally caused by the deterioration of neurons in the brain (Parkinson’s disease), caused by complex human emotions (depression) or simply the defect of the vocal organs (pathological speech). In order to achieve the objective a multilingual speech corpus must be created. We encourage you to associate with us to collect data and create such a database as part of a joint collaboration.
Arcot Rajasekar Data Bridge - A Sociometric System for Long-tail Science Data Collections Massive number of relatively small datasets gathered and generated by individual scientists and groups, form a distinct class of Big Data called the “long tail of science” data and harnessing their hidden power is crucial for advancing science. The main challenge facing the science community in the Big Data era is the difficulty of discovering relevant datasets across distributed repositories. The complication of effective discovery and identification of relevant data forms the last mile problem for long tail of science data. We are developing a system called the DataBridge that applies 'signature' and 'similarity' algorithms to semantically bridge large numbers of diverse datasets into a 'sociometric' network. By providing a venue for defining complicated search criteria through pattern analysis, feature extraction and other relevance criteria, DataBridge provides a highly customizable search engine for scientific data.
Tomaž Erjavec, Simon Krek, Darja Fišer, Cyprian Laskowski, Andrej Pančur, Iza Škrjanec, Jakob Lenardič Overview of activities We will present an overview of the activities of the Slovenian national Clarin consortium in the past year that is relevant to other national Clarin consortia, potential users of our activities and future collaborators. More information about our activities can be found on our webpage:
Kadri Vare, Olga Gerassimenko, and Sebastian Drude Usability testing and feedback This is related to CLARIN-PLUS Task 3.3.3 on usability testing and improvement.
We will collect feedback on the general usability of CLARIN , in particular the website and some central services.
We also will do a short usability test by asking visitors to do a short task, which we monitor.
Finally, we also present draft instruction videos (screencasts etc.) on central CLARIN services and gather feedback on them.
The details of the test will be finalized on the usability workshop on Wednesday (pre-conference-workshop).
Christoph Draxler BAS Web Services The Bavarian Archive for Speech Signals (BAS) provides a number of multilingual web services to the CLARIN community. These include, amongst others, a grapheme to phoneme converter (G2P), an automatic phonetic segmentation system (WebMAUS), and a metadata generator for speech corpora (COALA). Futhermore, BAS has developed SpeechRecorder, a platform independent software for scripted speech recordings, and percy, a tool for online perception experiments.

At the CLARIN-EU conference, BAS will demonstrate the workflow from the recording to a phonetic segmentation using CLARIN tools. The target audience are researchers in the field of spoken language, i.e. phoneticians, linguists and speech technologists, plus developers who want to include the web services in their tools or workflows.

The CMDI Task Force (Twan Goosen, Menzo Windhouwer, et al) Component Metadata: CMDI 1.2 This year, CLARIN's CMDI task force has released a new version of the metadata framework CMDI. Version 1.2 fixes a number of issues and introduces new features designed to improve the quality and usability of metadata within CLARIN. Among the improvements are the option to use external vocabularies (e.g. from CLARIN's vocabulary service CLAVAS) as open vocabularies and provide multilingual documentation. A largely re-implemented toolkit with enhanced validation capacities is part of the release. The task force also has written a specification of CMDI 1.2, providing full technical details for metadata modellers, software developers and other users of CMDI requiring such information. CMDI 1.2 is now ready to be used by anyone, while the infrastructure keeps supporting existing CMDI metadata (version 1.1) alongside the current version. Facilities for upgrading existing metadata are provided in the CMDI toolkit. This toolkit, the specification document and more information can be found at At our stall, we will demonstrate CMDI 1.2 and answer all of your questions regarding component metadata!
Chris Cieri NIEUW - Novel Incentives and Engineering Unique Workflow Novel Incentives and Engineering Unique Workflow (NIEUW) is an initiative, funded by the US National Science Foundation, to increase the supply of language resources by increasing the number and types of incentives that encourage people to contribute both raw data and judgment. It provides tools for language professionals, learning experiences for citizen linguists and entertainment to game players all while collecting data and annotations. Each activity has a coordinated set of incentives, workflows and processing routines to maximize the benefit of the data. NIEUW is lead by LDC and is building international collaboration to increase both its linguistic scope and its impact.


Back to the main conference programme page.