You are here


At this page you can find a list of available language resources that give an example-based impression of what researchers can do with CLARIN's functionality.

CLARIN Resource Families

The aim of the CLARIN Resource Families initiative is to provide a user-friendly overview of the available corpora in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. The overviews are organized according to the types of data in the corpora and include listings of corpora sorted by language:


The Portal for the Presentation of Slovene Language Resources and Tools

The Portal for the Presentation of Slovene Language Resources and Tools is a library of video tutorials that present the purpose, content, and structure of various freely available digital language resources and tools for Slovene. 

plWordNet 3.0 – Słowosieć 3.0

plWordNet is a lexico-semantic network which reflects the lexical system of the Polish language.It is now the largest wordnet in the world and is still growing.

Lärka (English LARK) - Language Acquisition Reusing Korp

Showcase: Lärka - “LÄR språket via KorpusAnalys”, Språkbanken's platform designed for learning Swedish based on principles of Intelligent Computer-Assisted Language Learning.

The Old Bailey Corpus 2.0, 1720-1913

The Old Bailey Corpus (OBC) is a sociolinguistically, pragmatically and textually annotated corpus based on a selection of the Proceedings of the Old Bailey, the published version of the trials at London's Central Criminal Court. 

Word level based comparative text analysis

Many questions of the humanities, which relate to specific text resources, can be reduced to the analysis of vocabulary. Especially the comparison of such vocabulary is of central interest. This may require comparing two own text resources or a text resource with a reference corpus. CLARIN allows to easily perform such comparative analyses using the resources and Web tools it provides. The following show case will demonstrate this on the basis of a simple example. It covers the discovery and selection of resources, their processing and finally their analysis.

DiaCollo: collocation analysis in diachronic perspective

DiaCollo is a software tool for the discovery, comparison, and interactive visualization of the typical word combinations for a user-specified target term. Characteristic word combination profiles based on various underlying text corpora can be requested for a particular time period, as well as direct comparisons between different time periods. In addition to traditional static tabular display formats, a number of intuitive interactive online visualizations for query result data are also available.

Nederlab, online laboratory for humanities research on Dutch text collections

A user-friendly and tool-enriched open access web interface that aims at containing all digitized texts relevant for the Dutch national heritage and the history of Dutch language and culture (c. 800 - present).

Spokes - a conversational corpus search engine

Corpora of spontaneous conversational speech are an important source of primary data for research in the humanities. Spokes is a multimedia search engine for a unique corpus of conversational Polish, which has been developed by the University of Łódź as part of the Polish CLARIN Infrastructure.

Poly-GrETEL Search Engine for Querying Syntactic Constructions in Parallel Treebanks Poly-GrETEL is an online tool which enables syntactic querying in parallel treebanks. It is based on the monolingual GrETEL environment (
GrETEL Search Engine for Querying Syntactic Constructions in Treebanks

GrETEL is a query engine in which linguists can use a natural language example as a starting point for searching a treebank with limited knowledge about tree representations and formal query languages. Instead of a formal search instruction, it takes a natural language example as input. This provides a convenient way for novice and non-technical users to use treebanks with a limited knowledge of the underlying syntax and formal query languages. By allowing linguists to search for constructions similar to the example they provide, it aims to bridge the gap between descriptive-theoretical and computational linguistics.

OpenSONAR OpenSoNaR is an online system that allows for analyzing and searching the large scale Dutch reference corpus SoNaR. Due to the size of the corpus (500 million words), accessing the information contained in the dataset has proven to be difficult for less technically inclined researchers. OpenSoNaR facilitates the use of the SoNaR corpus by providing a user-friendly online interface.
Austrian Baroque Corpus collage of page images ABaC:us – Austrian Baroque Corpus The Austrian Baroque Corpus is a digital collection of printed German language texts dating from the Baroque era, now freely available through the Austrian Centre for Digital Humanities. The collection holds several texts specific to the memento mori genre written by, or ascribed to, Abraham a Sancta Clara (1644-1710), who was a renowned Augustinian monk, and a widely read author throughout Europe at his time.
Gesta Danorum Gesta Danorum

This showcase is an example of how language technology can be exploited in research within the humanities. The resource that this case is based on is Gesta Danorum written about 1200 by the Danish historian, Saxo. Gesta Danorum is written in High Latin and describes in 16 books the period of time from King Dan to Canute VI of Denmark.

Keeleveeb Query

Although there are several online Estonian dictionaries, both monolingual and bilingual, they are sometimes difficult to find and their simultaneous use is definitely complicated. The idea behind "keeleveeb" (it translates "language web" in English) is to carry out a query over many language resources — dictionaries and corpora.

Glossa The Glossa corpus search system

At the Text Laboratory, University of Oslo, we are currently developing a new version of our corpus search and post-processing tool Glossa. Glossa allows a user to search a text corpus, or a set of corpora, for one or more words, phrases or grammatical constructions.

WebMAUS: Automatic Segmentation and Labelling of Speech Signals over the Web

The web application WebMAUS allows the user to automatically align speech recordings to their corresponding text form. Two input files need to be uploaded by the user: a media file containing a recorded speech signal and a file containing some textual encoding of the words spoken in the recording. In case the latter is a simple text, the contents are text-normalized and tokenized into a chain of words. The application then produces a phonological pronunciation encoding of the content in SAMPA (Speech Assessment Methods Phonetic Alphabet), that basically reflects the standard citation pronunciation of the content. Based on this phonological form, a statistically weighted graph of all possible realisations (pronunciation variants) within the selected language is created based on a machine-learned expert system. Finally this graph is aligned to the speech signal using standard techniques from automatic speech recognition. The result of this process is an orthographic and a phonetic alignment (segmentation and labelling, S&L) of the recorded speech, which is then rendered into the desired target format (BPF, Emu, TextGrid) and returned to the user via the web browser.

Detailed interactive mapping of migration in The Netherlands in the 20th century

People do not always stay in the place where they were born. The website "Migration in The Netherlands in the 20th century" presents maps that show the dispersion of people during the 20th century at the level of municipalities. The maps show the distances people move from their ancestors, for several generations. This is important to understand the dispersion of, for instance, dialects, family names, traditions and cultural expressions. Per municipality it is shown where the ancestors of the current residents came from, and where descendants of the inhabitants of a century ago were born and where they live nowadays.

WordTies WordTies is a web interface developed to visualize monolingual wordnets as well as their alignments with wordnets in other languages. Wordnets are a kind of lexical-semantic dictionaries where concepts are related to other concepts in language via semantic relations. In the WordTies browser, these semantic relations are made available in a more intuitive and graphical fashion compared to what is found in most other wordnet browsers.