Skip to main content

CLARIN Federated Content Search - version 2

Submitted by Leif-Jöran Olsson on

‌ Go to the content search The CLARIN Federated Content Search, CLARIN-FCS 2.0, has been released. This is a new major release, that brings groundbreaking new features and capabilities to the linguistic search of corpora.

The goal of CLARIN-FCS is to introduce an interface specification that makes it possible to search different repositories of text within a unified framework by decoupling the search engine functionality from its exploitation and to allow services to access heterogeneous search engines in a uniform way.

This means the CLARIN-FCS is augmenting your search engine – not replacing it. And by still keeping a low threshold, version 2.0 introduces a few new features:

  • A new Query language
  • A matching display of query results, the AdvancedDataView, ADV, with layer capabilities
  • Backwards compatibility to earlier versions

We believe that these new additions to the CLARIN-FCS will not only enhance the power user experience and possibilities when performing queries from repositories, but also that less experienced users will find it easier to explore different corpora.

The Federated Content Search

CLARIN-FCS is a means to perform searches in one or more repositories, that may use different search engines, using a set of software components.

The Endpoint is a software component that acts as a bridge between the user and the Search Engine by implementing the CLARIN-FCS Transport Protocol. The Search Engine is a custom software component that can perform a search of language resources in a Repository. The role of the Endpoint is to act as a mediator between the CLARIN-FCS specific formats and formats used by the custom search engine. The Endpoint translates the request, and then, as the Search Engine returns the result, translate that and send it back.

At the fingertips of the user is The Aggregator, that represent the user interface of the search, commonly reachable at a website that is open to the public. The Aggregator connects to Endpoints and keeps track of their capabilities. It also collects the search results that Endpoints send back for display in the UI.

A new Query language

Basic queries in CLARIN-FCS are performed in the Contextual Query Language (OASIS-CQL), but version 2.0 introduces the Federated Content Search Query Language, FCS-QL. The FCS-QL is a superset of Corpus Workbench’s CQP version 3.0. This is a powerful language well known by most linguists, with added functionality for federated searches in different repositories.

CLARIN-FCS 2.0 defines a set of searchable annotation layers with certain semantics and syntax. A Layer identifies an annotation of a certain type. The ones available in CLARIN-FCS 2.0 are:

  • token - An appropriate tokenisation/segmentation of the resource, like words
  • lemma - Lemmatisation of tokens
  • pos - Part-of-Speech annotations, PoS, in UD-17 tagset
  • orth - Orthographic transcription of (mostly) spoken resources,
  • norm - Orthographic normalization of (mostly) spoken resources,
  • phonetic - Phonetic transcription SAMPA
  • text - Annotation layer used in Basic Search.

In CLARIN-FCS version 2.0 you can use several annotation layers of the same type, but the types should be internally non-overlapping.

The graphical user interface in The Aggregator now gives the possibility to create queries by choice of annotations and combinations of layers. Depending on your annotation of choice, e.g. PoS, you get support by selecting values in drop down lists in the graphical Query builder of The Aggregator, as illustrated below. Users, that are not experienced linguists, will thus have a greater opportunity to use the corpus repositories and perform queries with the new CLARIN-FCS 2.0.

As an alternative to the graphical Query builder, experienced users can formulate and enter the query directly in the query language itself. In figure Hockey we made a query expression for the well known ice hockey player Tomas Surovy, to see what verbs are used to describe his actions ([word=’Surovy’][pos=’VERB’])

Advanced Data View

Another exciting and major new feature of CLARIN-FCS 2.0 is the AdvancedDataView, ADV. This is a display function, part of the Aggregator user interface that creates better display of query results. The ADV has the capability of highlighting the different layers of the search result. This graphical display of results will make it easier to do a first exploration of data and also serves as a

foundation for future hierarchical presentations. The AdvancedDataView holds the information of each available layer of any resource. ADV even supports several layers of the same type, it could be annotations produced by different tools, e.g. PoS taggers or phonetic transcribers.

Backwards compatibility

To make the transition to CLARIN-FCS 2.0 as smooth as possible, we have put quite some effort in making it backwards compatible. A Client using CLARIN-FCS 1.0 should use only the Basic Search capability. Clients implement a heuristic to automatically determine which CLARIN-FCS protocol version, can be used to talk to an Endpoint.

Future

Altogether, we see CLARIN-FCS 2.0 as an easy to use, easy to maintain, federated search that will make the language resource exploration more open to everyone, no matter if you are a power or less experienced user, or a resource provider.

With this strong foundation of The Aggregator and ADV laid in CLARIN-FCS 2.0, we can now develop the ways to formulate and present queries and their results, making a good foundation for addressing the presentation of hierarchical data ,e.g. syntactic search, and use of additional layers, e.g. named entities.

The future development of CLARIN-FCS will be guided by popularity of the features and a reasonably low threshold for integration of new corpus search engines. We will try to keep CLARIN-FCS open to as many users and language repository providers as possible.