Federated Content Search (CLARIN-FCS) - Technical Details

The CLARIN Federated Content Search (CLARIN- ) introduces an interface specification that decouples search engine functionality from its exploitation and defines data formats for structuring standardized query results (so-called Data Views).

This can be used to create user interfaces that allow accessing heterogeneous search engines in a uniform way. The most important publicly available user interface is the FCS aggregator - a search engine for language resources hosted at a variety of institutions.

More general information about the FCS can be found here.

Technical functionality

The CLARIN-FCS specification defines a set of capabilities, an extensible result format, and a set of required operations. CLARIN-FCS is built using the SRU/CQL standard that is maintained by the Library of Congress and the OASIS standardization consortium. Additional functionality required for CLARIN-FCS is added through /CQL's extension mechanisms.

Specifically, the CLARIN-FCS specification consists of two components, a set of formats and a transport protocol. The Endpoint component is a software component that acts as a bridge between the formats that are sent by a Client using the Transport Protocol, and a Search Engine. The Search Engine is a custom software component that allows the search of language resources in a specific institution. The Endpoint implements the transport protocol and acts as a mediator between the CLARIN-FCS specific formats and the idiosyncrasies of Search Engines at the individual institutions.

The following figure illustrates the overall architecture:

FCS Overall Architecture with Search Portal / Aggregator on Top, sending requests to the FCS endpoints which themselves translate the requests to their own formats to speak to their own (internal) search engines. The FCS Endpoints are with the institutions while the Aggregator is part of the CLARIN infrastructure.
 

In general, the workflow in CLARIN-FCS is as follows:

  1. a Client submits a query to an FCS Endpoint.
  2. The Endpoint translates the query from CQL to the query dialect used by the Search Engine and submits the translated query to the Search Engine.
  3. The Search Engine processes the query and generates a result set, i.e. it compiles a set of hits that match the search criterion.
  4. The Endpoint then translates the results from the Search Engine-specific result set format to the CLARIN-FCS result format and sends it to the Client

Specifications

CLARIN-FCS is defined in two specifications, the Core specification and the supplementary Data View specification. The first defines the general framework and the latter defines additional Data Views, which allow Endpoints to provide resources in more detailed formats.

The CLARIN FCS Core specification is currently available in version 2.0 and can be downloaded as PDF document. The outdated CLARIN FCS Core specification version 1.0 is available here.

The CLARIN FCS Data View specification defines additional ways to represent resources. It is currently available in version 1.0 and can be downloaded here.

Helpful Applications & Software Libraries

Search engine / FCS aggregator

The FCS aggregator is CLARIN’s central content search engine. Its source code is available on Github.

Testing Endpoint conformance

Compliance of an endpoint with the CLARIN-FCS specification can be evaluated with the SRU/CQL Conformance Tester (Login required).

List of available endpoints

The full list of implemented endpoints is available at the CLARIN Centre Registry.

Artifact Repository/Nexus

Some of the above-mentioned projects are also available via the CLARIN Nexus.

Various implementations

In most cases, the implementing data provider only builds a simple wrapper-service, that translates between the CLARIN-FCS/SRU protocol and the endpoint's software. However, there are efforts to provide default wrappers (or at least sample implementations) for individual persistence systems like SQL databases or XML databases. 

Non-exhaustive list of reference implementations:

  • The IDS has made a reference implementation of a CLARIN-FCS endpoint:
    https://svn.clarin.eu/FCSSimpleEndpoint/trunk/ (NOTE: the library still implements the deprecated version of the specification.)
    The source is basically divided in two parts, the SRU implementation (de.mannheim.ids.sru) and the Cosmas specific implementation (de.mannheim.ids.cosmassru). The latter package also contains the Servlet, that shows how to (programmatic) initialize the endpoints. To use it, the endpoint's developers need to customize "endpoint-config.xml" and implement SRUDatabase interface. CosmasSRUDatabase can be used as an example. Furthermore, the endpoint's developers create (or change the exiting) Servlet, web.xml and context.xml deployment descriptors.
  • Based on the IDS library, Alex Kislev has implemented a CQP/SRU bridge: any CQP indexed corpus can be integrated quite easily into the CLARIN Federated Content Search.
  • Recently, OCLC announced the oclcsrw, an Open Source implementation of an SRU 1.2 server that exposes a database interface allowing implementers to expose their databases via SRU 1.2. Database implementations are separately available for Apache Lucene and DSpace.
  • The ICLTT (Vienna) is developing corpus_shell a modular framework for publishing heterogeneous distributed language resources building on top of FCS. The system currently contains prototype implementations of a FCS-wrapper for mysql-db (in php), the ddc search engine (in perl). Additionally, an eXist/XQuery-based solution is being developed, but this code has been moved from corpus_shell as module to SADE. These implementations are work in progress and don't yet fully conform to the FCS specs.
  • KonText is an advanced corpus query interface and corpus data integration platform built around the corpus search engine Manatee-open. It supports the CLARIN FCS in version 1.0.