Federated Content Search Workshop in Copenhagen

Submitted by Jan Odijk on 12 May 2013

On April 24, 2013 I attended the workshop on Federated Content Search ( ) at CST, Copenhagen. Federated Content Search aims to make it possible to search for data in multiple datasets that can reside on different locations via a single query and without requiring detailed knowledge about the structure of the datasets (which may differ among them). This page sketches the architecture envisaged for FCS in CLARIN. FCS is a very hard problem, and CLARIN is working at the edge of the scientific and engineering knowledge in this domain. It also imposes very hard requirements of formal and semantic interoperability. FCS functionality, however, is highly desirable in the CLARIN infrastructure because it promises making search in a wide variety of distributed data easy and user-friendly.

The workshop was well organized, and held in a nice environment. We thank the local CST committee and the workshop organizers for this. The workshop was excellently moderated by Christoph Draxler from the Bavarian Archive for Speech Signals (BAS). At the workshop, there were technical presentations, presentations of example use cases, and a lot of fruitful discussion. It became clear that many of the use cases presented are too difficult for FCS at this stage (including, unfortunately, many of the ones I described), and that a clearly planned, incremental road has to be walked to make progress in this area.

One possibility to approach the problem is by clearly distinguishing at least three categories of data (and associated possible queries):

  1. corpora with relatively simple annotations of text (e.g. only annotations associated to tokens)
  2. treebanks
  3. lexicons

For example, for annotated text it would be ideal if FCS could deal with queries allowed by the Corpus Query Language (CQL). Unfortunately, in this stage this is more an ambitious goal rather than reality.

Research into and work towards FCS must continue, as must the work on making existing and new datasets formally and semantically interoperable. FCS will in the near future offer only limited search functionality. Therefore, CLARIN must systematically offer and continue to offer alternatives in the form of resource-specific search engines. Hopefully, these can then gradually be replaced by FCS.