To enable researchers to search for specific patterns across collections of data, CLARIN offers a search engine that connects to the local data collections that are available in the centres. The data itself stays at the centre where it is hosted – therefore the underlying technique is called federated content search. The search engine summarises and displays what is available. An easy next step is to go to the centre's specialised search interface to perform a more sophisticated query.
The technology behind this federated content search is /CQL and a CLARIN-specific extension to this protocol.
Federated Content Search vs. Metadata Search
The federated content search approach differs from the metadata search, e.g. performed in the Virtual Language Observatory, where all metadata is first harvested (copied to a single server) and then centrally indexed. This is for several reasons:
- Legal issues make it impossible for some resources to be copied to another location.
- The size of many datasets makes decentralised indexing the most viable option.
- Most language resources are annotated in a collection-specific manner, which makes it hard to use or develop one single search engine that can cope with all of them.
Although more scaleable, federated content search comes at the cost of being less powerful than a local search and certain features are absent, e.g. ranking. This is why the federated content search will often be particularly useful as a first step to discover where interesting language resources are hosted and at which centre(s) a more specialised search could be useful.
- Technical details
- Content search tutorial
- Documents from a Federated Content Search workshop