Skip to main content

Interoperability

Introduction

The goal of this web page is to provide concrete information and guidelines on interoperability. It is intended for  researchers, developers and  national CLARIN consortia.

Why interoperability?

Drivers of interoperability are needs for

  • Cooperation
  • Cross-domain use
  • Cross-national use
  • Cross-lingual use 

e.g. in hackathons, hands-on workshops, multinational projects where several resources need to be handled along similar lines (preferably with the same program) to get comparable results.

In addition, long-term preservation of resources  and tools is made easier by interoperability, since it will increase the chances of reusability at a later point in time.

Interoperability is one of the crucial ingredients for the FAIRness of data and software. See the CLARIN pages on FAIR: https://www.clarin.eu/fair 

Interoperability

Interoperability is the degree to which data and software can be used in combination with other data and software without any ad-hoc adaptations. 

In CLARIN we strive towards maximal interoperability, at least for data from the same natural class. As natural classes we distinguish inter alia: data describing running natural language text, data describing audio (and its subclass speech), data describing video, data describing images,  databases with structured information, and combinations of these.  

Note that interoperability is more than maintaining and adhering to standards for data sets within natural classes. Support for conversions within a natural class must be supplied, and more advanced modes of interoperability provide support for conversions between natural classes, e.g. from scanned document images to text.

Of course, this is an ambitious goal, which cannot be easily reached. Therefore, each national consortium must make its own priorities for the treatment of interoperability cases, and set up a plan accordingly.

Syntactic and Semantic Interoperability

Interoperability comes at two levels: the form or syntactic level, and the meaning or  semantic level. At the syntactic level interoperability ensures that data and tools are compatible with regard to the formats they come in. At the semantic level interoperability ensures that data and tools are compatible with regard to the meanings of the form elements they come in. For more on semantic interoperability, see the section Semantic Interoperability

Standards

Interoperability is made easier by the use of standards for formats, protocols etc. Standards can be very useful but they are not enough for interoperability (and also not always necessary). A list of CLARIN-recommended standards can be found here. The more pluses an item has there, the more preferred it is in  the CLARIN context.

It is important to use and mention the most specific standard (including version) for a given resource type. For example, for textual resources, XML as a standard is good but not enough, because there are more specific standards for textual resources such as the TEI P5 encoding scheme, version 3.6.0 (an instantiation of the XML standard).

The CLARIN standards document we referred to is relatively old. We mention new developments and recommendations on this page. A more recent list is also provided here, where one can also find information about formats actually used and supported by CLARIN Centres. Formats are often characterised by mimetypes. An inventory of mimetypes in use at the CLARIN Centres can be found here 

Legacy Formats

Many data exist in legacy formats that differ from the standard formats preferred in CLARIN. One strategy can be to produce  a new version of these data in accordance with a CLARIN-supported standard format. But that is not always desirable or even possible.

It is not necessarily desirable if there is a whole tool suite around the legacy formats. This tool suite cannot apply to the data in the new standard format, so that the data in the standard format will simply not be used. Issuing a new version of data in a different format only makes sense if existing tool suites for the legacy format data are adapted accordingly or if suitable alternatives are offered.

It is also often not possible to make such conversions, e.g. for technical reasons, or for reasons of financing. Certain data formats simply cannot be converted to standard formats without loss of information. Issuing a new version of a dataset may require a lot of effort, hence money, which often is not available.

Each national consortium should inventory which legacy formats are being used in the country, and set up a plan on how to deal with them and their associated tools suites, if any. 

Real world data and data formats

Many data are gathered by researchers in the `real world’, and many different data formats are used there (e.g. doc, docx, xls, xlsx, html, pdf, mdb, csv, epub, …). CLARIN should offer facilities to researchers to make such data also interoperable, e.g. by offering  converters that can convert data in these formats to data formats that are fully supported by CLARIN. Here is a manually constructed overview of converters that already exist in CLARIN, and this query approximates the converters described in the CLARIN Virtual Language Observatory (VLO):  55 as of 2019-07-03. Use them, and  create new converters if no suitable ones exist yet and integrate these in CLARIN. See also this FAQ page on converters.

Semantic Interoperability

Semantic interoperability is achieved in CLARIN in a number of ways. 

  • Elements in metadata and data can be formally assigned a meaning by providing a link to a concept in the CLARIN Concept Registry (CCR). Especially for elements in CMDI metadata this is already common practice. For more information on the CCR, see this page.
  • A second way to make the meaning of vocabulary elements explicit is by selecting them from vocabularies that are included in CLAVAS (CLARIN Vocabulary Service). For example, CLAVAS contains the vocabulary of the ISO 639-3 language coding system and vocabularies for licenses, organizations and media types (mime types) are being considered for incorporation. For more information on CLAVAS, see this page.
  • Mapping values from local part of speech, features or relation tagsets to the Universal Dependencies part of speech , features and relation tagset, especially  in the context of multi-layer federated content search (FCS).

Metadata

For metadata CLARIN uses the CLARIN MetaData Infrastructure (CMDI), which is described in more detail on this page and the links one finds there. CMDI offers a lot of flexibility in creating metadata: one can compose one’s own profile to make a metadata schema fully tuned to one’s needs. But this flexibility also comes with the danger that there is little overlap between the metadata of two researchers or research groups. For this reason, we strongly recommend:

 

  • Appoint a national coordinator for metadata creation and checking for completeness and appropriateness. An individual researcher may often not realise that certain types of metadata elements are crucial for discoverability as soon as the metadata show up in the VLO.
  • Reuse existing components and vocabularies wherever possible, and preferably use recommended components
  • Use closed vocabularies, preferably with language-independent codes rather than natural language words.
  • For natural language words and texts in metadata, use English, and mark explicitly that English has been used. Other languages are allowed in addition, but mark explicitly which languages are involved.  
  • It is desirable that a piece of software that applies to data can also take into account the CMDI metadata associated with these data, and that it can automatically generate appropriate CMDI metadata for any new or enriched data the software generates.
  • Check whether metadata produced score well on quality, e.g. to what extent information is present for the VLO faceted search to serve its function (discoverability of resources) well. Automated checking tools are available: assess the quality of your profile and instances of the profile automatically. 
  • A proposal for a minimum set of metadata elements for describing data and  software is being produced. An initial proposal for a minimum set for metadata for software is available here and here.

Integrating data and tools in CLARIN

There are a number of very concrete ways to integrate data and tools in the CLARIN infrastructure that immediately contribute to interoperability and that can actually be tested.

  • VLO: data and software must be described by means of metadata using the CMDI framework. If these metadata show up in the CLARIN Virtual Language Observatory, it means that they consist of validated CMDI and are stored by a CLARIN Centre on a harvestable place. If the metadata are rich enough to enable discovery of these data, their usage will increase and their interoperability will be tested by users. See this web page for more information on the CMDI framework and this web page for the Virtual Language Observatory.
  • SwitchBoard: Web-based software (web applications, web services) can be integrated into the CLARIN infrastructure by making them compatible with the CLARIN Language Resource SwitchBoard and by adding information on the software in the SwitchBoard Registry. See this link for more information on the CLARIN SwitchBoard.

A useful strategy to increase and test the interoperability of a (web) service is to integrate it into one of the (web) service pipelines that are being developed in CLARIN. Some of these are already compatible with SwitchBoard (WebLicht & LaMachine). Examples are: 

  • WebLicht: web page and github
  • Galaxy: web site; github; e.g. the Language Analysis Portal (LAP) by CLARINOPresentation:LAP in Galaxy LAP; La Machine web page and github
  • Federated Content Search: Web-based content search applications can be integrated into the CLARIN infrastructure by providing a search endpoint that is compatible with the CLARIN approach to Federated Content Search. See this link for more details on federated content search in CLARIN.

Some other concrete guidelines

Here we list a number of concrete guidelines that are lacking in the documents referred to or that were not adhered to in recent cases.

  • Tools for textual resources must support UTF-8 as character encoding. They may support other encodings, but UTF-8 is obligatory.
  • Tools often yield data as output. Many tools also take data as input. In the CLARIN context, a tool should not only accept data as input, but also its associated (CMDI) metadata, and it should yield not just data as output, but also CMDI metadata on this output. The metadata on the tool should provide enough information to compute the metadata of the output  from the metadata of the input (if there is any input) and the metadata of the tool.

Some related pages

Related FAQ pages:

FAQ pages on the CLARIN Concept Registry: here and here 

Short guides:

On interoperability: https://www.clarin.eu/media/1602

On standards: https://www.clarin.eu/media/1412