Standards

Introduction: standards in CLARIN

CLARIN relies on standard procedures, data formats and protocols to achieve the goal of collaborating with each other independent of institutional boundaries. For this purpose CLARIN is actively involved in the development of standards on various levels:

  1. CLARIN develops best practices within its committees for interacting with each other. These are agreements of technical working groups in various areas, but coordinated and with the intention to make these publicly available, to be reused by other infrastructures and projects wherever applicable.
  2. CLARIN defines standards of interoperability between CLARIN centers, selecting from standards and best practice, openly so that even external contributors can assess if they are CLARIN-compliant or not.
  3. CLARIN participates in the standardization processes conducted by the International Organization of Standardization (ISO) in the Technical Committee 37, subcommittee 4 “Language Resources” (ISO TC 37 SC 4) – this is done under the liaison agreement between CLARIN- and TC37SC4. Likewise, CLARIN has strong ties to the Text Encoding Initiative Consortium (maintainer of the TEI Guidelines), which acts as a community-based standards body. 
  4. Data deposition and archiving offered by CLARIN is most successful where the applicable standards are adhered to. To this end, CLARIN centres that offer data deposition should publish information about what data formats they accept (unconditionally or conditionally) or discourage.
  5. The information about standards and data formats used and recommended by CLARIN centres is published periodically by the CLARIN Standards Committee (CSC). The CSC maintains the Standards Information System (SIS), accessible at https://standards.clarin.eu/.

CLARIN Standards Committee – information

The main responsibility of the Standards Committee is to advise the Board of Directors on the adoption of standards to be supported by CLARIN ERIC. Its main tasks include:

  • to collect, consolidate and prepare for publication in a single place its findings and recommendations related to standards;
  • to maintain the set of standards supported by CLARIN and adapt them to new developments within or outside CLARIN;
  • to publish and promote the standards supported by CLARIN;

  • to develop and implement procedures for the discussion of recommendations and the adoption of new standards;

  • to ensure harmonisation of standards between CLARIN ERIC and related initiatives;

  • to ensure communication with international standards bodies such as (but not limited to) ISO;

  • to advise the Board of Directors in all matters related to standards.

The list of CSC members is available at https://www.clarin.eu/governance/standards-committee .

The committee should be seen as a collective of experts and researchers who, apart from engaging in theoretical considerations, do not mind the occasional dabble in scripting or data modeling or processing. National consortia are welcome to delegate their researchers to the CSC. These researchers are expected to have some experience working in standardization processes (or to have expertise that allows for such participation as part of the CSC) and to be willing to participate actively in various tasks.

 

Below, we present some of the more 'tangible' results of CSC activities.

 

CLARIN recommendations for data-deposition formats

One of the most pressing tasks of the CSC is to produce and maintain the recommendations for the use of HLT (Human Language Technology) standards within and across CLARIN. This is not an easy task, given the variety of ways in which all kinds of standards are used in all kinds of HLT research (and given that the definition of the very term is fuzzy at best). Several listings have appeared in the history of CLARIN that are still floating around. Those listings vary greatly in the coverage and granularity of standards that they enumerate as recommended.

Currently, the CSC concentrates on addressing a subset of the overall demand, by looking at standardized (for the most part) formats for data that can be deposited at CLARIN centres (mostly B-centres, because those centres by definition offer data deposition services). The data are harvested from web pages that each B-centre should (in theory) make available, listing the kinds of formats in which the data should, optimally, be encoded as it is deposited with CLARIN. In this way, the ultimate authority for the listing lies with the centres themselves, so this is a bottom-up exercise (with some necessary interpretive steps along the way).

Release 0.1 (January 2021)

In January 2021, the CSC prepared the first, internal, release of data format recommendations. Public releases are going to be made available through the Standards Information System; working releases are most likely going to be published as PDF exports of the underlying spreadsheets, into which data are entered.

The first release (call it version 0.1) is made available in two forms: (a) a concise A4 listing of formats divided into categories, and (b) a more comprehensive version listing the particular B-centres, non-B-centres that accept data depositions, as well as centres that aim for the B-centre status. The comprehensive listing is mostly meant for centre representatives and national coordinators to see if the data has been made available and how it got interpreted by the CSC. The numbers in the "recommendations" column reflect the number of centres that recommend the given format; note that the recommendations are separated into data categories, and are presented in the descending order.

 
 
 

Release 0.2 (July 2021)

The internal release 0.2 marks a switch in the approach to the structure of the information that is expected of centres. Two dimensions are added: the level of recommendation (divided into (i) recommended, (ii) acceptable, and (iii) deprecated), and the domain in which the given format is going to be used (roughly, if it is used for documenting the resource, for storing the audio/video artefacts, as harvestable metadata, as annotations, etc. -- we distinguish 18 such domains plus a catch-all domain, "Other", for when there is no need to go into lower granularity).

This release corresponds to release 2.0.0-beta of the Standards Information System.

Please note that all issues and suggestions should best be processed within the GitHub ticketing system that is part of the SIS source code management.

 

The release comes in several parts:

  • Firstly, the PDFs corresponding to the PDFs of the previous release; this is most probably the last time when a release of CLARIN format recommendations uses the spreadsheet that underlies the PDFs; the relevant information will be transferred to the Standards Information System over time. As in release 0.1, these snapshots come in two versions: one showing values split across the particular centres, the other, in the more handy A4 format, shows the format names and the derived numbers that indicate how many centres mention that format as recommended or acceptable (recall that this difference is sometimes a matter of interpretation of centre reports). Since not much has changed from release 0.1 (only a few corrections have been made), these snapshots come sorted not by format name but rather by the number of supporting centres.
  • An XML export of the SIS page with format recommendations forms another part of release 0.2. This is simply a snapshot of the current state of the information, which is restricted to a few formats for now, and which required a certain degree of interpretation with regard to the data dimensions that have (in most cases) not been explicitly reported by centres by now, i.e. with regard to the level of recommendation and the functional domain that the format is meant to be used in. Note that, for security reasons, the CMS has renamed the file: clip "_.doc" from the end of the filename to make it usable as XML.
  • Finally, a PDF snapshot of the corresponding GitHub ticket is provided as auxiliary information about where centres point users for information on formats. This has started as internal information for the CLARIN Standards Committee, but may also be useful for the centres listed there. The "pointing elsewhere" in the title of the ticket comes from the observation that, typically, a centre has its own research profile and therefore its own data-orientation, favouring some formats over others. That is why (as indicated in centre assessment criteria), each centre is expected to publish its own recommendations, rather than pointing "elsewhere" for general information. Since version 1.0 of these recommendations, it should be possible for centres to publish those recommendations directly in the Standards Information System.
 
 

Release 1.0 (November 2021)

Release 1.0 marks a full switch from the KPI spreadsheet, where the fact that a centre mentioned a format positively was marked with a "1", to the CLARIN Standards Information System, where a format can be "recommended", "acceptable" or "deprecated" by a centre, and that is further relativised to the function that the formatted resource is expected to play (is it documentation or textual source, or annotated text, etc.).
 
The release has been made available in batches over the late September / November 2021 at https://standards.clarin.eu/sis/ . The goal was to match the then-current recommendations as posted by the individual centres, unify them and make them available to the centres for corrections and updates. At the same time, the SIS itself has been expanded to serve the information in a friendly fashion and to make it easy/easier for the individual centres to maintain their recommendations within the SIS.
 
Release 1.0 consists of an XML dump of the state of the SIS format recommendation database as of mid-November 2021 and a PDF that reflects what the CSC knew about centres that, often instead of publishing their own format recommendations, pointed the user elsewhere. The PDF is a snapshot of one of the issues posted at the GitHub repository of the SIS meant to collect that information in order to monitor whether centres switch to pointing at the SIS.

Release 1.1

Release 1.1 was tentatively planned for the end of 2021, but that was conditioned on the tempo of the uptake of the SIS-based way to publish and maintain recommendations by the particular centres. Since, as of January 2022, no major feedback has been offered from any centre, the release should be considered "in the making". Progress towards the release can be traced at the corresponding GitHub milestone.
 
In the meantime, several releases of the Standards Information System code have been published.
 

Namespace assignment

Through the CSC, CLARIN offers namespace assignment to projects and organisations that need stable namespace identifiers. So far, namespaces have been assigned to the following projects:

  • CQLF-2 (requested by ISO TC37 SC4 WG6):  https://www.clarin.eu/standards/cqlf
  • SynAF-2 (requested by ISO TC37 SC4 WG6): http://clarin.eu/standards/ns/synaf

Publications and presentations