Introduction: standards in CLARIN
CLARIN relies on standard procedures, data formats and protocols to achieve the goal of collaborating with each other independent of institutional boundaries. For this purpose CLARIN is actively involved in the development of standards on various levels:
- CLARIN develops best practices within its committees for interacting with each other. These are agreements of technical working groups in various areas, but coordinated and with the intention to make these publicly available, to be reused by other infrastructures and projects wherever applicable.
- CLARIN defines standards of interoperability between CLARIN centers, selecting from standards and best practice, openly so that even external contributors can assess if they are CLARIN-compliant or not.
- CLARIN participates in the standardization processes conducted by the International Organization of Standardization (ISO) in the Technical Committee 37, subcommittee 4 “Language Resources” (ISO TC 37 SC 4) – this is done under the liaison agreement between CLARIN- and TC37SC4. Likewise, CLARIN has strong ties to the Text Encoding Initiative Consortium (maintainer of the TEI Guidelines), which acts as a community-based standards body.
- Data deposition and archiving offered by CLARIN is most successful where the applicable standards are adhered to. To this end, CLARIN centres that offer data deposition should publish information about what data formats they accept (unconditionally or conditionally) or discourage.
- The information about standards and data formats used and recommended by CLARIN centres is published periodically by the CLARIN Standards Committee (CSC). The CSC maintains the Standards Information System (SIS), accessible at https://standards.clarin.eu/.
CLARIN Standards Committee – information
The main responsibility of the Standards Committee is to advise the Board of Directors on the adoption of standards to be supported by CLARIN ERIC. Its main tasks include:
- to collect, consolidate and prepare for publication in a single place its findings and recommendations related to standards;
- to maintain the set of standards supported by CLARIN and adapt them to new developments within or outside CLARIN;
to publish and promote the standards supported by CLARIN;
to develop and implement procedures for the discussion of recommendations and the adoption of new standards;
to ensure harmonisation of standards between CLARIN ERIC and related initiatives;
to ensure communication with international standards bodies such as (but not limited to) ISO;
to advise the Board of Directors in all matters related to standards.
The list of CSC members is available at https://www.clarin.eu/governance/standards-committee .
The committee should be seen as a collective of experts and researchers who, apart from engaging in theoretical considerations, do not mind the occasional dabble in scripting or data modeling or processing. National consortia are welcome to delegate their researchers to the CSC. These researchers are expected to have some experience working in standardization processes (or to have expertise that allows for such participation as part of the CSC) and to be willing to participate actively in various tasks.
Below, we present some of the more 'tangible' results of CSC activities.
CLARIN recommendations for data-deposition formats
One of the most pressing tasks of the CSC is to produce and maintain the recommendations for the use of HLT (Human Language Technology) standards within and across CLARIN. This is not an easy task, given the variety of ways in which all kinds of standards are used in all kinds of HLT research (and given that the definition of the very term is fuzzy at best). Several listings have appeared in the history of CLARIN that are still floating around. Those listings vary greatly in the coverage and granularity of standards that they enumerate as recommended.
Currently, the CSC concentrates on addressing a subset of the overall demand, by looking at standardized (for the most part) formats for data that can be deposited at CLARIN centres (mostly B-centres, because those centres by definition offer data deposition services). The data are harvested from web pages that each B-centre should (in theory) make available, listing the kinds of formats in which the data should, optimally, be encoded as it is deposited with CLARIN. In this way, the ultimate authority for the listing lies with the centres themselves, so this is a bottom-up exercise (with some necessary interpretive steps along the way).
Release 0.1 (January 2021)
In January 2021, the CSC prepared the first, internal, release of data format recommendations. Public releases are going to be made available through the Standards Information System; working releases are most likely going to be published as PDF exports of the underlying spreadsheets, into which data are entered.
The first release (call it version 0.1) is made available in two forms: (a) a concise A4 listing of formats divided into categories, and (b) a more comprehensive version listing the particular B-centres, non-B-centres that accept data depositions, as well as centres that aim for the B-centre status. The comprehensive listing is mostly meant for centre representatives and national coordinators to see if the data has been made available and how it got interpreted by the CSC. The numbers in the "recommendations" column reflect the number of centres that recommend the given format; note that the recommendations are separated into data categories, and are presented in the descending order.
Release 0.2 (July 2021)
The internal release 0.2 marks a switch in the approach to the structure of the information that is expected of centres. Two dimensions are added: the level of recommendation (divided into (i) recommended, (ii) acceptable, and (iii) deprecated), and the domain in which the given format is going to be used (roughly, if it is used for documenting the resource, for storing the audio/video artefacts, as harvestable metadata, as annotations, etc. -- we distinguish 18 such domains plus a catch-all domain, "Other", for when there is no need to go into lower granularity).
This release corresponds to release 2.0.0-beta of the Standards Information System.
Please note that all issues and suggestions should best be processed within the GitHub ticketing system that is part of the SIS source code management.
The release comes in several parts:
- Firstly, the PDFs corresponding to the PDFs of the previous release; this is most probably the last time when a release of CLARIN format recommendations uses the spreadsheet that underlies the PDFs; the relevant information will be transferred to the Standards Information System over time. As in release 0.1, these snapshots come in two versions: one showing values split across the particular centres, the other, in the more handy A4 format, shows the format names and the derived numbers that indicate how many centres mention that format as recommended or acceptable (recall that this difference is sometimes a matter of interpretation of centre reports). Since not much has changed from release 0.1 (only a few corrections have been made), these snapshots come sorted not by format name but rather by the number of supporting centres.
- An XML export of the SIS page with format recommendations forms another part of release 0.2. This is simply a snapshot of the current state of the information, which is restricted to a few formats for now, and which required a certain degree of interpretation with regard to the data dimensions that have (in most cases) not been explicitly reported by centres by now, i.e. with regard to the level of recommendation and the functional domain that the format is meant to be used in. Note that, for security reasons, the CMS has renamed the file: clip "_.doc" from the end of the filename to make it usable as XML.
- Finally, a PDF snapshot of the corresponding GitHub ticket is provided as auxiliary information about where centres point users for information on formats. This has started as internal information for the CLARIN Standards Committee, but may also be useful for the centres listed there. The "pointing elsewhere" in the title of the ticket comes from the observation that, typically, a centre has its own research profile and therefore its own data-orientation, favouring some formats over others. That is why (as indicated in centre assessment criteria), each centre is expected to publish its own recommendations, rather than pointing "elsewhere" for general information. Since version 1.0 of these recommendations, it should be possible for centres to publish those recommendations directly in the Standards Information System.
Release 1.0 (November 2021)
Through the CSC, CLARIN offers namespace assignment to projects and organisations that need stable namespace identifiers. So far, namespaces have been assigned to the following projects:
- CQLF-2 (requested by ISO TC37 SC4 WG6):
- SynAF-2 (requested by ISO TC37 SC4 WG6):
Publications and presentations
- Bański, Piotr and Hanna Hedeland. Standards in CLARIN. In: ... ; forthcoming in 2022, de Gruyter.
- Bański, Piotr. 2021. CLARIN Standards Committee: recent work and nearest future. Presentation given at the National Coordinator Forum Meeting, in October 2021.
- Piotr Bański, Tomaž Erjavec, Francesca Frontini, Hanna Hedeland, Eliza Margaretha Illig, Neeme Kahusk, Fahad Khan, Karlheinz Mörth, Jan Odijk, Jussi Piitulainen, Christian Thomas, Dieter Van Uytvanck, Menzo Windhouwer, Andreas Witt. 2021. Ask your SIS: Collecting Centre Recommendations on Data Deposition Formats – presentation given at the CLARIN Annual Conference, in September 2021. Available at https://www.clarin.eu/sites/default/files/CLARIN2021_Bazaar_Metadata_CS…
- CLARIN Standards Committee. 2020. Pursuing the elusive KPI: Filling the gaps in centre self-published standards-related information – presentation given at the CLARIN Annual Conference, in September 2020. Available at https://www.clarin.eu/sites/default/files/Clarin%202020_bazaar_CSC_Core…
- Bański, Piotr, Hanna Hedeland & Dieter Van Uytvanck. 2019. "Unified list of standards: next steps forward". Poster presented at the Bazaar, CLARIN Annual Conference 2019, Leipzig, Germany.
- Hedeland, Hanna & Piotr Bański. 2018. Towards CLARIN recommended formats: a bottom-up approach. Interactive poster presented at the Bazaar, CLARIN Annual Conference 2018, Pisa, Italy, 8-10 October.
- Bański, Piotr. 2018. Towards unified CLARIN recommendations for the use of standards: a pilot study on “text formats”. CLARIN document CE-2021-1931, version 0.2, May 2018. Available at https://hdl.handle.net/11372/DOC-164
- Standards Committee. 2015. Standards in CLARIN. Position paper presented at the yearly CLARIN Conference in Wrocław, 2015.
- Stührenberg, Maik, Antonina Werthmann, and Andreas Witt. 2012. Guidance through the Standards Jungle for Linguistic Resources. In Proceedings of the LREC 2012 Workshop on Collaborative Resource Development and Delivery, 9–13.