Blog: Metadata Curation Task Force Meeting in Vienna

Submitted by Jakob Lenardič on 9 February 2018

Curation Task Force Members Met in Vienna to Work on Metadata Vocabularies

From 30 January to 1 February 2018, a metadata curation meeting was held at the Austrian Academy of Sciences in Vienna. The meeting was organized by Matej Ďurčo, who is the head of the Austrian Centre for Digital Humanities (ACDH) technical working group “Tools, Services & Systems”. The meeting was attended by 14 representatives of various CLARIN-affiliated institutions – the ACDH, the Swedish National Data Service, CLARIN , CLARIN-D, LINDAT and the Center of Estonian Language Resources. Primary aim being to establish a value normalisation scheme for selected facets in the VLO on the basis of an agreed-upon controlled vocabulary, as well as to perform hands-on work on normalisation.

The meeting began with the presentation of the current state of affairs of the , focusing on the facets. As has been established on several occasions, the multitude of values for each facet makes it difficult to navigate them. At the time of the meeting, there were 346 unique values for "Resource Type", 1679 for "Genre" and 51869 for "Subject". At the meeting we focused on the "Resource Type" facet because it is one of the most basic ones, and also because it is much more difficult to establish a semi-closed vocabulary set of values for facets like "Subject" and "Genre".

Darja Fišer and Jakob Lenardič prepared an investigation into the practical VLO use case where we tried to identify corpora belonging to different resource families like parallel and parliamentary resources. Following issues of faceted search could exemplify the problems that are to be solved:

It is difficult to combine simple search with faceted search; for instance, when querying parliament* in simple search, the "Resource Type" facet originally showed only 2 resources under the value Corpus even though there are at least 14 entries for parliamentary corpora in the VLO;
Values exist in languages other than English;
Different values refer to the same resource type (e.g. newspaper vs. newspaper issue) and there are values that could be subsumed under a broader category (e.g. Corpus vs. Diachronic corpus).

On day two, we reduced the great number of "Resource Type" values by mapping them to a smaller controlled vocabulary, which contains 13 values – collection, corpus, text, lexicalResource, grammar, dataset*, annotation, image, audio, video, session, tool service, and physical object. To do this, a value decomposition approach was adopted whereby a resource that had previously been labelled under the unique value of TextAnnotatedCorpus would now be mapped to three separate values – annotation, text, and corpus. Hands-on work was then divided among three working groups, two of which focused on mapping the existing values to the controlled vocabulary set while the remaining group worked on resources that had originally lacked a "Resource Type" value.

On day three, the results of the hands-on normalisation were reviewed. Currently, there are 138 unique values under "Resource Type" in the VLO, which means that more than half have already been successfully mapped to the controlled vocabulary set. Consequently, there has been a marked improvement in faceted search – a simple search string like parliament* now yields 17 results under the value corpus, which correctly reflects the actual state of affairs in relation to the available parliamentary corpora in the VLO.

To conclude the meeting, a VLO curation taskforce was established the aim of which is to normalize the remaining "Resource Type" values. The taskforce will attempt to finish the normalization by the end of February 2018.