Support for data archiving and dissemination or "Contact-me-if-you-are-interested-in-my-data"

Submitted by Thorsten Trippel on 19 March 2013

Asking researchers with fascinating data for options to reuse their data often results in a Contact-me-if-you-are-interested-in-my-data reply. Scholars often do not mind other scholars to have access to their unfinished, raw data as long as they know the other personally, being able to brief them about restrictions, concerns and future work and receive positive feedback.

At the spring conference of the German Linguistics Society 2013, I had the opportunity to hear Christian Mair from Freiburg (Germany), speaking about the creation of three special corpora. These corpora contain orthographic representations of dialectal languages for Jamaican, Cameroonian and Nigerian English from websites and an accompanying tool for visualizing the distribution of sources. The corpora Corpus of "Cyber-Jamaican" (CCJ), Corpus of "Cyber-Cameroonian" (CCC) and Corpus of "Cyber-Nigerian" (CCN) have a considerable size and the audience and questions asked show clearly that others besides me also would love to have a look at such a resource. When he was asked for the availability of the resources, he answered that individuals could possibly already have a look, but that for general availability a couple of tasks will have to be accomplished, among them annoymization of the content and some cleaning up of the resources; the tools could be made available in an alpha version to selected users.

Many researchers are in a similar situation. Making resources and tools publicly available requires preparation - of the resource and of the infrastructure. And of the researchers and communities. In the following I will briefly explain, why such a preparation is required and mention some aspects on where to find further information. In this format I will try to be concise rather than complete, else a rather bookish article would be the consequence.

Preparation of a resource to be published

Corpora, lexical resources and other types of linguistic material are created with lots of efforts, time and money goes into it. Often this happens until a specific research question can be answered or the funding of a project runs out. At these stages, work at the resources is stalled and ideas for improvements are no longer implemented. In principle such resources can be made available to a wider audience but this requires some additional work on the side of the resources:

Annonymization of data
Resources containing names and references to small communities or locations may touch on the privacy of the original data providers. Removing privacy sensative information requires a review of the data, either manually or automatically. A lot of these are related to names - both of persons and locations. Assistance could be provided by a part of speech tagger tagging named entities. Within CLARIN various taggers for different languages are available at least for the academic public, often as webservices or webapplications.
Coherent, formalized description of the resource
Usually the researchers creating a resource know best what it is. But for reuse by others it is important to provide a description not unlike a library catalogue entry for a book. This metadata is meant to provide a first overview for interested other researchers and provides descriptions and essential information plus contact information for getting access to the resource and finding more details. The creation of metadata is essential for locating the resource later, but it takes additional effort to create it. CLARIN offers tools and supports the Component Metadata Infrastructure ( ) to provide descriptions suitable for the type of resource of the data creator.
Removing internal markup
Often, especially if manual work is involved, resources contain comments and notes on required tasks left open. These are often results from internal workflows and are not part of the resource. Though provenance information could be relevant and stored elsewhere, often these comments for example by student assistants are removed before making the resource available. If a consistent notation was used, this can be done automatically. Experts in CLARIN can provide assistance in defining such a notation and may help to create a clean up mechanism.

Of course other preparation is also required such as a high quality of the material, but usually this is already in the creators own interest from the start of a project and not related to extra efforts before publication. But with their expertise the CLARIN centers can also support data providers in the beginning stages of their resource creation.

Thoughts on the infrastructure

Providing resources to a larger community means that the infrastructure has to be sufficient for it. Server capacities have to be provided, usually the desktop computer of the scholar is not appropriate. And servers require maintenance, network access, power. They wear out after some time and need to be replaced. If the data is referenced or cited in publications it is also important that the references and citation can be resolved in a way that the audience can find the resources. Why else should someone refer to resources?

For small resource of a couple of gigabytes, few hours of recordings, etc. the investment in such an infrastructure may sound too large. But with joined forces as in the CLARIN infrastructure it is possible to use a center to store resources even for a long time and assign persistent identifiers that can be used to refer to the resources and to cite them. Descriptions of the data are provided to search engines, and unique persistent identifiers are assigned to resources just as book numbers are assigned to books in a library catalogue.

Data provider vs. potential users: a problematic relation?

Last but not least it takes courage for a data provider to publish a resource. The reason is simple: there is just no corpus or lexical resource that is finished. There is always something to improve, to enhance, to add, to revise - you get the idea. Knowing about the imperfection of a resource means that releasing a resource opens it to public criticism, additional demands and requests by the community, etc. And of course the harshest critic can be the creator.

Even an imperfect resource may be valuable for a community and worth using and citing. This calls for the community to be error tolerant and take the resource as it is at the time of release, not bashing the provider for providing the material in an imperfect state. And of course the provider may be criticized openly, this is also something to be prepared for. But serious research with serious data may cause debates. The debate is usually an issue of the interpretation of the data, not the data itself. And every involved party should be aware that this is what research is about.

Knowing that Christian Mair with his fascinating resources is aware of the requirements and the options, I am looking forward to him finishing the "brushing up" of those resources and making them available to the community. And I am positive that researchers like him will have a great impact on the availability of high quality language related material also for the research of others. Even if the material is not perfect and preparing a resource for a larger audience takes some additional efforts. And here, the CLARIN infrastructure supports dissemination and archiving of resources.