Skip to main content




CLARIN: widening the application of corpora

Martin Wynne
Oxford Text Archive, University of Oxford

Practical applications of linguistic corpora are not limited to linguistics and computer science. They are already used in research in many academic disciplines, and there is considerable potential for further exploitation of corpora and other language resources and technologies, particularly in the Humanities and Social Sciences.

There are initiatives under way to build a working environment which will enable the more effective use of corpora across the humanities and social sciences, including the EC-funded CLARIN Preparatory Phase project. CLARIN is committed to enhancing research across Europe by facilitating use of language resources and technology for researchers across a wide spectrum of domains in the Humanities and Social Sciences. CLARIN aims to make the following vision a reality: access to repositories of data with standardized descriptions and language processing tools that will operate on them; legal and access issues will be resolved, and all of this will be available on the Internet using the existing and emerging standards. The nature of the project is therefore primarily to turn existing, fragmented technology and resources into accessible and stable services that any user can share or customize for their own applications. In the case of the use of language corpora, this means overcoming the current difficulties experienced in finding corpora, accessing relevant documentation, and negotiating permission to use them, as well as the problems relating to the lack of standardisation of corpora, such as the lack of interoperability of corpora and tools, and finally to make use of the emerging possibilities for online processing and virtual collaborative environments.

This presentation examines the use of corpora in literary studies as an example of the practical application of corpora in another field, with a review of current work and various ways in which corpora are used. The use of corpora in historical studies is also briefly considered as another area of potential fruitful inquiry.

This examination of some aspects of the use of corpora outside of linguistics raises some fundamental questions about the definition and nature of the corpus. Enormous quantities of linguistic content on the world-wide web are available to the researcher, including large-scale, high-quality scholarly collections - do these make corpora redundant? Can we still make a case for the carefully selected and crafted corpus of limited size in the current environment? And should corpus builders consider the needs of users in different disciplines when designing corpora? What are the implications for corpus design if users want to extract information from the content of texts, rather than linguistic patterns? These challenges are considered in the context of the emerging research infrastructure.


United Kingdom