The aim of the CLARIN Resource Families initiative is to provide a user-friendly overview of the available corpora in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. The overviews are organized according to the types of data in the corpora and include listings of corpora sorted by language.
The listings include the most important metadata and descriptions on corpus size, text sources, time periods, annotations and licences as well as links to download pages and concordancers, whenever available. In addition to the corpora found in the CLARIN infrastructure we provide an overview of other existing valuable corpora which have not yet been integrated in the infrastructure.
We also provide hyperlinks to other relevant materials such as the thematic CLARIN workshops and tutorials and their accompanying videolectures, as well as a list of key publications on the corpora surveyed.
We currently offer overviews of 7 resource families:
- Computer-mediated communication corpora
- Historical corpora
- L2 learner corpora
- Newspaper corpora
- Parallel corpora
- Parliamentary corpora
- Spoken corpora
In the future, we plan to include other resource families, such as manually annotated corpora, as well as add tutorials on how to query, annotate and analyse the data.
The overviews have been prepared by Darja Fišer and Jakob Lenardič and have received funding from the European Union's Horizon 2020 research and innovation programme for projects CLARIN-PLUS and PARTHENOS. We would like to thank all the User Involvement coordinators, National Coordinators, workshop participants and other individuals who have participated in the survey and have provided information about the resources.
Comments and suggestions to improve this page are welcome. Please send us an email.
This website was last updated on 5 July 2018.