Corpora of academic texts contain scholarly writing, such as research papers, essays and abstracts published in academic journals, conference proceedings, and edited volumes, theses written by students at undergraduate and graduate levels, and scientific monographs.
The CLARIN ERIC infrastructure gives access to 22 corpora of academic texts, 2 of which are multilingual and 20 monolingual. The available corpora contain scholarly texts in the following 11 languages: Czech, English, Estonian, Finnish, French, German, Greek, Russian, Slovenian, Spanish, and Swedish. More than 15 different scholarly disciplines are represented, with the most prominent being linguistics, computer science, economics, and medicine. The majority of the corpora are richly tagged and are available under public licences.
We first provide an overview of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.
For comments, changes of the existing content or inclusion of new corpora, send us an email.
This website was last updated on 21 June 2023.
Corpora of academic texts in the CLARIN infrastructure
Monolingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 3 million words |
Czech |
This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format. The corpus is available for download from the LINDAT repository. |
|
ACL Anthology Reference Corpus Size: 75 million tokens |
English |
This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format. The corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website. For the relevant publication, see Bird et al. 2008 |
|
English Scientific Text Corpus Size: 35 million tokens |
English |
This corpus contains journal articles in the following disciplines:
The articles were published in the 1970s, 1980s and the 200s. The corpus is available for online querying through CQPWeb (CLARIN-D distribution). For the relevant publication, see Degaetano-Ortlieb et al. 2013 |
|
Size: 437,000 words |
English |
This corpus contains journal paper abstracts in biomedicine. The corpus data are in various formats, e.g., PTB. The corpus is available for download from PORTULAN. For the relevant publication, see Su et al. 2008 |
|
Size: 200 million tokens |
English |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
|
Size: 32 million tokens |
English (late and early modern) |
This corpus contains journal articles published in Philosophical Transactions of the Royal Society of London between 1665 and 1869. The corpus is available for online querying through CQPweb and for download from the CLARIN-D repository of the University of Saarland. For the relevant publication, see Kermes et al. 2016 |
|
Corpus of Estonian scientific texts Size: 5 million words |
Estonian |
This corpus contains scientific articles and PhD theses. The corpus data are in the P5 format. |
|
Size: 12.5 million tokens |
Finnish |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
|
Chambers-Le Baron Corpus of Research Articles Size: 1 million words |
French |
This corpus contains research papers in the following disciplines:
The research papers were published between 1998 and 2006. This is a plain text corpus. The corpus is available for download from the Oxford Text Archive. |
|
Size: 580,000 tokens |
French |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
|
Size: 560,000 tokens |
German |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
|
Modern Greek Dialects: scientific papers Size: 113,000 words |
Greek |
This corpus contains scientific texts in linguistics and dialectology. This is a plain text corpus. The corpus is available for download from the CLARIN:EL repository. |
|
Size: 2.5 million tokens |
Greek |
This corpus contains academic texts in the following disciplines:
The corpus is encoded in XML ( ). The corpus is available for download from the CLARIN:EL repository. For the relevant publication, see Mantzari et al. 1999 |
|
The Language of Literature and the Language of Translation (collected scientific papers) Size: 48,300 words |
Greek |
This corpus contains journal articles in literary and translation studies. This is a plain text corpus. The corpus is available for download from the CLARIN:EL repository. |
|
Size: 1.1 million words |
Russian |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
|
Corpus of Academic Slovene KAS 2.0 Size: 1.5 billion tokens |
Slovenian |
This corpus contains BA, MA, and PhD theses in humanities, social sciences, and natural sciences published between 2000 and 2018. The corpus data are in the format. The corpus is available for download from CLARIN.SI. Version 1.0 is also available for online querying through noSketch Engine and KonText (CLARIN.SI distribution). For the relevant publication, see Erjavec et al. 2020 |
|
Size: 2.3 million tokens |
Spanish |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
|
Size: 14.5 million tokens |
Swedish |
This corpus contains academic texts from humanities disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text. The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution). |
|
Academic texts - social science Size: 10.8 million tokens |
Swedish |
This corpus contains academic texts from social sciences disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text. The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution). |
|
Size: 105 million tokens |
Swedish |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Czech and English abstracts of ĆFAL papers Size: 2 million words |
Czech,English |
This parallel corpus contains research paper abstracts in formal and applied linguistics. For each publication, the authors were obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. The corpus data are in the TSV format. The corpus is available for download from the LINDAT repository. |
|
Size: 3.9 million tokens |
English,French,Norwegian |
This comparable corpus contains research articles in economics, linguistics, and medicine published between 1992 and 2003. The corpus is available for online browsing through the concordancer Corpuscle (CLARINO distribution). |
Corpora outside the infrastructure
Monolingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 3.5 million words |
English |
This corpus contains journal articles, book chapters, course workbooks, laboratory manuals, and course notes from the following disciplines: arts, commerce, law, and biology. This corpus is not available. |
|
Licence: restricted |
English |
This corpus contains PhD theses from the following disciplines: agriculture, psychology, food science, technology, meteorology, and history. The data are encoded in ASCII and HTML. The corpus is not available because it is restricted at present to staff and researchers at the University of Reading, and it is only available 'on-site'. However, it is possible for people outside the University to make use of the corpus on a Research Attachment arrangement. |
|
Size: 9 million words |
Lithuanian |
This corpus contains textbooks, scientific monographs, journal articles, abstracts, forewords, research reports, and masterās and PhD theses from the following disciplines:
The materials were published between 1999 and 2009. The corpus is encoded in TEI 5.
The corpus is available for online querying through a dedicated website. For the relevant publication, see UsonienÄ and LinkeviÄienÄ (2009) |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
MuchMore Springer Bilingual Corpus Size: 1 million tokens |
English,German |
This paper contains journal paper abstracts from medical disciplines. The corpus is encoded in MuchMore XML. The corpus is available for download from a dedicated website. |
|
Size: 20 million words |
French,English |
This corpus contains scientific texts and argumentative essays in humanities, experimental sciences, and applied/technical sciences. The corpus is available for online querying through a dedicated webpage. |
|
Corpus of Romanian Academic Genres ā ROGER (bilingual, student papers) Size: 3.3 million words |
Romanian, English |
The corpus contains academic papers from eight disciplines, written by the Romanian students in native Romanian and English L2. The corpus was collected over a three-year period (2018ā2021) with the help of 27 collaborators from nine Romanian universities. The corpus is available for online querying through a dedicated platform developed at the CODHUS research centre from the West University of Timisoara. For the relevant publication, see Striletchi et al. (2022) |
|
Spanish-English Research Article Corpus Size: 5.7 million words |
Spanish,English |
This corpus contains journal articles published between 2000 and 2010. The corpus is unavailable. |
Related Publications
[Bird et al. 2008] Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan Tan. 2008. The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), edited by Nicoletta Calzolari, 1755ā1759.
[Degaetano-Ortilieb et al. 2013] Stefania Degaetano-Ortilieb, Hannah Kermes, Ekaterina Lapshinova-Koltunski, and Elke Teich. 2013. SciTex ā A Diachronic Corpus for Analyzing the Development of Scientific Registers. In New Method in Historical Corpus Linguistics, edited by Paul Bennett et al.
[Erjavec et al. forthcoming] Tomaž Erjavec, Darja FiÅ”er, and Nikola LjubeÅ”iÄ. 2021. The KAS Corpus of Slovenian Academic Writing.Language Resources and Evaluation.
[Kermes et al. 2016] Hannah Kermes, Stefania Degaetano, Ashraf Khamis, Jƶrg Knappen, and Elke teich. The Royal Society Corpus: From Uncharted Data to Corpus. In Proceedings of LREC 2016, edited by Nicoletta Calzolari.
[Mantazi et al. 1999] Elena Mantazi, Maria Gavrilidou, Penny Labropoulou, and George Carayannis. 1999. Collection of digital terminological resources: methodology and results. In Proceedings of the 2nd Conference on Greek Language and Terminology.
[Parodi 2010] Giovanni Parodi. 2010. Academic and Professional genre variation across four disciplines: exploring the PUCB-2006 corpus of written Spanish. Linguagem em (Dis) curso, 10 (3): 535ā567.
[Striletchi et al. 2022] Cosmin StrileČchi, MÄdÄlina Chitez, and Karla Csürƶs. 2022. Building Roger: Technical Challenges While Developing a Bilingual Corpus Management and Query Platform. In Proceedings of the 17th International Conference on Software Technologies - ICSOFT.
[Su et al. 2008] Jian Su, Xiaofeng Yang, Huaqing Hong, Yuka Tateisi, and Jun'ichi Tsujii. 2008. Coreference resolution in biomedical texts: a machine learning approach. In Ontologies and Text Mining for Life Sciences: Current Status and Future Perspectives, edited by Michael Ashburner, Ulf Leser, and Dietrich Rebholz-Schuhmann.
[UsonienÄ and LinkeviÄienÄ 2009] Aurelija UsonienÄ and JolÄ LinkeviÄienÄ. 2009. Lietuvių mokslo kalbos tekstynas ir specialioji leksika. Lituanistica, 55 (3ā4): 133ā143.