Legal corpora

  

Introduction

Legal corpora contain legislation, legal acts, transcriptions of court decisions, and other kinds of materials related to national or supernational law. Such corpora are an important resource for anyone who practises or researches law, as they can be used to investigate issues such as legal phraseology and terminology, variation in legal discourse, legal translation, register and genre perspectives on legal discourse, legal discourse in forensic contexts, and evaluative language in judicial settings (Goźdź-Roszkowski 2021).

The CLARIN infrastructure gives access to 33 legal corpora, most of which are richly annotated both linguistically (e.g., syntactic dependency parsing in addition to PoS-tagging and lemmatisation) and at various domain-specific metalinguistic levels, such as the speaker roles in the case of courtroom proceedings (e.g., judge, defendant, prosecutor, etc.). As CLARIN mostly consists of European countries, many of the legal corpora consist of the so-called Acquis Communautaire, which refers to the legislation, legal acts and court decisions constituting the law of the European Union.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 17 October 2022.

Monolingual corpora

Corpus Language Description Availability

Annotated Corpus of Czech Case Law for Reference Recognition Tasks

Annotation: legal references (identifier of court decision; author of law book or article, etc.)

Licence: CC BY 4.0

Czech

This corpus consists of 350 manually annotated decisions at Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Each decision has been manually annotated by two trained annotators; the corpus is primarily developed as training and testing materials for reference recognition tasks. See also the variant of this corpus annotated for segmentation tasks.

The corpus is available for download from LINDAT.

For the relevant publication, see Harašta et al. (2018)

Download

Czech Court Decisions Corpus (CzCDC 1.0)

Size: 460 million words

Annotation: unannotated

Licence: CC BY-NC 4.0

Czech

This corpus consists of around 237,000 court decisions from three top-tier courts (Supreme, Supreme Administrative, and Constitutional) in Czechia, published between 1993 and 2018.

The corpus is available for download from LINDAT.

For the relevant publication, see Novotná and Harašta (2019)

Download

Czech Legal Text Treebank

Size: 1128 sentences

Annotation: manual syntactic annotation; manual annotation of entities from the accouting domain and relations definition, obligation, right

Licence: CC BY-NC-SA 4.0

Czech

This corpus consists of two legal documents: Accounting Act (563/1991 Coll., as amended) and Decree on Double-entry Accounting for undertakers (500/2002 Coll., as amended).

The corpus is available for download from LINDAT and online browsing through the treebank viewer PML-TQ and the concordancer KonText.

For the relevant publication, see Kríž and Hladka (2018)

Download

PML-TQ

KonText

META-NORD Acquis Danish Treebank

Size: 102 sentences; 1799 words

Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation

Licence: CC BY 4.0

Danish

This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).

Download

Browse

Corpus Juridisch Nederlands

Size: 5,856 texts

Annotation: lemmatised, PoS-tagged

Licence: CLARIN PUB

Dutch

This corpus contains legal texts from 1814 to 1989, compiled year by year.

The corpus is available for online browsing on a dedicated webpage

For the relevant publication, see de Does et al. (2017)

Browse

CABank English SCOTUS Oral Arguments Corpus

Annotation: speaker segmentation, sociolinguistic annotation

Licence: CC BY-NC-SA 3.0

English

This corpus consists of transcripts and recordings of oral arguments at the Supreme Court of the United States.

The transcripts and audio recordings are aligned at the utterance level; the utterances are annotated based on speaker role (the primary one being Justice) and name, as well as gender.

The corpus is part of the CABank collection and available for download from and online browsing through TalkBank.

For the relevant publication, see Johnson and Goldman (2009)

Browse

Download

The English Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts

Size: 359,874 tokens

Annotation: lemmatised, MSD-tagged

Licence: CC BY-ND

English

This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN Distribution).

Browse

English Acquis Communautaire

Size: 34.6 million tokens

Licence: MIT (academic)

English

This corpus contains selected texts from the Acquis Communautaire between the 1950s and today, translated to English.

The corpus is available for download from PORTULAN.

For the relevant publication, see Steinberger et al. (2006)

Download

Old Bailey Corpus

Size: 24.4 million words

Annotation: sociolinguistic annotation

Licence: CC BY-NC-SA 4.0

English (Late Modern)

This historical corpus consists of Proceedings of the Old Bailey; the Old Bailey was London’s central criminal court between 1674 and 1913. The corpus consists of texts from 1970 to 1913, and is annotated for detailed utterance-level sociolinguistic annotation at the following three levels: sociobiographical speaker information (gender, age, occupation, social class), pragmatic information (speaker role in the courtroom such as judge, witness, etc.), and metatextual information (the scribe, printer, and publisher of the individual Proceeding).

The corpus is available for download from CLARIN-D (Saarland University) and for online browsing through CQPWeb.

Browse

Download

Corpus of Estonian law texts

Size: 11 million tokens

Licence: CLARIN PUB

Estonian

This corpus contains Estonian laws (1.8 million tokens) as well as European legislation (9.6 million tokens) translated into Estonian.

The corpus is available for download from a dedicated webpage hosted by CLARIN Estonia.

Download

META-NORD Acquis Estonian Treebank

Size: 78 sentences; 1443 words

Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation

Licence: CC-BY 4.0

Estonian

This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).

Download

Browse

The Finnish Sub-corpus of FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts

Size: 1.5 million tokens

Annotation: lemmatised, MSD-tagged

Licence: CC BY-ND

Finnish

This is the Finnish subcorpus of FiRuLex, which contains juridical texts in Russian and Finnish.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution)

Browse

The Finnish Sub-corpus of the JRC-Acquis Multilingual Parallel Corpus, Downloadable Version

Size: 44.1 million tokens

Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation

Licence: CC BY

Finnish

This is the legal subcorpus of the Helsinki Korp Version of the Finnish TreeBank 3.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution) and for download from the Finnish Language Bank.

Browse

Download

META-NORD Acquis Finnish Treebank

Size: 122 sentences; 1464 words

Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation

Licence: CC BY 4.0

Finnish

This is a subcorpus of the META-NORD Acquis Parallel Treebank. The corpus is syntactically parsed using the FinnTreeBank 2 schema and is available for download and online browsing through INESS (CLARINO).

Download

Browse

The German Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts

Size: 198,035 tokens

Licence: CC BY-ND

German

This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN Distribution).

Browse

Corpus of Judicial Rhetoric: cases of rapes and homicides

Licence: CC BY-NC-ND 4.0

Greek

This corpus consists of transcriptions of defendants’ and witnesses’ speeches in criminal cases of rape, attempted rape, murder, and attempted murder.

The corpus is available for download from the CLARIN:EL repository.

Download

META-NORD Acquis Icelandic Treebank

Size: 73 sentences; 1880 words

Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation

Licence: CC BY 4.0

Icelandic

This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).

Download

Browse

IGC-Laws-21.05 (The Icelandic Gigaword Corpus: Law, bills and proposals)

Size: 2,2 million sentences; 40,6 million words

Annotation: lemmatised, MSD-tagged

Licence: CC BY 4.0

Icelandic

IGC-Laws is a subcorpus of the The Icelandic Gigaword Corpus (see also CLARIN reference corpora). IGC-Laws contains 1) the Icelandic laws, 2) explanatory reports and observations extracted from bills submitted to Althingi, and 3) parliamentary proposals and resolutions. The corpus comes in two formats. One contains the texts untokenized and untagged while the other has been tokenized, PoS-tagged and lemmatized.

The corpus is available for download from the CLARIN-IS repository.

For the relevant publication, see Steingrímsson et al. (2018)

Download

Corpus of Legal Acts of the Republic of Latvia (Likumi)

Size: 116 million tokens; 73 million words

Licence: CC BY 4.0

Latvian

The corpus contains all legal acts of the Republic of Latvia published on the website likumi.lv (until February 2022).

The corpus is available for download from the CLARIN.LV repository.

Download

Lithuanian Corpus of the EU Primary and Secondary Law Acts of the Period 2015–2017

Size: 274,460 words

Licence: CLARIN PUB

Lithuanian

This corpus contains primary and secondary European law acts (32 texts) translated into Lithuanian.

The corpus is available for download from CLARIN-LT.

Download

Maltese Acquis Communautaire

Size: 20.9 million tokens

Licence: MIT (academic)

Maltese

This corpus contains selected texts from the Acquis Communautaire between the 1950s and today, translated to Maltese.

The corpus is available for download from PORTULAN.

For the relevant publication, see Steinberger et al. (2006)

Download

META-NORD Acquis Norwegian Treebank

Size: 101 sentences; 1862 words

Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation

Licence: CC BY 4.0

Norwegian

This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).

Download

Browse

Norwegian Acquis Communautaire

Size: 14 million words

Licence: CC BY-NC 4.0

Norwegian (Bokmål and Nynorsk)

This corpus contains Norwegian translations of 5414 documents in Acquis Communautaire.

The corpus is available for download from the Norwegian Language Bank.

Download

Legal Documents from Norwegian Nynorsk Municipialities

Size: 127 million words

Licence: CC0 1.0 Universal

Norwegian (Nynorsk and Bokmål)

This corpus contains 50,000 legal documents and meeting minutes collected with the web crawler Veidemann. Around 88.5 million words are in Nynork, while the rest are in Bokmal (Bokmål).

The corpus is available for download from the Norwegian Language Bank.

Download

The Russian Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts

Size: 198,035 tokens

Annotation: lemmatised, MSD-tagged

Licence: CC BY-ND

Russian

This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties.

The corpus can be accessed online through the concordancer Korp (FIN-CLARIN Distribution).

Browse

The Russian Sub-corpus of FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts

Size: 1.2 million tokens

Annotation: lemmatised, MSD-tagged

Licence: CC BY-ND

Russian

This is the Russian subcorpus of FiRuLex, which contains juridical texts in Russian and Finnish.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution)

Browse

META-NORD Acquis Swedish Treebank

Size: 102 sentences; 1982 words

Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation

Licence: CC BY 4.0

Swedish

This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).

Download

Browse

Multilingual corpora

Corpus Language Description Availability

JRC EU DGT Translation Memory Parsebank DGT-UD

Size: 2.1 billion tokens

Annotation: syntactically parsed (Universal Dependencies)

Licence: CC BY 4.0

Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Hungarian, Irish, Italian, Latvian, Lithuanian, Modern Greek (1453-), Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish

This is a 23-language parallel syntactically parsed corpus, which consists of the JRC DGT translation memory of European law, automatically annotated with UD-Pipe 1.2 using Universal Dependencies 2.0 models.

The corpus is available for download from the CLARIN.SI repository and for online browsing through the KonText and noSketch Engine concordancers.

Download

KonText

noSketch Engine

The JRC-Acquis Corpus, version 3.0

Size: 1 billion words

Annotation: paragraph and sentence alignment

Licence: CC BY 4.0

Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish

This is a parallel corpus of Acquis Communautaire, which is the total body of European Union law applicable in European member states.

Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction). The sentence-level alignment was done using the hunalign tool.

The corpus is available for download from the CLARIN:EL repository.

For the relevant publication, see Steinberger et al. (2006)

Download

COVID-19 EUR-LEX dataset. Βilingual (EN-PT)

Size: 21,000 units

Licence: CC BY

English, Portuguese

This is a parallel corpus of the European Union Law pertaining to COVID-19 period.

The corpus is available for download from the PORTULAN repository.

Download

Legal texts from Estonian Ministry of Justice (Processed)

Size: 47,000 units

Licence: CC BY

Estonian-English

This corpus contains Estonian-English translations of the Acts of Estonian law.

The corpus is available for download from PORTULAN.

Download

MultiEURLEX

Annotation: conceptual annotation

Licence: CC BY

Finnish, Slovak, Lithuanian, Croatian, Slovenian, Estonian, Latvian, Maltese, English, German, French, Italian, Spanish; Castilian, Polish, Romanian; Moldavian; Moldovan, Dutch; Flemish, Modern Greek (1453-), Hungarian, Portuguese, Czech, Swedish, Bulgarian, Danish

This corpus consists of 65,000 European laws in 23 official European languages. Each law has been annotated with the EuroVoc concept labels.

The corpus is available for download from the repository of CLARIN:EL.

For the relevant publication, see Chalkidis et al. (2021)

Download

COVID-19 EUR-LEX dataset . Multilingual (CEF languages)

Size: 475,931 translation pairs

Licence: CC BY

Maltese, Hungarian, Lithuanian, Latvian, Polish, Portuguese, English, Slovenian, Modern Greek, Spanish (Castilian), Romanian, Slovak, Moldavian, Swedish, Bulgarian, Italian, German, Croatian, French, Dutch (Flemish), Czech, Finnish, Danish, Irish, Estonian

This is a multilingual corpus of the European Union Law pertaining to COVID-19 period.

The corpus is available for download from the PORTULAN repository.

Download

Publications on the legal corpora

[Chalkidis et al. 2021] Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. 2021. MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. arXiv

[de Does et al. 2017] Jesse de Does, Jan Niestadt, and Katrien Depuydt. 2017.  Creating research environments with BlackLab. In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, 151–165. London: Ubiquity Press.

[Goźdź-Roszkowski 2021] Stanisław Goźdź-Roszkowski. 2021. Corpus Linguistics in Legal Discourse. International Journal for the Semiotics of Law 34: 1515–1540. 

[Harašta et al. 2018] Jakub Harašta, Jaromír Šavelka, František Kasl, Adéla Kotková, Pavel Loutocký, Jakub Míšek, Daniela Procházková, Helena Pullmannová, Petr Semenišín, Tamara Šejnová, Nikola Šimková, Michal Vosinek, Lucie Zavadilová, and Jan Zibner. 2018. Annotated Corpus of Czech Case Law for Reference Recognition Tasks. In TSD 2018, volume 11107. Springer, Cham. 

[Johnson and Goldman 2009] Timothy R. Johnson and Jerry Goldman, eds. 2009. A Good Quarrel: America’s Top Legal. University of Michigan Press. 

[Kríž and Hladká 2018] Vincent Kríž and Barbora Hladká. 2018. Czech Legal Text Treebank 2.0. Proceedings of LREC 2018, 4501–4505. 

[Novotná and Harašta 2019] Tereza Novotná and Jakub Harašta. 2019. The Czech Court Decisions Corpus (CzCDC): Availability as the First Step. arXiv pre-print

[Steinberger et al. 2006] Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, and Dániel Varga. 2006. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In Proceedings of LREC 2006, 2142–2147.

[Steingrímsson et al. 2018] Steinþór Steingrímsson, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson, and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. In Proceedings of LREC 2018, 4361–4366.