Manually annotated corpora

Introduction

Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools as well as to test the accuracy of existing annotation tools. 

There are 74 manually annotated training corpora and corpus collections in the CLARIN infrastructure, 63 of which are monolingual (accounting for 21 different languages) and 11 multilingual. Among the multilingual corpora, there are 4 collections in the CLARIN infrastructure were annotated under the following umbrella initiatives: HamleDT 3.0, Treebanks of INESS, Universal Dependencies 2.8.1, and Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1).  

The corpora and corpus collections are classified into 6 categories based on the type of manual annotation:

 If a corpus is manually annotated for more than one linguistic information, then it is listed under all the relevant sections. For instance, the xLiMe Twitter Corpus XTC 1.0.1 is manually annotated for PoS tags, Named Entities and sentiment, so it is listed under all the three relevant sections.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 14 April 2022.

The manually annotated corpora

PoS MSD tagging

Corpus Language Description Availability

MULTEXT-East "1984" annotated corpus 4.0

Size: 80,000 sentences, 1 million words

Annotation: morphosyntactic tagging, lemmatisation, sentence alignment

Licence: CC BY-NC-SA 4.0

Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian

This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2012)

Download

The Morphologically Annotated Part of BulTreeBank

Size: 214,000 tokens

Annotation: morphosyntactic tagging

Licence: MS-NC-NoReD

Bulgarian

This corpus is available for download through the concordancer Corpuscle.

Concordancer

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

Size: 89,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY 4.0

Croatian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Milińćevińá and LjubeŇ°ińá (2016)

KonText

noSketch

Download

BNC Sampler

Size: 2 million tokens

Annotation: PoS tagging

Licence: BNC Licence

English

The corpus was manually post-edited to correct the PoS tags automatically assigned by CLAWS.

The corpus is available for online querying via CQPWeb (registration required) for download from the Oxford Text Archive

Concordancer

Download

Corpus of morphologically disambiguated Estonian texts

Size: 513,000 tokens

Annotation: morphological disambiguation

Licence: CLARIN_ACA-NC

Estonian

This corpus contains texts from the 1980s subcorpus of the Corpus of Written Estonian 1890-1990.

Download

Austrian Baroque Corpus

Size: 200,000 tokens

Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-T√ľbingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al (2016)

Concordancer

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens

Annotation: PoS tagging, Named Entity recognition, sentiment analysis

Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

Szeged Corpus 2.0

Size: 1.5 million tokens

Annotation: morphosyntactic tagging

Licence: Licence agreement

Hungarian

This corpus is available for download from a dedicated webpage.

To download the versions of the Szeged Corpus and Szeged Treebank, you are obliged to fill and send a Licence Agreement.

Download

Lithuanian morphologically annotated corpus - MATAS

Size: 1.6 million words

Annotation: morphosyntactic tagging

Licence: CLARIN ACA

Lithuanian

The corpus contains texts from various domains (documents, fiction, periodicals, scientific texts, wordforms).

This corpus is available for download from the CLARIN-LT repository.

Download

NKJP1M

Size: 1 million tokens

Annotation: morphosyntactic tagging

Licence: GNU GPL 3

Polish

This corpus is a manually annotated subset of the National Corpus of Polish.

The corpus is available for download from the Computational Linguistics in Poland website.

For the relevant publication, see Przepiórkowski and Murzynowski (2011)

Download

CINTIL-Corpus Internacional do Português

Size: 1 million tokens

Annotation: morphosyntactic tagging, Named Entity recognition

Licence: CLARIN RES

Portuguese

The corpus contains transcriptions of spoken communication as well as written texts from several genres (news, literature, magazines, etc.).

The corpus is available for download from the ELRA Catalogue.

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

Size: 92,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY 4.0

Serbian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Milińćevińá and LjubeŇ°ińá (2016).

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see FiŇ°er et al. (2018)

KonText

noSketch

Download

Training corpus jos1M 1.1

Size: 1 million words

Annotation: morphosyntactic tagging and lemmatisation

Licence: CC BY-NC 4.0

Slovenian

This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec et al. (2010).

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens

Annotation: fully ‚Äď tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus ‚Äď syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles

Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Lemmatisation

Corpus Language Description Availability

MULTEXT-East "1984" annotated corpus 4.0

Size: 80,000 sentences, 1 million words

Annotation: morphosyntactic tagging, lemmatisation, sentence alignment

Licence: CC BY-NC-SA 4.0

Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian

This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2012)

Download

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

Size: 89,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY 4.0

Croatian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Milińćevińá and LjubeŇ°ińá (2016)

KonText

noSketch

Download

Austrian Baroque Corpus

Size: 200,000 tokens

Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-T√ľbingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al (2016)

Concordancer

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

Size: 92,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY 4.0

Serbian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Milińćevińá and LjubeŇ°ińá (2016).

KonText

noSketch

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens

Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition

Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now defunct.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Batanovińá et al. (2018).

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see FiŇ°er et al. (2018)

KonText

noSketch

Download

Training corpus jos1M 1.1

Size: 1 million words

Annotation: morphosyntactic tagging and lemmatisation

Licence: CC BY-NC 4.0

Slovenian

This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec et al. (2010).

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens

Annotation: fully ‚Äď tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus ‚Äď syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles

Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Syntatic parsing

Corpus Language Description Availability

Prague Arabic Dependency Treebank 1.0

Annotation: syntactic parsing and morphosyntactic tagging

Licence: CC BY-NC-SA 3.0

Arabic

This corpus is available for download from the LINDAT repository.

For the relevant publication, see Hajińć et al. (2004)

Download

Training corpus hr500k 1.0

Size: 500,000 tokens

Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and Named Entity recognition. Half of the corpus also syntactically parsed

Licence: CC BY-SA 4.0

Croatian

This corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Czech Legal Text Treebank 2.0

Size: 1121 sentences

Annotation: syntactic parsing, labelling of semantic entities

Licence: CC BY-NC-SA 4.0

Czech

This corpus contains legal texts.

The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.

For the relevant publication, see Kr√≠Ňĺ and Hladk√° (2018)

KonText

PML-TQ

Download

FicTree 1.0

Size: 12760 sentences

Annotation: syntactic parsing and morphosyntactic tagging

Licence: CC BY-NC-SA 4.0

Czech

This corpus contains fictional texts.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Jelínek (2017)

KonText

Download

Prague Dependency Treebank 3.5

Size: 2 million words

Annotation: syntactic parsing and morphosyntactic tagging

Licence: CC BY-NC-SA 4.0

Czech

This corpus is manually annotated at several levels ‚Äď aside from syntactic parsing and morphological information, it is annotation for sentence information structure, multiword expression, coreference, bridging relations and discourse relations.

The corpus is available for download from the LINDAT repository.

Download

Prague Discourse Treebank 2.0

Size: 49,500 sentences

Annotation: syntactic parsing, mark-up of discourse phenomena enriched by the annotation of secondary connectives

Licence: CC-BY

Czech

This corpus is a subset of the Prague Dependency Treebank 3.5

The corpus is available through the PML-TQ tool.

PML-TQ

Slovak Dependency Treebank

Size: 106,000 tokens, 10,600 sentences

Annotation: syntactic parsing

Licence: CC BY-SA 4.0

Czech

This syntactic parsing is modelled after the Prague Dependency Treebank.

The corpus is available for download from the LINDAT repository.

Download

Prague Czech-English Dependency Treebank 2.0 Coref

Size: 49,000 sentences

Annotation: syntactic parsing, mark-up of coreference

Licence: CC-BY-NC-SA + LDC99T42 (restricted use)

Czech, English

This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 style.

The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool (Czech part only).

For the relevant publication, see Hajińć et al. (2012)

KonText

PML-TQ

Download

Artificial Treebank with Ellipsis

Size: 106,000 tokens, 10,604 sentences

Annotation: syntactic parsing, mark-up of elliptical constructions

Licence: Licence Universal dependencies v2.1

Czech, English, Finnish, Russian, Slovak

This syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the LINDAT repository.

Download

Lassy Klein-corpus

Size: 1 million tokens

Annotation: PoS tagging, syntactic parsing

Licence: VAGUE

Dutch

This corpus is available for download from the Dutch Language Institute and through the online environments PaQu and GrETEL.

For the relevant publication, see Noord (2009)

Download

Pa-Qu

GrETEL

SoNaR-1

Size: 1 million words

Annotation: PoS tagging, syntactic parsing, semantic role labelling

Dutch

This is a manually annotated subset of the much larger (approx. 500 million) word) SoNaR corpus.

The corpus is available for download from the Dutch Language Institute.

Download

Estonian Treebank

Size: 1,000 sentences

Annotation: syntactic parsing

Licence: CLARIN_ACA

Estonian

The corpus contains fictional and newspaper texts.

The corpus is available for download from META-SHARE (CELR distribution).

Download

UD Estonian ver.2.3

Size: 434,000 tokens

Annotation: syntactic parsing

Licence: CC-BY-SA

Estonian

This corpus contains fictional, newspaper and scientific texts. The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from (CELR distribution).

For the relevant publication, see Muischnek et al. (2014)

Download

TimeML annotated corpus of Estonian newspaper articles

Size: 22,000 words

Annotation: morphosyntactic tagging and syntactic parsing

Licence: CC-BY-SA

Estonian

This corpus contains newspaper articles.

The corpus is available for download from META-SHARE (CELR distribution).

For the relevant publication, see Orasmaa (2014)

Download

Finnish TreeBank 1

Size: 160,000 tokens

Annotation: syntactic parsing

Licence: CC-BY 3.0

Finnish

This corpus contains 19,000 sentences from the Large Grammar of Finnish.

The corpus is available for download from the Language Bank of Finland.

Download

Finnish TreeBank 2

Size: 160,000 tokens

Annotation: syntactic parsing

Licence: CC-BY 3.0

Finnish

This corpus contains 19,000 sentences from the Large Grammar of Finnish.

The corpus is available for download from the Language Bank of Finland.

Download

Turku Dependency Treebank

Size: 204,000 tokens

Annotation: syntactic parsing

Licence: CC-BY-SA

Finnish

The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the Turku BioNLP Group.

For the relevant publication, see Haverinen et al. (2013)

Download

Syntactic Reference Corpus of Medieval French

Size: 245,000 words

Annotation: syntactic parsing

Licence: CLARIN ACA

French

This corpus contains Old French texts.

The corpus is available for download from the IMS CLARIN-D repository.

For the relevant publication, see Stein and Prévost (2013)

Download

GRUG Parallel Treebank

Size: 10,400 sentence pairs

Annotation: syntactic parsing, PoS tagging

Licence: CC-BY

Georgian, Ukranian, Russian, German

The corpus is syntactically parsed following the TIGER guidelines.

The corpus is available for download from a dedicated website provided by the CLARIN-D consortium.

Download

B4 Heliand

Size: 3495 tokens

Annotation: PoS tagging, syntactic parsing

Licence: CC-BY

German

This corpus contains historical German texts.

The corpus is available for download from the HZSK repository.

Download

Dependency-Annotated Subset of the CREG Corpus

Size: 109 sentences

Annotation: PoS tagging, syntactic parsing

Licence: CLARIN RES

German

This corpus consists of answers to reading comprehension questions written by American college students learning German.

The corpus is available for download from the T√ľbingen CLARIN Repository.

Download

T√ľbingen Treebank of Written German / Newspaper Corpus (T√ľBa-D/Z)

Size: 1.9 million tokens

Annotation: syntactic parsing

Licence: CLARIN RES

German

This corpus contains newspaper articles.

The corpus is available for download from the T√ľbingen CLARIN Repository.

Download

Szeged Treebank 2.0

Size: 82,000 sentences

Annotation: syntactic parsing

Licence: licence agreement

Hungarian

This corpus is available for download from a dedicated webpage.

For the relevant publication, see Csendes et al. (2005)

Download

Icelandic Parsed Historical Corpus (IcePaHC)

Size: 1 million tokens

Annotation: morphosyntactic tagging, lemmatisation, syntactic parsing

Licence: GNU LGPL

Icelandic

This corpus contains Icelandic texts from the 12th through the 21st centuries ‚Äď approximately 100,000 words from each century. The corpus is syntactically parsed following the UUPenn scheme for historical textse

The corpus is available for online search through treebankstudio.org and for download in different formats from a dedicated webpage.

For the relevant publication, see Rögnvaldsson et al. (2012)

Download

Concordancer

Lithuanian Treebank ALKSNIS

Size: 2,355 sentences

Annotation: syntactic parsing

Licence: CLARIN PUB

Lithuanian

Syntactic parsing follows the rules of the Prague Dependency Treebank

This corpus is available for download from the CLARIN-LT repository. The second version is available upon request.

Download

Polish Dependency Bank in Universal Dependency format

Size: 22,000 trees, 351,000 tokens

Annotation: syntactic parsing

Licence: CC BY-NC-SA 4.0

Polish

This corpus also contains sentences showing certain problematic syntactic phenomena ‚Äď sentences with ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction jako, etc. The syntactic parsing follows the Universal Dependencies schema.

The first version of the corpus is available for download from the Computational Linguistics in Poland website. The second version is available upon request.

For the relevant publication, see Wróblewska (2018)

Download

CINTIL DependencyBank

Size: 110,000 tokens

Annotation: morphosyntactic tagging and syntactic parsing

Licence: ELRA END USER

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the ELRA catalogue.

Download

CINTIL TreeBank

Size: 110,000 tokens

Annotation: syntactic parsing

Licence: ELRA END USER

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the ELRA catalogue.

Download

CINTIL-DeepBank

Size: 110,000 tokens

Annotation: PoS-tagging, syntactic parsing, grammatical functions, logical forms

Licence: ELRA END USER

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the ELRA catalogue.

Download

CINTIL-PropBank

Size: 110,000 tokens

Annotation: syntactic parsing and phrase semantic roles

Licence: ELRA END USER

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the ELRA catalogue.

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens

Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition

Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now defunct. The syntactic parsing follows the Universal Dependencies framework.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Tamil Dependency Treebank v0.1

Size: 600 sentences

Annotation: syntactic parsing and morphosyntactic tagging

Licence: CC BY-NC-SA 3.0

Tamil

The syntactic parsing follows the rules of the https://ufal.mff.cuni.cz/pdt/.

The corpus is available for download from the LINDAT repository.

Download

HamleDT 3.0

Size: 19 treebanks

Annotation: syntactic parsing and morphosyntactic tagging

Licence: HamleDT 3.0 Licence Terms

19 languages

This treebank collection is available for download from LINDAT.

The treebanks can be individually queried through KonText and the treebank tool PML-TQ. We list them here by language:

 

  1. Arabic(KonText, PML-TQ)
  2. Bengali (KonText)
  3. Catalan (KonText)
  4. Czech (KonText, PML-TQ)
  5. Dutch (KonText, PML-TQ)
  6. English (KonText)
  7. Estonian (KonText, PML-TQ)
  8. German (KonText)
  9. Greek (KonText)
  10. Hindi (KonText)
  11. Latin (KonText, PML-TQ)
  12. Persian (KonText, PML-TQ)
  13. Polish (KonText, PML-TQ)
  14. Portuguese (KonText, PML-TQ)
  15. Romanian (KonText, PML-TQ)
  16. Russian (KonText)
  17. Slovenian (KonText, PML-TQ)
  18. Spanish (KonText)
  19. Tamil (KonText, PML-TQ)

 

For the relevant publication, see Zeman et al. (2012)

Download

Treebanks of INESS

Size: 532 treebanks

Annotation: syntactic parsing

Licence: CC-BY

71 languages

This is a collection of treebanks made available through the Infrastructure for the Exploration of Syntax and Semantics (INESS).

The corpora are available for online querying through INESS.

For the relevant publication, see Rosén et al. (2012)

 

Universal Dependencies 2.8.1

Size: 27 million tokens

Annotation: morphosyntactic tagging, syntactic parsing

Licence: Licence Universal Dependencies v2.3  publicly available

75 languages

This corpus collection contains 126 treebanks.

The corpus collection is available for download from the LINDAT repository.

The individual treebanks in Universal Dependencies 2.3 can also be queried through the concordancer KonText and the treebank query tool PML-TQ. Below we provide links to these search environments for all the treebanks. For a detailed description of the treebanks, see the Universal Dependencies project page.

 

  1. UD_Akkadian-PISANDUB (KonText)
  2. UD_Amharic-ATT (KonText, PML-TQ)
  3. UD_Armenian-ArmTDP (KonText, PML-TQ)
  4. UD_Breton-KEB (KonText, PML-TQ)
  5. UD_Buryat-BDT (KonText, PML-TQ)
  6. UD_Cantonese-HK (KonText, PML-TQ)
  7. UD_Chinese-HK (KonText, PML-TQ)
  8. UD_Chinese-CFL (KonText, PML-TQ)
  9. UD_Coptic-Scriptorium (KonText, PML-TQ)
  10. UD_Croatian-SET (KonText, PML-TQ)
  11. UD_English-ESL (KonText, PML-TQ)
  12. UD_Faroese-OFT (KonText, PML-TQ)
  13. UD_Galician-TreeGal (KonText, PML-TQ)
  14. UD_Hindi_English-HIENCS (KonText)
  15. UD_Kazakh-KTB 2.2 (KonText, PML-TQ)
  16. UD_Komi_Zyrian-Lattice (KonText, PML-TQ)
  17. UD_Komi_Zyrian-IKDP (KonText, PML-TQ) 
  18. UD_Kurmanji-MG (KonText, PML-TQ)
  19. UD_Lithuanian-HSE (KonText, PML-TQ)
  20. UD_Maltese-MUDT (KonText, PML-TQ)
  21. UD_Marathi-UFAL (KonText, PML-TQ)
  22. UD_Naija-NSC (KonText, PML-TQ)
  23. UD_Persian-Seraji (KonText, PML-TQ)
  24. UD_Russian-Taiga (KonText, PML-TQ)
  25. UD_Sanskrit-UFAL (KonText, PML-TQ)
  26. UD_Serbian-SET (KonText, PML-TQ)  
  27. UD_Slovenian-SST (KonText, PML-TQ)
  28. UD_Tagalog-TRG (KonText, PML-TQ)
  29. UD_Telugu-MTG (KonText, PML-TQ)
  30. UD_Ukrainian-IU (KonText, PML-TQ)
  31. UD_Upper_Sorbian-UFAL (KonText, PML-TQ)
  32. UD_Uyghur-UDT (KonText, PML-TQ)
  33. UD_Warlpiri-UFAL (KonText, PML-TQ)
  34. UD_Yoruba-YTB (KonText, PML-TQ)
  35. UD_Afrikaans-AfriBooms (KonText)
  36. UD_Ancient_Greek-PROIEL (KonText)
  37. UD_Ancient_Greek-Perseus (KonText, PML-TQ)
  38. UD_Arabic-PADT (KonText, PML-TQ)
  39. UD_Arabic-PUD (KonText, PML-TQ)
  40. UD_Arabic-NYUAD (KonText)
  41. UD_Bambara-CRB (KonText, PML-TQ)
  42. UD_Basque-BDT (KonText, PML-TQ)
  43. UD_Belarusian-HSE  (KonText, PML-TQ)
  44. UD_Bulgarian-BTB (KonText, PML-TQ)
  45. UD_Catalan-AnCora (KonText, PML-TQ)
  46. UD_Chinese-GSD (KonText, PML-TQ)
  47. UD_Chinese-PUD (KonText, PML-TQ)
  48. UD_Czech-PDT  (KonText, PML-TQ)
  49. UD_Czech-CAC  (KonText, PML-TQ)
  50. UD_Czech-FicTree  (KonText, PML-TQ) 
  51. UD_Czech-PUD (KonText,  PML-TQ)
  52. UD_Czech-CLTT (KonText,  PML-TQ)
  53. UD_Danish-DDT (KonText, PML-TQ)
  54. UD_Dutch-Alpino (KonText, PML-TQ)
  55. UD_Dutch-LassySmall (KonText, PML-TQ)
  56. UD_English-ParTUT (KonText,  PML-TQ)
  57. UD_English-GUM (KonText, PML-TQ)
  58. UD_English-EWT (KonText, PML-TQ)
  59. UD_English-PUD (KonText, PML-TQ)
  60. UD_English-LinES (KonText, PML-TQ)
  61. UD_Erzya-JR (KonText, PML-TQ)
  62. UD_Finnish-FTB (KonText, PML-TQ)
  63. UD_Finnish-TDT (KonText, PML-TQ)
  64. UD_Finnish-PUD (KonText, PML-TQ)
  65. UD_French-ParTUT (KonText, PML-TQ)
  66. UD_French-GSD (KonText, PML-TQ)
  67. UD_French-Sequoia (KonText, PML-TQ)
  68. UD_French-Spoken (KonText, PML-TQ)
  69. UD_French-PUD (KonText, PML-TQ)
  70. UD_French-FTB (KonText)
  71. UD_Galician-CTG (KonText, PML-TQ)
  72. UD_German-GSD  (KonText, PML-TQ)
  73. UD_German-PUD (KonText, PML-T )
  74. UD_Gothic-PROIEL (KonText, PML-TQ)
  75. UD_Greek-GDT (KonText, PML-TQ)
  76. UD_Hebrew-HTB (KonText, PML-TQ)
  77. UD_Hindi-HDTB (KonText, PML-TQ)
  78. UD_Hindi-PUD (KonText, PML-TQ)
  79. UD_Hungarian-Szeged (KonText, PML-TQ)
  80. UD_Indonesian-GSD (KonText, PML-TQ)
  81. UD_Indonesian-PUD  (KonText, PML-TQ)
  82. UD_Irish-IDT  (KonText, PML-TQ)
  83. UD_Italian-ISDT (KonText, PML-TQ)
  84. UD_Italian-ParTUT (KonText, PML-TQ)
  85. UD_Italian-PUD (KonText, PML-TQ)
  86. UD_Japanese-GSD (KonText, PML-TQ) 
  87. UD_Japanese-PUD (KonText, PML-TQ)
  88. UD_Japanese-Modern (KonText, PML-TQ)
  89. UD_Korean-Kaist (KonText, PML-TQ)
  90. UD_Korean-GSD (KonText, PML-TQ)
  91. UD_Korean-PUD (KonText, PML-TQ)
  92. UD_Latin-PROIEL (KonText, PML-TQ)
  93. UD_Latin-ITTB (KonText, PML-TQ)
  94. UD_Latin-Perseus (KonText, PML-TQ)
  95. UD_Latvian-LVTB (KonText, PML-TQ)
  96. UD_North_Sami-Giella (KonText, PML-TQ)
  97. UD_Norwegian-Bokmaal (KonText, PML-TQ)
  98. UD_Norwegian-Nynorsk (KonText, PML-TQ)
  99. UD_Norwegian-NynorskLIA (KonText, PML-TQ)
  100. UD_Old_Church_Slavonic-PROIEL (KonText, PML-TQ)
  101. UD_Old_French-SRCMF (KonText, PML-TQ)
  102. UD_Polish-LFG (KonText, PML-TQ)
  103. UD_Polish-SZ (KonText, PML-TQ)
  104. UD_Portuguese-Bosque (KonText, PML-TQ)
  105. UD_Portuguese-GSD (KonText, PML-TQ)
  106. UD_Portuguese-PUD (KonText, PML-TQ)
  107. UD_Romanian-RRT (KonText, PML-TQ)
  108. UD_Romanian-Nonstandard (KonText, PML-TQ)
  109. UD_Russian-GSD (KonText, PML-TQ)
  110. UD_Russian-PUD (KonText, PML-TQ)
  111. UD_Russian-SynTagRus (KonText, PML-TQ)
  112. UD_Slovak-SNK (KonText, PML-TQ)
  113. UD_Slovenian-SSJ (KonText, PML-TQ)
  114. UD_Spanish-AnCora (KonText, PML-TQ)
  115. UD_Spanish-GSD (KonText, PML-TQ)
  116. UD_Spanish-PUD (KonText, PML-TQ)
  117. UD_Swedish-Talbanken (KonText, PML-TQ)
  118. UD_Swedish-LinES (KonText, PML-TQ)
  119. UD_Swedish-PUD (KonText, PML-TQ)
  120. UD_Swedish_Sign_Language-SSLC (KonText, PML-TQ)
  121. UD_Tamil-TTB (KonText, PML-TQ)
  122. UD_Thai-PUD (KonText, PML-TQ)
  123. UD_Turkish-IMST (KonText, PML-TQ)
  124. UD_Turkish-PUD (KonText, PML-TQ)
  125. UD_Urdu-UDTB (KonText, PML-TQ)
  126. UD_Vietnamese-VTB (KonText, PML-TQ)

 

Download

Named Entity Recognition

Corpus Language Description Availability

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

Size: 89,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY 4.0

Croatian

This corpus contains Tweets.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Milińćevińá and LjubeŇ°ińá (2016)

KonText

noSketch

Download

Training corpus hr500k 1.0

Size: 500,000 tokens

Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and Named Entity recognition. Half of corpus also syntactically parsed

Licence: CC BY-SA 4.0

Croatian

This corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Czech Named Entity Corpus 1.1

Size: 5868 sentences, 35220 NEs

Annotation: Named Entity recognition

Licence: CC BY-NC-SA 3.0

Czech

This corpus is available for download from LINDAT.

For the relevant publication, see Kravalov√° and ŇĹabokrtsk√Ĺ (2009)

Download

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens

Annotation: PoS tagging, Named Entity recognition, sentiment analysis

Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

KPWr (Polish Corpus of WrocŇāaw University of Technology) 1.2

Size: 447,000 tokens

Annotation: chunks and selected predicate-argument relations, Named Entity recognition, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases

Licence: CC BY-SA 3.0

Polish

This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.).

The corpus is available for download from the CLARIN-PL repository.

Download

Polish Spatial Texts 1.0

Size: 46,000 tokens

Annotation: Named Entity recognition (spatial expressions)

Licence: CC BY-SA 4.0

Polish

This corpus contains travel blogs.

The corpus is available for download from the CLARIN-PL repository.

Download

CINTIL-Corpus Internacional do Português

Size: 1 million tokens

Annotation: morphosyntactic tagging, Named Entity recognition

Licence: CLARIN RES

Portuguese

The corpus contains transcriptions of spoken communication as well as written texts from several genres (news, literature, magazines, etc.).

The corpus is available for download from the ELRA Catalogue.

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens

Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition

Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now no longer being updated.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Batanovińá et al. (2018)

KonText

noSketch

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

Size: 92,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY 4.0

Serbian

This corpus contains Tweets.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Milińćevińá and LjubeŇ°ińá (2016)

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens

Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition

Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC).

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see FiŇ°er et al. (2018)

KonText

noSketch

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens

Annotation: fully ‚Äď tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus ‚Äď syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles

Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Sentiment analysis

Corpus Language Description Availability

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens

Annotation: PoS tagging, Named Entity recognition, sentiment analysis

Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

Twitter sentiment for 15 European languages

Size: 1.6 million tweets

Annotation: sentiment analysis

Licence: CC BY-SA 4.0

Albanian, Bosnian, Bulgarian, Croatian, English, German, Hungarian, Polish, Portuguese, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish

This corpus contains Tweet IDs with sentiment annotations.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Mozetińć et al. (2016)

Download

Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0

Size: 407.5 million words

Annotation: sentiment analysis (socially unacceptable discourse)

Licence: CC BY-SA 4.0

Croatian

This corpus contains news comments from the website 24sata.hr.

The corpus is available for download from CLARIN.SI.

Download

Aspect-Term Annotated Customer Reviews in Czech

Size: 2200 reviews

Annotation: sentiment analysis

Licence: CC BY-NC-SA 3.0

Czech

This corpus contains online user-product reviews.

The corpus is available for download from LINDAT.

Download

Facebook Data for Sentiment Analysis

Size: 10,000 Facebook posts

Annotation: sentiment analysis

Licence: CC BY-SA 3.0

Czech

This corpus contains Facebook posts.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Habernal et al. (2013)

KonText

Download

NoReC: The Norwegian Review Corpus

Size: 14.8 million tokens

Annotation: sentiment analysis

Licence: CC BY-NC 3.0

Norwegian

This corpus contains reviews in different domains (e.g., literature, videogames, etc.).

The corpus is available for download from the CLARINO repository.

For the relevant publication, see Velldal et al. (2018)

Download

Manually sentiment annotated Slovenian news corpus SentiNews 1.0

Size: 10,427 articles

Annotation: sentiment analysis

Licence: CC BY-SA 4.0

Slovenian

This corpus contains news articles.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Buńćar et al. (2018)

Download

Other annotation layers

Corpus Language Description Availability

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

Size: 89,000 tokens

Annotation: word normalisation

Licence: CC BY 4.0

Croatian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Milińćevińá and LjubeŇ°ińá (2016)

KonText

noSketch

Download

Czech Legal Text Treebank 2.0

Size: 1121 sentences

Annotation: semantic role labelling

Licence: CC BY-NC-SA 4.0

Czech

This corpus contains legal texts.

The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.

KonText

PML-TQ

Download

Prague Discourse Treebank 2.0

Size: 49,500 sentences

Annotation: mark-up of discourse phenomena enriched by the annotation of secondary connectives

Licence: CC-BY

Czech

This corpus is a subset of the Prague Dependency Treebank 3.5.

The corpus is available through the PML-TQ tool.

PML-TQ

Prague Czech-English Dependency Treebank 2.0 Coref

Size: 49,000 sentences

Annotation: mark-up of coreference

Licence: CC-BY-NC-SA + LDC99T42 (restricted use)

Czech, English

This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 styleD

The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool.T 2.0 style.

KonText

PML-TQ

Download

Artificial Treebank with Ellipsis

Size: 106,000 tokens, 10,604 sentences

Annotation: mark-up of elliptical constructions

Licence: Licence Universal dependencies v2.1

Czech, English, Finnish, Russian, Slovak

The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the LINDAT repository.

Download

Grundtvig's Works Corpus

Size: 11,417,194 words

Annotation: linked data (places, persons, bible citations, etc.)

Licence: CC BY-NC 4.0

Danish

This corpus contains the literary works of the Danish bishop N.F.S Grundtvig.

The corpus is available for download from the CLARIN-DK repository.

Download

SoNaR-1

Size: 1 million words

Annotation: semantic role labelling

Dutch

This is a manually annotated subset of the much larger (approx.. 500 million) word) SoNaR corpus.

The corpus is available for download from the Dutch Language Institute.

Download

The ACL RD-TEX 2.0

Size: 33216 tokens

Annotation: terminology extraction/classification

Licence: CC BY-NC-SA 4.0

English

This corpus contains 6818 terms extracted from abstracts of computational linguistics papers.

The corpus is available for download from LINDAT and through KonText.

For the relevant publication, see QasemiZadeh and Schumann (2016)

KonText

Download

Speech, Thought and Writing Presentation Corpus

Size: 260,000 words

Annotation: identification of reported speech

Licence: CC BY-NC-SA 3.0

English

This corpus contains literary, newspaper and biography texts.

The corpus is available for download from the Oxford Text Archive.

Download

TimeML annotated corpus of Estonian newspaper articles

Size: 22,000 words

Annotation: temporal semantic annotations

Licence: CC-BY-SA

Estonian

This corpus contains newspaper articles.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Estonian Treebank annotated with coreference relations

Size: 107,000 words

Annotation: anaphora relations

Licence: GPL

Estonian

This corpus contains newspaper texts plus one scientific medical text.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Semantically disambiguated corpus of Estonian

Size: 375,733 tokens

Annotation: word sense disambiguation

Licence: CLARIN ACA

Estonian

The corpus is available for download from META-SHARE (CELR distribution).

Download

Greek Coreference Corpus

Size: 62,988 tokens

Annotation: coreference

Licence: CC-BY-NC-SA

Greek

In addition to coreference, the corpus is annotated for identity and bridging relations.

In addition to coreference, the corpus is annotated for identity and bridging relations.

For the relevant publication, see Ogrodnizcuk et al. (2015)

Download

Greek Textual Entailment Corpus

Size: 600 sentence-pairs

Annotation: logical entailment

Licence: CC-BY

Greek

This corpus contains texts from the domains of politics, law and travel.

This corpus is available for download from the clarin:el repository.

Download

KPWr (Polish Corpus of WrocŇāaw University of Technology) 1.2

Size: 447,000 tokens

Annotation: selected predicate-argument relations, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases

Licence: CC BY-SA 3.0

Polish

This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.).

The corpus is available for download from the CLARIN-PL repository.

Download

Polish Coreference Corpus

Size: 540,000 tokens

Annotation: coreference

Licence: CC BY 3

Polish

This corpus contains texts in a variety of domains (magazines, fiction literature, non-fiction literature, computer-mediated communication, academic writing, etc.).

The corpus is available for download and online browsing.

Concordancer

Download

Polish Summaries Corpus

Size: 10845 summaries

Annotation: summarization

Licence: CC BY 3

Polish

This corpus is available for download from the ZIL IPI PAN repository.

For the relevant publication, see Ogrodniczuk and Kopeńá (2014)

Download

WUT Relations Between Sentences Corpus

Size: 5654 sentences

Annotation: relations between sentences - Cross-document Structure Theory (CST)

Licence: CC BY-SA 3.0

Polish

This corpus contains news items.

The corpus is available for download from the CLARIN.PL repository.

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

Size: 92,000 tokens

Annotation: word normalisation

Licence: CC BY 4.0

Serbian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Milińćevińá and LjubeŇ°ińá (2016)

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens

Annotation: word normalisation

Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see FiŇ°er et al. (2018)

KonText

noSketch

Download

Corpus of comma placement Vejica 1.3

Size: 104,000 sentences

Annotation: comma placement

Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains texts from various Slovenian corpora (KUST, ҆olar aLektorm JANES-Vejican Wikpedia.

The corpus is available for dow.nload from CLARIN.SI.

Download

Terminology identification dataset KAS-term 1.0

Size: 22,950 term candidates

Annotation: monolingual term extraction

Licence: CC BY-SA 4.0

Slovenian

This corpus contains term candidates from PhD theses in chemistry, computer science and political science.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Holozan (2018)

Download

CMC training corpus Janes-Norm 1.2

Size: 184,755 tokens

Annotation: normalization

Licence: CC BY-SA 4.0

Slovenian

This corpus is partially also manually annotated with MSD tags and lemmatized.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens

Annotation: verbal multiword expression tagging, semantic role labelling

Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Bilingual terminology extraction dataset KAS-biterm 1.0

Size: 1,950 sentences, 78,500 tokens, 3,700 terms

Annotation: bi-lingual term extraction

Licence: CC BY-SA 4.0

Slovenian, English

This corpus contains PHD theses.

The corpus is available for download from the CLARIN.SI repository.

Download

Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1)

Size: 5.8 million tokens

Annotation: identification of verbal multi-word expressions (idioms, light-verb constructions, verb-particle constructions, inherently reflexive verbs, multi-verb constructions)

Licence: PARSEME Shared Task Data (v. 1.1) Agreement

16 languages

This corpus collection is available for download from LINDAT.

The PARSEME corpora can be queried individually through KonText. We provide the individual links to each corpus:

 

  1. Parseme VMWE 1.0 ‚Äď Czech
  2. Parseme VMWE 1.0 ‚Äď German
  3. Parseme VMWE 1.0 ‚Äď Greek
  4. Parseme VMWE 1.0 ‚Äď Spanish
  5. Parseme VMWE 1.0 ‚Äď Persian
  6. Parseme VMWE 1.0 ‚Äď French
  7. Parseme VMWE 1.0 ‚Äď Hungarian
  8. Parseme VMWE 1.0 ‚Äď Italian
  9. Parseme VMWE 1.0 ‚Äď Maltese
  10. Parseme VMWE 1.0 ‚Äď Polish
  11. Parseme VMWE 1.0 ‚Äď Portuguese
  12. Parseme VMWE 1.0 ‚Äď Romanian
  13. Parseme VMWE 1.0 ‚Äď Slovenian
  14. Parseme VMWE 1.0 ‚Äď Swedish
  15. Parseme VMWE 1.0 ‚Äď Turkish

 

Download

Publications

[Batanovińá et al. 2018] Vuk Batanovińá, Nikola LjubeŇ°ińá, and Tanja SamadrŇĺińá. 2018.¬†SETimes.SR¬†‚Äď A Reference Training Corpus of Serbian.

[Buńćar et al. 2018]¬† JoŇĺe Buńćar, Martin ŇĹnidarŇ°ińć, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.

[Csendes et al. 2005]  Dóra Csendes, János Csirik, Tibor Gyimóthy, and András Kocsor. 2005. The Szeged Treebank.

[Erjavec 2012] TomaŇĺ Erjavec. 2012.¬†MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[Erjavec et al. 2010] TomaŇĺ Erjavec, Darja FiŇ°er, Simon Krek, and Nina Ledinek. 2010. The JOS Linguistically Tagged Corpus of Slovene.

[FiŇ°er et al. 2018] Darja FiŇ°er, Nikola LjubeŇ°ińá and TomaŇĺ Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.

[Habernal et al. 2013] Ivan Habernal, Tom√°Ň° Pt√°ńćek, and Josef Steinberger. 2013. Sentiment Analysis in Czech Social Media Using Supervised Machine Learning.¬†

[Hajińć et al. 2004]¬†Jan Hajińć, Otakar SmrŇĺ, Petr Zem√°nek, Jan ҆naidauf, and Emanuel BeŇ°ka. 2004.¬†Prague Arabic Dependency Treebank: Development in Data and Tools

[Hajińć et al. 2012]¬†¬†Jan,¬†Hajińć, Eva¬†Hajińćov√°,¬†Jarmila Panevov√°,¬†Petr Sgall, OndŇôej Bojar, Silvie Cinkov√°, Eva Fuńć√≠kov√°, Marie Mikulov√°, Petr Pajas, Jan Popelka, JiŇô√≠ Semeck√Ĺ, Jana ҆indlerov√°, Jan ҆tńõp√°nek, Josef Toman, ZdeŇąka UreŇ°ov√°, and Zdenńõk¬†ŇĹabokrtsk√Ĺ. 2012.¬†Announcing Prague Czech-English Dependency Treebank 2.0

[Haverinen et al. 2014] Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter. 2014. Building the essential resources for Finnish: the Turku Dependency Treebank.

[Holozan 2018] Peter Holozan. 2018. Corpus of comma placement Vejica 1.3.

[Kravalov√° and ŇĹabokrtsk√Ĺ 2009] Jana¬†Kravalov√° and Zdenek¬†ŇĹabokrtsk√Ĺ. 2009. Czech Named Entity Corpus and SVM-based Recognizer.

[Kr√≠Ňĺ and Hladk√° 2018]¬†Vincent Kr√≠z and Barbora Hladk√°. 2018.¬†Czech Legal Text Treebank 2.0.

[Milińćevińá and LjubeŇ°ińá 2016] Maja Milińćevińá and Nikola LjubeŇ°ińá. 2016. Tviterasi, tviteraŇ°i or twitteraŇ°i? Producing and analysing a normalised dataset of Croatian and Serbian tweets.

[Mozetińć et al. 2016]¬†Igor Mozetińć, Miha Grńćar, and Jasmina Smailovińá. 2016. Multilingual Twitter Sentiment Classification: The Role of Human Annotators.

[Muischnek et al. 2014]¬†Kadri Muischnek, Kaili M√ľ√ľrisep, Tiina Puolakainen, Eleri Aedmaa, Riin Kirt, Dage S√§rg. 2014.¬†Estonian Dependency Treebank and its annotation scheme

[van Noord 2009] Gertjan van Noord. 2009. Huge Parsed Corpora in LASSY. 

[Jel√≠nek 2017]¬†Tom√°Ň°¬†Jel√≠nek. 2017.¬†FicTree: a Manually Annotated Treebank of Czech Fiction.

[Ogrodniczuk and Kopeńá 2014]¬† Maciej Ogrodniczuk and Mateusz Kopeńá. The Polish Summaries Corpus.

[Ogrodnizcuk et al. 2015]¬†Maciej Ogrodniczuk, Katarzyna GŇāowiŇĄska, Mateusz Kopeńá, Agata Savary, and Magdalena ZawisŇāawska.¬†Coreference in Polish: Annotation, Resolution and Evaluation in Polish.

[Orasmaa 2014] Siim Orasmaa. Towards an Integration of Syntactic and Temporal Annotations in Estonian.

[Przepiórkowski and Murzynowski  2011]  Adam Przepiórkowski and Grzegorz Murzynowski. 2011. Manual annotation of the National Corpus of Polish with Anotatornia.

[QasemiZadeh and Schumann 2016] Behrang QasemiZadeh and Anne-Kathrin Schumann. 2016. The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods.

[Rei et al. 2016] Luis Rei, Dunja Mladenińá, and Simon Krek. 2016.¬†A Multilingual Social Media Linguistic Corpus.

[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.

[R√∂gnvaldsson et al. 2012] Eir√≠kur R√∂gnvaldsson, Anton Karl Ingason, Einar Freyr Sigur√įsson and Joel Wallenberg. 2012. The Icelandic Parsed Historical Corpus (IcePaHC).

[Rosén et al. 2012] Victoria Rosén, Koenraad De Smedt, Paul Meurer, and Helge Dyvik. 2012. An Open Infrastructure for Advanced Treebanking.

[Stein and Prévost 2013]  Achim Stein and Sophie Prévost. 2013. Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF).

[Velldal et al. 2018]¬†Erik Velldal, Lilja √ėvrelid, Eivind Alexander Bergem, Cathrine Stadsnes, Samia Touileb, and Fredrik J√łrgensen. 2018.¬†NoReC: The Norwegian Review Corpus

[Wróblewska 2018] Alina Wróblewska. 2018. Extended and enhanced Polish dependency bank in Universal Dependencies format.

[Zeman et al. 2012]¬†Daniel Zeman, David Mareńćek, Martin Popel, Loganathan Ramasamy, Jan ҆tńõp√°nek, Zdenńõk ŇĹabokrtsk√Ĺ, and Jan Hajińć. 2012. HamleDT: To Parse or Not to Parse?