Manually Annotated Corpora

Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools, as well as testing the accuracy of existing annotation tools. 

There are 74 manually annotated training corpora and corpus collections in the CLARIN infrastructure, 63 of which are monolingual (accounting for 21 different languages) and 11 are multilingual. Among the multilingual corpora, there are 4 collections in the CLARIN infrastructure that were annotated under the following umbrella initiatives: HamleDT 3.0, Treebanks of INESS, Universal Dependencies 2.8.1, and Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1).  

The corpora and corpus collections are classified into 6 categories based on the type of manual annotation:

If a corpus is manually annotated for more than one linguistic information, then it is listed under all the relevant sections. For instance, the xLiMe Twitter Corpus XTC 1.0.1 is manually annotated for PoS tags, Named Entities and sentiment, so it is listed under all the three relevant sections.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 21 June 2023.

The manually annotated Corpora

PoS MSD tagging

Corpus Language Description Availability

MULTEXT-East "1984" annotated corpus 4.0

Size: 80,000 sentences, 1 million words
Annotation: morphosyntactic tagging, lemmatisation, sentence alignment
Licence: CC BY-NC-SA 4.0

Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian

This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2012)

Download

The Morphologically Annotated Part of BulTreeBank

Size: 214,000 tokens
Annotation: morphosyntactic tagging
Licence: MS-NC-NoReD

Bulgarian

This corpus is available for download through the concordancer Corpuscle.

Concordancer

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

Size: 89,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Croatian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

KonText

noSketch

Download

BNC Sampler

Size: 2 million tokens
Annotation: PoS tagging
Licence: BNC Licence

English

The corpus was manually post-edited to correct the PoS tags automatically assigned by CLAWS.

The corpus is available for online querying via CQPWeb (registration required) for download from the Oxford Text Archive

Concordancer

Download

Corpus of morphologically disambiguated Estonian texts

Size: 513,000 tokens
Annotation: morphological disambiguation
Licence: CLARIN_ACA-NC

Estonian

This corpus contains texts from the 1980s subcorpus of the Corpus of Written Estonian 1890-1990.

Download

Austrian Baroque Corpus

Size: 200,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-Tübingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al (2016)

Concordancer

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens
Annotation: PoS tagging, Named Entity recognition, sentiment analysis
Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

Szeged Corpus 2.0

Size: 1.5 million tokens
Annotation: morphosyntactic tagging
Licence: Licence agreement

Hungarian

This corpus is available for download from a dedicated webpage.

To download the versions of the Szeged Corpus and Szeged Treebank, you are obliged to fill and send a Licence Agreement.

Download

Lithuanian morphologically annotated corpus - MATAS

Size: 1.6 million words
Annotation: morphosyntactic tagging
Licence: CLARIN ACA

Lithuanian

The corpus contains texts from various domains (documents, fiction, periodicals, scientific texts, wordforms).

This corpus is available for download from the CLARIN-LT repository.

Download

NKJP1M

Size: 1 million tokens
Annotation: morphosyntactic tagging
Licence: GNU GPL 3

Polish

This corpus is a manually annotated subset of the National Corpus of Polish.

The corpus is available for download from the Computational Linguistics in Poland website.

For the relevant publication, see Przepiórkowski and Murzynowski (2011)

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

Size: 92,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Serbian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016).

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Training corpus jos1M 1.1

Size: 1 million words
Annotation: morphosyntactic tagging and lemmatisation
Licence: CC BY-NC 4.0

Slovenian

This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec et al. (2010).

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Lemmatisation

Corpus Language Description Availability

MULTEXT-East "1984" annotated corpus 4.0

Size: 80,000 sentences, 1 million words
Annotation: morphosyntactic tagging, lemmatisation, sentence alignment
Licence: CC BY-NC-SA 4.0

Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian

This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2012)

Download

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

Size: 89,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Croatian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

KonText

noSketch

Download

Austrian Baroque Corpus

Size: 200,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-Tübingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al (2016)

Concordancer

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

Size: 92,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Serbian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016).

KonText

noSketch

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now defunct.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Batanović et al. (2018).

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Training corpus jos1M 1.1

Size: 1 million words
Annotation: morphosyntactic tagging and lemmatisation
Licence: CC BY-NC 4.0

Slovenian

This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec et al. (2010).

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Syntatic parsing

Corpus Language Description Availability

Prague Arabic Dependency Treebank 1.0

Annotation: syntactic parsing and morphosyntactic tagging
Licence: CC BY-NC-SA 3.0

Arabic

This corpus is available for download from the LINDAT repository.

For the relevant publication, see Hajič et al. (2004)

Download

Training corpus hr500k 1.0

Size: 500,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and Named Entity recognition. Half of the corpus also syntactically parsed
Licence: CC BY-SA 4.0

Croatian

This corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Czech Legal Text Treebank 2.0

Size: 1121 sentences
Annotation: syntactic parsing, labelling of semantic entities
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains legal texts.

The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.

For the relevant publication, see Kríž and Hladká (2018)

KonText

PML-TQ

Download

FicTree 1.0

Size: 12760 sentences
Annotation: syntactic parsing and morphosyntactic tagging
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains fictional texts.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Jelínek (2017)

KonText

Download

Prague Dependency Treebank 3.5

Size: 2 million words
Annotation: syntactic parsing and morphosyntactic tagging
Licence: CC BY-NC-SA 4.0

Czech

This corpus is manually annotated at several levels – aside from syntactic parsing and morphological information, it is annotation for sentence information structure, multiword expression, coreference, bridging relations and discourse relations.

The corpus is available for download from the LINDAT repository.

Download

Prague Discourse Treebank 2.0

Size: 49,500 sentences
Annotation: syntactic parsing, mark-up of discourse phenomena enriched by the annotation of secondary connectives
Licence: CC-BY

Czech

This corpus is a subset of the Prague Dependency Treebank 3.5

The corpus is available through the PML-TQ tool.

PML-TQ

Slovak Dependency Treebank

Size: 106,000 tokens, 10,600 sentences
Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Czech

This syntactic parsing is modelled after the Prague Dependency Treebank.

The corpus is available for download from the LINDAT repository.

Download

Prague Czech-English Dependency Treebank 2.0 Coref

Size: 49,000 sentences
Annotation: syntactic parsing, mark-up of coreference
Licence: CC-BY-NC-SA + LDC99T42 (restricted use)

Czech, English

This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 style.

The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool (Czech part only).

For the relevant publication, see Hajič et al. (2012)

KonText

PML-TQ

Download

Artificial Treebank with Ellipsis

Size: 106,000 tokens, 10,604 sentences
Annotation: syntactic parsing, mark-up of elliptical constructions
Licence: Licence Universal dependencies v2.1

Czech, English, Finnish, Russian, Slovak

This syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the LINDAT repository.

Download

Lassy Klein-corpus

Size: 1 million tokens
Annotation: PoS tagging, syntactic parsing
Licence: VAGUE

Dutch

This corpus is available for download from the Dutch Language Institute and through the online environments PaQu and GrETEL.

For the relevant publication, see Noord (2009)

Download

Pa-Qu

GrETEL

SoNaR-1

Size: 1 million words
Annotation: PoS tagging, syntactic parsing, semantic role labelling

Dutch

This is a manually annotated subset of the much larger (approx. 500 million) word) SoNaR corpus.

The corpus is available for download from the Dutch Language Institute.

Download

Estonian Treebank

Size: 1,000 sentences
Annotation: syntactic parsing
Licence: CLARIN_ACA

Estonian

The corpus contains fictional and newspaper texts.

The corpus is available for download from META-SHARE (CELR distribution).

Download

UD Estonian ver.2.3

Size: 434,000 tokens
Annotation: syntactic parsing
Licence: CC-BY-SA

Estonian

This corpus contains fictional, newspaper and scientific texts. The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from (CELR distribution).

For the relevant publication, see Muischnek et al. (2014)

Download

TimeML annotated corpus of Estonian newspaper articles

Size: 22,000 words
Annotation: morphosyntactic tagging and syntactic parsing
Licence: CC-BY-SA

Estonian

This corpus contains newspaper articles.

The corpus is available for download from META-SHARE (CELR distribution).

For the relevant publication, see Orasmaa (2014)

Download

Finnish TreeBank 1

Size: 160,000 tokens
Annotation: syntactic parsing
Licence: CC-BY 3.0

Finnish

This corpus contains 19,000 sentences from the Large Grammar of Finnish.

The corpus is available for download from the Language Bank of Finland.

Download

Finnish TreeBank 2

Size: 160,000 tokens
Annotation: syntactic parsing
Licence: CC-BY 3.0

Finnish

This corpus contains 19,000 sentences from the Large Grammar of Finnish.

The corpus is available for download from the Language Bank of Finland.

Download

Turku Dependency Treebank

Size: 204,000 tokens
Annotation: syntactic parsing
Licence: CC-BY-SA

Finnish

The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the Turku BioNLP Group.

For the relevant publication, see Haverinen et al. (2013)

Download

Syntactic Reference Corpus of Medieval French

Size: 245,000 words
Annotation: syntactic parsing
Licence: CLARIN ACA

French

This corpus contains Old French texts.

The corpus is available for download from the IMS CLARIN-D repository.

For the relevant publication, see Stein and Prévost (2013)

Download

GRUG Parallel Treebank

Size: 10,400 sentence pairs
Annotation: syntactic parsing, PoS tagging
Licence: CC-BY

Georgian, Ukranian, Russian, German

The corpus is syntactically parsed following the TIGER guidelines.

The corpus is available for download from a dedicated website provided by the CLARIN-D consortium.

Download

B4 Heliand

Size: 3495 tokens
Annotation: PoS tagging, syntactic parsing
Licence: CC-BY

German

This corpus contains historical German texts.

The corpus is available for download from the HZSK repository.

Download

Dependency-Annotated Subset of the CREG Corpus

Size: 109 sentences
Annotation: PoS tagging, syntactic parsing
Licence: CLARIN RES

German

This corpus consists of answers to reading comprehension questions written by American college students learning German.

The corpus is available for download from the Tübingen CLARIN Repository.

Download

Tübingen Treebank of Written German / Newspaper Corpus (TüBa-D/Z)

Size: 1.9 million tokens
Annotation: syntactic parsing
Licence: CLARIN RES

German

This corpus contains newspaper articles.

The corpus is available for download from the Tübingen CLARIN Repository.

Download

Szeged Treebank 2.0

Size: 82,000 sentences
Annotation: syntactic parsing
Licence: licence agreement

Hungarian

This corpus is available for download from a dedicated webpage.

For the relevant publication, see Csendes et al. (2005)

Download

Icelandic Parsed Historical Corpus (IcePaHC)

Size: 1 million tokens
Annotation: morphosyntactic tagging, lemmatisation, syntactic parsing
Licence: GNU LGPL

Icelandic

This corpus contains Icelandic texts from the 12th through the 21st centuries – approximately 100,000 words from each century. The corpus is syntactically parsed following the UUPenn scheme for historical textse

The corpus is available for online search through treebankstudio.org and for download in different formats from a dedicated webpage.

For the relevant publication, see Rögnvaldsson et al. (2012)

Download

Concordancer

Lithuanian Treebank ALKSNIS

Size: 2,355 sentences
Annotation: syntactic parsing
Licence: CLARIN PUB

Lithuanian

Syntactic parsing follows the rules of the Prague Dependency Treebank

This corpus is available for download from the CLARIN-LT repository. The second version is available upon request.

Download

Polish Dependency Bank in Universal Dependency format

Size: 22,000 trees, 351,000 tokens
Annotation: syntactic parsing
Licence: CC BY-NC-SA 4.0

Polish

This corpus also contains sentences showing certain problematic syntactic phenomena – sentences with ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction jako, etc. The syntactic parsing follows the Universal Dependencies schema.

The first version of the corpus is available for download from the Computational Linguistics in Poland website. The second version is available upon request.

For the relevant publication, see Wróblewska (2018)

Download

CINTIL DependencyBank

Size: 110,000 tokens
Annotation: morphosyntactic tagging and syntactic parsing
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL TreeBank

Size: 110,000 tokens
Annotation: syntactic parsing
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL-DeepBank

Size: 110,000 tokens
Annotation: PoS-tagging, syntactic parsing, grammatical functions, logical forms
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL-PropBank

Size: 110,000 tokens
Annotation: syntactic parsing and phrase semantic roles
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the ELRA catalogue.

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now defunct. The syntactic parsing follows the Universal Dependencies framework.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Tamil Dependency Treebank v0.1

Size: 600 sentences
Annotation: syntactic parsing and morphosyntactic tagging
Licence: CC BY-NC-SA 3.0

Tamil

The syntactic parsing follows the rules of the https://ufal.mff.cuni.cz/pdt/.

The corpus is available for download from the LINDAT repository.

Download

HamleDT 3.0

Size: 19 treebanks
Annotation: syntactic parsing and morphosyntactic tagging
Licence: HamleDT 3.0 Licence Terms

19 languages

This treebank collection is available for download from LINDAT.

The treebanks can be individually queried through KonText and the treebank tool PML-TQ. We list them here by language:

 

  1. Arabic(KonText, PML-TQ)
  2. Bengali (KonText)
  3. Catalan (KonText)
  4. Czech (KonText, PML-TQ)
  5. Dutch (KonText, PML-TQ)
  6. English (KonText)
  7. Estonian (KonText, PML-TQ)
  8. German (KonText)
  9. Greek (KonText)
  10. Hindi (KonText)
  11. Latin (KonText, PML-TQ)
  12. Persian (KonText, PML-TQ)
  13. Polish (KonText, PML-TQ)
  14. Portuguese (KonText, PML-TQ)
  15. Romanian (KonText, PML-TQ)
  16. Russian (KonText)
  17. Slovenian (KonText, PML-TQ)
  18. Spanish (KonText)
  19. Tamil (KonText, PML-TQ)

 

For the relevant publication, see Zeman et al. (2012)

Download

Treebanks of INESS

Size: 532 treebanks
Annotation: syntactic parsing
Licence: CC-BY

71 languages

This is a collection of treebanks made available through the Infrastructure for the Exploration of Syntax and Semantics (INESS).

The corpora are available for online querying through INESS.

For the relevant publication, see Rosén et al. (2012)

 

Universal Dependencies 2.8.1

Size: 27 million tokens
Annotation: morphosyntactic tagging, syntactic parsing
Licence: Licence Universal Dependencies v2.3  publicly available

75 languages

This corpus collection contains 126 treebanks.

The corpus collection is available for download from the LINDAT repository.

The individual treebanks in Universal Dependencies 2.3 can also be queried through the concordancer KonText and the treebank query tool PML-TQ. Below we provide links to these search environments for all the treebanks. For a detailed description of the treebanks, see the Universal Dependencies project page.

 

  1. UD_Akkadian-PISANDUB (KonText)
  2. UD_Amharic-ATT (KonText, PML-TQ)
  3. UD_Armenian-ArmTDP (KonText, PML-TQ)
  4. UD_Breton-KEB (KonText, PML-TQ)
  5. UD_Buryat-BDT (KonText, PML-TQ)
  6. UD_Cantonese-HK (KonText, PML-TQ)
  7. UD_Chinese-HK (KonText, PML-TQ)
  8. UD_Chinese-CFL (KonText, PML-TQ)
  9. UD_Coptic-Scriptorium (KonText, PML-TQ)
  10. UD_Croatian-SET (KonText, PML-TQ)
  11. UD_English-ESL (KonText, PML-TQ)
  12. UD_Faroese-OFT (KonText, PML-TQ)
  13. UD_Galician-TreeGal (KonText, PML-TQ)
  14. UD_Hindi_English-HIENCS (KonText)
  15. UD_Kazakh-KTB 2.2 (KonText, PML-TQ)
  16. UD_Komi_Zyrian-Lattice (KonText, PML-TQ)
  17. UD_Komi_Zyrian-IKDP (KonText, PML-TQ
  18. UD_Kurmanji-MG (KonText, PML-TQ)
  19. UD_Lithuanian-HSE (KonText, PML-TQ)
  20. UD_Maltese-MUDT (KonText, PML-TQ)
  21. UD_Marathi-UFAL (KonText, PML-TQ)
  22. UD_Naija-NSC (KonText, PML-TQ)
  23. UD_Persian-Seraji (KonText, PML-TQ)
  24. UD_Russian-Taiga (KonText, PML-TQ)
  25. UD_Sanskrit-UFAL (KonText, PML-TQ)
  26. UD_Serbian-SET (KonText, PML-TQ)  
  27. UD_Slovenian-SST (KonText, PML-TQ)
  28. UD_Tagalog-TRG (KonText, PML-TQ)
  29. UD_Telugu-MTG (KonText, PML-TQ)
  30. UD_Ukrainian-IU (KonText, PML-TQ)
  31. UD_Upper_Sorbian-UFAL (KonText, PML-TQ)
  32. UD_Uyghur-UDT (KonText, PML-TQ)
  33. UD_Warlpiri-UFAL (KonText, PML-TQ)
  34. UD_Yoruba-YTB (KonText, PML-TQ)
  35. UD_Afrikaans-AfriBooms (KonText)
  36. UD_Ancient_Greek-PROIEL (KonText)
  37. UD_Ancient_Greek-Perseus (KonText, PML-TQ)
  38. UD_Arabic-PADT (KonText, PML-TQ)
  39. UD_Arabic-PUD (KonText, PML-TQ)
  40. UD_Arabic-NYUAD (KonText)
  41. UD_Bambara-CRB (KonText, PML-TQ)
  42. UD_Basque-BDT (KonText, PML-TQ)
  43. UD_Belarusian-HSE  (KonText, PML-TQ)
  44. UD_Bulgarian-BTB (KonText, PML-TQ)
  45. UD_Catalan-AnCora (KonText, PML-TQ)
  46. UD_Chinese-GSD (KonText, PML-TQ)
  47. UD_Chinese-PUD (KonText, PML-TQ)
  48. UD_Czech-PDT  (KonText, PML-TQ)
  49. UD_Czech-CAC  (KonText, PML-TQ)
  50. UD_Czech-FicTree  (KonText, PML-TQ
  51. UD_Czech-PUD (KonTextPML-TQ)
  52. UD_Czech-CLTT (KonTextPML-TQ)
  53. UD_Danish-DDT (KonText, PML-TQ)
  54. UD_Dutch-Alpino (KonText, PML-TQ)
  55. UD_Dutch-LassySmall (KonText, PML-TQ)
  56. UD_English-ParTUT (KonTextPML-TQ)
  57. UD_English-GUM (KonText, PML-TQ)
  58. UD_English-EWT (KonText, PML-TQ)
  59. UD_English-PUD (KonText, PML-TQ)
  60. UD_English-LinES (KonText, PML-TQ)
  61. UD_Erzya-JR (KonText, PML-TQ)
  62. UD_Finnish-FTB (KonText, PML-TQ)
  63. UD_Finnish-TDT (KonText, PML-TQ)
  64. UD_Finnish-PUD (KonText, PML-TQ)
  65. UD_French-ParTUT (KonText, PML-TQ)
  66. UD_French-GSD (KonText, PML-TQ)
  67. UD_French-Sequoia (KonText, PML-TQ)
  68. UD_French-Spoken (KonText, PML-TQ)
  69. UD_French-PUD (KonText, PML-TQ)
  70. UD_French-FTB (KonText)
  71. UD_Galician-CTG (KonText, PML-TQ)
  72. UD_German-GSD  (KonText, PML-TQ)
  73. UD_German-PUD (KonText, PML-T )
  74. UD_Gothic-PROIEL (KonText, PML-TQ)
  75. UD_Greek-GDT (KonText, PML-TQ)
  76. UD_Hebrew-HTB (KonText, PML-TQ)
  77. UD_Hindi-HDTB (KonText, PML-TQ)
  78. UD_Hindi-PUD (KonText, PML-TQ)
  79. UD_Hungarian-Szeged (KonText, PML-TQ)
  80. UD_Indonesian-GSD (KonText, PML-TQ)
  81. UD_Indonesian-PUD  (KonText, PML-TQ)
  82. UD_Irish-IDT  (KonText, PML-TQ)
  83. UD_Italian-ISDT (KonText, PML-TQ)
  84. UD_Italian-ParTUT (KonText, PML-TQ)
  85. UD_Italian-PUD (KonText, PML-TQ)
  86. UD_Japanese-GSD (KonText, PML-TQ
  87. UD_Japanese-PUD (KonText, PML-TQ)
  88. UD_Japanese-Modern (KonText, PML-TQ)
  89. UD_Korean-Kaist (KonText, PML-TQ)
  90. UD_Korean-GSD (KonText, PML-TQ)
  91. UD_Korean-PUD (KonText, PML-TQ)
  92. UD_Latin-PROIEL (KonText, PML-TQ)
  93. UD_Latin-ITTB (KonText, PML-TQ)
  94. UD_Latin-Perseus (KonText, PML-TQ)
  95. UD_Latvian-LVTB (KonText, PML-TQ)
  96. UD_North_Sami-Giella (KonText, PML-TQ)
  97. UD_Norwegian-Bokmaal (KonText, PML-TQ)
  98. UD_Norwegian-Nynorsk (KonText, PML-TQ)
  99. UD_Norwegian-NynorskLIA (KonText, PML-TQ)
  100. UD_Old_Church_Slavonic-PROIEL (KonText, PML-TQ)
  101. UD_Old_French-SRCMF (KonText, PML-TQ)
  102. UD_Polish-LFG (KonText, PML-TQ)
  103. UD_Polish-SZ (KonText, PML-TQ)
  104. UD_Portuguese-Bosque (KonText, PML-TQ)
  105. UD_Portuguese-GSD (KonText, PML-TQ)
  106. UD_Portuguese-PUD (KonText, PML-TQ)
  107. UD_Romanian-RRT (KonText, PML-TQ)
  108. UD_Romanian-Nonstandard (KonText, PML-TQ)
  109. UD_Russian-GSD (KonText, PML-TQ)
  110. UD_Russian-PUD (KonText, PML-TQ)
  111. UD_Russian-SynTagRus (KonText, PML-TQ)
  112. UD_Slovak-SNK (KonText, PML-TQ)
  113. UD_Slovenian-SSJ (KonText, PML-TQ)
  114. UD_Spanish-AnCora (KonText, PML-TQ)
  115. UD_Spanish-GSD (KonText, PML-TQ)
  116. UD_Spanish-PUD (KonText, PML-TQ)
  117. UD_Swedish-Talbanken (KonText, PML-TQ)
  118. UD_Swedish-LinES (KonText, PML-TQ)
  119. UD_Swedish-PUD (KonText, PML-TQ)
  120. UD_Swedish_Sign_Language-SSLC (KonText, PML-TQ)
  121. UD_Tamil-TTB (KonText, PML-TQ)
  122. UD_Thai-PUD (KonText, PML-TQ)
  123. UD_Turkish-IMST (KonText, PML-TQ)
  124. UD_Turkish-PUD (KonText, PML-TQ)
  125. UD_Urdu-UDTB (KonText, PML-TQ)
  126. UD_Vietnamese-VTB (KonText, PML-TQ)

 

Download

Named Entity Recognition

Corpus Language Description Availability

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

Size: 89,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Croatian

This corpus contains Tweets.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

KonText

noSketch

Download

Training corpus hr500k 1.0

Size: 500,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and Named Entity recognition. Half of corpus also syntactically parsed
Licence: CC BY-SA 4.0

Croatian

This corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Czech Named Entity Corpus 1.1

Size: 5868 sentences, 35220 NEs
Annotation: Named Entity recognition
Licence: CC BY-NC-SA 3.0

Czech

This corpus is available for download from LINDAT.

For the relevant publication, see Kravalová and Žabokrtský (2009)

Download

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens
Annotation: PoS tagging, Named Entity recognition, sentiment analysis
Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

KPWr (Polish Corpus of Wrocław University of Technology) 1.2

Size: 447,000 tokens
Annotation: chunks and selected predicate-argument relations, Named Entity recognition, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases
Licence: CC BY-SA 3.0

Polish

This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.).

The corpus is available for download from the CLARIN-PL repository.

Download

Polish Spatial Texts 1.0

Size: 46,000 tokens
Annotation: Named Entity recognition (spatial expressions)
Licence: CC BY-SA 4.0

Polish

This corpus contains travel blogs.

The corpus is available for download from the CLARIN-PL repository.

Download

CINTIL-Corpus Internacional do Português

Size: 1 million tokens
Annotation: morphosyntactic tagging, Named Entity recognition
Licence: CLARIN RES

Portuguese

The corpus contains transcriptions of spoken communication as well as written texts from several genres (news, literature, magazines, etc.).

The corpus is available for download from the CLARIN PORTULAN repository.

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now no longer being updated.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Batanović et al. (2018)

KonText

noSketch

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

Size: 92,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Serbian

This corpus contains Tweets.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC).

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Sentiment analysis

Corpus Language Description Availability

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens
Annotation: PoS tagging, Named Entity recognition, sentiment analysis
Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

Twitter sentiment for 15 European languages

Size: 1.6 million tweets
Annotation: sentiment analysis
Licence: CC BY-SA 4.0

Albanian, Bosnian, Bulgarian, Croatian, English, German, Hungarian, Polish, Portuguese, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish

This corpus contains Tweet IDs with sentiment annotations.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Mozetič et al. (2016)

Download

Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0

Size: 407.5 million words
Annotation: sentiment analysis (socially unacceptable discourse)
Licence: CC BY-SA 4.0

Croatian

This corpus contains news comments from the website 24sata.hr.

The corpus is available for download from CLARIN.SI.

Download

Aspect-Term Annotated Customer Reviews in Czech

Size: 2200 reviews
Annotation: sentiment analysis
Licence: CC BY-NC-SA 3.0

Czech

This corpus contains online user-product reviews.

The corpus is available for download from LINDAT.

Download

Facebook Data for Sentiment Analysis

Size: 10,000 Facebook posts
Annotation: sentiment analysis
Licence: CC BY-SA 3.0

Czech

This corpus contains Facebook posts.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Habernal et al. (2013)

KonText

Download

NoReC: The Norwegian Review Corpus

Size: 14.8 million tokens
Annotation: sentiment analysis
Licence: CC BY-NC 3.0

Norwegian

This corpus contains reviews in different domains (e.g., literature, videogames, etc.).

The corpus is available for download from the CLARINO repository.

For the relevant publication, see Velldal et al. (2018)

Download

Manually sentiment annotated Slovenian news corpus SentiNews 1.0

Size: 10,427 articles
Annotation: sentiment analysis
Licence: CC BY-SA 4.0

Slovenian

This corpus contains news articles.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Bučar et al. (2018)

Download

Other annotation layers

Corpus Language Description Availability

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0

Size: 89,000 tokens
Annotation: word normalisation
Licence: CC BY 4.0

Croatian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

KonText

noSketch

Download

Czech Legal Text Treebank 2.0

Size: 1121 sentences
Annotation: semantic role labelling
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains legal texts.

The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.

KonText

PML-TQ

Download

Prague Discourse Treebank 2.0

Size: 49,500 sentences
Annotation: mark-up of discourse phenomena enriched by the annotation of secondary connectives
Licence: CC-BY

Czech

This corpus is a subset of the Prague Dependency Treebank 3.5.

The corpus is available through the PML-TQ tool.

PML-TQ

Prague Czech-English Dependency Treebank 2.0 Coref

Size: 49,000 sentences
Annotation: mark-up of coreference
Licence: CC-BY-NC-SA + LDC99T42 (restricted use)

Czech, English

This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 styleD

The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool.T 2.0 style.

KonText

PML-TQ

Download

Artificial Treebank with Ellipsis

Size: 106,000 tokens, 10,604 sentences
Annotation: mark-up of elliptical constructions
Licence: Licence Universal dependencies v2.1

Czech, English, Finnish, Russian, Slovak

The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the LINDAT repository.

Download

Grundtvig's Works Corpus

Size: 11,417,194 words
Annotation: linked data (places, persons, bible citations, etc.)
Licence: CC BY-NC 4.0

Danish

This corpus contains the literary works of the Danish bishop N.F.S Grundtvig.

The corpus is available for download from the CLARIN-DK repository.

Download

SoNaR-1

Size: 1 million words
Annotation: semantic role labelling

Dutch

This is a manually annotated subset of the much larger (approx.. 500 million) word) SoNaR corpus.

The corpus is available for download from the Dutch Language Institute.

Download

The ACL RD-TEX 2.0

Size: 33216 tokens
Annotation: terminology extraction/classification
Licence: CC BY-NC-SA 4.0

English

This corpus contains 6818 terms extracted from abstracts of computational linguistics papers.

The corpus is available for download from LINDAT and through KonText.

For the relevant publication, see QasemiZadeh and Schumann (2016)

KonText

Download

Speech, Thought and Writing Presentation Corpus

Size: 260,000 words
Annotation: identification of reported speech
Licence: CC BY-NC-SA 3.0

English

This corpus contains literary, newspaper and biography texts.

The corpus is available for download from the Oxford Text Archive.

Download

TimeML annotated corpus of Estonian newspaper articles

Size: 22,000 words
Annotation: temporal semantic annotations
Licence: CC-BY-SA

Estonian

This corpus contains newspaper articles.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Estonian Treebank annotated with coreference relations

Size: 107,000 words
Annotation: anaphora relations
Licence: GPL

Estonian

This corpus contains newspaper texts plus one scientific medical text.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Semantically disambiguated corpus of Estonian

Size: 375,733 tokens
Annotation: word sense disambiguation
Licence: CLARIN ACA

Estonian

The corpus is available for download from META-SHARE (CELR distribution).

Download

Greek Coreference Corpus

Size: 62,988 tokens
Annotation: coreference
Licence: CC-BY-NC-SA

Greek

In addition to coreference, the corpus is annotated for identity and bridging relations.

In addition to coreference, the corpus is annotated for identity and bridging relations.

For the relevant publication, see Ogrodnizcuk et al. (2015)

Download

Greek Textual Entailment Corpus

Size: 600 sentence-pairs
Annotation: logical entailment
Licence: CC-BY

Greek

This corpus contains texts from the domains of politics, law and travel.

This corpus is available for download from the clarin:el repository.

Download

KPWr (Polish Corpus of Wrocław University of Technology) 1.2

Size: 447,000 tokens
Annotation: selected predicate-argument relations, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases
Licence: CC BY-SA 3.0

Polish

This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.).

The corpus is available for download from the CLARIN-PL repository.

Download

Polish Coreference Corpus

Size: 540,000 tokens
Annotation: coreference
Licence: CC BY 3

Polish

This corpus contains texts in a variety of domains (magazines, fiction literature, non-fiction literature, computer-mediated communication, academic writing, etc.).

The corpus is available for download and online browsing.

Concordancer

Download

Polish Summaries Corpus

Size: 10845 summaries
Annotation: summarization
Licence: CC BY 3

Polish

This corpus is available for download from the ZIL IPI PAN repository.

For the relevant publication, see Ogrodniczuk and Kopeć (2014)

Download

WUT Relations Between Sentences Corpus

Size: 5654 sentences
Annotation: relations between sentences - Cross-document Structure Theory (CST)
Licence: CC BY-SA 3.0

Polish

This corpus contains news items.

The corpus is available for download from the CLARIN.PL repository.

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

Size: 92,000 tokens
Annotation: word normalisation
Licence: CC BY 4.0

Serbian

This corpus contains Tweets. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens
Annotation: word normalisation
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Corpus of comma placement Vejica 1.3

Size: 104,000 sentences
Annotation: comma placement
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains texts from various Slovenian corpora (KUST, Šolar aLektorm JANES-Vejican Wikpedia.

The corpus is available for dow.nload from CLARIN.SI.

Download

Terminology identification dataset KAS-term 1.0

Size: 22,950 term candidates
Annotation: monolingual term extraction
Licence: CC BY-SA 4.0

Slovenian

This corpus contains term candidates from PhD theses in chemistry, computer science and political science.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Holozan (2018)

Download

CMC training corpus Janes-Norm 1.2

Size: 184,755 tokens
Annotation: normalization
Licence: CC BY-SA 4.0

Slovenian

This corpus is partially also manually annotated with MSD tags and lemmatized.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens
Annotation: verbal multiword expression tagging, semantic role labelling
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Bilingual terminology extraction dataset KAS-biterm 1.0

Size: 1,950 sentences, 78,500 tokens, 3,700 terms
Annotation: bi-lingual term extraction
Licence: CC BY-SA 4.0

Slovenian, English

This corpus contains PHD theses.

The corpus is available for download from the CLARIN.SI repository.

Download

Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1)

Size: 5.8 million tokens
Annotation: identification of verbal multi-word expressions (idioms, light-verb constructions, verb-particle constructions, inherently reflexive verbs, multi-verb constructions)
Licence: PARSEME Shared Task Data (v. 1.1) Agreement

16 languages

This corpus collection is available for download from LINDAT.

The PARSEME corpora can be queried individually through KonText. We provide the individual links to each corpus:

 

  1. Parseme VMWE 1.0 – Czech
  2. Parseme VMWE 1.0 – German
  3. Parseme VMWE 1.0 – Greek
  4. Parseme VMWE 1.0 – Spanish
  5. Parseme VMWE 1.0 – Persian
  6. Parseme VMWE 1.0 – French
  7. Parseme VMWE 1.0 – Hungarian
  8. Parseme VMWE 1.0 – Italian
  9. Parseme VMWE 1.0 – Maltese
  10. Parseme VMWE 1.0 – Polish
  11. Parseme VMWE 1.0 – Portuguese
  12. Parseme VMWE 1.0 – Romanian
  13. Parseme VMWE 1.0 – Slovenian
  14. Parseme VMWE 1.0 – Swedish
  15. Parseme VMWE 1.0 – Turkish

 

Download

Publications

[Batanović et al. 2018] Vuk Batanović, Nikola Ljubešić, and Tanja Samadržić. 2018. SETimes.SR – A Reference Training Corpus of Serbian.

[Bučar et al. 2018]  Jože Bučar, Martin Žnidaršič, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.

[Csendes et al. 2005]  Dóra Csendes, János Csirik, Tibor Gyimóthy, and András Kocsor. 2005. The Szeged Treebank.

[Erjavec 2012] Tomaž Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[Erjavec et al. 2010] Tomaž Erjavec, Darja Fišer, Simon Krek, and Nina Ledinek. 2010. The JOS Linguistically Tagged Corpus of Slovene.

[Fišer et al. 2018] Darja Fišer, Nikola Ljubešić and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.

[Habernal et al. 2013] Ivan Habernal, Tomáš Ptáček, and Josef Steinberger. 2013. Sentiment Analysis in Czech Social Media Using Supervised Machine Learning. 

[Hajič et al. 2004] Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004. Prague Arabic Dependency Treebank: Development in Data and Tools

[Hajič et al. 2012]  Jan, Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. 2012. Announcing Prague Czech-English Dependency Treebank 2.0

[Haverinen et al. 2014] Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter. 2014. Building the essential resources for Finnish: the Turku Dependency Treebank.

[Holozan 2018] Peter Holozan. 2018. Corpus of comma placement Vejica 1.3.

[Kravalová and Žabokrtský 2009] Jana Kravalová and Zdenek Žabokrtský. 2009. Czech Named Entity Corpus and SVM-based Recognizer.

[Kríž and Hladká 2018] Vincent Kríz and Barbora Hladká. 2018. Czech Legal Text Treebank 2.0.

[Miličević and Ljubešić 2016] Maja Miličević and Nikola Ljubešić. 2016. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets.

[Mozetič et al. 2016] Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016. Multilingual Twitter Sentiment Classification: The Role of Human Annotators.

[Muischnek et al. 2014] Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Eleri Aedmaa, Riin Kirt, Dage Särg. 2014. Estonian Dependency Treebank and its annotation scheme

[van Noord 2009] Gertjan van Noord. 2009. Huge Parsed Corpora in LASSY. 

[Jelínek 2017] Tomáš Jelínek. 2017. FicTree: a Manually Annotated Treebank of Czech Fiction.

[Ogrodniczuk and Kopeć 2014]  Maciej Ogrodniczuk and Mateusz Kopeć. The Polish Summaries Corpus.

[Ogrodnizcuk et al. 2015] Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary, and Magdalena Zawisławska. Coreference in Polish: Annotation, Resolution and Evaluation in Polish.

[Orasmaa 2014] Siim Orasmaa. Towards an Integration of Syntactic and Temporal Annotations in Estonian.

[Przepiórkowski and Murzynowski  2011]  Adam Przepiórkowski and Grzegorz Murzynowski. 2011. Manual annotation of the National Corpus of Polish with Anotatornia.

[QasemiZadeh and Schumann 2016] Behrang QasemiZadeh and Anne-Kathrin Schumann. 2016. The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods.

[Rei et al. 2016] Luis Rei, Dunja Mladenić, and Simon Krek. 2016. A Multilingual Social Media Linguistic Corpus.

[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.

[Rögnvaldsson et al. 2012] Eiríkur Rögnvaldsson, Anton Karl Ingason, Einar Freyr Sigurðsson and Joel Wallenberg. 2012. The Icelandic Parsed Historical Corpus (IcePaHC).

[Rosén et al. 2012] Victoria Rosén, Koenraad De Smedt, Paul Meurer, and Helge Dyvik. 2012. An Open Infrastructure for Advanced Treebanking.

[Stein and Prévost 2013]  Achim Stein and Sophie Prévost. 2013. Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF).

[Velldal et al. 2018] Erik Velldal, Lilja Øvrelid, Eivind Alexander Bergem, Cathrine Stadsnes, Samia Touileb, and Fredrik Jørgensen. 2018. NoReC: The Norwegian Review Corpus

[Wróblewska 2018] Alina Wróblewska. 2018. Extended and enhanced Polish dependency bank in Universal Dependencies format.

[Zeman et al. 2012] Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, and Jan Hajič. 2012. HamleDT: To Parse or Not to Parse?