You are here

Lexica

Introduction

Lexica are primarily used in NLP applications. They typically contain an extensive lexical inventory with specific linguistic information (e.g., morphosyntax, sentiment). There are 75 lexica in the CLARIN infrastructure. Most (62) of the lexica are monolingual, accounting for 17 languages (Arabic, Croatian, Czech, Danish, Dutch, English, Estonian, French, Icelandic, Italian, Greek, Maltese, Polish, Portuguese, Serbian, Slovenian, and Swedish). The rest (13) are multilingual and include a variety of language combinations. In the vast majority of the cases, the lexica can be directly downloaded from the national repositories or queried through easy-to-use online search environments.

For comments, changes of the existing content or inclusion of new resources, send us an email.

This website was last updated on 04 June 2020.

Lexica in the CLARIN infrastructure

Monolingual resources

Resource

Language Description Availability

A machine-readable dictionary of Egyptian Arabic

Size: 2,418 entries
Linguistic information: basic morphological information, usage examples
Licence: CC-BY-NC-SA 3.0

Arabic (Egyptian)

 

This lexicon presents a more comprehensive version of A machine-readable glossary of Egyptian Arabic. The resource is available for download from ARCHE.

Download

A machine-readable glossary of Egyptian Arabic

Size: 2,204 entries
Linguistic information: basic morphological information, usage examples
Licence: CC-BY-NC-SA 3.0

Arabic (Egyptian)

 

This lexicon has been compiled for comparative as well as didactic purposes in the on-going VICAV project. The resource is available for download from ARCHE.

Download

Automatically constructed multiword lexicon hrMWELex v0.5

Size: 43,730 entries
Linguistic information: multi-word expressions
Licence: CC-BY 4.0

Croatian

This is a lexicon of multiword expressions available for download from CLARIN.SI.

For a related publication, see Ljubešić et al. (2015).

Download

Word embeddings CLARIN.SI-embed.hr 1.0

Size: 3,147,352 entries
Linguistic information: PoS-tags, lemmas
Licence: CC-BY 4.0

Croatian

This lexicon contains word embeddings extracted from the Croatian web corpus hrWaC and a 400-million-token-heavy collection of newspaper texts. The resource is available for download from CLARIN.SI.

Download

DeriNet 1.6

Size: 1,027,832 entries
Licence: CC-BY-NC-SA 3.0

Czech

This is a lexicon of derivational relations (both compounding and inflections). The resource is available for download and online browsing through LINDAT.

Browse

Download

MorfFlex CZ

Size: 124,259,099 lexical types
Linguistic information: MSD-tags, derivational, semantic, NER information
Licence: CC-BY-NC-SA 3.0

Czech

This is a morphological lexicon available for download from LINDAT.

Download

ParaDi 2.0

Size: 1,621 entries
Linguistic information: MSD-tags, syntactic/semantic features
Licence: CC-BY 4.0

Czech

This is a lexicon of single-word paraphrases of Czech verbal multiword expressions. The resource is available for download from LINDAT.

Download

PDT-Vallex

Size: 7,121 entries, 11,933 frames
Linguistic information: verb, adjective and noun valency
Licence: CC-BY-NC-SA 4.0

Czech

This is a valency lexicon linked to several Czech corpora (PDT, PCEDT Cz side, PDTSC, Faust). The resource is available for download and online browsing through LINDAT.

For a related publication, see Urešová (2011).

Browse

Download

VALLEX 3.0

Size: 2,722 entries, 6,711 units, 6,711 frames, 4,586 words
Linguistic information: verb senses (characterized by glosses and examples)
Licence: CC-BY-NC-SA 4.0

Czech

This is a valency lexicon available for download and online browsing through LINDAT.

For a related publication, see Lopatková et al. (2017).

Browse

Download

STO morphology (v2) - LMF format

Size: 87,209 entries
Licence: CC BY-SA 4.0

Danish

This morphological lexicon is available for download from the CLARIN-DK repository. It is also available in the .csv format.

Download

STO syntax (v2) - LMF format

Size: 84,159 entries
Licence: CC BY-SA 4.0

Danish

This syntactic lexicon is available for download from the CLARIN-DK repository.

Download

Basilex Lexicon

 

 

Dutch

This is a lexicon that comprises all lemmas from the Basilex Corpus. The Basilex Corpus is an annotated collection of texts written for children in elementary school. The resource is available for download from the Dutch Language Institute (INT).

Download

Basiscript Lexicon

Licence: other

Dutch

This is a lexicon that comprises all lemmas from the Basiscript Corpus. The Basiscript Corpus is an annotated collection of texts written by children in elementary school. The resource is available for download from the Dutch Language Institute (INT).

Download

CombiLex

Size: 213,000 lemmas
Linguistic information: lemmas and word forms
Licence: other

Dutch

This is a lexicon of words and word forms available for download from the Dutch Language Institute (INT).

Download

Dutch Electronic Lexicon of Multiword Expressions

Size: 5,000 expressions
Licence: other

Dutch

This is a lexicon of multiword expressions available for download from the Dutch Language Institute (INT).

Download

PAROLE Lexicon

Size: 20,000 entries
Linguistic information: MSD-tags and syntactic complementation patternsLicence: other

Dutch

This morphosyntactic lexicon is available for download from the Dutch Language Institute (INT).

Download

Reference Lexicon for Belgian-Dutch (RBBN)

Size: 4,000 words and expressions-
Licence: other

Dutch

This lexicon, which contains words and expressions typically of Dutch spoken in Belgium, is available for download from the Dutch Language Institute (INT).

Download

Reference Lexicon for Dutch

Size: 50,000 lemmas
Linguistic information: dialectical information
Licence: other

Dutch

This is a corpus-based monolingual lexicon available for download the Dutch Language Institute (INT).

Download

BioLexicon

Size: over 2.2 million entries (over 3.3 million semantic relations)
Licence: ELRA END USER

English

This is a large-scale, wide-coverage computational lexicon covering the biomedical domain. The resource is unavailable for download or online browsing, but can be accessed by contacting the resource manager.

 

EngVallex

Size: 4,337 entries, 7,148 frames
Linguistic information: verb valency
Licence: CC-BY-NC-SA 4.0

English

This is a valency lexicon linked to the English side of the PCEDT corpus (WSJ corpus). The resource is available for download from LINDAT and for online browsing.

Browse

Download

The Database of Estonian Multi-Word Expressions

Size: 12,500 words
Licence: proprietary

Estonian

This is a collection of lexica that contain multi-word expressions consisting of a verb and a particle or a verb and its complements. The resource is available for download from META-SHARE (CELR distribution) and for online browsing through a dedicated website.

Browse

Download

Démonette

Size: 96,027 entries
Linguistic information: MSD-tags (grace format), semantic types
Licence: CC-BY 4.0

French

This is a morphological lexicon available for download from ORTOLANG.

Download

Dicovalence

Size: 8,000 entries
Linguistic information: c- and s-selectional restrictions
Licence: Licence Publique Générale Amoindrie GNU

French

This is a verb-valency lexicon.

The lexicon specifies certain selectional restrictions, possible term manifestations (pronominal, phrasal), and whether the valency frames can be used in various passive constructions, as well as references to other valency frames for the same infinitive. The resource is available for download from ORTOLANG.

Download

MarsaLex

Size: 595,000,000 inflected forms
Licence: CC-BY 4.0

French

This is a morphological lexicon available for download from ORTOLANG.

Download

Morphalou

Size: 159,261 entries
Linguistic information: spelling, phonetics, mood, tense, MSD-tags, spelling variant, feminine variation, pronominal
Licence: Publique Générale Amoindrie GNU

 

French

This is a morphological lexicon available for download from ORTOLANG.

Download

VfrLPL

Size: 8,800 entries
Linguistic information: conjugation forms, phonetic forms, use frequencies

French

This is a morphosyntactic lexicon available for download from ORTOLANG.

Download

ILSP PsychoLinguistic Resource

Size: 217,664 entries
Linguistic information: phonetic transcription, frequency of usage
Licence: CC-BY-NC-SA

Greek

This is a lexicon for psycholinguistic research. The resource is available for download from clarin:el.

For a related publication, see Protopapas et al. (2010).

Download

Database of Modern Icelandic Inflections

Size: 278,994 entries
Linguistic information: MSD-tags
Licence: other

Icelandic

This is a morphological lexicon available for download and online browsing through CLARIN-IS.

For a related publication, see Bjarnadóttir (2012).

Browse

Download

Italian Content Words v3

Size: 2,342,120 items
Licence: CC-BY-NC-SA 4.0

Italian

This is a morphological lexicon. The resource is available for download from LINDAT.

Download

Italian Function Words v3

Size: 3,510 entries
Licence: CC-BY-NC-SA 4.0

Italian

This is a morphological lexicon. The resource is available for download from LINDAT.

Download

OpeNER Sentiment Lexicon Italian - LMF

Size: 24,293 entries
Linguistic information: positive/negative/neutral polarity
Licence: CC-BY 4.0

Italian

This is a sentiment lexicon available for download from ILC4CLARIN.

Download

PAROLE-SIMPLE-CLIPS

Size: 37,406 syntactic units
Licence: CC-BY-SA 4.0

Italian

This is a morphological lexicon available for download from LC4CLARIN.

Download

Maltese Speech Engine Lexicon

Size: 39,242 entries
Linguistic information: PoS-tags, orthographic transcription, phonetic forms, syllables, stress position
Licence: MS-BY-NC-SA

Maltese

This is a speech lexicon that is useful for building speech-to-text systems. It is available for download from CLARIN PORTULAN.

Download

Emotional Annotations Dictionary

Size: 178,514 elements
Licence: CC-BY 4.0

Polish

This is a lexicon with emotional annotation extracted from Polish Wordnet. The resource is available for download from the CLARIN-PL repository.

Download

Extended dictionary of named entities NELexicon connected with Linked Open Data

Size: 103,585 entries
Licence: GNU LGPL 3.0

Polish

This lexicon contains Polish named entities connected with terminology from available resources within Linked Open Data (e.g. WordNet, DBPedia, Wikipedia, etc.). The resource is available for download from the CLARIN-PL repository.

Download

MWELexicon 1.1

Size: 56,500 lexical units
Linguistic information: “syntactic behaviour”
Licence: plWordNet

Polish

This is a lexicon of multiword expressions available for download from CLARIN.PL.

Download

Walenty (2018-06-29)

Size: 18,236 entries
Licence: CC BY SA 4.0

Polish

This is a lexicon of verb valency that is available for download from the CLARIN-PL repository.

Download

LEX-MWE-PT: Word Combination in Portuguese Language

Size: 1,198 entries
12,753 multi word unit
Linguistic information: lemmas
Licence: MS NC-NoReD-ND

Portuguese

This is a lexicon of multiword expressions. The resource is unavailable for download or online browsing, but can be accessed by contacting the resource manager.

 

LX-Abbreviations

Size: 208 words
Linguistic information: MSD-tags
Licence: MS NC-NoReD-ND

Portuguese

This is a lexicon of abbreviations. It is unavailable for download or online browsing, but can be accessed by contacting the resource manager.

 

LX-DSemVectors

Size: 17,572 words
Linguistic information: word embeddings
Licence: MS NC-NoReD-ND

Portuguese

This lexicon provides distributional semantic representations of Portuguese words. The dataset is available for download from GitHub.

Download

LX-Rare Word Similarity Dataset

Size: 2,034 words
Linguistic information: synonyms
Licence: MS NC-NoReD-ND

Portuguese

This is a word-similarity lexicon that is unavailable for download.

 

LX-SimLex-999

Size: 1,998 words
Linguistic information: MSD-tags, linguistic standardness
Licence: MS NC-NoReD-ND

Portuguese

This is a word-similarity lexicon that is unavailable for download or online browsing, but can be accessed by contacting the resource manager.

 

LX-StopWords

Size: 2,631 words
Linguistic information: MSD-tags, MWEs
Licence: MS NC-NoReD-ND

Portuguese

This is a manually compiled exhaustive list of closed-class words in Portuguese. The resource is unavailable for download or online browsing, but can be accessed by contacting the resource manager.

 

LX-WordSim-353

Size: 706 words
Linguistic information: synonyms, antonyms, identical, hypernym-hyponym, sibling terms, meronym-holonym
Licence: MS NC-NoReD-ND

Portuguese

This is a word-similarity lexicon that is unavailable for download or online browsing, but can be accessed by contacting the resource manager.

 

Multifunctional Computational Lexicon of Contemporary Portuguese

Size: 26,443 entries
Linguistic information: lemmas, MWEs, PoS-tags
Licence: CC-BY - SA

Portuguese

This is a frequency lexicon suitable for NLP specific purposes (information extraction, lemmatization, PoS tagging). The resource is available for download from META-SHARE (CLARIN PORTULAN distribution).

Download

PAROLE Portuguese Lexicon

Size: 20,000 entries
Linguistic information: MSD tags, lemma
Licence: ELRA EVALUATION

Portuguese

This is a morphosyntactic lexicon available for download from CLARIN PORTULAN

Download

Porlex

Size: 27,374 words
Linguistic information: orthographic and phonological/phonetic transcriptions, phonetic, MSD-tags, and frequency informationLicence: MS NC-NoReD-ND

Portuguese

This is a lexicon that provides psycholinguistic and cognitive information that is useful to select stimulus materials for experiments and/or training vocabularies. The resource is available for download from CLARIN PORTULAN.

Download

Simple Portuguese Lexicon

Size: 10,438 entries
Linguistic information: qualia structure, semantic relations (hyponymy, synonymy, etc.)
Licence: MS-BY-NC-SA

Portuguese

This semantic lexicon is available for download from CLARIN PORTULAN.

Download

Automatically constructed multiword lexicon srMWELex v0.5

Size: 22,290 entries
Linguistic information: MWEs
Licence: CC-BY 4.0

Serbian

This is a lexicon of multiword expressions available for download from CLARIN.SI.

Download

Word embeddings CLARIN.SI-embed.sr 1.0

Size: 1,480,566 entries
Linguistic information: PoS-tags, lemmas
Licence: CC-BY 4.0

Serbian

The lexicon contains word embeddings from the srWaC web corpus. The resource is available for download from CLARIN.SI.

Download

Automatically constructed multiword lexicon slMWELex v0.5

Size: 47,579 entries
Linguistic information: MWEs
Licence: CC-BY 4.0

Slovenian

This is a lexicon of multiword expressions available for download from CLARIN.SI.

Download

Automatically stress labelled morphological lexicon Sloleks 1.2, version 1.1

Size: 100,805 entries; 2,774,745 words
Linguistic information: wordforms, PoS-tags, lemmas, frequency, prosody
Licence: CC-BY-NC-SA 4.0

Slovenian

This is an extended version of the morphological lexicon Sloleks 1.2 with added information about the stress of each word form. The resource is available for download from CLARIN.SI.

For a related publication, see Krsnik and Robnik Šikonja (2017).

 

 

Download

Beseda Corpus Lemmatisation Lexicon

Size: 3,228,127 entries
Linguistic information: wordforms, PoS-tags, lemmas, frequency
Licence: CC-BY 4.0

Slovenian

This lexicon contains inflected open class words from the Dictionary of Standard Slovenian that are augmented by wordforms, their part of speech tags and their lemmas used during the PoS tagging and lemmatization of the Beseda corpus. The resource is available for download from CLARIN.SI and for online browsing.

Browse

Download

Lexicon of historical Slovene imp25k 1.1

Size: 28,034 entries
Linguistic information: MSD-tags, lemmas, etymological glosses
Licence: CC-BY 4.0

Slovenian

This is a morphological lexicon available for download from CLARIN.SI and for online browsing through a dedicated environment.

For a related publication, see Erjavec (2015).

Browse

Download

Morphological lexicon Sloleks 2.0

Size: 100,805 entries
Linguistic information: wordforms, PoS-tags, lemmas, frequency, phonology
Licence: CC-BY-NC-SA 4.0

Slovenian

This is a reference morphological lexicon of the Slovenian language developed to be used in NLP applications and language manuals. The resource is available for download from CLARIN.SI and for online browsing.

For a related publication, see Dobrovoljc et al. (2017).

Browse

Download

Slovene sentiment lexicon JOB 1.0

Size: 25,524 entries
Linguistic information: sentiment tags
Licence: CC-BY-S15A 4.0

Slovenian

This is a lexicon of sentiment labels available for download from the CLARIN.SI repository.

For a related publication, see Bučar et al. (2018).

Download

Slovene sentiment lexicon KSS 1.1

Size: 90,620 lexica
Linguistic information: lemmas, sentiment tags
Licence: CC-BY 4.0

Slovenian

This is a lexicon of sentiment labels available for download from the CLARIN.SI repository.

Download

Word embeddings CLARIN.SI-embed.sl 1.0

Size: 4,560,444 entries
Linguistic information: PoS-tags, lemmas
Licence: CC-BY 4.0

Slovenian

This is a lexicon of word embeddings that is available for download from CLARIN.SI.

Download

Old Swedish morphology (2017-10-16)

Size: 41,958 entries
Licence: CC-BY 4.0

Swedish

This is a glossary of Old Swedish that is available for download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

Parole+ (2017-10-16)

Size: 24,523 entries
Licence: CC-BY 4.0

Swedish

This is a lexicon for language technologies which offers access to syntactic information and is connected to SALDO senses. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

SALDO's morphology (2017-10-16)

Size: 128,036 entries
Licence: CC-BY 4.0

Swedish

This is a semantic and morphological lexicon for language technologies. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

Simple lexicon

Size: 11,624 entries
Licence: CC-BY 4.0

Swedish

This is a semantic lexicon that is available for download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

Multilingual resources

Resource

Language Description Availability

Concreteness and imageability lexicon MEGA.HR-Crossling

Size: 7,237,589 entries
Linguistic information: concreteness prediction, imageability prediction
Licence: CC-BY-SA 4.0

77 languages

These lexica contain concreteness and imageability predictions for 77 languages. They are available for download from CLARIN.SI.

For a related publication, see  Ljubešić et al. (2018).

Download

Emoji Sentiment Ranking 1.0

Size: 751 entries (emojis)
Linguistic information: sentiment labels
Licence: CC-BY-SA 4.0

Albanian, Bulgarian, English, German, Hungarian, Polish, Portuguese, Russian, Serbo-Croatian, Slovak, Slovenian, Spanish, Swedish

This is a lexicon of emojis available for download from CLARIN.SI and for online browsing through a dedicated environment.

For a related publication, see Kralj Novak et al. (2015).

Browse

Download

OMBI Dutch-Arabic

Size: 37,000 entries
Licence: other

Arabic, Dutch

This is a bilingual lexicon that is suitable for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT).

Download

MULTEXT-East free lexicons 4.0

Size: 3,665,864 entries
Linguistic information: MSD-tags, lemmas
Licence: CC-BY-SA 4.0

Bulgarian, Czech, English, Estonian, French, Hungarian, Romanian, Slovak, Slovenian, Ukrainian

These are morphological lexica available for download from the CLARIN.SI repository.

For a related publication, see Erjavec (2011).

Download

CzEngClass 0.2

Size: 200 classes, 3,525 entries
Linguistic information: valency and synonymy
Licence: CC-BY-NC-SA 4.0

Czech, English

This is a valency lexicon linked to PDT-Vallex, EngVallex and external resources, such as FrameNet, VerbNet, WordNet, etc. The resource is available for download and online browsing through LINDAT.

Browse

Download

CzEngVallex

Size: 20,835 pairs (verb senses)
Linguistic information: verb valency
Licence: CC-BY-NC-SA 4.0

Czech, English

This is a valency lexicon linked to the parallel PCEDT corpus. The resource is available for download and online browsing through LINDAT.

For a related publication, see Fučíková et al. (2016).

Browse

Download

OMBI Arabic-Dutch

Size: 37,000 entries
Licence: other

Dutch, Arabic

This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT).

Download

OMBI Dutch-Danish

Size: 46,000 entries
Licence: other

Dutch, Danish

This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT).

Download

OMBI Dutch-Indonesian

Size: 50,000 entries
Licence: other

Dutch, Indonesian

This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT).

Download

QTLeap specialized lexicons

Size: 231,516 entries
Licence: CC-BY

English, Spanish, Castilian, Bulgarian, Basque, Dutch, Flemish, Czech, Portuguese

This lexicon is used for the automatic translation of specific IT domain expressions and is available for download from CLARIN PORTULAN.

Download

MULTEXT-East non-commercial lexicons 4.0

Size: 2,288,228 entries
Linguistic information: MSD-tags, lemmas
Licence: CC-BY-NC 4.0

Macedonian, Persian, Polish, Russian, Serbian

These are morphological lexica available for download from the CLARIN.SI repository.

Download

A machine-readable Persian-English dictionary

Size: 1,892 entries
Licence: CC-BY-NC-SA 3.0
Linguistic information: morphological information, usage examples

Persian-English

This bilingual lexicon has been compiled for comparative as well as didactic purposes in the on-going VICAV project. The resource is available for download from ARCHE.

Download

A machine-readable Persian-English glossary of verbs

Size: 429 entries
Linguistic information: basic morphological information
Licence: CC-BY-NC-SA 3.0

Persian-English

This lexicon of single-word verbs in Modern Persian is available for download from ARCHE.

Download

Publications

[Bjarnadóttir 2012] Kristín Bjarnadóttir. 2012. The Database of Modern Icelandic Inflection (Beygingarlýsing íslensks nútímamáls).

[Bučar et al. 2018] Jože Bučar, Martin Žnidaršič, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.

[Erjavec 2011] Tomaž Erjavec. 2011. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[Erjavec 2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources.

[Fučíková et al. 2016] Fučíková Eva, Hajič Jan, and Urešová Zdeňka. 2016. Joint search in a bilingual valency lexicon and an annotated corpus.

[Dobrovoljc et al. 2017] Kaja Dobrovoljc, Simon Krek, and Tomaž Erjavec. 2017. The Sloleks Morphological Lexicon and its Future Development.

[Kralj Novak et al. 2015] Petra Kralj Novak  Jasmina Smailović, Borut Sluban, and Igor Mozetič. 2015. Sentiment of Emojis.

[Krsnik and Robnik Šikonja 2017] Luka Krsnik and Marko Robnik Šikonja. 2017. Napovedovanje naglasa slovenskih besed z metodami strojnega učenja.

[Ljubešić et al. 2015]  Nikola Ljubešić, Kaja Dobrovoljc, and Darja Fišer. 2015. MWELEX – MWE LEXICA OF CROATIAN, SLOVENE AND SERBIAN EXTRACTED FROM PARSED CORPORA.

[Ljubešić et al. 2018] Nikola Ljubešić, Darja Fišer,  and Anita Peti-Stantić. 2018. Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings.

[Lopatková et al. 2017] Markéta Lopatková et al. 2017. Valenční slovník českých sloves VALLEX.

[Protopapas et al. 2010] Athanassios Protopapas, Marina Tzakosta, Aimilios Chalamandaris, and Pirros Tsiakoulis. 2010. IPLR: an online resource for Greek word-level and sublexical information.

[Urešová 2011] Zdeňka Urešová. 2011. Valenční slovník Pražského závislostního korpusu (PDT-Vallex). 

[Úlfarsdóttir 2014] Thórdís Úlfarsdóttir. 2014. ISLEX – a Multilingual Web Dictionary.