You are here

Wordlists

Introduction

Wordlists are lexical resources which only provide alphabetical or frequency-based lexical inventories.There are 53 wordlists in the CLARIN infrastructure. About half (29) of the wordlists are monolingual, accounting for 10 languages (Dutch, Estonian, Finnish, German, Greek, Maltese, Ngbugu, Slovenian, Spanish, Swedish), while the other half (24) include a variety of both bilingual and multilingual language combinations (e.g., English-Greek, French-English-Spanish). In the vast majority of the cases, the wordlists can be directly downloaded from the national repositories or queried through easy-to-use online search environments.

For comments, changes of the existing content or inclusion of new resources, send us an email.

This website was last updated on 20 April 2020.

 

Wordlists in the CLARIN infrastructure

Monolingual resources

Resource

Language Description Availability

INT Historical Word List

Size: 500,000 word forms
Licence: other

Dutch

This wordlist includes historical lexemes for the period between 1550 and 1970. The resource is available for download from the Dutch Language Institute (INT).

For a related publication, see de Does and Depuydt (2012).

Download

Neologisms Online v3

Size: 19,000 words and expressions

Dutch

This wordlist of neologisms is available for online browsing the Dutch Language Institute (INT).

Download

Estonian Frequency Dictionary (ver. 2.0)

Size: 997,934 word forms
Licence: CLARIN PUB

Estonian

This is a frequency list available for download from META-SHARE (CELR distribution) and for online browsing.

Browse

Download

Names of Countries

Licence: CLARIN ACA

Estonian

This is a wordlist that is based on the Estonian orthography of foreign place names. The resource is available for online browsing.

Browse

The Conceptual File of Estonian Lexis of the Institute of Estonian Language

Licence: CLARIN ACA

Estonian

This is a controlled vocabulary of several more-and-less related concepts (e.g., gardening, haymaking, weather, fishing, religion). The resource is available for online browsing.

Browse

Finnish Verbal Colorative Constructions

Size: 61,617 words
Licence: CC-BY

Finnish

This is a wordlist that contains Finnish verbal “colorative” (i.e., stylistically marked) constructions­. The resource is available for download through FIN-CLARIN.

Download

Frequencies of Early Modern Finnish Words

Size: 4,862,190 words
Licence: EUPL

Finnish

This is a frequency lexicon that consists of words from the Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN.

Browse

Frequencies of Old Literary Finnish Words

Size: 3,425,382 words
Licence: EUPL

Finnish

This is a frequency lexicon that is constituted of words from Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN.

Browse

Frequency Lexicon of the Finnish Newspaper Language

Size: 9,996 words
Licence: CC-BY NC ND 1.0

Finnish

This is a frequency lexicon available online through FIN-CLARIN.

Browse

Frequency List of Written Finnish Word Forms

Size: 17,604 lemmas; 1,339,787 word forms
Licence: EUPL

Finnish

This is a frequency lexicon of Finnish word forms that appear in the Finnish Parole text corpus. The resource is available online through FIN-CLARIN.

Browse

Modern Finnish Word List

Size: 94,110 entries
Linguistic information: MSD-tags, lemmas
Licence: GNU LGPL; EUPL v.1.1; CC-BY SA 3.0

Finnish

This is a wordlist of contemporary general vocabulary that is available for download through FIN-CLARIN.

Download

Psycholinguistic Descriptives

Size: 2.5 billion words
Licence: CC-BY 4.0

Finnish

This is a frequency wordlist (accompanied by a query tool) of acquiring commonly used psycholinguistic descriptives for Finnish words, and word surface form frequencies, lemma frequencies, syllable frequencies and letter n-gram frequencies. The resource is available for download from FIN-CLARIN.

Download

Relative frequencies of part-of-speech n-grams in native and translated Finnish literary prose

Licence: CC-BY 4.0

Finnish

This is a frequency list of N-grams appearing in the corpus Classics of English and American Literature, English-Finnish parallel corpus and the corpus of Translated Finnish. The resource is available for download from FIN-CLARIN.

Download

The Finnish N-grams 1820-2000 of the Newspaper and Periodical Corpus of the National Library of Finland

Licence: CC-BY 4.0

Finnish

This is a frequency list that contains sets of unigrams, bigrams and trigrams extracted from a newspaper corpus. The resource is available for download from FIN-CLARIN.

Download

Deutsche Wortschatz

Size: 5.8 million types
Linguistic information: synonymy, examples of use

German

This resource provides a list of annotated words taken from the deu_newscrawl_2011 corpus. The resource is available for online browsing through CLARIN-D/University of Leipzig.

Download

KELLY word-list Greek

Size: 7,385 entries
Licence: CC-BY-NC

Greek

This wordlist is useful for learning and teaching Greek as a foreign/second language; the words are classified according to the language levels of CEFR. The resource is available for download from clarin:el.

Download

Maltese Fiction Wordlist

Size: 41,251 tokens
Linguistic information: frequency
Licence: MS-NC-No ReD

Maltese

This is a wordlist from 32 fictional books available for download from CLARIN PORTULAN.

Download

Maltese Wordlist

Size: 824,839 words
Licence: LGPL

Maltese

This is a wordlist used for spell-checking. It is available for download from CLARIN PORTULAN.

Download

Ngbugu digital wordlist: Archival form

Size: 204 words

Ngbugu

This is a wordlist used in language documentation, phonetics and lexicography. The resource is available for download from Ortolang.

Download

LCM-PL

Size: 10,000 entries
Licence: CC BY SA 3.0

Polish

This is a wordlist that list the frequencies and abstract levels of verbs. The resource is available for download from the CLARIN-PL repository.

Download

Gos corpus n-grams 2.0

Size: 2,598,153 n-grams
Linguistic information: frequency
Licence: CC-BY-SA 4.0

Slovenian

This is a list of n-grams extracted from the Gos corpus of spoken Slovene. The resource is available for download from CLARIN.SI

Download

IMP corpus n-grams 2.0

Size: 34,668,696 n-grams
Linguistic information: frequency
Licence: CC-BY-SA 4.0

Slovenian

This is a list of n-grams extracted from the IMP corpus of historical Slovene. The resource is available for download from CLARIN.SI.

Download

Janes corpus n-grams 1.0

Size: 351,029,703 n-grams
Linguistic information: frequency
Licence: CC-BY-SA 4.0

Slovenian

This is a list of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0. The resource is available for download from CLARIN.SI

Download

Kres corpus n-grams 2.0

Size: 211,104,769 n-grams
Linguistic information: frequency
Licence: CC-BY-SA 4.0

Slovenian

This is a list of n-grams extracted from the Kres corpus of written Slovenian. The resource is available for download from CLARIN.SI

Download

Lexical functions of Spanish verb-noun collocations

Size: 1,000 verb-noun pairs
Linguistic information: lexicological classifications (free-word combinations, errors), semantic information

Spanish

This is a wordlist that consists of the most frequent 1000 verb-noun pairs extracted automatically from the Spanish Web Corpus classifications (collocation vs. free-word-combo). The resource is available for download from Ortolang.

Download

Idioms from the NEO lexicon DB

Size: 4,928 entries
Licence: CC-BY 4.0

Swedish

This is a wordlist of idioms with explanations extracted from the database for the dictionary Nationalencyklopediens ordbok. The resource can be download from the SWE-CLARIN.

 

Download

Kelly (2017-10-16)

Size: 10,510 entries
Licence: CC-BY 4.0

Swedish

This is a list of keywords for Language Learning for Young and adults alike. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

The Swedish N-grams 1770-1940 of the Newspaper and Periodical Corpus of the National Library of Finland

Licence: CC-BY 4.0

Swedish

This frequency list contains sets of unigrams, bigrams and trigrams extracted from a corpus compiled by the University of Helsinki from the digitized newspapers from the National Library of Finland. The resource is available for download from FIN-CLARIN.

Download

Vocation list (2015-01-10)

Size: 13,833 entries
Licence: CC-BY 4.0

Swedish

This is a wordlist of vocations in Swedish. The resource can be download from the SWE-CLARIN repository.

Download

Multilingual resources

Resource

Language Description Availability

Topics of library and information science

Size: 732 words
Licence: CC-BY-NC

English, Greek

This is a word list of terms from the domain of library and information science. The resource is available for download from clarin:el.

Download

Vocabulaire d'archéologie

Size: 4,431 entries
Linguistic information: preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of archaeology available for download from ORTOLANG.

Download

Vocabulaire d'art et archéologie

Size: 1,960 entries
Linguistic information: preferred forms, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from fine arts and archaeology available for download from ORTOLANG.

Download

Vocabulaire de géographie de l'Amérique du Nord

Size: 4,232 entries
Linguistic information: hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a thesaurus of terms from the geography of North America available for download from ORTOLANG.

Download

Vocabulaire de Nutrition artificielle

Size: 2,500 entries
Linguistic information: associative relationship, hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of nutrition available for download from ORTOLANG.

Download

Vocabulaire de Pathologies humaines

Linguistic information: hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of medicine (pathological diseases) available for download from ORTOLANG.

Download

Vocabulaire de philosophie

Size: 4,435 entries
Linguistic information: preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions of philosophical terms available for download from ORTOLANG.

Download

Vocabulaire de préhistoire et protohistoire

Size: 3,093 entries
Linguistic information: hierarchical relationship, preferred forms, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions historical terms available for download from ORTOLANG.

Download

Vocabulaire de Psychopathologie

Size: 575 terms
Linguistic information: associative relationship, hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of psychopathology available for download from ORTOLANG.

Download

Vocabulaire de Sciences de l'éducation

Size: 2,681 entries
Linguistic information: preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of education available for download from ORTOLANG.

Download

Vocabulaire de Sciences du langage

Size: 6,142 entries
Linguistic information: hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of linguistics available for download from ORTOLANG.

Download

Vocabulaire de sociologie

Size: 5,277 entries
Linguistic information: preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of sociology available for download from ORTOLANG.

Download

Vocabulaire de Transferts de chaleur

Size: 1462 entries
Linguistic information: associative relationship, Hierarchical relationship, Preferred form, Synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions of thermodynamic terms available for download from ORTOLANG.

Download

Vocabulaire de Transfusion sanguine

Size: 2,000 entries
Linguistic information: associative relationship, hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of medicine (related to blood transfusion) available for download from ORTOLANG.

Download

Vocabulaire d'ethnologie

Size: 9,517 entries
Linguistic information: preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of ethnology available for download from ORTOLANG.

Download

Vocabulaire d'histoire des sciences et des techniques

Size: 3,766 entries
Linguistic information:
contextual information, usage examples,preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain technical sciences available for download from ORTOLANG.

Download

Vocabulaire d'histoire et sciences de la littérature

Size: 11,065 entries
Linguistic information: preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of literary studies available for download from ORTOLANG.

Download

Vocabulaire d'Histoire et sciences des religions

Size: 4,581 entries
Linguistic information: preferred form, synonym shape
Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of philosophy and religion available for download from ORTOLANG.

Download

Vocabulaire de sciences de la Terre

Size: 19,707 entries
Linguistic information: hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English, Spanish

This is a controlled vocabulary of expressions from the domain of geology available for download from ORTOLANG.

Download

Vocabulaire d'Electronique et électro-optique

Size: 4,456 entries
Linguistic information: associative relationship, hierarchical relationship, preferred form, synonym shap
eLicence: CC-BY 4.0

French, English, Spanish

This is a controlled vocabulary of expressions from the domain of electronics available for download from ORTOLANG.

Download

Λεξικό Γλωσσολογικών όρων: Γερμανικά – Ελληνικά - Αγγλικά (lexicon of linguistic terms: DE-EL-EN)

Size: 2,000 words

German, Greek, English

This is a wordlist of linguistic terms that is available for download from clarin:el.

Download

Labial vibrants in Mangbetu: Archival form

Licence: CC-BY

Mangbetu, French, English

 

This is a wordlist of lexical items that exemplify occurrences of bilabial trills and the labiodental flaps. The resource is available for download from ORTOLANG.

Download

JRC-Names - a multilingual named entity resource

Linguistic information: spelling varieties of names
Licence: Open for Reuse with Restrictions

Slovenian, Swedish, Bulgarian, English, Greek, Estonian, Spanish, Castilian, Czech, German, Danish, French, Finnish, Italian, Hungarian, Latvian, Lithuanian, Maltese, Dutch, Flemish, Portuguese, Polish, Slovak, Romanian

This is a wordlist of named entities (person and organisation names). The resource is available for download from clarin:el.

Download

Swedish words, LEXIN

Size: 29,111 entries
Licence: CC-BY 4.0

Swedish, Albanian, Bosnian, English, Finnish, Modern Greek, Croatian, Iranian Persian, Russian, Serbian, Somali, Spanish, Turkish

This is a word list to be used by immigrants to Sweden. The resource can be download from the SWE-CLARIN repository.

Download