Wordlists

Introduction

Wordlists are lexical resources which only provide alphabetical or frequency-based lexical inventories.There are 53 wordlists in the CLARIN infrastructure. About half (29) of the wordlists are monolingual, accounting for 10 languages (Dutch, Estonian, Finnish, German, Greek, Maltese, Ngbugu, Slovenian, Spanish, Swedish), while the other half (24) include a variety of both bilingual and multilingual language combinations (e.g., English-Greek, French-English-Spanish). In the vast majority of the cases, the wordlists can be directly downloaded from the national repositories or queried through easy-to-use online search environments.

For comments, changes of the existing content or inclusion of new resources, send us an email.

This website was last updated on 3 June 2021.

 

Wordlists in the CLARIN infrastructure

Monolingual resources

Resource

Language Description Availability

INT Historical Word List

Size: 500,000 word forms

Licence: other

Dutch

This wordlist includes historical lexemes for the period between 1550 and 1970. The resource is available for download from the Dutch Language Institute (INT).

For a related publication, see de Does and Depuydt (2012).

Download

Neologisms Online v3

Size: 19,000 words and expressions

Dutch

This wordlist of neologisms is available for online browsing the Dutch Language Institute (INT).

Download

Estonian Frequency Dictionary (ver. 2.0)

Size: 997,934 word forms

Licence: CLARIN PUB

Estonian

This is a frequency list available for download from (CELR distribution) and for online browsing.

Browse

Download

Names of Countries

Licence: CLARIN ACA

Estonian

This is a wordlist that is based on the Estonian orthography of foreign place names. The resource is available for online browsing.

Browse

The Conceptual File of Estonian Lexis of the Institute of Estonian Language

Licence: CLARIN ACA

Estonian

This is a controlled vocabulary of several more-and-less related concepts (e.g., gardening, haymaking, weather, fishing, religion). The resource is available for online browsing.

Browse

Finnish Verbal Colorative Constructions

Size: 61,617 words

Licence: CC-BY

Finnish

This is a wordlist that contains Finnish verbal “colorative” (i.e., stylistically marked) constructions­. The resource is available for download through FIN-CLARIN.

Download

Frequencies of Early Modern Finnish Words

Size: 4,862,190 words

Licence: EUPL

Finnish

This is a frequency lexicon that consists of words from the Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN.

Browse

Frequencies of Old Literary Finnish Words

Size: 3,425,382 words

Licence: EUPL

Finnish

This is a frequency lexicon that is constituted of words from Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN.

Browse

Frequency Lexicon of the Finnish Newspaper Language

Size: 9,996 words

Licence: CC-BY NC ND 1.0

Finnish

This is a frequency lexicon available online through FIN-CLARIN.

Browse

Frequency List of Written Finnish Word Forms

Size: 17,604 lemmas; 1,339,787 word forms

Licence: EUPL

Finnish

This is a frequency lexicon of Finnish word forms that appear in the Finnish Parole text corpus. The resource is available online through FIN-CLARIN.

Browse

Modern Finnish Word List

Size: 94,110 entries

Linguistic information: MSD-tags, lemmas

Licence: GNU LGPL; EUPL v.1.1; CC-BY SA 3.0

Finnish

This is a wordlist of contemporary general vocabulary that is available for download through FIN-CLARIN.

Download

Psycholinguistic Descriptives

Size: 2.5 billion words

Licence: CC-BY 4.0

Finnish

This is a frequency wordlist (accompanied by a query tool) of acquiring commonly used psycholinguistic descriptives for Finnish words, and word surface form frequencies, lemma frequencies, syllable frequencies and letter n-gram frequencies. The resource is available for download from FIN-CLARIN.

Download

Relative frequencies of part-of-speech n-grams in native and translated Finnish literary prose

Licence: CC-BY 4.0

Finnish

This is a frequency list of N-grams appearing in the corpus Classics of English and American Literature, English-Finnish parallel corpus and the corpus of Translated Finnish. The resource is available for download from FIN-CLARIN.

Download

The Finnish N-grams 1820-2000 of the Newspaper and Periodical Corpus of the National Library of Finland

Licence: CC-BY 4.0

Finnish

This is a frequency list that contains sets of unigrams, bigrams and trigrams extracted from a newspaper corpus. The resource is available for download from FIN-CLARIN.

Download

Deutsche Wortschatz

Size: 5.8 million types

Linguistic information: synonymy, examples of use

German

This resource provides a list of annotated words taken from the deu_newscrawl_2011 corpus. The resource is available for online browsing through CLARIN-D/University of Leipzig.

Download

KELLY word-list Greek

Size: 7,385 entries

Licence: CC-BY-NC

Greek

This wordlist is useful for learning and teaching Greek as a foreign/second language; the words are classified according to the language levels of CEFR. The resource is available for download from clarin:el.

Download

Maltese Fiction Wordlist

Size: 41,251 tokens

Linguistic information: frequency

Licence: MS-NC-No ReD

Maltese

This is a wordlist from 32 fictional books available for download from CLARIN PORTULAN.

Download

Maltese Wordlist

Size: 824,839 words

Licence: LGPL

Maltese

This is a wordlist used for spell-checking. It is available for download from CLARIN PORTULAN.

Download

Ngbugu digital wordlist: Archival form

Size: 204 words

Ngbugu

This is a wordlist used in language documentation, phonetics and lexicography. The resource is available for download from Ortolang.

Download

LCM-PL

Size: 10,000 entries

Licence: CC BY SA 3.0

Polish

This is a wordlist that list the frequencies and abstract levels of verbs. The resource is available for download from the CLARIN-PL repository.

Download

Gos corpus n-grams 2.0

Size: 2,598,153 n-grams

Linguistic information: frequency

Licence: CC-BY-SA 4.0

Slovenian

This is a list of n-grams extracted from the Gos corpus of spoken Slovene. The resource is available for download from CLARIN.SI

Download

IMP corpus n-grams 2.0

Size: 34,668,696 n-grams

Linguistic information: frequency

Licence: CC-BY-SA 4.0

Slovenian

This is a list of n-grams extracted from the IMP corpus of historical Slovene. The resource is available for download from CLARIN.SI.

Download

Janes corpus n-grams 1.0

Size: 351,029,703 n-grams

Linguistic information: frequency

Licence: CC-BY-SA 4.0

Slovenian

This is a list of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0. The resource is available for download from CLARIN.SI

Download

Kres corpus n-grams 2.0

Size: 211,104,769 n-grams

Linguistic information: frequency

Licence: CC-BY-SA 4.0

Slovenian

This is a list of n-grams extracted from the Kres corpus of written Slovenian. The resource is available for download from CLARIN.SI

Download

Lexical functions of Spanish verb-noun collocations

Size: 1,000 verb-noun pairs

Linguistic information: lexicological classifications (free-word combinations, errors), semantic information

Spanish

This is a wordlist that consists of the most frequent 1000 verb-noun pairs extracted automatically from the Spanish Web Corpus classifications (collocation vs. free-word-combo). The resource is available for download from Ortolang.

Download

Idioms from the NEO lexicon DB

Size: 4,928 entries

Licence: CC-BY 4.0

Swedish

This is a wordlist of idioms with explanations extracted from the database for the dictionary Nationalencyklopediens ordbok. The resource can be download from the SWE-CLARIN.

Download

Kelly (2017-10-16)

Size: 10,510 entries

Licence: CC-BY 4.0

Swedish

This is a list of keywords for Language Learning for Young and adults alike. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

The Swedish N-grams 1770-1940 of the Newspaper and Periodical Corpus of the National Library of Finland

Licence: CC-BY 4.0

Swedish

This frequency list contains sets of unigrams, bigrams and trigrams extracted from a corpus compiled by the University of Helsinki from the digitized newspapers from the National Library of Finland. The resource is available for download from FIN-CLARIN.

Download

Vocation list (2015-01-10)

Size: 13,833 entries

Licence: CC-BY 4.0

Swedish

This is a wordlist of vocations in Swedish. The resource can be download from the SWE-CLARIN repository.

Download

Multilingual resources

Resource

Language Description Availability

Topics of library and information science

Size: 732 words

Licence: CC-BY-NC

English, Greek

This is a word list of terms from the domain of library and information science. The resource is available for download from clarin:el.

Download

Vocabulaire d'archéologie

Size: 4,431 entries

Linguistic information: preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of archaeology available for download from ORTOLANG.

Download

Vocabulaire d'art et archéologie

Size: 1,960 entries

Linguistic information: preferred forms, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from fine arts and archaeology available for download from ORTOLANG.

Download

Vocabulaire de géographie de l'Amérique du Nord

Size: 4,232 entries

Linguistic information: hierarchical relationship, preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a thesaurus of terms from the geography of North America available for download from ORTOLANG.

Download

Vocabulaire de Nutrition artificielle

Size: 2,500 entries

Linguistic information: associative relationship, hierarchical relationship, preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of nutrition available for download from ORTOLANG.

Download

Vocabulaire de Pathologies humaines

Linguistic information: hierarchical relationship, preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of medicine (pathological diseases) available for download from ORTOLANG.

Download

Vocabulaire de philosophie

Size: 4,435 entries

Linguistic information: preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions of philosophical terms available for download from ORTOLANG.

Download

Vocabulaire de préhistoire et protohistoire

Size: 3,093 entries

Linguistic information: hierarchical relationship, preferred forms, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions historical terms available for download from ORTOLANG.

Download

Vocabulaire de Psychopathologie

Size: 575 terms

Linguistic information: associative relationship, hierarchical relationship, preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of psychopathology available for download from ORTOLANG.

Download

Vocabulaire de Sciences de l'éducation

Size: 2,681 entries

Linguistic information: preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of education available for download from ORTOLANG.

Download

Vocabulaire de Sciences du langage

Size: 6,142 entries

Linguistic information: hierarchical relationship, preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of linguistics available for download from ORTOLANG.

Download

Vocabulaire de sociologie

Size: 5,277 entries

Linguistic information: preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of sociology available for download from ORTOLANG.

Download

Vocabulaire de Transferts de chaleur

Size: 1462 entries

Linguistic information: associative relationship, Hierarchical relationship, Preferred form, Synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions of thermodynamic terms available for download from ORTOLANG.

Download

Vocabulaire de Transfusion sanguine

Size: 2,000 entries

Linguistic information: associative relationship, hierarchical relationship, preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of medicine (related to blood transfusion) available for download from ORTOLANG.

Download

Vocabulaire d'ethnologie

Size: 9,517 entries

Linguistic information: preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of ethnology available for download from ORTOLANG.

Download

Vocabulaire d'histoire des sciences et des techniques

Size: 3,766 entries

Linguistic information:

contextual information, usage examples,preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain technical sciences available for download from ORTOLANG.

Download

Vocabulaire d'histoire et sciences de la littérature

Size: 11,065 entries

Linguistic information: preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of literary studies available for download from ORTOLANG.

Download

Vocabulaire d'Histoire et sciences des religions

Size: 4,581 entries

Linguistic information: preferred form, synonym shape

Licence: CC-BY 4.0

French, English

This is a controlled vocabulary of expressions from the domain of philosophy and religion available for download from ORTOLANG.

Download

Vocabulaire de sciences de la Terre

Size: 19,707 entries

Linguistic information: hierarchical relationship, preferred form, synonym shape

Licence: CC-BY 4.0

French, English, Spanish

This is a controlled vocabulary of expressions from the domain of geology available for download from ORTOLANG.

Download

Vocabulaire d'Electronique et électro-optique

Size: 4,456 entries

Linguistic information: associative relationship, hierarchical relationship, preferred form, synonym shap

eLicence: CC-BY 4.0

French, English, Spanish

This is a controlled vocabulary of expressions from the domain of electronics available for download from ORTOLANG.

Download

Λεξικό Γλωσσολογικών όρων: Γερμανικά – Ελληνικά - Αγγλικά (lexicon of linguistic terms: DE-EL-EN)

Size: 2,000 words

German, Greek, English

This is a wordlist of linguistic terms that is available for download from clarin:el.

Download

Labial vibrants in Mangbetu: Archival form

Licence: CC-BY

Mangbetu, French, English

 

This is a wordlist of lexical items that exemplify occurrences of bilabial trills and the labiodental flaps. The resource is available for download from ORTOLANG.

Download

JRC-Names - a multilingual named entity resource

Linguistic information: spelling varieties of names

Licence: Open for Reuse with Restrictions

Slovenian, Swedish, Bulgarian, English, Greek, Estonian, Spanish, Castilian, Czech, German, Danish, French, Finnish, Italian, Hungarian, Latvian, Lithuanian, Maltese, Dutch, Flemish, Portuguese, Polish, Slovak, Romanian

This is a wordlist of named entities (person and organisation names). The resource is available for download from clarin:el.

Download

Swedish words, LEXIN

Size: 29,111 entries

Licence: CC-BY 4.0

Swedish, Albanian, Bosnian, English, Finnish, Modern Greek, Croatian, Iranian Persian, Russian, Serbian, Somali, Spanish, Turkish

This is a word list to be used by immigrants to Sweden. The resource can be download from the SWE-CLARIN repository.

Download