Tools for named entity recognition

Introduction

Named entity recognition (NER) is an information extraction task which identifies mentions of various named entities in unstructured text and classifies them into predetermined categories, such as person names, organisations, locations, date/time, monetary values, and so forth. They can, for example, help with the classification of news content, content recommentations and search algorithms.

The CLARIN infrastructure offers 25 tools for NER. 17 tools are aimed at normalizing texts within a single language (4 Dutch, 2 English, 1 Finnish, 2 German, 1 Icelandic, 1 Greek, 1 Hungarian, 1 Latvian, 3 Polish, 1 Portuguese), while the rest have a very broad multilingual scope. While 16 tools are in terms of their functionality dedicated exclusively to NER, 10 are part of tool pipelines that also provide functionalities such as PoS-tagging, lemmatisation and syntactic parsing.

For comments, changes of the existing content or inclusion of new tools, send us an email.

This website was last updated on 20 July 2023.

Tools for named entity recognition in the CLARIN infrastructure

Tool Language Description

CTexTools 2

Functionality: tokenization, sentence segmentation, PoS-tagging, phrase chunking, NER
Licence: CC 4.0

Afrikaans, English, South Ndebele, Xhosa, Zulu, Sesotho, Pedi, Setswana, Swazi, Venda, Tsonga

This is a corpus query and manipulation tool primarily for the official South African languages. The tool supports the creation of frequency and word lists, collocation searches and statistical analysis of corpus data.

Availability: download

NER categories: organisation, person, location, miscellaneous, outside
CLARIN Centre: SADiLaR

NCHLT Tagger

Functionality: PoS-tagging, phrase chunking, NER
Platform: cross-platform
Licence: CC-A 2.5 South Africa Licence

Afrikaans, English, Ndebele, Xhosa, Zulu, Pedi Setswana, Sesotho, Swazi, Venda, Tssonga

This is a graphical user interface and command-line tool for automatic text processing.

Availability: download

NER categories: organisation, person, location, miscellaneous, outside
CLARIN Centre: SADiLaR

FreeLing

Functionality: tokenisation, MSD-tagging, syntactic parsing, lemmatization, NER
Platform: cross-platform
Licence: Affero GPL

Catalan, English, Galician, Italian, Portuguese, Welsh

This is an open source language analysis tool suite that provides several processing components.

Availability: download
CLARIN Centre: LINDAT
NER categories: person, location, organisation, miscellaneous
Publication: Carreras, M脿rquez, and Padr贸 (2013)

Frog

Functionality: tokenisation, MSD-tagging, lemmatisation, morphologic segmentation, phrase chunking,  NER
Platform: Linux, Mac OS X
Licence: GNU General Public Licence

Dutch

Frog is a memory-based pipeline based on Timbl, the Tilburg memory-based learning software package. Frog produces FoLiA XML. 

Availability: download
CLARIN Centre: CLARIAH-NL
NER categories: person, organisation, location, product, event, miscellaneous
Publication: Van den Bosch et al. (2007)

INL labs

Functionality: tokenisation, sentece segmentation, PoS-tagging, lemmatisation, NER
Platform: cross-paltform

Dutch

This toolchain currently provides two annotation tools: the Stanford named entity recognizer, which was trained on the historical Dutch newspapers corpus Letters as loot in the context of the IMPACT project (Landsbergen 2012), and a tagger that consists of a tokenizer/sentence boundary detector, a statistical part-of-speech tagger and a lemmatizer. 

This toolchain outputs linguistically annotated from a number of input formats (TEI, plain text, Alto, .doc files).

Availability: online service
CLARIN Centre: CLARIAH-NL
NER categories: person, organization, location, miscellaneous
 

NameScape: Named Entity Recognition

Functionality: NER
Platform: cross-platform

Dutch

This NER was developed in the Namescape project.

Availability: online service
CLARIN Centre: CLARIAH-NL

The NERD named entity recognizer

Functionality: NER
Platform: cross-platform

Dutch

This NER is now integrated into the PICLL workflow.

Availability: online service
CLARIN Centre: CLARIAH-NL

NameTag

Functionality: NER
Platform: Linux, Windows, OS X
Licence: MPL 2.0 (software), CC BY-NC-SA (models)

Czech, English

NameTag is an open-source tool that recognizes different NER categories per language model. For Czech, it recognizes a complex hierarchy of categories. The English model, which is trained on CoNLL-2003 NER annotations (Sang and De Meulder 2003), distinguishes the following four NER classes: person, organisation, location and miscellaneous.

The trained model for Czech is available for through LINDAT: Czech Models (CNEC) for NameTag.

A user manual is also available.

Availability: download, online service, web API
CLARIN Centre: LINDAT
NER categories: per model, see above
Publication: Strakov谩, Straka and Haji膷 (2013)

Illinois Named Entity Recognizer

Functionality: NER
Platform: cross-platform
Licence: underlying software is open source

English

This NER annotates plain text.

Availability: WebLicht
CLARIN Centre: CLARIN-D
NER categories: person, location, organization, miscellaneous

OpenNLP Name Finder (English)

Functionality: NER
Platform: Linux, Windows
Licence: Apache Licence 2.0

English

This NER can be applied to existing corpora available through the CLARIN:EL infrastructure and to those independently uploaded corpora that are compatible with the tool鈥檚 requirements.

Availability: online service
CLARIN Centre: CLARIN:EL
NER categories: person, location, organization

GATE

Functionality: tokenization, PoS-tagging, NER, semantic and orthographic coreference, pronominal coreference
Platform: cross-platform
Licence: LGPL

English, French, German, Romanian, Russian, Welsh, Danish, Chinese, Arabic

This is a complete NLP platform with modules for named entity recognition.

Availability: download, online service
CLARIN Centre: CLARIN-UK
NER categories: person, location, organisation, date, percent, money
Publication: Cunningham et al. (2019)

OpenNLP Named Entity Recognizer

Functionality: NER
Platform: cross-platform
Licence: Apache License version 2.0 (underlying software)

English, Spanish

This NER is based on the OpenNLP NER tool.

Availability: WebLicht
CLARIN Centre: CLARIN-D
NER categories: person, location, organization

Finnish Tagtools 1.4

Functionality: PoS/MSD-tagging, NER
Platform: Linux, Unix
Licence: GPL 3

Finnish

This software package provides finnish-postag, a part-of-speech and morphology tagger for Finnish, and finnish-nertag, a named entity recogniser for Finnish.

Availability: download, online service
CLARIN Centre: FIN-CLARIN
NER categories: person (human, mythological, animal, other); location (political, geographical, street, infrastructure, mythological, astronomical, other); organisation (corporation, political, media, financial, educational, cultural, athletic, other, miscellaneous); product; event; time (dates, times), numerical expressions (measurements, money)
Publication: Ruokolainen et al. (2019)

German Named Entity Recognizer

Functionality: NER
Platform: cross-platform
Licence: Apache License, Version 2.0 (underlying software)

German

This NER is based on the maximum entropy approach using the OpenNLP maxent library. Two models are available: one trained on CoNLL2003 training set (conll), and the one trained on TuebaDZ corpus release 8 (tuebadz).

Availability: WebLicht
CLARIN Centre: CLARIN-D
NER categories: person, location, organization

Person Name Recognizer

Functionality: NER
Platform: cross-platform
Licence: terms of service

German

This NER is tailored to historical German (optimized for journals and high precision) and is based on weighted finite state transducers.

Availability: WebLicht
CLARIN Centre: CLARIN-D
NER categories: person
Publication: Didakowski and Drotschmann (2008)

Sticker Named Entity Recognizer

Functionality: NER
Platform: cross-platform
Licence: Blue Oak Model Licence 1.0.0 (underlying software)

German, Dutch

This NER is built on a neural-network-based sequence labeller that can label named entities for German and Dutch.

Availability: download, WebLicht
CLARIN Centre: CLARIN-D
NER categories: person, location, organization, geopolitical entity, other

GrNE-Tagger

Functionality: NER
Platform: cross-platform
Licence: terms of service (academic non-commercial use)

Greek (modern)

This NER operates on a rule-based engine designed. It was developed and is maintained by the Institute for Language and Speech Processing / Athena Research Center. This recognizer can be applied to existing corpora available through the CLARIN:EL infrastructure and to those independently uploaded corpora that are compatible with the tool鈥檚 requirements.

Availability: online service
CLARIN Centre: CLARIN:EL
NER categories: person, location, organization, facility, gpe (geo-political entity)

hunner - named entitiy recognizer for Hungarian

Functionality: NER

Hungarian

This NER employs a maximum entropy approach.

Availability: unavailable
CLARIN Centre: HUN-CLARIN
Publication: Simon (2013)

 
Functionality: NER
Licence: Apache 2.0 (models)
Icelandic
This is a dockerized NER for Icelandic. The code for the API is available at GitHub. There are two models for this NER available for download through the CLARIN-IS repository; the ELECTRA-base model, which achieves F1-score of ~91.9 on the test set for MIM-GOLD-NER, and the Ensamble model, which uses a the IceBERT language model from Mi冒eind as its primary model, but it also offers the possibility to use 3 other transformer language models with it ( ELECTRA-base, convbert-small, and multilingual-BERT) and combines them with CombiTagger.
 
Availability: online service
NER categories: person, location, organization, miscellaneous, date, money, time, percent
CLARIN Centre: CLARIN-IS
 
Functionality: tokenisation, NER, MSD-tagging, lemmas, syntactic parsing (universal dependencies)
 
Latvian
NLP-PIPE is a modular toolchain that allows researchers to combine multiple natural language processing tools in a unified framework. It supports a wide range of annotation services for Latvian, including tokenization, morphological tagging, lemmatisation, universal dependency parsing, and named entity recognition. In the web based interface, a user simply selects the required processing tools and inputs the text they want to annotate. The results can then be viewed either directly on the website or exported in several formats (JSON, CONLL).
 
Availability: online service
CLARIN Centre: CLARIN-LV
Publication: Znoti艈拧 and C墨rule (2018)

Liner2

Functionality: NER
Platform: cross-platform
Licence: GNU General Public License

Polish

This NER uses conditional random fields and a rich set of token features. The tool got third place in the PolEval 2018 Task 2 on named entity recognition. It contains a pre-trained model trained on the National Corpus of Polish (NKJP) and KPWr corpus (Broda et al. 2012).

The KPWr model distinguishes the following categories: person, location, facility, organization, product, event, adjective.

The NKJP model distinguishes the following NER categories: person, organization, location, date, time

Availability: download, online service, web API
CLARIN Centre: CLARIN-PL
NER categories: per model, see above.
Publication: Marci艅czuk, Koco艅, and Gawor (2018)

Nerf

Functionality: NER
Platform: Haskell Platform
Licence: GPL v.3

Polish

This statistical NER is based on linear-chain conditional random fields.

Availability: download
CLARIN Centre: CLARIN-PL
Trained models: download

PolDeepNer

Functionality: NER
Platform: cross-platform
Licence: GNU General Public Licence

Polish

This NER uses deep learning methods . The tool got 2nd place in the PolEval 2018 Task 2 on NER. It contains a pre-trained model on the NKJP corpus .

Availability: download
CLARIN Centre: CLARIN-PL
NER categories: nested annotations of the following types: personal names (forenames, surnames, additional names), organizational names, geographic names, place names (district, settlement, region, country, bloc), date, and time
Publication: Marci艅czuk, Koco艅, and Gawor (2018)

LX-NER

Functionality: NER
Platform: cross-platform

Portuguese

This NER annotates plain text by identifying and classifying the expressions for named entities it contains.

The named-based module is integrated into the full LX-Suite pipeline (tokenization, POS tagging, parsing).

Availability: online service
CLARIN Centre: PORTULAN
NER categories: name-based: person, organization, location, events, works; number-based: numbers, measures, time

janes-ner

Functionality: NER
Platform: cross-platform
Licence: Apache License 2.0

Slovenian, Croatian, Serbian

This named entity recognizer is a slight modification of the CRF-based reldi-tagger with Brown clusters information added. Input data need to be pre-processed by the reldi-tokeniser and the reli-tagger for morphosyntactic annotation.

Availability: download, online service, web API
CLARIN Centre: CLARIN.SI
NER categories: person, person derivative, location, organization and miscellaneous
Publication: Fi拧er, Ljube拧i膰 and Erjavec (2018)

Publications

[Broda et al. 2012]  Bartosz Broda, Micha艂 Marcinczuk, Marek Maziarz, Adam Radziszewski, and Adam Wardy艅ski. 2012. KPWr: Towards a Free Corpus of Polish. In Proceedings of LREC2012.

[Carreras, M脿rquez and Padr贸 2003] Xavier Carreras, Lu铆s M脿rquez, and Lu铆s Padr贸. 2003. A simple named entity extractor using AdaBoost. In CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, 152鈥155.

[Chrupa艂a and Klakow 2010] Grzegorz Chrupa艂a and Dietrich Klakow. 2010. A Named Entity Labeller for German: exploiting Wikipedia and distributional clusters. In Proceedings of LREC2010.

[Cunningham et al. 2019] Hammish Cunninghamn, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, Wim Peters, and Leon Derczynski. 2019. Developing Language Processing Components with GATE Version 8 (a User Guide).

[Derczynski et al. 2015] Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve, Gorrell, Rapha毛l Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Information Processing & Management 51 (2): 32鈥49.

[Didakowski and Drotschmann 2009] J枚rg Didakowski and Marko Drotschmann. 2009. In Finite-State Methods and Natural Language Processing, 50鈥61. 

[Fi拧er, Ljube拧i膰 and Erjavec 2018] Darja Fi拧er, Nikola Ljube拧i膰, and Toma啪 Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation.

[Landsbergen 2012] Frank Landsbergen. 2012. Evaluation of named entity work in IMPACT: NE Recognition and matching. Technical report.

[Marci艅czuk, Koco艅 and Gawor 2018] Micha艂 Marci艅czuk, Jan Koco艅, and Micha艂 Jacek Gawor. 2018. Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches. In Proceedings of the PolEval 2018 Workshop, 71鈥86.

[Ruokolainen et al. 2019] Ruokolainen, Teemu, Pekka Kauppinen, Miikka Silfverberg, and Krister Lind茅n. 2019. A Finnish news corpus for named entity recognition. Language Resources and Evaluation.

[Sang and De Meulder 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceeding CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 鈥 Volume 4, 142鈥147.

[Simon 2013] Simon, Eszter. 2013. Approaches to Hungarian Named Entity Recognition. PhD Thesis.

[Strakov谩, Straka and Haji膷 2013] Jana Strakov谩, Milan Straka, and Jan Haji膷. 2013. A New State-of-The-Art Czech Named Entity Recognizer. In TSD 2013: Text, Speech, and Dialogue, edited by I. Habernal and V. Matou拧ek, 68鈥75. 

[Van den Bosch et al. 2007] Antal van den Bosch, Bertjan Busser, Sander Canisius and Walter Daelemans. 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch. In Computational Linguistics in the Netherlands 2006: selected papers from the Seventeenth CLIN meeting, edited by Peter Dirix, 191鈥206.

[Znoti艈拧 and C墨rule 2018] Art奴rs Znoti艈拧 and Elita C墨rule. 2018. NLP-PIPE: Latvian NLP Tool Pipeline. Frontiers in Artificial Intelligence and Applications 307: 183鈥189.