You are here

Tools for named entity recognition

 

Introduction

Named entity recognition (NER) is an information extraction task which identifies mentions of various named entities in unstructured text and classifies them into predetermined categories, such as person names, organisations, locations, date/time, monetary values, and so forth. They can, for example, help with the classification of news content, content recommentations and search algorithms.

The CLARIN infrastructure offers 24 tools for NER. 15 tools are aimed at normalizing texts within a single language (4 Dutch, 2 English, 1 Finnish, 2 German, 1 Greek, 1 Hungarian, 3 Polish, 1 Portuguese), while the rest have a very broad multilingual scope. While 16 tools are in terms of their functionality dedicated exclusively to NER, 8 are part of tool pipelines that also provide functionalities such as PoS-tagging, lemmatisation and syntactic parsing.

For comments, changes of the existing content or inclusion of new tools, send us an email.

This website was last updated on 30 March 2020.

Tools for named entity recognition in the CLARIN infrastructure

Tool Language Description

CTexTools 2

Functionality: tokenization, sentence segmentation, PoS-tagging, phrase chunking, NER
Licence: CC 4.0

Afrikaans, English, South Ndebele, Xhosa, Zulu, Sesotho, Pedi, Setswana, Swazi, Venda, Tsonga

This is a corpus query and manipulation tool primarily for the official South African languages. The tool supports the creation of frequency and word lists, collocation searches and statistical analysis of corpus data.

Availability: download
CLARIN Centre: SADiLaR

NCHLT Tagger

Functionality: PoS-tagging, phrase chunking, NER
Platform: cross-platform
Licence: CC-A 2.5 South Africa Licence

Afrikaans, English, Ndebele, Xhosa, Zulu, Pedi Setswana, Sesotho, Swazi, Venda, Tssonga

This is a graphical user interface and command-line tool for automatic text processing.

Availability: download
CLARIN Centre: SADiLaR

FreeLing

Functionality: tokenisation, MSD-tagging, syntactic parsing, lemmatization, NER
Platform: cross-platform
Licence: Affero GPL

Catalan, English, Galician, Italian, Portuguese, Welsh

This is an open source language analysis tool suite that provides several processing components.

Availability: download
CLARIN Centre: LINDAT
NER categories: person, location, organisation, miscellaneous
Publication: Carreras, Màrquez, and Padró (2013)

Frog

Functionality: tokenisation, MSD-tagging, lemmatisation, morphologic segmentation, phrase chunking,  NER
Platform: Linux, Mac OS X
Licence: GNU General Public Licence

Dutch

Frog is a memory-based NLP pipeline based on Timbl, the Tilburg memory-based learning software package. Frog produces FoLiA XML. 

Availability: download
CLARIN Centre: CLARIAH-NL
NER categories: person, organisation, location, product, event, miscellaneous
Publication: Van den Bosch et al. (2007)

INL labs

Functionality: tokenisation, sentece segmentation, PoS-tagging, lemmatisation, NER
Platform: cross-paltform

Dutch

This toolchain currently provides two annotation tools: the Stanford named entity recognizer, which was trained on the historical Dutch newspapers corpus Letters as loot in the context of the IMPACT project (Landsbergen 2012), and a tagger that consists of a tokenizer/sentence boundary detector, a statistical part-of-speech tagger and a lemmatizer. 

This toolchain outputs linguistically annotated TEI from a number of input formats (TEI, plain text, Alto, .doc files).

Availability: online service
CLARIN Centre: CLARIAH-NL
NER categories: person, organization, location, miscellaneous
 

NameScape: Named Entity Recognition

Functionality: NER
Platform: cross-platform

Dutch

This NER was developed in the Namescape project.

Availability: online service
CLARIN Centre: CLARIAH-NL

The NERD named entity recognizer

Functionality: NER
Platform: cross-platform

Dutch

This NER is now integrated into the PICLL workflow.

Availability: online service
CLARIN Centre: CLARIAH-NL

NameTag

Functionality: NER
Platform: Linux, Windows, OS X
Licence: MPL 2.0

Czech, English

NameTag is an open-source tool that recognizes different NER categories per language model. For Czech, it recognizes a complex hierarchy of categories. The English model, which is trained on CoNLL-2003 NER annotations (Sang and De Meulder 2003), distinguishes the following four NER classes: person, organisation, location and miscellaneous.

The trained model for Czech is available for through LINDAT: Czech Models (CNEC) for NameTag.

A user manual is also available.

Availability: download, online service, web API
CLARIN Centre: LINDAT
NER categories: per model, see above
Publication: Straková, Straka and Hajič (2013)

Illinois Named Entity Recognizer

Functionality: NER
Platform: cross-platform
Licence: underlying software is open source

English

This NER annotates plain text.

Availability: WebLicht
CLARIN Centre: CLARIN-D
NER categories: person, location, organization, miscellaneous

OpenNLP Name Finder (English)

Functionality: NER
Platform: Linux, Windows
Licence: Apache Licence 2.0

English

This NER can be applied to existing corpora available through the CLARIN:EL infrastructure and to those independently uploaded corpora that are compatible with the tool’s requirements.

Availability: online service
CLARIN Centre: CLARIN:EL
NER categories: person, location, organization

GATE

Functionality: tokenization, PoS-tagging, NER, semantic and orthographic coreference, pronominal coreference
Platform: cross-platform
Licence: LGPL

English, French, German, Romanian, Russian, Welsh, Danish, Chinese, Arabic

This is a complete NLP platform with modules for named entity recognition.

Availability: download, online service
CLARIN Centre: CLARIN-UK
NER categories: person, location, organisation, date, percent, money
Publication: Cunningham et al. (2019)

OpenNLP Named Entity Recognizer

Functionality: NER
Platform: cross-platform
Licence: Apache License version 2.0 (underlying software)

English, Spanish

This NER is based on the OpenNLP NER tool.

Availability: WebLicht
CLARIN Centre: CLARIN-D
NER categories: person, location, organization

Finnish Tagtools 1.4

Functionality: PoS/MSD-tagging, NER
Platform: Linux, Unix
Licence: GPL 3

Finnish

This software package provides finnish-postag, a part-of-speech and morphology tagger for Finnish, and finnish-nertag, a named entity recogniser for Finnish.

Availability: download, online service
CLARIN Centre: FIN-CLARIN
NER categories: person (human, mythological, animal, other); location (political, geographical, street, infrastructure, mythological, astronomical, other); organisation (corporation, political, media, financial, educational, cultural, athletic, other, miscellaneous); product; event; time (dates, times), numerical expressions (measurements, money)
Publication: Ruokolainen et al. (2019)

German Named Entity Recognizer

Functionality: NER
Platform: cross-platform
Licence: Apache License, Version 2.0 (underlying software)

German

This NER is based on the maximum entropy approach using the OpenNLP maxent library. Two models are available: one trained on CoNLL2003 training set (conll), and the one trained on TuebaDZ corpus release 8 (tuebadz).

Availability: WebLicht
CLARIN Centre: CLARIN-D
NER categories: person, location, organization

Person Name Recognizer

Functionality: NER
Platform: cross-platform
Licence: terms of service

German

This NER is tailored to historical German (optimized for journals and high precision) and is based on weighted finite state transducers.

Availability: WebLicht
CLARIN Centre: CLARIN-D
NER categories: person
Publication: Didakowski and Drotschmann (2008)

SemiNER

Functionality: PoS-tagging, syntactic chunking, NER
Platform: cross-platform
Licence: see here

German, English

The SemiNER is part of a sequence labeller called Sequor, which is based on Collins’s (2002) perceptron. Sequor has a flexible feature template language and is meant mainly for NLP applications such as Named Entity recognition, Part of Speech tagging and syntactic chunking. It includes pre-trained models for German and English.

Availability: download
CLARIN Centre: CLARIN-D
Trained models: available
NER categories: person, organisation, location, miscellaneous
Publication: Chrupala and Klakow (2010)

Sticker Named Entity Recognizer

Functionality: NER
Platform: cross-platform
Licence: Blue Oak Model Licence 1.0.0 (underlying software)

German, Dutch

This NER is built on a neural-network-based sequence labeller that can label named entities for German and Dutch.

Availability: download, WebLicht
CLARIN Centre: CLARIN-D
NER categories: person, location, organization, geopolitical entity, other

GrNE-Tagger

Functionality: NER
Platform: cross-platform
Licence: terms of service (academic non-commercial use)

Greek (modern)

This NER operates on a rule-based engine designed. It was developed and is maintained by the Institute for Language and Speech Processing / Athena Research Center. This recognizer can be applied to existing corpora available through the CLARIN:EL infrastructure and to those independently uploaded corpora that are compatible with the tool’s requirements.

Availability: online service
CLARIN Centre: CLARIN:EL
NER categories: person, location, organization, facility, gpe (geo-political entity)

hunner - named entitiy recognizer for Hungarian

Functionality: NER

Hungarian

This NER employs a maximum entropy approach.

Availability: unavailable
CLARIN Centre: HUN-CLARIN
Publication: Simon (2013)

Liner2

Functionality: NER
Platform: cross-platform
Licence: GNU General Public License

Polish

This NER uses conditional random fields and a rich set of token features. The tool got third place in the PolEval 2018 Task 2 on named entity recognition. It contains a pre-trained model trained on the National Corpus of Polish (NKJP) and KPWr corpus (Broda et al. 2012).

The KPWr model distinguishes the following categories: person, location, facility, organization, product, event, adjective.

The NKJP model distinguishes the following NER categories: person, organization, location, date, time

Availability: download, online service, web API
CLARIN Centre: CLARIN-PL
NRE categories: per model, see above.
Publication: Marcińczuk, Kocoń, and Gawor (2018)

Nerf

Functionality: NER
Platform: Haskell Platform
Licence: GPL v.3

Polish

This statistical NER is based on linear-chain conditional random fields.

Availability: download
CLARIN Centre: CLARIN-PL
Trained models: download

PolDeepNer

Functionality: NER
Platform: cross-platform
Licence: GNU General Public Licence

Polish

This NER uses deep learning methods . The tool got 2nd place in the PolEval 2018 Task 2 on NER. It contains a pre-trained model on the NKJP corpus .

Availability: download
CLARIN Centre: CLARIN-PL
NER categories: nested annotations of the following types: personal names (forenames, surnames, additional names), organizational names, geographic names, place names (district, settlement, region, country, bloc), date, and time
Publication: Marcińczuk, Kocoń, and Gawor (2018)

LX-NER

Functionality: NER
Platform: cross-platform

Portuguese

This NER annotates plain text by identifying and classifying the expressions for named entities it contains.

The named-based module is integrated into the full LX-Suite pipeline (tokenization, POS tagging, parsing).

Availability: online service
CLARIN Centre: PORTULAN
NER categories: name-based: person, organization, location, events, works; number-based: numbers, measures, time

janes-ner

Functionality: NER
Platform: cross-platform
Licence: Apache License 2.0

Slovenian, Croatian, Serbian

This named entity recognizer is a slight modification of the CRF-based reldi-tagger with Brown clusters information added. Input data need to be pre-processed by the reldi-tokeniser and the reli-tagger for morphosyntactic annotation.

Availability: download, online service, web API
CLARIN Centre: CLARIN.SI
NER categories: person, person derivative, location, organization and miscellaneous
Publication: Fišer, Ljubešić and Erjavec (2018)

Publications

[Broda et al. 2012]  Bartosz Broda, Michał Marcinczuk, Marek Maziarz, Adam Radziszewski, and Adam Wardyński. 2012. KPWr: Towards a Free Corpus of Polish. In Proceedings of LREC2012.

[Carreras, Màrquez and Padró 2003] Xavier Carreras, Luís Màrquez, and Luís Padró. 2003. A simple named entity extractor using AdaBoost. In CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, 152–155.

[Chrupała and Klakow 2010] Grzegorz Chrupała and Dietrich Klakow. 2010. A Named Entity Labeller for German: exploiting Wikipedia and distributional clusters. In Proceedings of LREC2010.

[Cunningham et al. 2019] Hammish Cunninghamn, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, Wim Peters, and Leon Derczynski. 2019. Developing Language Processing Components with GATE Version 8 (a User Guide).

[Derczynski et al. 2015] Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve, Gorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Information Processing & Management 51 (2): 32–49.

[Didakowski and Drotschmann 2009] Jörg Didakowski and Marko Drotschmann. 2009. In Finite-State Methods and Natural Language Processing, 50–61. 

[Fišer, Ljubešić and Erjavec 2018] Darja Fišer, Nikola Ljubešić, and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation.

[Landsbergen 2012] Frank Landsbergen. 2012. Evaluation of named entity work in IMPACT: NE Recognition and matching. Technical report.

[Marcińczuk, Kocoń and Gawor 2018] Michał Marcińczuk, Jan Kocoń, and Michał Jacek Gawor. 2018. Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches. In Proceedings of the PolEval 2018 Workshop, 71–86.

[Ruokolainen et al. 2019] Ruokolainen, Teemu, Pekka Kauppinen, Miikka Silfverberg, and Krister Lindén. 2019. A Finnish news corpus for named entity recognition. Language Resources and Evaluation.

[Sang and De Meulder 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceeding CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 – Volume 4, 142–147.

[Simon 2013] Simon, Eszter. 2013. Approaches to Hungarian Named Entity Recognition. PhD Thesis.

[Straková, Straka and Hajič 2013] Jana Straková, Milan Straka, and Jan Hajič. 2013. A New State-of-The-Art Czech Named Entity Recognizer. In TSD 2013: Text, Speech, and Dialogue, edited by I. Habernal and V. Matoušek, 68–75. 

[Van den Bosch et al. 2007] Antal van den Bosch, Bertjan Busser, Sander Canisius and Walter Daelemans. 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch. In Computational Linguistics in the Netherlands 2006: selected papers from the Seventeenth CLIN meeting, edited by Peter Dirix, 191–206.