Blog post written by Maria Gavrilidou
- PERSON: person names, family names
- LOCATION: political or geographical names such as continents, countries, cities, etc.
- ORGANIZATION: names of entities such as companies, institutions, organizations, etc.
- FACILITY: names of buildings and other human-created structures, such as streets, bridges, etc.
- GPE (Geo-political entity): entities whose names coincide with a location name, but whose semantic content actually refers to its government or administration.
The GrNE tagger is not a single tool, but rather a pre-defined pipeline of tools seamlessly integrated, in the sense that the output of one tool constitutes the input to the next:
Tokenization > Sentence Segmentation > Part-of-Speech Tagging > Lemmatization > Chunking > Named Entity Recognition
The annotation processes before Named Entity Recognition constitute the pre-processing of the text. After the pre-processing stage is completed, the Named Entity Recognition algorithm is applied to the text in two stages: it first uses linguistic rules to identify a set of candidate NEs and subsequently checks them against manually created wordlists of existing proper names. If a proper name in the pre-processed text is not identified in this manner, the tool tags it as UNKNOWN.
To consolidate a candidate NE or a proper name labelled as UNKNOWN, and to finally place it into the correct category, GrNE-Tagger applies another round of linguistic rules that search for specific keywords in the context of the ambiguous expression. The keywords used for such disambiguation are, for example, profession titles, words denoting nationality or kinship terms such as father of, sister of etc. (in the case of PERSON); prefixes or suffixes denoting company types, such as Corp., Ltd. etc. (for ORGANIZATION); words such as street, bridge etc. (for LOCATION) and so on. Based on shallow syntactic parsing, the system also disambiguates between LOCATION and GPE (Geo-political entity).
The following picture shows an example of the output of GrNE-tagger (using GATE as a visualization tool), in which different NEs are marked with different colors.
GrNE-tagger has been integrated in the clarin:el infrastructure as a web service, which means that the users do not need to install the tool locally; they simply select a resource from the clarin:el inventory (or upload their own resource) and they process it. After the completion of the processing, the users receive an email with a link to the result of the processing. Furthermore, the tool has already been successfully applied to annotate several resources; for instance, one such resource enriched with GrNE-tagger is a corpus of interviews conducted with female business entrepreneurs in Athens.
GrNE-tagger has been developed and is maintained by the Institute for Language and Speech Processing / Athena RC and is available under a license that permits Academic – Non Commercial Use.
Click here to read more about Tour de CLARIN.