You are here

Tools for normalization

Introduction

Text normalization is the process of transforming parts of a text into a single canonical form. It represents one of the key stages of linguistic processing for texts in which spelling variation abounds or deviates from the contemporary norm, such as in texts published in historical documents or on social media. After text normalization, standard tools for all further stages of text processing can be used. Another important advantage of text normalization is improved search which can be performed with querying a single, standard variant but takes into account all its spelling variants, be it historical, dialectal, colloquial or slang.

The CLARIN infrastructure offers 14 tools for text normalization. Most of the tools are aimed at normalizing texts within a single language (3 Dutch, 1 English, 3 German, 1 Hungarian, 1 Icelandic, 1 Slovenian, 1 Turkish), while the rest have a very broad multilingual scope. Half of the tools are dedicated normalizers, while the others provide additional functionalities such as PoS-tagging, lemmatization and named entity recognition.

For comments, changes of the existing content or inclusion of new tools, send us an email.

This website was last updated on 23 July 2020.

Tools for normalization in the CLARIN infrastructure

Tool Language Description

Text Tonsorium

Functionality: tokenization, segmentation, lemmatization, PoS-tagging, normalization, syntax analysis, NER, format transformations
Domain: independent
Licence: GPL

Afrikaans, Albanian, Armenian, Basque, Bosnian, Breton, Bulgarian, Catalan, Chinese, Corsican, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Finnish,
French, Galician, Georgian, German, Greek, Middle Low German, Haitian, Hindi, Hungarian, Icelandic, Indonesian, Inuktitut, Irish, Italian, Javanese, Kannada, Kurdish, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malay, Malayalam, Maltese, Norwegian, Occitan, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Turkish, Ukranian, Uzbek, Vietnamese, Welsh, Yiddish

Automatic construction and execution of several NLP workflows, which include normalisation.

FoLiA-wordtranslate

Functionality: normalization
Domain: historical texts
Licence: GNU Public License v3

Dutch

This tool does word-by-word lookups in a bilingual lexicon, applies some transformation rules and additionally it may consult the INT Historical Lexicon (to be obtained separately due to licensing restrictions). The aim is to use the modernisation layer to do further linguistic enrichment using contemporary models. This tool is part of the FoLiA-Utils collection as it operates on documents in the FoLiA format. Standalone it is only of very limited interest to others.

  • Availability: download
  • CLARIN Centre: CLARIAH-NL
  • Platform: Linux/POSIX (C++)
  • Input format: plain text, FoLiA-XML
  • Output format: FoLiA-XML, plaintext

Nederlab Pipeline

Functionality: modernisation, normalisation, tokenisation, conversion, PoS-tagging, lemmatisation, NER with entity linking (all functionality is derived from the individual parts rather than an inherent part of the workflow)
Domain: independent
Licence: GNU Public License v3

Dutch

This is a linguistic enrichment pipeline for Historical Dutch as developed for and used in the Nederlab project. This workflow, powered by Nextflow, invokes various tools, including Frog and FoLiA-wordtranslate , as well as other tools such as ucto (a tokeniser), folialangid (language identification), tei2folia (conversion from a subset of TEI to FoLiA, which serves as the exchange format for all our tooling, as well as the final corpus format for Nederlab). Due to the high complexity in tooling, this workflow and all dependencies are distributed as part of the LaMachine distribution.

  • Availability: download
  • CLARIN Centre: CLARIAH-NL
  • Platform: Linux/POSIX (workflow itself runs on JVM, underlying components are mostly implemented in C++ and Python)
  • Input format: FoLiA XML
  • Output format: FoLiA XML
  • Publication: Brugman et al. (2016)

TiCClops

Functionality: corpus processing, normalization
Domain: independent

Dutch

This tool is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus.

TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly
reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form.

  • CLARIN Centre: CLARIAH-NL
  • Platform: cross-platform
  • Input format: images (tiff, djvu), plain text, xml, csv
  • Output formal: xml
  • Publication: Reynaert (2010)

@Philostei

Functionality: corpus processing, normalization
Domain: independent

Dutch, English, Finnish, French, German, German (Fraktur), Classical Greek, Modern Greek, Icelandic, Italian, Latin, Polish, Portuguese, Russian, Spanish, Swedish

This tool uses a combination of an Tesseract webservice for text layout analysis and OCR and a multilingual version of TICCL for normalization.

  • CLARIN Centre: CLARIAH-NL
  • Platform: cross-platform
  • Input format: images (tiff, djvu), plain text, XML, csv
  • Output formal: XML
  • Related publication: Betti, Reynaert and van den Berg (2017)

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Functionality: OCR, normalization, tokenisation, dependency parsing, shallow parsing, lemmatization, morphological analysis, NER, PoS-tagging
Domain: independent
Licence: GNU GPL

Dutch, Swedish, Russian, Spanish, Portuguese, English, German, French, Italian, Finnish, Modern Greek, Classical Greek, Icelandic, German (Fraktur), Latin, Romanian

This is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL and FROG functionality in a single pipeline.

  • Availability: download 
  • CLARIN Centre: CLARIAH-NL
  • Platform: cross-platform
  • Input format: images (tiff, vnd.djvu), plain text, xml
  • Output formal: FoLiA XML
  • Publication: Reynaert et al. (2015)

VARD2

Functionality: normalization
Domain: historical texts
Licence: CC-BY-NC-SA 2.0

English

This tool performs manual and automatic spelling normalisation based on letter replacement rules, phonetic matching (extended Soundex), edit distance, and variant mappings.

  • Availability: download
  • CLARIN Centre: CLARIAH-UK
  • Platform: cross-platform (java)
  • Input format: plain text, rtf, SGML, XML
  • Output format: XML
  • Publications: see here

CAB historical text analysis

Functionality: normalisation, PoS-tagging, lemmatisation
Domain: historical texts

German

This tool is a WebLicht stub for the DTA::CAB service and provides orthographic normalisation, PoS-tagging and lemmatization for historical German.

  • Availability: web application
  • CLARIN Centre: CLARIN-D
  • Platform: cross-platform
  • Input format: plain text, XML
  • Output format: stts tagset for PoS

CAB orthographic canonicalizer

Functionality: normalisation
Domain: historical texts

German

This tool a WebLicht stub for the DTA::CAB service and provides orthographic normalization for historical German.

  • Availability: web application
  • CLARIN Centre: CLARIN-D
  • Platform: cross-platform
  • Input format: plain text, XML
  • Output format: unspecified

DTA::CAB

Functionality: lemmatization, PoS-tagging, normalization
Domain: historical texts
Licence: see here

German

This is an abstract framework for robust linguistic annotation, with public web-service including normalization and lemmatization for historical German

Normo

Functionality: normalization
Domain: historical texts

Hungarian

This tool is an automatic pre-normalizer for Middle Hungarian Bible translations. It employs a memory-based and a rule-based module, which consists of character- and token level rewrite rules. The tool was used for building the Old Hungarian Corpus.

  • CLARIN Centre: HUN-CLARIN
  • Input format: unspecified
  • Output format: unspecified
  • Publication: Vadász and Simon (2018)

Skrambi

Functionality: OCR, normalization
Domain: historical texts

Icelandic

This tool is a spell-checking application based on a noisy channel model, which can be used to achieve a true copy of the original spelling of historical OCR texts, and to produce a parallel text with modern spelling.

  • CLARIN Centre: CLARIN-IS
  • Input format: unspecified
  • Output format: unspecified

CSMTiser

Functionality: normalization
Domain: social media
Licence: GNU Lesser General Public License v3.0

Slovenian

This is a trainable tool for text normalisation, based on Moses.

  • Availability: download
  • CLARIN Centre: CLARIN.SI
  • Platform: Linux
  • Input format: unspecified
  • Output format: unspecified
  • Publication: Ljubešić et al. (2016).

Turkish Natural Language Processing Pipeline

Functionality: tokenisation, sentence splitting, normalisation,de-asciification, vowelisation, spelling correction, morphological analysis/disambiguation, named entity recognition, dependency parsing
Domain: independent

Turkish

This is a pipeline of state-of-the-art Turkish NLP tools.

  • Availability: web application, web API
  • CLARIN Centre: LINDAT
  • Platform: cross-platform
  • Input format: plain text
  • Output format: plain text
  • Publication: Eryiğit (2014)

Publications

[Betti et al. 2017]  Arianna Betti, Martin Reynaert, and Hein van den Berg. 2017. @PhilosTEI: Building Corpora for Philosophers. In CLARIN in the Low Countries, edited by Jan Odijk and Arjan van Hessen. London: Ubiquity Press.

[Brugman et al. 2016]  Hennie Brugman, Martin Reynaert, Nicoline van der Sijs, René van Stipriaan, Erik Tjong Kim Sang, and Antal van den Bosch. 2016. Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora. In Proceedings of LREC 2016, 1277–1281.

[Eryiğit 2014] Gülşen Cebiroğlu Eryiğit. 2014. ITU Turkish NLP Web Service. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014).

[Jurish 2012] Bryan Jurish. 2012. Finite-State Canonicalization Techniques for Historical German. PhD dissertation. Universität Potsdam.

[Ljubešić et al. 2016] Nikola Ljubešić, Katja Zupan, Darja Fišer, and Tomaž Erjavec. 2016. Normalising Slovene data: historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 146–155.

[Reynaert et al. 2015] Martin Reynaert, Maarten van Gompel, Ko van der Sloot, and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries.

[Reynaert 2010] Martin Reynaert. 2010. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition 14 (2): 173–187.

[Vadász and Simon 2018] Noémi Vadász and Eszter Simon. 2018. NORMO: An Automatic Normalization Tool for Middle Hungarian. In Proceedings of the Second Workshop on Corpus-Based Research in the Humanities, 227–236.