ABC - Language Identifier

Organisation (not a CLARIN member): 
RACAI - Research Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Short name: 
LangId
Type: 
toolbox
Author(s)/Developer(s): 
Dan Tufiş, Alexandru Ceauşu
Description: 
The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages.
By using training texts in different languages (approx. 1.5Mb of text for each language), a training module counts the prefixes (the first 3 characters) and the suffixes (4 characters endings) for all the words in the texts, for each language. For every language two models are constructed, containing the weights (percentages) of prefixes and suffixes in the texts representing a language. In the prediction phase, for a new text, two models are built on the fly in a similar manner. These models are then compared with the stored models representing each language for which the application was trained. Using comparison functions, the best model is chose.
More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]:
-- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2008). RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco, May 2008. ELRA - European Language Resources Association. ISBN 2-9517408-4-0.
-- Dan Tufiş and Alexandru Ceauşu (2007). Diacritics Restoration in Romanian Texts. In Elena Paskaleva and Milena Slavcheva (eds.), A Common Natural Language Processing Paradigm for Balkan Languages - RANLP 2007 Workshop Proceedings, pp. 49-56, Borovets, Bulgaria, September 2007. INCOMA Ltd., Shoumen, Bulgaria. ISBN 978-954-91743-8-0.
-- Dan Tufiş and Adrian Chiţu (1999). Automatic Insertion of Diacritics in Romanian Texts. In Ferenc Kiefer, Gábor Kiss, and Júlia Pajzs (eds.), Proceedings of the 5th International Workshop on Computational Lexicography (COMPLEX 1999), pp. 185-194, Pecs, Hungary, May 1999. Linguistics Institute, Hungarian Academy of Sciences.

Contact person(s): 
Dan Tufiş - tufis@racai.ro
Country: 
Romania
Readily Available: 
Readily available
Language(s) of input data: 
-- any --
Character encoding of input data: 
Unicode (UTF-8)
Language(s) of output data: 
-- any --
Character encoding of output data: 
Unicode (UTF-8)
Documentation language(s): 
English
Availibility: 
web service
Open source code: 
no
Implementation language(s): 
C#
Approach: 
language models
URL check result: 
{ "Errors" : [ { "Number" : "0", "Code" : "404", "URL" : "http://nlp.racai.ro/webservices/LangIdWebService.asmx?WSDL", "Column" : "field_tool_document_link_value" }, { "Number" : "1", "Code" : "404", "URL" : "http://nlp.racai.ro/webservices/LanguageId.aspx", "Column" : "field_tool_reference_link_value" }, { "Number" : "2", "Code" : "404", "URL" : "http://nlp.racai.ro/webservices/LanguageId.aspx", "Column" : "field_tool_webservice_link_value" } ] }

Comments

These webservices are now

These webservices are now available from a different URL:

http://www.racai.ro/webservices/