BulTreeBank Tokenizer

Organisation (not a CLARIN member): 
Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Version: 
1.0
Type: 
annotation tool
written language
single tool
Author(s)/Developer(s): 
Kiril Simov
Description: 
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.
Contact person(s): 
Kiril Simov (kivs@bultreebank.org)
Country: 
Bulgaria
Language(s) of input data: 
-- any --
Character encoding of input data: 
Unicode (UTF-8)
Language(s) of output data: 
-- any --
Character encoding of output data: 
Unicode (UTF-8)
Availibility: 
free
Open source code: 
no
System requirements: 
Java based
Software requirements: 
Java
Platform(s): 
used under MS Windows, Linux
Implementation language(s): 
Java
Approach: 
cascaded regular grammars (finite-state)
URL check result: 
{ "Errors" : [ { "Number" : "0", "Code" : "500", "URL" : "not available", "Column" : "field_tool_document_link_value" } ] }