Clarin
Overview
Organization
Structure
Work Packages
Architecture
Clarin Member institutions
Connections
Contact
Publications
News
Newsletter
Newsflash
Presentations
Scientific publications
Deliverables
Advertising material
Posters
Flyers
Clarin Groups
Technical Infrastructure (WP 2)
Requirements for LRT Centers (WG 2.1)
Requirements for the LRT Federation (WG 2.2)
LRT Federation Pilot (WG 2.3)
Specification of the Registry Infrastructure (WG 2.4)
Web Services and Workflow Requirements (WG 2.6)
Web Services and Workflow Creation (WG 2.7)
Humanities Overview (WP 3)
Scoping and Impact Study (WG 3.1)
Overview of Relevant Humanities Projects and Professional Associations (WG 3.2)
Call for Humanities Projects (WG 3.3)
Language Resources and Technologies Overview (WP 5)
Tools (WG 5.1)
Lexical Resources (WG 5.2)
Corpora (WG 5.3)
BLARK (WG 5.4)
Taxonomy (WG 5.5)
LRT Integration (WG 5.6)
Interoperability and Standards (WG 5.7)
Dissemination (WP 6)
Planning and Dissemination (WG 6.1)
Website and Newsletter (WG 6.2)
Referral Help desk and Registry of Expertise (WG 6.3)
Intellectual Property Rights and Business Models (WP7)
Licensing of Materials (WG 7.2)
ELDA/ELRA coordination (WG 7.3)
Trust Relations (WG 7.4)
Construction and Exploitation Agreement (WP8)
Governance and Management (WG 8.1)
CLARIN Members
CLARIN Executive Board
Events
Resources
Resource inventory
Tools inventory
Help Desk
Frequently asked questions about CLARIN
Website
Contact webmaster
Search this site:
Site Outline
Join CLARIN
Become a member
Join a Working Group
Login / Register
CLARIN News
CLARIN ERIC officially established
CRDO-Aix renamed SLDR
New Virtual Language Observatory launch
View all news
CLARIN Newsletter
CLARIN Newsletter no 13
View all CLARIN Newsletters
Home
›
Tool Inventory
›
BulTreeBank Tokenizer
Added by Kiril Simov
October 21, 2008
Organisation (not a CLARIN member):
Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Version:
1.0
Type:
annotation tool
written language
single tool
Author(s)/Developer(s):
Kiril Simov
Description:
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.
Contact person(s):
Kiril Simov (kivs@bultreebank.org)
Country:
Bulgaria
Language(s) of input data:
-- any --
Character encoding of input data:
Unicode (UTF-8)
Language(s) of output data:
-- any --
Character encoding of output data:
Unicode (UTF-8)
Documentation link:
not available
Reference link:
http://www.bultreebank.org/clark/index.html
Webservice link:
http://www.bultreebank.org/clark/index.html
Availibility:
free
Open source code:
no
System requirements:
Java based
Software requirements:
Java
Platform(s):
used under MS Windows, Linux
Implementation language(s):
Java
Approach:
cascaded regular grammars (finite-state)
URL check result:
{ "Errors" : [ { "Number" : "0", "Code" : "500", "URL" : "not available", "Column" : "field_tool_document_link_value" } ] }