Blog post written by Marie Hinrichs and Christoph Draxler, edited by Nathalie Walker, Darja Fišer, and Jakob Lenardič
WebLicht (“Web-based Linguistic Chaining Tool”) is an environment for building and executing chains of natural language processing tools, with integrated capabilities for visualizing and searching the resulting annotations. It is hosted by the CLARIN centre at the University of Tübingen.
One of the main goals of WebLicht is to make a wide range of text processing tools such as tokenizers, part-of-speech taggers and syntactic parsers easily accessible to researchers in the humanities and social sciences. WebLicht’s annotation tools can be invoked via any web browser, without the need for local software installation or any prior familiarity with the tools. Researchers can select predefined processing chains, called “Easy Chains”, that have been created for the most common annotations and languages. However, custom processing chains can also be easily generated. The user is guided through each tool choice, where only tools that are valid for the current annotation task in the processing chain are made available for selection. This is made possible by detailed metadata about the input requirements and output annotations of each tool, and ensures that custom processing chains are always valid. CLARIN-D has also prepared a set of illustrative use cases and annotation examples which showcase how new users can get started with the tool.
WebLicht is tightly integrated into the CLARIN infrastructure. It uses information from the Center Registry to harvest tool metadata from all CLARIN centre repositories. The tool metadata from the Centre Registry are automatically harvested several times each day, ensuring that all tool information is up to date. WebLicht also supports log in with CLARIN Federated Identity, which allows researchers to log in through their academic institutions and makes the service available to researchers from thousands of institutions.
Another highly relevant and closely related CLARIN-D tool is WebMAUS, a web service for automatic word and phoneme alignment, developed by the Bavarian Archive for Speech Signals . It is part of the suite of BAS web tools for speech processing and provides word and phoneme alignment for more than 25 languages, including several dialects, and even a language independent alignment mode based on phonemic transcripts.
Most aligners are based on forced alignment, i.e. they map a given sequence of phonemes to a signal file. WebMAUS takes a different approach: from a set of pronunciation rules and a language model it generates a large number of phoneme sequences, and then returns the sequence that best matches the signal file. Thus, WebMAUS captures phonetic variation caused by e.g. coarticulation, regional variation or speaking style. In inter-rater comparisons, WebMAUS achieves up to 95% of human transcriber performance, and in an evaluation of automatic aligners for Swiss parliamentary speech, MAUS outperformed the other aligners in terms of boundary precision.
Originally developed for phonetic analysis of speech, WebMAUS has seen growing interest from communities as diverse as speech technology development, language documentation, and research in oral history. Each new application area has led to important extensions and improvements of the service. For example, many widely used annotation tools for oral corpora, such as the Emu Speech Database System , ELAN , EXMARaLDA , and Octra , integrate access to WebMAUS, which greatly facilitates their transcription tasks.
At BAS, work on WebMAUS continues, and CLARIN-D is actively collaborating closely with speech researchers and potential users all over the world. Recently, the first tone language – Thai – has been added, as well as six different Swiss German dialects. The CLARIN BAS team also encourages anyone who works with a language not yet covered by WebMAUS to get in touch so that the language can be added to the service.
Figure 3: Schematic description of WebMAUS input and resulting multi-level time-aligned transcript. Note that the sequence “and tells” is produced as [a n t e l z].
- Dima, E., E. Hinrichs, M. Hinrichs, A. Kislev, T. Trippel, and T. Zastrow (2012). “Integration of WebLicht into the CLARIN Infrastructure.” In: Proceedings of the Joint CLARIN-D/DARIAH Workshop at Digital Humanities Conference 2012: Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts. Hamburg 17–23.
- Hinrichs, E., M. Hinrichs, and T. Zastrow (2010). “WebLicht: Web-Based Services for German.” In: Proceedings of the Systems Demonstrations at the 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010). Uppsala, 25–29.
- Kisler, T., U. Reichel, and F. Schiel (2017) “Multilingual processing of speech via web services” In: Computer Speech and Language, vol. 45, 326–347
- Kisler, T., F. Schiel, and H. Sloetjes (2012) “Signal processing via web services: the use case WebMAUS” In: Proceedings of Digital Humanities Conference 2012, Hamburg, 30–34