CLASSLA recognizes the need for the development of language resources and technologies not only for Slovene and Bulgarian, but also for the other under-resourced South Slavic languages. That is why the centre aims to support researchers from the fields of Computational and Corpus Linguistics, Digital Humanities, as well as interested individuals from other scientific and business areas that use and produce language data for Slovene, Croatian, Serbian, Bosnian, Montenegrin, Macedonian, and Bulgarian.
Space for a productive cooperation of small nations
The languages supported by CLASSLA are spoken by a small number of speakers. The estimated number of speakers worldwide ranges from less than half a million (Montenegrin, Hlavac 2013), between 1.4 million (Macedonian, Wikipedia) and 2.5 million (Slovene, Krek 2012, and Bosnian, Hlavac 2013), to over 5.5 million (Croatian, Tadić et al. 2012) and around 9 million speakers of Serbian (Hlavac 2013) and Bulgarian (Blagoeva et al. 2012). All seven CLASSLA languages together are used by around 30 million speakers. These seven languages form a dialect continuum with various degrees of mutual intelligibility between neighbouring languages.
The production of resources and tools for South Slavic languages costs just as much as for global languages such as English with more than a billion of speakers. However, despite the small number of speakers and consequently very small language technology communities, it is crucial for the maintenance of an equal status of South Slavic languages in future digital environments that they be supported with the same technologies as global languages. This is where CLASSLA plays an important role. The knowledge centre provides a space for the cooperation of researchers interested in any of the South Slavic languages, as well as for a rational and economical approach to solving common problems, especially in the light of mutual intelligibility of most of the languages.
To stimulate the development of language resources and technologies, CLASSLA provides information on freely available dictionaries, corpora, concordancers, (manually annotated) datasets, tools, and pipelines. The information is provided in the form of frequently asked questions (FAQ), and it is aimed towards both non-technical and more technically educated audiences. Currently, there are available FAQs for Slovene, Croatian, Serbian, Macedonian, and Bulgarian. The information is regularly updated to encompass all emerging resources and technologies.
In addition to this, CLASSLA supports researchers in producing resources and technologies for South Slavic languages via its help desk which can be contacted at firstname.lastname@example.org. So far, it has provided individual help to more than 50 researchers.
To share knowledge and enlarge the South Slavic language technology community, CLASSLA organizes workshops and raises awareness about its activities home and abroad. In 2020, the first CLASSLA workshop was organized, which brought together 42 researchers.
Developing and providing freely available technologies and resources for under-resourced languages
Recently, the CLASSLA neural pipeline, an adaptation of the highly popular Stanza package, was built, and offers state-of-the-art language processing of Slovene, Croatian, Serbian, Macedonian, and Bulgarian. The pipeline encompasses both standard and non-standard language processing, processing from tokenization to syntactic parsing and named entity recognition for most of the supported languages, with semantic parsing being currently added to the pipeline. The CLASSLA pipeline is designed to suit the needs of researchers with various backgrounds, from the non-technical linguists who can simply run the pipeline as described in the documentation, to the more technically sophisticated engineers who can use the pipeline to train their own language models. This year also a state-of-the-art transformer model BERTić was trained that covers Bosnian, Croatian, Montenegrin, and Serbian. Transformer models are large language models consisting of millions, or even billions of parameters that produce a general numerical representation of a portion of text, which is then used for various
tasks, from part-of-speech tagging, via text classification and machine translation, to text summarization and question answering.
In the two years since the inception of CLASSLA, South Slavic languages have become supported with many new technologies and resources, and many more are planned for the near future. Currently, CLASSLA is a part of the MaCoCu project, which will produce large high-quality monolingual and bilingual web corpora for under-resourced languages, South Slavic languages included. CLASSLA is also aware of the current technological advances in speech technologies, and is working on ensuring a comparable technological coverage of South Slavic languages to their larger counterparts in that area as well. Finally, CLASSLA plans to add a newsflash and other dissemination channels to continue supporting and enlarging the South Slavic language technologies community.
Blagoeva D., S. Koeva, and V. Murdarov. 2012. The Bulgarian Language in the Digital Age. META-NET White Paper Series, edited by G. Rehm and H. Uszkoreit. Berlin, Heidelberg: Springer.
Hlavac, J. 2013. Interpreting in one’s own and in closely related languages: Negotiation of linguistic varieties amongst interpreters of the Bosnian, Croatian and Serbian languages. Interpreting 15 (1): 94–125.
Krek, S. 2012. The Slovene Language in the Digital Age. META-NET White Paper Series, edited by G. Rehm and H. Uszkoreit. Berlin, Heidelberg: Springer.
Tadić, M., D. Brozović-Rončević, and A. Kapetanović. 2012. The Croatian Language in the Digital Age. META-NET White Paper Series, edited by G. Rehm and H. Uszkoreit. Berlin, Heidelberg: Springer.