Tour de CLARIN: Bulgaria

Submitted by Jakob Lenardič on 22 October 2019

Blog post written by Petya Osenova and Kiril Simov

Bulgaria has been a founding member of CLARIN since 2012. In 2014, following the strategic plan of the Bulgarian Government and Ministry of Education and Science, the CLARIN and DARIAH Infrastructures merged into a single infrastructure called CLaDA-BG (CLARIN and DARIAH in Bulgaria) and obtained funding in 2018. In Europe such models have already proved to be successful in the Netherlands, Austria and Greece. The CLaDA-BG consortium is very heterogeneous; its members come from universities, other academic institutions, museums, libraries, non-government organizations and companies. It includes a group of language and semantic technology oriented partners, on the one hand, and expert and content oriented ones, on the other.

The first group includes: the Institute of Information and Communication Technologies at the Bulgarian Academy of Sciences (Coordinator for CLaDA-BG and CLARIN-BG), Institute of Mathematics and Informatics at the Bulgarian Academy of Sciences, Ontotext AD (Sirma AI), Sofia University “St. Kliment Ohridski” (Coordinator for DARIAH-BG), New Bulgarian University, Konstantin Preslavsky University – Shumen, and Bulgariana – an NGO promoting CH in Bulgaria. The second group includes: the South-West University “Neofit Rilski” – Blagoevgrad, Sirma Media, the Cyrillo-Methodian Research Centre at the Bulgarian Academy of Sciences, Institute of Balkan Studies and Centre of Thracology “Alexander Fol” at the Bulgarian Academy of Sciences, Institute of Ethnology and Folklore Studies with Ethnographic Museum at the Bulgarian Academy of Sciences, Burgas Free University, “Ivan Vazov” Public Library – Plovdiv, and Sofia History Museum.

The mission of the infrastructure is to build a scientific ecosystem for supporting research in Social Studies and the Digital Humanities. The main goal is to construct a Bulgaria-Centric Knowledge Graph (BGKG) – repository where all types of linguistic and encyclopaedic knowledge are stored and linked. Thus, they will be used for extracting content with respect to particular tasks.

In their first year of operation the consortium worked on: structuring of the various resources, extending and building contemporary and old corpora, and modelling cultural objects, contextualizing the knowledge through connecting events, artefacts, and descriptions.

Some of the main resources to mention are: the syntactic corpus BulTreeBank (215,000 tokens), the BTB-Wordnet that is integrated with Wikipedia (22,000 synsets), the Valency Lexicon (6,000 verb frames), the Inflextional Lexicon (over one million wordforms) (the Institute of Information and Communication Technologies), the large Bulgarian corpus with statistics on collocations with a span of one to six tokens (eight million webpages have been processed) (Ontotext AD), the Corpus of Child Speech (33 hours of records and 355 pages of transcripts) (Shoumen University), the Ethnographical Museum exhibition on 3D representation, the epigraphic collection of ancient inscriptions in Greek – TELAMON (Sofia University), bilingual corpora (New Bulgarian University), and so on.

Among the most important tools for Bulgarian are: the pipeline and the online concordance webclark (IICT-BAS). Several other tools are also in development: an old-to-new spelling transformation tool, a conceptual and keyword search tool over a huge corpus of contemporary Bulgarian, and a semantic annotator of Bulgarian. CLaDA-BG’s plans include the creation of CLARIN B and K centre, and applying for assessment during the second year of the project.

Click here to read more about Tour de CLARIN