Winning the ‘Bridging gaps’ Call allowed us to connect CLARIN to external language technology tools by building web services that would enable monolingual and bilingual NER (Named Entity Recognition) annotation of aligned text, specifically Italian-Serbian and geoparsing. The possibility of integrating the It-Sr-NER-ws into the CLARIN ERIC Switchboard and the JERTEH (Society for Language Resources and Tools) infrastructure following FAIR Data Principles for sharing project results was another crucial point in linking with CLARIN’s strategic priorities.
The motivation and inspiration for the project were the result of a lack of tools and resources that enable annotation, exploration and analysis of bilingual aligned Italian-Serbian texts. Aligned texts enriched with NE are a powerful tool for the improvement of teaching and learning both Serbian in Italy and Italian in Serbia. Working on projects with parallel corpora, domain-specific terminology recognition tools are of key importance for automatic or traditional translation and to improve the search quality of databases and different types of repositories.
Objectives and Goals
The main goal of the project was to build a web service that is integrated into the CLARIN-it infrastructure to perform NER for using spaCy models tested on Italian and Serbian and to provide NER for parallel texts in TMX (Translation Memory eXchange) format, aligned on the sentence level. Although primarily developed for aligned, parallel texts in TMX, it can also be used for monolingual texts (in 24 languages) and in both cases it can automatically link the LOC-related entities to the corresponding pages on Wikidata. The publishing of the results would enable users to gain insight into the model's performance and to understand with what kind of input these models yield the best results.
Project Initiators and Participants
Project initiator and team leader is Olja Perišić, professor at the University of Turin (Università degli Studi di Torino, Dipartimento di Lingue e Letterature Straniere e Culture Moderne), where she teaches Serbian and Croatian. The project team includes members of JERTEH, led by professor Ranka Stanković, this year's winner of the award of the Institute for Standardization of Serbia in recognition of her contribution to the development of national standardisation, in collaboration with professor Duško Vitas, the founder and president of JERTEH, affiliated with the University of Belgrade, Faculty of Mathematics. Professor Vitas is the leading author and developer of the Corpus of contemporary Serbian language. Other members of our team are: professor Cvetana Krstev, professor Saša Moderc, PhD candidate Mihailo Škorić and PhD student Milica Ikonić Nešić.
The project lasted from June 2022 to September 2022. The first step was to enable NE annotation for both languages, Italian and Serbian, separately. For the purpose of web application development, we used Python programming language and the django framework. The NE recognisers have been previously trained using the spaCy module for Python, precisely the CNN architecture, the guidelines Developer Manual - WebLichtWiki was followed.
The next step was to enable TMX Italian-Serbian files as input. The system then ran both previously developed NER tools on these TMX files and output on the same text, but containing named entity annotations, retaining the same TMX format.
The Named Entity Classifier for Wikidata was developed, relying on the ideas and results of the NECKAr tool that assigns entities present in Wikidata to the NE classes Person, Location, and Organisation, Wikidata based Location Entity Linking and GeoTxt: A scalable geoparsing system for unstructured text geolocation.
The test corpus was developed of 10.000 aligned segments (sentences) from several Italian and Serbian translated novels, represented with samples in which segments are shuffled in order to avoid copyright problems. Some of the novels were already available as aligned text, but others were aligned following the previously used pipeline for parallelisation. The corpus was published at the ILC4CLARIN B Centre, visible via the VLO (Virtual Language Observatory). Automatically annotated documents were imported to the INCEPTION tool for further manual correction (1000 sentence pairs). Evaluation included precision, recall, accuracy and F1 measures.
The test set was uploaded to the local installation of Inception (INCEpTION) as the project that includes NER with knowledge base.
- CLARIN compatible web services for NER annotation of
- bilingual texts tested on It-Sr with 6 NE classes,
- monolingual texts with supporting documentation (open source), tested on Serbian and Italian. In total 8 services: 4 per bilingual and 4 per monolingual text.
- Corpus with 10000 aligned segments to be used for testing (open data). Automatically linked LOC-related entities to the corresponding pages on Wikidata, followed by an analysis of the NEL success on a validation subset of parallel sentences. Available in CLARIN repository It-Sr-NER-corp at CLARIN (cnr.it) Fig.1 presents results of web service on parallel text samples with geolocations presented in Fig. 2.
- NER evaluation documentation that includes manually corrected 1000 segments with metrics and visualisation of differences.
Presence at Events and Relevant Materials
During the international conference ‘Incroci linguistici e letterari nel contesto culturale degli Slavi meridionali’, held 29 to 30 September 2022 at the University of Turin, we presented our project to the main scholars and professors of Serbian and Croatian language in Italy. They showed an interest in integrating the project results into their teaching practice.
We also participated at the annual CLARIN conference in Prague from 10 to 12 October, where we presented the results of our project at the Bazaar session (Fig. 3). During the poster presentation, we had the opportunity to exchange impressions on the project and ideas for future collaborations with other participants.
On 20 October we presented our project at the Faculty of Mining and Geology of the University of Belgrade, during the seminar of the Society for Language Resources and Technologies (JeRTeh) with the participation of researchers from different fields of study and journalists (Politika, 5th November). Project resources: source code, monolingual and bilingual corpora, as well as processed output files are freely available here.
The results of our project are immediately visible and accessible to the scientific community through the publication It-Sr-NER: Web services for named entities recognition, linking and mapping in the scientific journal Infoteka, dedicated to Digital Humanities.
The future developments of the project concern increasing corpus size, improvement of performances of the model for NER for Serbian, improvement of performances of the recognition of geolocations and connecting of entities with the knowledge base.
Named entity recognition is essential for many applications, such as identifying clients in business transcripts, determining location in social media posts, anonymising sensitive documents, and automatically classifying electronic media articles and topics. Extracting information and summarising text necessary for management to make quick decisions is not possible without successful recognition of named entities. Likewise, this kind of component is a mandatory part of the system for answering questions (chatbots) necessary to support bank users, telecommunications centers and e-commerce.