The CLARIN Bazaar 2023

Below you will find a list of the stalls at the CLARIN Bazaar. The CLARIN Bazaar is an informal space at the CLARIN Annual Conference where you can meet people from other centres and countries, find out about their work in progress, exchange ideas, and talk to people about future collaborations. The Bazaar is an in-person, highly interactive conference session.

Bazaar Presentations


CLARIN K-centres

CLARIN Knowledge Centre for Treebanking


Latest development in treebanking at our centres, including Universal Dependencies, tools for annotation and search of treebanks.


CLASSLA: The CLARIN Knowledge Centre for Language Resources and Technologies for South Slavic Languages


Nikola Ljubešić, Taja Kuzman

The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers expertise on language resources and technologies for South Slavic languages. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies via its helpdesk, and (3) organising training activities. CLASSLA is run by the Slovenian national consortium CLARIN.SI, the Institute for Croatian Language, and the Bulgarian national consortium CLADA-BG.


CLARIN - Knowledge Centre for Learner Corpora: Objectives and Current (Collaborative) Initiatives


Magali Paquot (with the help of Alexander König)

The CLARIN Knowledge Centre for Learner Corpora offers expert knowledge on the collection and use of learner corpora (i.e. electronic collections of language data produced by second or foreign language learners) for theoretical and applied purposes. Sharing of expertise can take various forms, from answering (theoretical, methodological, technical) questions sent via the helpdesk to sharing resources and providing training services.

The centre has expertise in learner corpus design (metadata, transcription, ethics), annotation (POS tagging, parsing, error annotation), and analysis. Current projects of the CKL2CORPORA members include the development of a Core Metadata Schema for Learner Corpora (Paquot et al., 2023) and the development of FABRA (Wilkens et al., 2022), a tool that was originally developed for readability research but can also be used to compute a wide range of measures of linguistic complexity for French.

At the Bazaar, we would like to showcase the Core Metadata Schema for Learner Corpora, which is the result of extensive collaboration between learner corpus compilers at the CLARIN Knowledge Centre for Learner Corpora and EURAC Research (Bolzano, Italy), and a research data infrastructure expert and member of CLARIN's metadata taskforce.


CorpLingCz K-centre: Overview of Activities


Michal Křen

The contribution will present the activities of the Czech CLARIN Knowledge Centre for Corpus Linguistics hosted at the Faculty of Arts, Charles University, Prague. The K-centre is closely connected to the Czech National Corpus research infrastructure, but it also offers a number of resources for users who need to work with languages other than Czech.
Its activities can be divided into three main categories:

  • Corpus compilation and annotation, with the main focus on continuous mapping of the Czech language and many of its varieties and modalities: contemporary written and spoken language, online Czech, historical Czech, large InterCorp parallel corpus covering 40+ languages etc.
  • Development and maintenance of web-based user applications that enable effective and user-friendly work with the language data
  • User support that includes an online help desk with Q&A, comprehensive web documentation and manuals, bibliography of research outputs, corpus-based exercises for language teaching, corpus hosting, provision of customised data packages, consulting, education and training.

PolLinguaTec – CLARIN K-Centre for Polish Language Technology


PolLinguatec - Polish K-Centre

The Language Technology Centre at the Wrocław University of Technology has the status of type K-centre. PolLinguaTec is a place where users can report a problem with the operation of the infrastructure, receive professional help from a team of specialists supporting research projects, as well as receive feedback on enquiries regarding the use of the infrastructure.

PolLinguaTec ensures access to the knowledge useful during the application of tools and systems for natural language analysis, especially Polish, within Digital Humanities and Social Sciences. The centre has at its disposal the documentation (instructions, guidelines, tutorials) and experienced employees are able to solve problems connected with the use of tools, resources and systems. PolLinguaTec also offers a number of research applications built for the purposes of specific types of research tasks and in close cooperation with researchers. It has been one of the leading Polish centres building technologies of natural language processing for many years now.


CLARIN Knowledge Centre for Systems and Frameworks for Morphologically Rich Languages (SAFMORIL)


Jurgita Vaičenonienė, Andrius Utka, Inguna Skadina

The aim of this poster is to present the main aims and activities of the CLARIN SAFMORIL Knowledge centre. This centre focuses in the area of computational morphology and its application during language processing. The focus of SAFMORIL is actual, working systems and frameworks based on linguistic principles and providing linguistically motivated analyses and/or generation on the basis of linguistic categories.

The K-centre operates as a distributed virtual centre supported by CLARIN member institutions, such as: University of Helsinki, Finland; University of Tromsø, Norway; University of Latvia; and Vytautas Magnus University, Lithuania.

CLARIN-MULTISENS at Lund University Humanities Lab


Johan Frid

Lund University Humanities Lab is a department for research infrastructure, interdisciplinary research and training. Since 2017, the Lab is a certified CLARIN Knowledge Centre with a special focus on multimodal and sensor-based methods. As of 2020, we are also a CLARIN C-centre, meaning that our datasets are integrated with CLARIN's Virtual Language Observatory. The Lab is a member of the Swedish national consortium for language resources and technology, Swe-Clarin.

We provide access to sensor-based technologies, methodological know-how, data management, and archiving expertise. Our mission is to facilitate and help diversify research around the issues of cognition, communication, and culture – traditional domains for the Humanities. That said, many projects undertaken at the Lab are interdisciplinary and conducted in collaboration with the Social Sciences, Medicine, the Natural Sciences, Engineering, and e-Science. The Lab enables researchers to combine traditional and novel methods, and to interact with other disciplines.
We have a wide range of facilities for measurements and recordings: articulography, electrophysiology, EEG, eye-tracking, professional audio and video recording, motion capture and virtual reality. The Lab also offers support and consultancy on statistics, machine learning-related research on language data, and keystroke logging for the study of the writing process.

CKCMC: CLARIN K(nowledge)-Centre for Computer-Mediated Communication and Social Media Corpora


Lionel Nicolas & Egon Stemle (Eurac Research, IT)

The CLARIN Knowledge Centre for Computer-Mediated Communication and Social Media Corpora (CKCMC) offers expertise on language resources and technologies for computer-mediated communication and social media. Its basic activities are a) to give researchers, students, and other interested parties information about the available resources, technologies, and community activities, b) to support interested parties in producing, modifying or publishing relevant resources and technologies and organise training activities, and c) to support and facilitate community-building. For the last point, we have just set up a social media platform including federated login, which we presented at the last community conference.


Corpus Citation. Creating and Using Unique References for Examples Extracted from Language Corpora in Scientific Papers


Christophe Parisse

Language corpora are widely used in scientific research. They may be the very object of the research or a source of emblematic data of particular language uses. Referring to an existing corpus is quite easy and is very similar to the academic citation practices of scientific references. By contrast, the citation of a particular extract or example of a given corpus is not an easy task. Indeed, corpora do not have page or line numbers as printed texts and books have. Yet the content of the citation of an extract is very important in a research process in order to allow scientific control and reproducibility. The CORLI CLARIN K-centre is currently developing a methodology and tools that would allow us to generate 'examples extracted from a corpus' that come with information and persistent support such that they can be controlled, used, reused, and findable by archiving and searching tools.


Terminology for All: Embracing New Audiences


Vesna Lušicky

The Knowledge Center for Terminology Resources and Translation Corpora (TRTR) provides resources, materials, and training opportunities to support users in creating and documenting translation-related resources, with a specific emphasis on terminology resources and translation corpora. In this presentation, we will highlight our latest projects, training activities, and language resources, all tailored to meet the needs of non-academic users and stakeholders beyond the CLARIN community.


Training and Education


Mapping Graduate Skills To Employers’ Needs: Ongoing Work in Enabling Collaborations between Academia and Industry


Amelia Sanz, Vicky Garnett, Iulianna van der Lek, Tom Gheldof

In our presentation at the CLARIN Bazaar session, we aim to start the conversation with the CLARIN conference participants, representing DH Masters’ and other educational programmes, and invite them to participate in our joint CLARIN-DARIAH initiative. This initiative aims to support DH Master’s programmes in enhancing employability opportunities for their graduates through internship programmes in both the cultural and heritage sector (GLAM), as well as industry.
We will share the results of our Digital Humanities and Industry: Identifying Employment Niches workshop, which we organised at the DARIAH Annual Conference (Sanz, A. et al., 2023) using a poster and a first draft of a White Paper. 

This is the next step in an important, long-term initiative to improve collaboration between industry, cultural heritage institutions and academia (specifically taught postgraduate DH degrees) in the framework of research infrastructures.


Making the CLARIN Training Materials FAIR-by-Design (poster)


Iulianna van der Lek, Francesca Frontini, Darja Fišer and Alexander König

Within the CLARIN network, training events and workshops take place each year to educate different stakeholders on various topics related to using the infrastructure in research and teaching. However, these materials are often not easily findable and reusable because they are developed using various formats, hosted on different platforms, and may have different access restrictions. Via the Teaching with CLARIN call, launched in 2021, we have collected a small set of training materials developed in the CLARIN community. They are showcased in the CLARIN Learning Hub.
The curation process revealed different practices for hosting and disseminating training materials, and showed that teachers need clear guidelines to help them make their training materials and learning resources findable and reusable by others in the community. In this Bazaar presentation, we would like to propose and discuss a more standardised workflow for developing, hosting and disseminating training materials within the CLARIN Knowledge Infrastructure. Our proposal draws on our experience and collaboration in various projects (e.g. SSHOC Trainers Toolkit, SSH Open Marketplace, UPSKILLS project) and task forces (e.g. ELIXIR FAIR Training Handbook, Community of Practice), as well as based on the FAIR-by-Design Methodology for Learning Materials proposed in Skills4eosc.

How to Use and Contribute the Digital Humanities Course Registry (poster)

Anna Woldrich, Iulianna van der Lek and Patrick Akkermans

Are you a student or a graduate seeking to improve your Digital Humanities (DH) skills abroad?
Are you a lecturer teaching a course in DH or a related field, seeking to increase the visibility of your course outside your university network?
Are you a researcher interested in using the DH course registry for research purposes?
Are you a programme director seeking to enhance your DH curriculum or looking for quantitative data to support you in the decision-making process?

Then come to our stand at the CLARIN Bazaar and get to know the DH Course Registry, a joint effort of CLARIN ERIC and DARIAH-EU, designed to showcase DH courses and training programmes across Europe and beyond. We will give an update about the latest developments in the registry and showcase several new courses and programmes in the database.

UPSKILLS Final Project Deliverables (poster)


Iulianna van der Lek, Jelena Gledic, Stavros Assimakopoulos, Darja Fišer

We present the final project deliverables produced in UPSKILLS, an Erasmus+ strategic partnership for higher education that aimed to identify and tackle the gaps and mismatches in skills for linguistics and language students through the development of a new curriculum component and supporting materials to be embedded in existing programmes of study. The final deliverables consist of:


CLARIN Terminology - Internship Project


Lesley Messori

The Multilingual CLARIN Terminology project was carried out as part of a curricular internship at CLARIN ERIC. The project aimed to develop an English-Italian resource that serves as a base for a versatile and adaptable multilingual terminology with concepts related to the CLARIN infrastructure, services and tools, which could be employed for various purposes. 

The terminological resource represents a way to standardise the terminology used within the infrastructure, and can also be used to facilitate topic annotation on CLARIN’s main website and improve (multilingual) content searches. Moreover, it could be convenient for CLARIN national consortia and K-centres to provide users with definitions of CLARIN’s services and tools. Finally, the terminology could be useful to localise the CLARIN ERIC website into other languages or translate the learning content developed in the UPSKILLS project.
The methodology for developing this resource is based on the approach of the SSHOC Multilingual Data Stewardship Terminology. At the end of the internship project, the terminology resource contained 152 English concepts with their Italian equivalents. 

Augmenting CLARIN's VLO: An Internship Project (poster)


Elton Pistolia

In this project, we focused on harnessing Natural Language Processing (NLP) to augment the capabilities of the VLO. The VLO is an integral part of CLARIN ERIC and a critical resource for researchers and scholars in social sciences and digital humanities by providing access to a wide range of language-related tools and resources. Our primary objective was to enhance the quality of search results within the VLO through various NLP techniques, including Language Identification, Machine Translation (to translate dataset descriptions), and Named Entity Recognition (to enrich keywords).

To accomplish this, we worked on a small-scale experiment focusing on a single collection, i.e., the CoCoON collection within the VLO. Our project involved extracting the description text from the CoCoON records and implementing a language detection component based on spaCy's language models for language identification. Additionally, we integrated Machine Translation using the DeepL API to translate descriptions from French to English, creating a dataset in two languages. Finally, we utilised spaCy's pre-trained models for Named Entity Recognition on the translated record descriptions.
Our approach yielded interesting results highlighting the linguistic variety of the resources available through the VLO. Additionally, the results suggest that the project could be beneficial if applied on a larger scale. Additional experiments, using different tools and techniques, could further improve this approach. This internship project serves as a baseline for future projects focusing on enhancing CLARIN's services, aligning with its mission to support research and facilitate the exploration of linguistic resources.

Software Demos and Tools


HuC CMDI Editor - See the Full Potential of CMDI 1.2!


Menzo Windhouwer & Rob Zeeman

The KNAW HuC CMDI editor takes full advantage of the rich feature set of CMDI 1.2 by using its 'cues for tools' functionality to tailor a generic editor to an application/resource specific one aimed at maximum usability!

We will show how this is done by combining the CMD profile with dedicated tweak files. This enabled us to use the HuC CMDI editor for a wide range of data entry and curation tasks, which we will showcase.

Introducing the CLARIN-PL Data Science Platform


Piotr Pęzik

Developing language processing tools for research use cases tends to present significant operational challenges. First of all, it can be expensive to create user interfaces and even programmatic solutions for highly specific experimental scenarios. In the long term, even successful research applications tend to fall behind as a result of accumulating technological debt and insufficient attention to user feedback.

To alleviate this problem, we have developed the CLARIN-PL Data Science Platform (DSP), a growing collection of richly documented interactive notebooks showcasing the use of various resources, web-based microservices and language models, which address well-defined research use cases. The majority of use cases described on the platform are either directly based on or motivated by feedback from the scientific users of the CLARIN-PL infrastructure. In some cases, we integrate our own resources with external tools and models to maximise the usefulness of the proposed data science recipes.
Given the agility with which we can create, update and modify new script-based data processing pipelines, we believe that the platform will play an important role in supporting and empowering the scientific user base of the CLARIN-PL infrastructure. To increase its educational impact, we also organise data science workshops and courses for researchers, which are based on the instructional resources originally developed for the CLARIN-PL DS.

Speech Acoustics Modeling to Infer Processes of Linguistic Evolution


Axel G. Ekström & Jens Edlund

The traditional source/filter model (acoustic theory of speech production) holds that a speech signal can be modeled as constituent of two largely independent features, the rate of vocal fold vibration (voice 'source') and shape of the supralaryngeal vocal tract ('filter'). In theory, because principles of vocalisation are conserved across mammals, it is possible to apply the same methodology for the analysis of animal sounds.

We show how this traditional and influential line of speech acoustics research can be used to glean new information with bearing on the evolution of speech capacities in human evolution. In particular, all great apes (chimpanzees, gorillas, orangutans) produce sounds closely overlapping with human back vowel [u] ('boot'). However, the high larynges and narrow pharynges of these animals means these features are likely achieved disparately, with disparate vocal tract configurations. We show that, indeed, configurations inspired by species-unique vocal morphology may achieve features close to those observed in nature.

Demo of MWE-Finder


Jan Odijk, Martin Kroon

We will give a demonstration of the MWE-Finder application embedded in GrETEL that we give a presentation about in the oral programme. We will explain how the application works, how it generates the queries on the basis of the example of the MWE given as input, show the results, and the differences between the results of different queries. Participants can request querying the Multiword Expressions that they are interested in.


DigItAnt: A Platform for Creating and Exploring Digital Resource Ecosystems for Ancient Languages


Valeria Quochi, Federico Boschetti, Monica Monachini

Creating lexicons connected to companion materials such as digital representations of witnesses, bibliographic references, shared vocabularies, and other existing lexico-terminological resources is becoming a common practice in the historical digital humanities. Manual encoding and interventions remain indispensable for various reasons, including enabling future discoveries by exploiting the reasoning capabilities of the semantic web and artificial intelligence, as well as preserving and disseminating intangible cultural heritage.

In this context, we introduce a newly developed platform, initiated as part of a CLARIN-IT related project focused on ancient Italian languages. This platform enables both the creation and exploration of a comprehensive set of interconnected digital resources centered around lexical data, which conforms to state-of-the-art representation formats (Ontolex-lemon for lexicons, SKOS for conceptual structures, and TEI-Epidoc for scholarly editions of inscriptions): a multilingual lexicon of ancient languages, a corpus of epigraphic texts, a dataset of bibliographic references, a conceptual system for Indo-European cultures, external lexical knowledge bases. The platform features a user-friendly web application, EpiLexo, that allows users to create and edit ancient lexicons, establish connections between lexical items and relevant internal or external resources, as well as to browse, query and explore the interlinked materials from a user's perspective.
As the current implementation assumes texts to be encoded independently, an experimental system based on Domain Specific Languages will be demonstrated, which allows scholars to encode new scholarly editions with minimal technical skills, and which could be integrated into the platform in future enhancements. While the front-end is necessarily geared to the specific use-case of ancient languages, the back-end components are general and expose Rest-ful APIs. All code is open source, and a docker image can be provided to facilitate independent installations.

MATEO: MAchine Translation Evaluation Online

Bram Vanroy

At this stall, we present a live showcase of MAchine Translation Evaluation Online (MATEO), a CLARIN Bridging Gaps project that simplifies the process of evaluating machine translation (MT) by providing an intuitive and user-friendly web interface that caters to both novice and expert users. It is equipped with a comprehensive suite of automatic metrics that can assess the quality of given machine translations by means of reference translations. The MATEO project serves a wide range of users including individuals with varying levels of experience in MT, such as system builders, educators, students, and researchers in (machine) translation as well as the social sciences and humanities. We take specific steps to make installation and maintenance as straightforward as possible, while also making adding additional metrics a breeze. The interface is integrated in the CLARIN B center of the Dutch Language Institute (INT) and instructions are available for users to run the tool on their own device or in the cloud for free. It is open-source and GPLv3 licensed.

Korpusnik: A Corpus Summarising Tool for Slovene
Iztok Kosem, Jaka Cibej, Kaja Dobrovoljc and Simon Krek
In this paper, we present Korpusnik, a corpus summarising tool for Slovene, which is being developed as part of the CLARIN.SI infrastructure. The tool offers a simple and clear overview of the most relevant information (e.g., collocations, example sentences, distribution by text type, year of publication, and source) from five corpora of Slovene: the Gigafida Corpus of Written Standard Slovene, the Gos Corpus of Spoken Slovene, the Trendi monitor corpus of Slovene, the JANES Corpus of Internet Slovene, and the OSS Corpus of Slovene Scientific Texts. Special attention in the design of the tool has been paid to accessibility, especially for people with disabilities.

Projects, Collaborations and CLARIN Consortia


The CLARIAH-ES Consortium

German Rigau
The CLARIAH-ES strategic network (formerly INTELE) has articulated a common proposal for a CLARIAH-ES consortium that has helped steer Spain’s official incorporation into both CLARIN and DARIAH in September. This joint proposal created a collaborative space for the scientific community and demonstrated to the ministry its potential, critical mass, and capacity.
The CLARIAH-ES consortium currently consists of 10 nodes that will provide support to researchers working in Spanish and Spain’s official languages (Galician, Catalan and Basque): UPV/EHU (HiTZ), CSIC (Centro de Ciencias Humanas y Sociales), USC (Instituto da Lingua Galega and CiTIUS), UA (Cervantes Virtual Library), UNED (LENAR and LINDH), BSC (CNS), UCM, UJA (CEATIC), ULPGC (IATEXT) and BNE. The CLARIAH-ES coordination office will be located at the UPV/EHU.
The CLARIAH-ES nodes bring together centres composed of interdisciplinary groups of language experts centred around the official languages of Spain, including Spanish, Catalan, Basque, and Galician. Additionally, they include experts in computer science, the digital transition within the Social Sciences and Humanities, and the field of Cultural Heritage, including leading libraries and repositories that contain written, graphic, and audiovisual cultural content related to these languages. The consortium's primary goals will be to propose an inventory of data, tools, and services that the ten nodes can offer and to discuss which data and tools will be incorporated into CLARIN-ERIC.

Holocaust Testimonies as Research Data

Martin Wynne

In a collaboration with CLARIN and the European Holocaust Research Infrastructure (EHRI), a workshop was held at King's College, London, in May 2023 to investigate ways in which language technologies can be used to make research datasets from oral testimonies of the Holocaust. This workshop focused particularly on speech technologies and how a speech-to-text pipeline can be constructed for a variety of languages, and how the resulting text can be aligned, annotated and enhanced in various ways.

Future workshops will focus more on text technologies, including creating annotated text corpora of Holocaust testimonies, including investigation of OCR for printed and handwritten sources, translation and parallel corpora, named entity recognition, geo-tagging, and stylistic studies of the literature of the Holocaust.

The ParlaMint Project: Ever-Growing Family of Comparable and Interoperable Parliamentary Corpora (poster)


Petya Osenova and Maciej Ogrodniczuk

ParlaMint is a CLARIN Flagship project which focuses on the creation of comparable and uniformly annotated corpora of parliamentary debates in Europe. The first stage of the project (ParlaMint I: 2020 – 2021) resulted in the compilation of 17 corpora, while the second stage (ParlaMint II: 2022 – 2023) has increased the time-span of the corpora, adding corpora for new countries and autonomous regions, providing a machine translated version of the corpora into English, further enhancing the corpora with additional metadata and improving the usability of the corpora. The corpora developed in the first stage of the project are described in this open access paper.

ParlaMint corpora are openly available under the CC BY license, as well as freely available for analysis through noSketch Engine. The latest versions of the corpora are:

The ParlaMint project also has a GitHub repository, where samples of the corpora, the XML schema and corpus processing and validation scripts are available.


FAIRCORE4EOSC Social Sciences & Humanities Case Study (poster)


Daan Broeder, Willem Elbers

The FAIRCORE4EOSC project focuses on the development and realisation of core components for the European Open Science Cloud (EOSC). Supporting a FAIR EOSC and addressing gaps identified in the Strategic Research and Innovation Agenda (SRIA). Leveraging existing technologies and services, the project will develop nine new EOSC-Core components aimed to improve the discoverability and interoperability of research outputs.

The CLARIN case study will integrate its Digital Object Gateway (DOG), the CLARIN Virtual Collection Registry (VCR) and the Language Resource Switchboard with the EOSC MSCR, the EOSC DTR and the EOSC PIDMR in order to (i) make their functions and content available beyond the SSH Cluster borders to all EOSC users, and (ii) benefit from the services and content (e.g. data types, crosswalks) available from other communities and infrastructures via the EOSC MSCR and DTR. Moreover, the integration with EOSC registries will make CLARIN domain language data resources available via the EOSC PIDGraph and the EOSC RDGraph. The CLARIN Virtual Collection Registry will indirectly benefit from this integration, relying on common PID frameworks, sharing of crosswalks and schemas.

Overall, the discoverability of language data hosted in the CLARIN domain is improved through the use of the proposed EOSC components, improving options for reuse of the data and its augmentation with other relevant data from other sources. This effort will also lead to faster and easier processing of language data that is referred to with a persistent identifier.


Component Metadata Infrastructure (CMDI) Help Desk, News, Discussion and Rumours


CMDI developers

The CMDI developers are here to answer your questions regarding the Component Metadata Infrastructure, or discuss any other matters related to metadata in CLARIN!

Don't forget to pick up your copy of the updated CMDI First Aid Kit!

Federated Content Search - Past, Present and Future (poster)

Erik Körner

The CLARIN Federated Content Search (FCS) is a search engine that connects local data collections. It allows researchers to search for specific patterns in collections from distributed data centres. We provide a brief overview of recent developments and a sneak preview at future plans.
Insights into the work of the German project Text+ show our current efforts to integrate lexical resources into the FCS to enable search through dictionaries, wordnets and similar.

Visualising Yiddish Text Genre Differences with Word Rain

Magnus Ahltorp

In Word Rain, the words are placed along a semantic x axis computed using distributional semantics and dimensionality reduction. Colour is also used to emphasise the semantic axis.

To show word prominence, Word Rain not only uses font size, but also places prominent words higher on the y axis. A bar chart above the words mirrors the word prominence information, making comparisons along the semantic axis easier.

We generate two Word Rains, one from each text genre. By using the same semantic axis and aligning the two Word Rains, the prominence of words on different points on the semantic axis can be compared between the text genres.

Text Mining on Applications for Cultural Funding

Paul Rosenbaum and Maria Skeppstedt

We are using text mining to study the content of 20.829 Swedish applications for cultural funding, and how this content has varied between the years 1994 and 2011. We are using two methods: (i) topic modelling with the Topics2Themes tool, and (ii) word cloud visualisations with our Word Rain code package.
Topics2Themes is developed and maintained by the CLARIN node the National Language Bank of Sweden (SB Sam). Some topics we detected with topic modelling correspond to different branches of culture, e.g., dance, film, books, theatre and music, while other topics correspond to methods for transmitting culture, e.g, museums, performances and festivals. Further, we also detected groups creating or experiencing culture, e.g., girls, immigrants, children, pupils, and the youth, as well as locations where cultural events take place, e.g., public spaces, specific cities, or regions.
The content is rather uniform over the years. For the two last years, however, the content is markedly different. Many topics do not occur, and there is more focus on some topics, e.g, building international networks and festivals. The same difference for the last two years was shown by the Word Rain visualisations, where some words are more prominent the last two years - e.g., international and network - than previously. This pilot project is a collaboration between the Department of Business Studies and the Center for Digital Humanities and Social Sciences at Uppsala University.

Generalising Political Leaning Inference to Multi-Party Systems


Joseba Fernandez de Landa

An ability to infer the political leaning of social media users can help in gathering opinion polls, thereby leading to a better understanding of public opinion. While there has been a body of research attempting to infer the political leaning of social media users, this has been typically simplified as a binary classification problem (e.g. left vs right) and has been limited to a single location, leading to a dearth of investigation into more complex classification and its generalisability to different locations, particularly those with multi-party systems.
Our work performs the first such effort by studying political leaning inference in three of the UK's nations (Scotland, Wales and Northern Ireland), each of which has a different political landscape composed of multiple parties. To do so, we collect and release a dataset comprising users labelled by their political leaning as well as interactions with one another. We investigate the ability to predict the political leaning of users by leveraging these interactions. We show that interactions in the form of retweets between users can be a very powerful feature to enable political leaning inference, leading to consistent and robust results across different regions with multi-party systems.