Skip to main content

Advancing the Reach of Digital Humanities

Maria Skeppstedt
Submitted by Karina Berger on

A recently developed topic modelling tool has not only proven to be versatile in terms of its potential applications, but has also been used as a way to introduce less technical SSH scholars to digital methods. The tool, Topics2Themes, was developed by Maria Skeppstedt in collaboration with several researchers, including colleagues from the Swedish CLARIN node at the government agency Institute for Language and Folklore (Isof). It uses topic modelling to extract recurring content from short texts and also supports their subsequent manual analysis. The tool has been applied to a range of different text collections, including corpora consisting of tweets, editorials, cultural funding applications and medical periodicals. In her latest role, Skeppstedt is using digital humanities methods, including Topics2Themes, to reach out to new communities, helping researchers from the SSH domain who are unfamiliar with such technology to solve their research questions in a practical way. 

Topics2Themes also has the potential for wider impact: By facilitating automated analyses of large corpora of contemporary discussions (tweets, editorials) on important topics such as climate change, this tool can help to address larger socio-economic questions by identifying trends and frequent arguments, as well as ‘framing’, which can occur in the discussion of contentious topics. More widely, introducing humanities scholars to digital methods opens up new opportunities for innovation in the field of digital humanities.      

 

'Collaborating with a CLARIN node [...] provides many advantages. Firstly, researchers can collaborate when developing the tools, which speeds up our work. Secondly, it provides a stability to the tools that are being developed, which increases the likelihood that they are used and reused.'

Maria Skeppstedt 

The Tool: Topics2Themes

Skeppstedt has a background in clinical text mining and currently works as a research engineer at the Centre for Digital Humanities and Social Sciences, Uppsala University (CDHU). She originally developed Topics2Themes in order to search for frequently used arguments for and against vaccination in a large set of discussions in an online forum. Her aim was to build a tool that uses standard topic modelling to automatically extract recurring topics, and then supports the manual categorisation and analysis of them. Thus, once topics have been automatically extracted, the tool suggests labels that can be added to them, and also displays other words associated with the identified topics. Unlike some other tools, Topics2Themes was designed to capture mainly qualitative data. The tool is fully functional, but is still being further developed, mainly in the form of additional visualisations. 

Topics2Themes is a collaborative project, first with Skeppstedt’s colleagues at the IsoVis group at Linnaeus university, and then when she worked at the Swedish CLARIN node at Isof. The tool is now maintained and further developed at Isof.

Applications

Tweets

In a project with PhD student Robin Schäfer, Skeppstedt applied Topics2Themes to the GerCCT corpus, a collection of 12.000 German tweets on the subject of climate change. Although Schäfer had conducted some natural language processing ( ) experiments on the topic regarding argumentation, he lacked a full overview over what the frequent arguments or frequent topics in this corpus were.

Topics2Themes provided such an overview by automatically extracting the 15 most frequently recurring topics among the tweets. In a second step, the team used the tool’s graphical user interface to manually search for recurring themes among the tweets most closely associated with the topics extracted. Although the content of the tweets associated with a topic was often diverse, they were able to identify 14 recurring themes.

The results were presented at the CLARIN conference 2022: ‘A Snapshot of Climate Change Arguments: Searching for Recurring Themes in Tweets on Climate Change’.

Topic2sThemes applied to the GerCCT corpus.
 
Editorials

In collaboration with professor Manfred Stede and a group of students at University of Potsdam, Skepptedt also applied Topics2Themes on a corpus of 490 editorials from Nature and Science spanning 50 years on the topic of climate change. An article on this was recently published. 

The corpus they worked on had already been manually annotated for thematic framing categories, and the team in Potsdam built a digital version of the corpus and attempted to automatically reconstruct the manually annotated topic frames. 

According to Skeppstedt, the automatic extraction of frequent topics delivered promising results; the team concluded that unsupervised topic modelling can indeed assist in detecting frame categories. The tool was able to reproduce almost all of the temporal ‘trends’ that had been found through manual annotation. She said: ‘Of course, if you have the time to do a full manual analysis of a large corpus, topic modelling will not replace it. However, for this example, we were able to show that similar trends could be found with topic modelling.’

Advancing Digital Humanities

In her current role as research engineer at CDHU, Skeppstedt works with researchers in SSH, teaching them how to apply digital methods and tools practically to their research. They either work collaboratively on smaller projects, or can be hired as consultants if researchers get external funding for larger projects where they require digital competence.

Skeppstedt reflects on her current role, noting that it has been a different way of working: ‘Everything takes longer than if you were just doing it yourself. It takes time to understand what the researcher wants, and then act on that. She says that the interaction with ‘less technical people’ has been useful for her, in order to understand what they need. For instance, it has helped her to improve the visualisations and graphical user interface of Topics2Themes.

The collaborations have also helped her to identify many of the unspoken assumptions which need to be bridged. She says: ‘When you work with non-technical people, they sometimes don’t know which things can be relatively quick to do, and which ones are more difficult and take a long time, or are maybe even impossible. Sometimes we lack a common language. So it’s been useful work.’

In one such recent project, a collaboration between the Department of Business Studies and the CDHU, Skeppstedt assisted a group of researchers in using text mining to study the content of more than 20.000 Swedish cultural funding applications, to determine what the frequent topics were and how these changed over time between 1994 and 2011. They used two methods: topic modelling with the Topics2Themes tool, and word cloud visualisations using the Word Rain code package (see below).

The outcome provided a useful overview of the different areas of culture typically involved, as well as methods for transmitting culture, and also offered insight into trends over the studied period. The group of researchers has applied for further funding to continue the project. 

Timeline of Topics2Themes applied to texts containing the word 'dietitian' in one of the periodicals studied in ActDisease.

Researchers in History of Science and Ideas

Skeppstedt is also part of the team of an ERC-funded digital humanities project called ActDisease. The project, which is led by Ylva Söderfeldt, is conducted in close collaboration with CDHU and the Centre for Medical Humanities at Uppsala University.

This four-year project investigates how patient organisations contributed to shaping disease concepts, illness experience, and medical practices throughout the 20th century. ActDisease spans different disciplines and methods to capture the long and broad history of patient organisations in Europe. It combines studies in historical archives and close reading of texts with computer-based analysis of sources. Skeppstedt's role is akin to a consultant, helping to apply digital methods to the dataset.

Overall, the project will study patient organisations in four European countries, more specifically the newsletters, reports, and magazines which they have published and through which they communicated with their members and wider audiences. By combining close and distant reading of these sources, the team aims to shed new light on how patients’ involvement in knowledge generation and decision-making developed over the past century.

Once the texts by patient organisations have been digitised, they will be compared to texts written by medical professionals. The team plans to use both classical methods within the humanities, as well as digital methods, to see how the knowledge flowed between those two worlds - the patient and organisations on the one side, and the professional medical world on the other - and how they affected each other.

Skeppstedt is involved in the task of applying digital methods to this dataset - including the  topic modelling tool Topics2Themes. The first results of this work are published in the Selected Papers from the 2023 CLARIN Conference.  She says: ‘What makes this project interesting is that it's a research question within the humanities that can possibly be solved using digital methods. It might be too difficult, but it feels like it's possible. We are aiming high.’

Future Plans and Other Projects

In addition to her current role, Skeppstedt is also involved with another project called Word Rain, a collaboration between CDHU, the Swedish CLARIN node at Isof and the iVis group at Linköping University. Together, they are developing a code package for a type of word cloud visualisations, which has been applied to visualising different types of texts, including IPCC-reports and other texts about climate change. 

The Word Rain text visualisation technique presents an alternative to the standard word cloud. While in the standard word cloud words are randomly positioned, and the font size as indicator of importance is imprecise (it seems to give longer words more prominence), Word Rain attempts to rectify this by producing a more semantically aware visualisation of texts. It does this in two ways: By using word2vec models to carefully position words, Word Rain’s visualisation shows which words are close to each other semantically. In terms of the font size, this is still used to indicate word prominence, but the Word Rain technique also provides an additional vertical bar to illustrate prominence. Visualisations using word rains were presented at the CLARIN 2023 Bazaar, and an additional article describing the visualisation technique has also been published. the hierarchy of use in the text. 

Word Rain can be used as a stand-alone tool, but is also useful for illustrating the outcomes of topic modelling work carried out with Topics2Themes. Skeppstedt will, together with the Swedish CLARIN node at Isof, continue to refine both tools and welcomes feedback from users. For questions regarding the work conducted at Isof, Magnus Ahltorp is the main person responsible. 

Views on CLARIN

'Collaborating with a CLARIN node situated at a government agency provides many advantages. Firstly, we can collaborate when developing the tools, which speeds up our work. Secondly, it provides a stability to the tools that are being developed, which increases the likelihood that they are used and reused. Researchers typically move between different temporary projects, which can mean that tools developed within projects are not taken care of when the project ends. With a connection to a CLARIN node, this can be avoided. In addition, to be able to present our work at CLARIN conferences and receive feedback from this community is very helpful.'

Maria Skeppstedt

Publications and Other Resources

Skeppstedt, M., Aangenendt, G., Danilova, V., S.derfeldt, Y. (2024) Topics in Periodicals from the Swedish Diabetes Association 1949–1990: Extending the Topic Modelling Tool Topics2Themes with a Timeline Visualisation. Selected papers from the CLARIN Annual Conference 2023 [link coming soon]

Skeppstedt M., Ahltorp M., Kucher K., Lindström M. (2024) From Word Clouds to Word Rain: Revisiting the Classic Word Cloud to Visualize Climate Change Texts. Information Visualization, Sage 

Stede, M., Bracke, Y., Borec, L., Kinkel, N.L. and Skeppstedt, M. (2023) Framing climate change in Nature and Science editorials: applications of supervised and unsupervised text categorization. Journal of Computational Social Science. London: Springer

Skeppstedt, M. and Schaefer, R. (2022) A Snapshot of Climate Change Arguments: Searching for Recurring Themes in Tweets on Climate Change. CLARIN Annual Conference Proceedings, 2022.

Skeppstedt, M., Ahltorp, M., Eriksson, G., Domeij R. (2021) A Pipeline for Manual Annotations of Risk Factor Mentions in the COVID-19 Open Research Dataset. Selected Papers from the CLARIN Annual Conference 2020

Skeppstedt M., Kucher K., Stede M., Kerren, A. (2018) Topics2Themes: Computer-Assisted Argument Extraction by Visual Analysis of Important Topics. Proceedings of the 3rd Workshop on Visualization as Added Value in the Development, Use and Evaluation of Language Resources at LREC

CLARIN Conference presentation on Topics2Themes (YouTube)

Open source code for Topics2Themes

Web service to upload text files to generate a Word Rain visualisation

Open source code for Word Rain

Contributors

Maria Skeppstedt, research engineer at the Centre for Digital Humanities and Social Sciences, Uppsala University (CDHU).