State-of-the-Art Speech Recognition for Understanding Oral Histories

Jan Švec, Martin Bulín, Pavel Ircing, Adam Frémund, Filip Polák
Submitted by Karina Berger on 13 July 2023

Many audio and video interviews are very long and unstructured, and not easily usable as research data. A team from CLARIN’s Czech node recently presented their state-of-the-art system that uses speech recognition and other natural language processing ( ) technologies designed specifically for oral history archives. The technology aims to make oral history data more accessible to researchers as well as the general public, opening up the wide range of audio material that is available in archives across the globe.

Called ‘Semantic Search – Asking Questions’, this new machine learning technology enables users to easily access oral history archives and engage with them in an intuitive and interactive way. The technology makes it possible to navigate long sequences of oral recordings by providing pre-generated, time-stamped questions that guide users through the content. In addition, a specialised search function enables direct interaction with the content in the video.

The technology enables user to interact with the testimonies.
The software was developed by three LINDAT/CLARIAH-CZ team members, Jan Švec, Martin Bulín and Pavel Ircing, from the AI Lab of the Department of Cybernetics, University of West Bohemia in Pilsen, as well as two PhD students, Adam Frémund and Filip Polák. The lab focuses on AI technologies and applied AI, mostly speech technologies and other NLP-related systems, but also machine learning techniques for image processing, in their work. Švec and Bulín recently presented their application at the joint EHRI-CLARIN workshop ‘Making Holocaust Oral Testimonies More Usable as Research Data’, which took place in May in London, highlighting its suitability for use in the digital humanities (DH).

'It all started with the cooperation with Steven Spielberg, who started the Shoah History Foundation after the film Schindler’s List was released. They had hundreds of hours recorded on VHS tapes, and when they digitised them, they realised that it was impossible to find anything in them. This planted the seed of our current project.’
Pavel Ircing

'Asking Questions Framework'

The team’s core project uses data from USC Shoah Foundation's Visual History Archive, which is provided at the Malach Center for Visual History at the Charles University in Prague. The project was originally inspired by a collaboration with Steven Spielberg more than 20 years ago. After releasing the film Schindler’s List, the director established the USC Shoah Foundation in order to collect testimonies of Holocaust survivors. However, when faced with hundreds of hours of unstructured recordings, he sought help to deal with this important, yet inaccessible, historical source. Since then, the group at the Department of Cybernetics has been experimenting with different speech recognition and information retrieval systems.
They began to develop a speech recognition system specifically for oral histories for Czech and English, and later also for German and Slovak, which they presented to an audience of oral historians and other scholars working in the DH at the EHRI-CLARIN workshop. 
Once the basic technology for speech recognition and thus good-quality transcriptions was developed, the team started to develop other technologies that take those transcriptions as input and provide higher-level information. Their aim was to make the often lengthy testimonies more accessible to users, including the general public and children.
Mindful of not wanting to change the testimonies, the team settled on the concept of guiding users through the videos with artificially generated questions. Švec notes: ‘We didn’t want to change the meaning of the testimonies, because it's a very sensitive subject. And so we decided to artificially generate questions because [the people in the videos] tend to give long monologues about what happened. And it is very hard to start somewhere in the middle of this. So you have to listen to it from the beginning, which takes up a lot of time. But if you artificially create the questions and assign them, they naturally continue with the testimony of the Holocaust survivor.’
In the application, the video and metadata are visible on the left, and all generated questions with their answers on the right, with continuity scores and sorted by timestamps.


The entire software is based on neural networks. All steps regarding audio and text handling are processed offline on a back-end Python server that also takes care of the metadata for the videos. The speech processing technology involves special, tailored speech-to-text speech recognition, search methods (terms and phrases), speech understanding methods (named entities, segmentation) and automatic subtitling, making it useful for a wide range of archives and disciplines. 

The team decided against using OpenAI’s Whisper for transcribing the testimonies. First, to ensure that their technology is suited to the specificities of the Holocaust testimonies: the recorded speakers are relatively old, are not native speakers, and use words and a register that differs from today’s generic data from the web. Often, the testimonies are emotional. Combined, this makes it quite a different task for speech recognition, so the researchers designed the speech recogniser specifically for oral histories, retraining their system from scratch.

Second, the team wanted to safeguard the original meaning of the testimonies. OpenAI’s encoder-decoder architecture can generate words that do not occur in the original speech, because it matches the context or the surrounding words (for example, the number 43 may become 1943); a more ‘traditional’ AI model seemed a more appropriate choice.

‘It's clearer what happens with the data with our technology. When using OpenAI technology that can be stored in their cloud, you don't know how it was trained, and how it performs on your data. We measured the performance on oral histories, so we know how it works under these precise conditions.’
Martin Bulín

Semantic Search Feature

The team then added further features to improve the accessibility of the videos in the form of a semantic search. This powerful addition to search interfaces allows the user to search not for specific words or phrases, but for passages with a meaning related to a search phrase. Users can type in a question, which is sent to the backend server, which finds the 20 closest pre-generated questions and presents these to the user. Using a semantic search significantly increases the chance of finding relevant information, as queries are not limited to a single keyword.

The semantic search function was then expanded to also enable speech queries, meaning that users can ‘interact’ with the person in the video. The speech is recognised by the team’s in-house technology Speechcloud (automatic speech recognition module based on Wav2Vec 2.0 technology). Originally developed for English, it is now also possible to ask questions in Czech. The recognised utterance is automatically translated in real time into English using the LINDAT translator . The process of finding the closest match in the pre-generated questions is the same as before; the answers are the original audio track of the interview.

The main interface of the application. The online demo is available here.

‘The ultimate goal is that you can interact with the person in the video using your own voice and your own language.’
Jan Å vec
In the future, the team would like to expand the set of supported languages and add speech-to-speech translation. That way, users could ask a question in their preferred language and get the answer in the same language, regardless of the original language of the testimony. As a result of the EHRI-CLARIN workshop, Švec and Bulín have begun cooperating with a team in Italy in order to develop an Italian speech recogniser.
The demo for the ‘Semantic Search: Asking Questions Framework’ application (available here). It uses data from USC Shoah Foundation's Visual History Archive, provided at the Malach Center for Visual History at the Charles University in Prague UWB Repository (code examples here).

The main page of the application contains 36 interviews taken from USC Shoah Foundation videos on YouTube. After selecting a video, users see the video and metadata on the left of the screen, and all generated questions, with brief answers, on the right, along with continuity scores and timestamps. By clicking on the question, the video automatically plays from the required time stamp. This video outlines the main features of the application.

Once released, the technology has a wide range of applications. Bulín says: ‘It could be used for almost any testimony, including recent media formats such as podcasts and broadcast news. In general, it could be applied to any textual data, such as scanned documents, which opens up a lot of possibilities.’


EHRI-CLARIN Workshop ‘Making Holocaust Oral Testimonies More Usable as Research Data’

‘If the technology can be applied within an entire collection, it can really save time and support the explorative phase of your research. It also works as an accelerator for associations and making sense of connections between words and themes.’ 
Stefania Scagliola, historian
Read more about the workshop in the blog 'Using Holocaust Testimonies as Research Data' on the CLARIN-UK website and the news article 'EHRI in the UK | Three Workshops' on the EHRI website.

The team is part of the Department of Cybernetic, NTIS - New Technologies for Information Society, University of West Bohemia, Pilsen, Czech Republic:
Jan Å vec, Researcher
Martin Bulín, Researcher  
Pavel Ircing, Associate Professor
Adam Frémund, PhD candidate
Filip Polák, PhD candidate