Blog post written by Fabio Ardolino, Silvia Calamai, Letizia Cirillo, Duccio Piccardi and Caterina Pesce about the CLARIN tutorial “Creating, Managing and Analysing Speech Databases using BAS Services and Emu” that was given by Christoph Draxler and Florian Schiel dring AISV 2019.
AISV 2019 in Arezzo
On 14-16 February 2019, the Arezzo campus of the University of Siena hosted the 15th Italian Association of Speech Sciences Conference (AISV 2019). This year the conference was devoted to audio archives, a cross-disciplinary artefact employed in different research fields, such as linguistics and speech technologies, oral history, ethnography, anthropology, sociology, and psychology. About 100 participants attended the Conference. The keynote lecture was given by Franciska de Jong (Executive Director of CLARIN ERIC) on “Spoken Word Archives as Societal and Cultural Data”.
During the conference, special emphasis was placed on the legal aspects involved in collecting and (re)using audio archives, on how to assure the correct conservation and metadatation of archives, and on possible ways to promote a closer collaboration between linguists, speech scientists, speech technologists and oral historians.
The event opened with CLARIN tutorial “Creating, Managing and Analysing Speech Databases using BAS Services and Emu” by Christoph Draxler and Florian Schiel (Phonetik und Sprachverarbeitung - LMU München), which aimed to present and discuss a number of resources available on the BAS web services page for researchers who work with empirical speech data.
The CLARIN tutorial was divided in six sections, each dedicated to a specific tool. SpeechRecorder, an independent multi-channel audio software for the recording of speech. This tool is used to simplify the creation of speech databases via recording scripts. The software needs high-quality records for the creation of a speech database; therefore, it is important to use suitable microphones, to regulate the volume, to focus on the speaker and to be aware of environmental noises. The tutorial then moved then to ASR(Automatic Speech Recognition), a useful tool to obtain a preliminary automatic transcription. ASR converts audio and video files into plain texts and is supported by external servers able to keep sensitive data within the files. In general ASR is offered by external providers, which means that sensitive data cannot be processed via ASR. Since ASR is often not good enough for a useful orthographic transcription, it is often necessary to integrate it with a manual transcription editor like OCTRAwhich runs in a browser and supports many file formats.
The second part of the tutorial focused on phonetic segmentations. Chunkeris a background web service which cuts long signal files for faster processing. Once the signal file is divided into segments, it is possible to move on to a phonetic analysis using WebMAUS,a webservice that helps to derive a word, a syllable or a phone segmentation, and to label it. Moreover, this tool is based statistical language models to hypothesize on the most likely segmentation in a speech unit. The final part of the tutorial introduced the Emu-SDMS (Emu Speech Database Management System), an all-in-one solution to create, annotate, and manipulate speech databases; participants were also shown how to use the Emu-SDMS tools to perform signal analysis.
The interest in the creation, management and analysis of speech databases led 26 researchers (from Italy, Albania, Germany, Switzerland, La Mayotte, and Belgium) to join the tutorial: linguists, oral historians, speech technologists, professionals working with cultural heritage and audio archives reached Arezzo from numerous universities, institutions, libraries and museums (Siena, Pisa, Naples “Federico II”, Cagliari, Milan Politecnico, Cologne, Bolzano, Gand, Venice “Ca’ Foscari”, Soprintendenza Archivistica e Bibliografica della Toscana, …).
Among these, there were various members of the Italian Association of Speech Sciences, with different academic positions and research interests (sociophonetics, second and third language acquisition, speech signal analysis, NLP, language variation and so on). The tutorial was a much-needed recap of the available tools to build speech databases and to work with speech signals.
Nine members of the Italian Oral History Association (AISO) took part in the CLARIN Tutorial. These were a group of young Italian researchers who have just graduated or are attending PhD and post-doc programmes. They already had some experience in collecting interviews, but they had no confidence with speech recognition, automatic transcription and alignment tools.
The voice of linguists
The tutorial received positive feedback from linguists. The tools included in the presentation are instruments that may simplify experimental processes and resolve some issues in many research fields in which participants are involved. SpeechRecorder happened to be the most versatile tool, useful as it is both for building oral archives and for generating speech samples for more sophisticated analyses. The resources used to perform speech transcription and annotation were known just to a subgroup of the participants, specifically those with a background in phonetics or speech research. Nevertheless, the step-by-step explanation of the tools provided a deeper knowledge and solved some doubts about their application. Some participants asked for information about the languages covered by the presented tools and how these manage to link signals to language-specific units (e.g., lexemes). Moreover, linguists showed interest in the potential integration of finer levels of phonetic analysis into the segmentation process, such as closure, burst and voice onset time for the plosive class. The open and interactive format of the tutorial made it possible to hold short discussion sessions with the participants.Linguists advanced ideas to overcome some of the limits of the presented technology, such as the use of YouTube automatic subtitles generation to improve the quality of the ASR output. Unfortunately, since Florian Schiel could not make it to the tutorial, the planned exploration of statistical resources for phonetics was drastically shortened; some of the attending linguists, while being overall satisfied with the tutorial, complained about this change.
The voice of speech technologists
It has been rather interesting to discover all the ASR services in the portal and it has been good to know that transcriptions can be revised by means of other tools such as Octra or the recent OH-Portal – which allows the researcher to make different types of analysis in a single tool. For the future work, it might be useful to organise tutorials in which all the participants are able to perform tasks in order to verify the functioning of the tools in a guided manner. In addition, it might be useful to offer more practical and methodological information on the use of EMU for the statistical analysis applied to values extracted from speech. Young researchers would benefit a lot from that.
The voice of oral historians
Young oral historians were manly interested in automatic transcription. Everybody asked for more time to practice the tools under the supervision of Christoph Draxler. However, some of them experimented ASR by themselves the day after the tutorial. They did not follow Draxler’s suggestions and recorded a short audio without microphones and with many background noises and they were surprised by the result. While showing a growing interest for these instruments, oral historians are at the moment not able to say to what extent ASR could be useful to the transcription of an oral history interview. The segmentation software integrated in the pipeline, which they used to divide their files, took a very long time to display the fragments, and the transcription done by ASR is still not good enough, as it requires manual correction. Oral historians wondered if becoming proficient with these tools would result in a faster transcription process. Moreover, they pointed out other limitations:
- They believe that the precautions taken to obtain high quality recordings may affect the intuitive aspects of oral history’s methodology, which privileges the relationship with the interviewee. Indeed, some oral historians pointed out that even the choice between audio and video recording is still debated among Italian oral historians and that the interviewee should be comfortable with all field instruments. For example, an interviewee who belongs to a radical political movement could be generally suspicious towards the interview itself and the situation could get worse because of the presence of professional instruments.
- Everybody was worried about external servers maintaining and using sensitive data improperly.
Despite these criticalities, everybody considered the discovery of these tools as a great opportunity and hoped to continue the collaboration with speech scientists and speech technologists.
We would like to thank the following colleagues for contributing to this blog text:
Claudia-Roberta Combei, Carlotta De Sanctis, Maria Di Maro, Jessica Matteo, Chiara Paris, Chiara Scarselletti, Ottavia Tordini, Patrick Urru.
Moreover, we are very grateful to CLARIN ERIC for the opportunity to hold this tutorial.