Corpora of spontaneous conversational speech are an important source of primary data for research in the humanities. Spokes is a multimedia search engine for a unique corpus of conversational Polish, which has been developed by the University of Łódź as part of the Polish CLARIN Infrastructure.
The corpus contains 2.2 million words (over 200 hours) of casual conversations which were recorded in everyday communicative contexts, transcribed, anonymised, annotated with sociolinguistic metadata and time-aligned with the original audio files.
Spokes makes it possible to query and explore this corpus through an graphical web interface with data visualization features. Programmatic access to the textual and audio data is also possible through a set of dedicated web services. In addition to the database of conversational Polish, an experimental instance of Spokes has been made available for the spoken component of the British National Corpus at http://pelcra.clarin-pl.eu/SpokesBNC/.
By providing tools for text data mining and visualisation, we hope to make these unique data sets more easily accessible to researchers interested in exploring samples of naturally-occurring spoken language.
Keywords
conversational corpora, corpus search engines, multimodal corpora
The development of Spokes has been financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education.