Encoding parliamentary data in TEI

Submitted by Linda Stokman on 13 May 2019

Blog post by Tanja Wissik (Centre of Digital Humanities of the Austrian Academy of Sciences) who received a CLARIN Mobility Grant in April 2019 to visit the Jožef Stefan Institute in Slovenia.  

I applied for the CLARIN Mobility Grant at the Jožef Stefan Institute, as part of CLARIN Slovenia, to learn more about encoding parliamentary data in . At the Austrian Centre for Digital Humanities of the Austrian Academy of Sciences, I initiated the corpus project ParlAT, a corpus of Austrian parliamentary records. To date, the data has been analysed using only a corpus query system. However, I would like to convert the ParlAT corpus into a standard format, such as TEI, to allow interoperability with other data sets and, in doing so, enable efficient research across different parliamentary data and other corpora.

During my research visit from 14 April to 19 April 2019 in Ljubljana, Slovenia, I was hosted by the Jožef Stefan Institute and very warmly welcomed by Tomaž Erjavec from the Jožef Stefan Institute and Andrej Pančur from the Institute of Contemporary History. These two researchers were the main contributors for the SlovParl Corpus, a collection of Slovenian parliamentary records, and have much experience with processing and annotating parliamentary data in TEI. In fact, they have even proposed a standard for encoding parliamentary data called teiParla (link: https://www.clarin.eu/event/2019/parlaformat-workshop).


Fig. 1: Tomaž Erjavec, Tanja Wissik, Andrej Pančur at Jožef Stefan Institute

After an initial analysis of the data, we learned that it would need some further cleaning and processing before we could transform it into TEI.

Using two small pre- and post-processing Perl scripts and Tidy, Tomaž Erjavec produced clean XHTML files. From these, we created a very preliminary TEI conversion, which did not include all the annotations that we had in our previous XML files. The next steps, therefore, will be to make the TEI more complex and to extract all the required structure. 

During my stay, I also had the opportunity to meet with Andrej Pančur. We talked about the creation and development of the SlovParl Corpus — from scraping XML files, to encoding it according to TEI (performance text module) drama, to transforming it into TEI (transcription of speech module) speech — and compared the structure of Slovenian parliamentary records with that of Austrian parliamentary records.

Additionally, we discussed the application of the SlovParl model to Austrian parliamentary data. Some adaptations had to be made to the SlovParl model, especially regarding the numerous comments in the Austrian Parliamentary Records. We decided to split these comments into <incident> for comments indicating applause or laughter and <u> (utterances) for comments indicating interjections (see Fig. 2).

I marked up one file in Oxygen manually to be used as model for the further semi-automatic processing of all the other files back home.

Fig.2: Text passages annotated with <u> and <incident>

All our work was documented in a GitLab Repository of CLARIN.SI and therefore, as a side effect, I could make myself familiar with Git as a version control system.

The Mobility Grant allowed Tomaž, Andrej, and myself not only to compare our two different cases, but also to discuss the encoding of parliamentary records in general and to do some hands on work. Our knowledge exchange was very fruitful and will continue during the CLARIN ParlaFormat Workshop in May 2019 and beyond.