Reflections on CLARIN-PLUS workshop in Sofia by Martijn Kleppe

Submitted by Karolina Badzm… on 10 April 2017

CLARIN-PLUS Workshop "Working with Parliamentary Records" was the third in a series of four as part of the CLARIN-PLUS project. The workshop aimed to discover the ways in which NLP technology, developed within CLARIN, would be helpful for curating parliament records and for answering research questions in the field of Digital Humanities given in by parliamentary datasets. To find out more about the workshop and for slides and abstracts please visit the workshop event page. Below you will find Martijn Kleppe's reflections on the workshop.

A guest blog post written by Martijn Kleppe, National Library of the Netherlands (translated from Dutch)

At the end of March 2017 some 35 researchers gathered in Sofia, Bulgaria, to discuss scholarly work based on parliamentary data in digital format (see image, taken on 28 March 2017). Central questions included: What are the opportunities and challenges for researchers working with parliamentary records within different disciplines? How might digital parliamentary collections be improved in terms of accessibility / usability / searchability / other? And how could parliamentary collections from different parliaments be linked?

I was invited to speak because, prior to my appointment at the National Library of the Netherlands (KB), I was involved in two projects in which such collections played a central role: PoliMedia and Talk of Europe.

The workshop was very inspiring and well connected to a number of current developments at KB. Of course, there was a direct link to the Staten-Generaal Digitaal project. In addition, topics such as the potential of Linked Open Data (LOD) for metadata enrichment, version control and interoperability of the datasets were addressed. Below I discuss some of the presentations I found most interesting.

Talk of Europe & Sparql

Laura Hollink (CWI) talked about the Talk of Europe project, in which records from the European Parliament in all European languages have been made accessible as Linked Open Data. Previously, researchers had to dig through thousands of PDF files; now they canexplore the data themselves using SPARQL Endpoint. Among others, the questions asked included: Which political party speaks most in Parliament? Or which topics do Dutch MPs discuss in comparison to their Polish and British colleagues?

During the workshop Laura also gave a tutorial on how to formulate SPARQL queries. That proved quite a challenge, echoed also by researchers who use the Short-Title Catalogue Netherlands (STCN) via the SPARQL Endpoint. The experience of the workshop participants reflected the discussions that regularly take place in the KB Digital Humanities Team: are we serving DH researchers as best as possible with data services or does working with such services require substantial programming knowledge? If you are interested in experimenting with the SPARQL Endpoint of Talk of Europe, check out Laura’s tutorial.

XMLification of Political Data

I am a big fan of Maarten Marx (UvA) who has been working on structuring parliamentary data from the Netherlands for many years in the Political Mashup project. He has also been collaborating with the Research Department of the National Library on a number of projects.

His dream is to: 'create a repository of all European parliamentary documents in a machine readable format', and he is quite close to making the dream come true already. When you go to you can search parliamentary data of the following countries: the Netherlands, Canada, UK, Denmark, Sweden, Norway, Belgium and also the EU. A number of  tools have been included in the user interface, such as a timeline and word cloud generator, which you can use to visualize, for example, the use of the term "Koninklijke Bibliotheek" (National Library) in the Dutch parliament (see image below).


Schermafbeelding 2017-03-31 om 16.59.28.png


German researcher Andreas Blätte (University of Duisburg-Essen)) presented a similar approach in his PolMine project. This project brings together parliamentary data from a range of German national and regional sources. Blätte defined his aim as: ‘Let’s all do XMLification of Political Data’. Both Marx and Blätte chose to integrate the available data into their own platform. This is understandable since they both have a research objective and clear research agenda. However,  from the perspective of resource sustainability this is not optimal. They spend a lot of time on developing interesting resources and tools that many other researchers would like to use. But what will happen to their resources and tools if they run out of funding or if they move to other jobs and places? Data extensions and improvements of annotations they created may not be preserved either in this scenario. These concerns were expressed and discussed during various workshop sessions in Sofia. Very similar to the discussions at the National Library in relation to version control and Linked Open Data for example.

The four Rs of parliamentary reporting

In addition, there was an interesting contribution by Roberto La Rocca, who works in the Dutch House of Commons as a parliamentary stenographer/reporter and thus knows better than anyone how reports are actually produced. He wanted to warn researchers not to draw excessively strong conclusions based on such reports. He described how, after reporting, four procedures take place with respect to these texts, which he defines as the four Rs:

  • removal (e.g. of repetitions)

  • repair (e.g. of grammar mistakes)

  • reorganization (e.g. to keep the text readable)

  • rendering(e.g.?)


More parliamentary data sets

Strictly by coincidence, in the Netherlands two related services were updated at the time of the workshop:

- The Open Data portal of the House of Commons, see (Dutch)

- The platform with political news (Dutch) has been updated, offering opportunity to monitor and investigate all recent parliamentary records, reports as well as the news reports



The slides of all presentations from the workshop can be found on the CLARIN website: