Blog post by Jan Niestadt: mini-workshop on Korp, Strix and BlackLab in Gothenburg

Submitted by karolina@clarin.eu on 12 December 2017

Språkbanken (the Swedish Language Bank) and the Dutch Language Institute are organisations with similar goals and ambitions. We both have own corpus search systems: Korp based on CWB and BlackLab. Both systems are used at institutes in several countries and connecting as many of these systems as possible to the is important for CLARIN, as is seeing how to evolve these systems and the FCS to serve users looking for more structured ways to search. Providing similar and optimal user experiences in each search frontend is of course also important for FCS users. Furthermore, comparing user interfaces might inform how the FCS aggregator interface could evolve in the future.

All of these subjects were to be touched on while I was in Gothenburg. This visit was initially suggested by Elena Volodina, researcher at Språkbanken, who thought it would be useful to get together and discuss these various topics -- she was right! She organised the three-day programme and showed me around. In addition to collaborating on the interoperability and user experience level, ideally we would actually be able to share (parts of) our implementations and continue developing them together.

This was very much designed as a knowledge sharing mission. We wanted to discuss:

details of each other’s technologies
challenges related to scaling up to larger datasets
challenges of integrating new ways of searching, such as treebank queries
how to better assist CLARIN members who wish to expose their resources to the FCS
how to integrate lexica and treebanks into the Federated Content Search
the feasibility of adding certain features to the FCS to increase its value to researchers
how we might collaborate on one or more of the above in the future

Beforehand we thought we might even have time to start on a prototype of a collaboration project, but it turned out there was simply too much to discuss, leaving very little time for this kind of practical work. We did get to briefly go over some code together that could form the basis of a future collaboration and we think large parts of such a project could very well be done remotely.

Monday: comparing notes

On the first day, I presented BlackLab, the corpus search system I’ve developed at the Dutch Language Institute, including the AutoSearch (early beta; better version in development, see here) application that enables any user to upload data to be indexed and searched. Many of the Språkbanken people were there and even though Monday morning is a difficult time to pay full attention to a technical presentation, I was thankful they not only listened, but had some insightful questions for me at the end, sparking some lively discussions.

Språkbanken has two impressive corpus search applications, Korp (sentence-based, with advanced FCS-QL-like searching) and Strix (document-based). The Korp and Strix team generously took the time to walk me through the more advanced areas of both of these applications. Korp offers a token-based search interface on their corpora, while Strix offers a document-based interface on the same data. After that, we had lively discussions about user interface design, the difficulty of figuring out what it is that (different kinds of) users want and the finer points of corpus query optimization. Also touched on were the challenges of scalability and the promise of distributed search to deal with this, and how to enable users to configure input formats in a way that can be figured out without two PhDs in technical fields.

For more detail about what we discussed on Monday, see this highlights document.

Tuesday: talking Federated Search

I spent Tuesday in a long one on one session with Leif-Jöran Olsson, covering all things Federated Content Search (FCS).

We discussed the possibility of providing more generic FCS endpoint code where endpoint implementers could “plug in” an implementation for their corpus engine, either a pre-existing one or a custom one they implement themselves, and can then set up endpoints using a few simple configuration files. This is the direction INT is moving in. We will explore this possibility further, sharing our code and discussing the pros and cons of this approach.

One important realization was that there may be a slight difference in how CLARIAH (the Dutch CLARIN extension project) views the FCS versus the role it currently plays in CLARIN. We established that within CLARIN there is a strong focus on the FCS for discovery purposes, while within CLARIAH (or at least within those colleagues of mine who are implementing FCS as part of CLARIAH) we would like to expose more search features, to use the FCS as a research tool as well. This also suggests functionality like being able to download full result sets from all endpoints, getting detailed information about the differences between tagsets and how the tagset translations happen, and so on. It is good to be aware of the differences in goals, and we have tried to see what parts of the feature requests might be easy to add in the aggregator, or could be enabled by writing a third-party script that talks to the FCS directly.

We found a similar difference in focus on discovery vs. detailed structural usage when talking about lexica. In CLARIAH, lexica are being published using Linked Open Data, and search is done using SPARQL, allowing complex queries on all of the different categories of information. CLARIN, again, is focusing on getting basic FCS support for as many resources as possible, so as to increase the usefulness of the FCS for discovery. Both lofty goals, but not easily addressed using a single approach.

The conclusion here is that CLARIAH may have to resort to its own Federated Content Search solutions, especially when it comes to lexica. Of course, we will try to stay close to the CLARIN FCS and will make sure that our endpoints integrate with that as well (more details here).

Although it is useful to realize these types of differences, so they can be discussed productively, it is also nice to be able to find common ground and arrive at more concrete, practical proposals. This was the case for treebanks. Leif-Jöran and I discussed how tree querying should be added to the FCS, covering topics from what query language to use, the minimum capabilities needed to be considered treebank search and visualisation libraries to technical considerations when extending token-based search systems with tree querying functionality.

We arrived at an initial idea of extending the FCS-QL with syntax inspired by TIGERsearch’s query language as a practical, low-threshold way of seeing how treebank search might be integrated.

Treebank search is inherently difficult for users, but most promising GUI seems to be example-based approach. Query language choice doesn’t really matter that much, although for FCS we would have to agree on a standard, and/or provide a lossless translation library between (subsets of) query languages. Supporting a few low-hanging fruit operations (basic tree parent/child[/sibling] relationships) first seems good, both because it’s easier to implement endpoints, and this makes it easier to attempt to add tree querying to existing token-based systems such as BlackLab and Korp. Here’s a document with more details. It contains some thoughts of what an extension of the FCS-QL for basic treebanks search might look like, as well as some thoughts on what primitives a corpus search engine needs to execute basic treebank searches.

Wednesday: brainstorming about future collaborations

On Wednesday I first met with director Lars Borin and co-director Markus Forsberg to discuss how we might cooperate in the future. We agreed it was important to identify areas where we are both facing challenges and see if we can work together on solutions. Several possible projects were mentioned, including user-uploadable corpora, real-time corpus-updates, researching integration of BlackLab with Korp and/or Strix, and making progress on treebanks.

The possibility of a future hackathon was raised, where developers from INT and Språkbanken would get together and work on a specific challenge for a few days. We will discuss this possibility further over email and, if we can find an area of sufficient common interest, will try to set that up for next year.

Afterwards, I spoke some more with Leif-Jöran and some of the Korp/Strix developers to fill in a few details for the possible collaboration projects. One concrete result of this meeting was that the Korp search backend was switched from private to public on GitHub, so I (and, of course, anyone else who’s interested) could have another look at it later. Here’s a document with some more details about our ideas for what to work on together.

Scribbles: work in progress!

Conclusion

During my visit to Språkbanken in Gothenburg we discussed many different topics: the goals and future of the FCS and treebanks, user research, interface design, scalability and optimization, to name a few. However, I think the wide array of topics we discussed was actually the visit’s greatest strength. It means I had the opportunity to meet many enthusiastic, knowledgeable people, and our various discussions have not only sparked our creativity, but hopefully also paved the way for future cooperation. We found that we were running into many of the same issues, discovered some of the same solutions, but fortunately also surprised each other from time to time with a different way of handling a problem that hadn’t yet occurred to us. It was a visit densely packed with information and occasional small epiphanies. I’m still processing it all, but it is clear to me that this was a good time investment and we should definitely stay in touch, not just to discuss shared challenges, but to seize opportunities for further collaboration.

Blog post written by Jan Niestadt who received a CLARIN Mobility Grant in November 2017.

More information about the CLARIN Mobility Grants can be found here.

Gothenburg in November 2017