Maintaining Corpus Download Facilities with Changing Technology
Daan Broeder (MPI), Mischa Sallé (NIKHEF), Tobias Valkenhoef (MPI)
At the Max-Planck Institute for Psycholinguistics (MPI) a large set of linguistic corpora from internal MPI researchers and also external projects is archived and made on-line accessible. The LAT software , used for archiving and exploitation was developed by the MPI and is currently used by several “sister” archives.
Already some years ago, also within the linguistic community, we saw the need to integrate the existing archiving infrastructures, since this would be the basis for a viable e-Science infrastructure for the linguistic domain. A small EU project of four archiving institutions “DAM-LR”  was created that aimed at integrating the archives at different levels, including . A requirement was formulated for federated login for all the archives’ users such that for instance SSO for distributed collections would become possible.
For the realization of this, the DAM-LR identity federation was created and the archives Shibbolized their web servers. A negative consequence of this was that local tools that had been used to copy data-sets from the archives to local storage stopped to function since Shibboleth only addresses access by web browsers. This was an unsatisfactory situation especially since further integration projects in the linguistic domain like CLARIN  all plan to continue the use of federated login by using Shibboleth.
Together with the BiG Grid project  and SURFnet , a project was setup to test if the use of  obtained certificates could be a solution for this problem. In this context, SURFnet set up a SLCS service, accessible by the members of the SURFfed identity federation. Secondly the MPI’s repository apache server was configured with mod_ssl and mod_rewrite to allow client certificate-based authentication in parallel with Shibboleth based authentication. Thirdly, the “IMDI-Browser” a local tool that was originally used to download data sets from archives running the LAT archiving software was modified to perform a handshake with the SLCS to obtain the certificate and use it to download the items of a data set.
Currently we have a working application, and the developed software can also be used to enable other tools to access resources in a similar way. Applying this in a EU wide context like the CLARIN project, imposes the question of the status of the SLCS service. Should it be organized on the basis of national NRENs, on a EU wide basis or as part of a virtual organization platform? This is a question that may be of interest for similar integration projects.
 LAT, http://www.mpi.nl/tools/LAT
 DAM-LR, http://www.mpi.nl/DAM-LR
 CLARIN, http://www.clarin.eu
 BiG Grid, http://www.biggrid.nl
 SURFnet, http://www.surfnet.nl/
 SWITCH SLCS service, http://www.switch.ch/grid/slcs/
Software available via:
Software available via: