The Glossa corpus search system


At the Text Laboratory, University of Oslo, we are currently developing a new version of our corpus search and post-processing tool Glossa. Glossa allows a user to search a text corpus, or a set of corpora, for one or more words, phrases or grammatical constructions.

The queries can be specified using a combination of concrete word forms, lemmas, and grammatical features. If the corpus texts have been associated with metadata (such as author, publication date etc.), the search results can be filtered by selecting a set of values from one or more metadata categories. Glossa offers an intuitive, graphical search interface that frees the user from needing a lot of the technical knowledge that is often required by other search systems, such as the query language of the underlying search engine, the names of grammatical tags, and the names of metadata categories and their values.

While the old version of Glossa required corpora to be encoded with the IMS Corpus Workbench, the new version is flexible with respect to search engines. This should make it easier to integrate existing corpora, stored in existing search systems, into Glossa (although some programming may be needed in order to set up the communication between Glossa and the existing system).

Also, unlike the old version, the new Glossa version is not restricted to searching corpora that are located at the same server or institution as the Glossa installation itself. In particular, Glossa can be used to search a collection of corpora that are located on different servers in the CLARIN infrastructure. This means that all corpora that are made available online through the so-called CLARIN federated content search will be searchable by any Glossa installation.

The new version of Glossa is also much easier to install than the old one. This is particularly important with the possibility of federated content search, since it means that even institutions that do not have any corpora of their own can install Glossa on their servers and use it to search the available corpora in the CLARIN infrastructure. Researchers will even be able to install Glossa on their laptops and use it to search CLARIN corpora or their own private corpora.

The new Glossa version is undergoing heavy development, and significant parts of the functionality of the old version are still missing from the new one. However, querying remote corpora in the CLARIN infrastructure is already working, and searches in local corpora support grammatical queries and metadata restrictions using three different search interfaces: a simple ("Google-like") search box, an extended view with menus and check boxes for specifying grammatical searches, and a regular expression view for advanced users (the three views are kept in sync so as to enable switching between them while specifying the query).

In the future, the new Glossa version will get many additional features such as support for multilingual (parallel) corpora, speech corpora with audio, video, and geographical maps, saving of search results, user annotation of search results, and statistical information such as frequency lists, collocations, and metadata distributions.


Corpus linguistics, general linguistics, phonology, morphology,
syntax, semantics, pragmatics, clinical linguistics, discourse
analysis, lexicography, literature studies, sociology (esp. wrt speech
corpora with audio and video), history, cultural studies

University of Oslo
Project leader
Anders Nøklestad
Contact email

Technical development: Anders Nøklestad (main developer), Joel
Priestley, Michał Kosek, André Lynum
Functionality design and testing: Kristin Hagen, Janne Bondi Johannessen
Organisation: The Text Laboratory at the Department of Linguistics and
Scandinavian Studies, University of Oslo
Project/funding: CLARINO and the Dept. of Linguistics and Scandianvian
Studies, University of Oslo