Sidsel Boldsen is a PhD Student in Natural Language Processing ( ) and digital humanities, with a special interest in historical languages and linguistic knowledge representation. She has successfully collaborated wIth the Danish CLARIN K-Centre DANSK.
Guest blog post by DEMBA Kandeh that chronicles and reflects on a five-day workshop: “Digital Humanities – the perspective of Africa”, recently organized at the Lorentz Centre in Leiden.
The Austrian Baroque Corpus is a digital collection of printed German language texts dating from the Baroque era, now freely available through the Austrian Centre for Digital Humanities: https://acdh.oeaw.ac.at/abacus/
At present, the digital collection holds several texts specific to the memento mori genre written by, or ascribed to, Abraham a Sancta Clara (1644-1710), who was a renowned Augustinian monk, and a widely read author throughout Europe at his time. All of the texts (sermons, devotional books and works related to the dance-of-death theme) have been enriched with different layers of structural information and tagged using automated tools adapted to the specific needs of the language of the period. One important achievement of the project is that each occurring historic word form has been electronically mapped to its corresponding lemma in High German and corrected or verified by domain experts. Throughout all of the phases of the workflow, the interdisciplinary team (literary, linguistic, and text technology specialists) insisted on high quality linguistic and semantic annotation, creating a sound basis that allows for sophisticated research questions.
The present corpus was compiled between 2010 and 2015 at the Institute for Corpus Linguistics and Text Technology (ICLTT) and at the Austrian Centre for Digital Humanities (ACDH) of the Austrian Academy of Sciences, alongside two associated research projects: “Text‐Technological Methods for the Analysis of Austrian Baroque Literature“ (March 2012 – September 2014, supported by funds of the Österreichische Nationalbank, Anniversary Fund) and “Mortuary Cult in 17th Century Vienna: Confraternity Studies in the Digital Age” (June 2014 – May 2015, supported by funds of the City of Vienna).
This showcase is an example of how language technology can be exploited in research within the humanities. The resource that this case is based on is Gesta Danorum written about 1200 by the Danish historian, Saxo. Gesta Danorum is written in High Latin and describes in 16 books the period of time from King Dan to Canute VI of Denmark. Traditionally, the work is divided into two main sections, one consisting of books 1-9 which deals with Norse mythology and a historical second part of the books 10-16 describing the introduction of Christianity in Denmark. In 1969, a competing thesis was launched cf. Skovgaard-Petersen (1969). In this analysis the composition of Gesta Danorum is split up into books 1-8 and books 9-16. These two competing interpretations can be paraphrased into the question: Is it book 9 or book 10 that represents the transition from the heathen to the Christian period in Gesta Danorum? In order to find evidence for the answer to this question, the platform with embedded linguistic information and advanced search facilities was exploited to identify subject area specific elements in the various books of Gesta Danorum and to display the search results in a manageable way.
The procedure was to take a translation of Gesta Danorum and the compute PoS and lemma information automatically. To give example of the outcome of the automatic processing, the sentence “Kongen blev kronet på slottet” (“the King was crowned at the castle”) was annotated as follows:
Kongen/konge/NN_COM_SING_DEF blev/blive/V_INDIC_PAST kronet/krone/V_PARTC_PAST på/på/PREP slottet/slot/NN_NEUT_SING_DEF
The final step was to upload the annotated version of Gesta Danorum into the IMS Open Corpus Workbenchi (open source software). This platform made it possible to make queries that exploit both the linguistic information and the Corpus Query Processing (CQP) search facilities embedded in this platform.
- Visit the website: http://cst.dk/cgi-bin/dighumlab/Saxo/form-query.pl?mode=cqp.
- Choose Run Query with the default search pattern, [pos!="RESID_SIGN"] that counts all the words in Gesta Danorum
- Choose distribution and check how many words that occur in book 8, 9, and 10 respectively.
- Insert the following search pattern, representing Christian language usage:
[lemma="helgen"] | [word="krist.*"] | [word="synd.*" & pos="N.*"] | [word="Herren"] | [word="ang(re|er)"] | [word="hellig.*"] | [word="Gud"]
- Press Run Query and then choose Distribution.
- Check the distribution of words defined as being members of the Christian register on book 8, 9, and 10.
- Does the frequency of Christian language usage differ between book 8, 9 and 10?
- Which thesis do the observations support? The traditional approach that advocates a split between book 9 and 10? Or the competing thesis that speaks in favour of dividing Gesta Danorum between book 8 and 9?