Tour de CLARIN: Interview with Yvonne van Baal

Submitted by Jakob Lenardič on 7 February 2020

Yvonne van Baal is a PhD student in linguistics at the University of Oslo. In her research on definiteness marking on the noun phrase in American Norwegian, she has successfully used resources and tools developed at the CLARINO Text Laboratory. The interview was conducted via e-mail.


1. Could you briefly describe your research background and your current position? How did you get involved with Text Laboratory?

Currently, I am a PhD Candidate at the University of Oslo (Department of Linguistics & Scandinavian Studies), and I will defend my dissertation in February. My research interests are bilingualism, language acquisition and language variation. All these topics come together in the field of heritage languages, which is the topic of my dissertation. I am affiliated to the Text Laboratory, and Janne Bondi Johannessen, the leader of the TextLab, is one of the supervisors of my PhD project. The TextLab has been very important for me because of their expertise with data collection, data storage, and corpora.

2. Your PhD project focuses on a (morpho)-syntactic topic, which is definiteness marking in American Norwegian, a heritage language. Could you briefly present your PhD work (aims and results)? What makes American Norwegian an interesting language to study?

American Norwegian is a heritage variety of Norwegian that is spoken in the Midwest of the US by descendants of Norwegian immigrants. Heritage languages are interesting for linguistic research because they are minority languages learnt and used at home, while the larger, national society uses another language. As such, they provide unique insights into the roles that language acquisition as well as language use play in a speaker’s language competence. In addition, they can provide interesting perspectives on language variation, both within and across speakers, and on language change.

The main aim of my project was to investigate how definiteness is marked in American Norwegian, specifically in relation to the use of the so-called “compositional definiteness” structures. This refers to the fact that a semantically definite noun phrase in Norwegian contains both a prenominal determiner and a definite suffix on the noun, but only when the noun phrase also contains an adjectival modifier. Let’s exemplify this with the noun hus (“house”). In the Norwegian spoken in Norway, the equivalent of the definite noun phrase the house would be huset, where definiteness is grammatically marked by the suffix -et rather than by an article like in English. However, a modified definite noun phrase such as the old house would be det gamle huset, because the use of the modifier gamle (“old”) requires the additional use of the definitive determiner det.

In my PhD work, I have found that definiteness marking is in general very similar to the Norwegian spoken in Norway, while compositional definiteness is vulnerable to restructuring, This means that the prenominal determiner is often omitted, but the suffix is very stable – speakers of American Norwegian often use structures like gamle huset, which would not be used in homeland Norwegian.

I argue that this language change cannot be caused by transfer from English. If American Norwegian would have become more like English, the speakers would have said det gamle hus (like “the old house”), but they do exactly the opposite. Interestingly, it turns out that children who grow up in Norway often omit the determiner while they acquire compositional definiteness. They too say gamle huset! But acquisition in the heritage context is somewhat different from acquisition in a monolingual context. I therefore argue that it is the acquisitional context that caused this difference between American Norwegian and homeland Norwegian.

3. Your work relies on spontaneous speech data from the Corpus of American Nordic Speech, which is made available through Text Laboratory. How have you used the corpus in your PhD work/to study definiteness marking? What kind of new empirical results were you able to achieve on the basis of this corpus?  

The Corpus of American Nordic Speech (CANS) has already been used by other researchers to study definiteness marking, in Anderssen, Lundquist & Westergaard 2018, who find that the determiner is often omitted, and compare this with the use of pre- and post-nominal possessives. I complement their research with experimental data (see below). I did, however, use the CANS corpus to extract frequency lists of lexical items.

I used the Nordic Dialect Corpus (NDC). This is a corpus with spoken conversations from many different dialects in Scandinavia, and is available through the same search interface (Glossa) as CANS. I used this corpus to study those Norwegian dialects that are spoken in the regions where the ancestors of the American Norwegians came from. The Nordic Dialect Corpus was extremely important for my research, because the dialectal data it offers allowed me to establish a proper baseline for comparison with the language spoken by the heritage speakers in the United States. A comparison with Bokmål or Nynorsk Norwegian would be unfair in this case, because most American Norwegian speakers are not familiar with the written language used in Norway.

4. Why is it important that the corpus consists of speech data instead of written materials? Could you discuss how you have complemented the corpus data with experimental data that you have collected during fieldwork in the United States?

The only data source that is available for American Norwegian is spoken data. These speakers grew up speaking Norwegian at home, but the school system in the US is of course completely in English. As a result, American Norwegians are only literate in English, and not in Norwegian (with a few exceptions); we therefore have to rely on spoken speech data.

CANS is a great resource for research on American Norwegian: it is the only corpus of this language, contains speech of many speakers, and is easily searchable. However, as with most corpora that consist of spontaneous conversations, it has some limitations. One of them is that infrequent grammatical constructions are difficult to study, and compositional definiteness is one of these phenomena; in the corpus one only finds a few instances of the construction for each speaker. I therefore complemented the findings from CANS with elicitation experiments that I conducted during fieldwork trips to the Midwestern United States in 2016 and 2018. With the experiments, which included a picture-aided elicitation task and a translation of a short story from English into Norwegian, I was able to obtain many additional phrases that in Norwegian require compositional definiteness. By doing so, I could investigate how each speaker uses this construction.

5. Both the Corpus of American Norwegian and the Nordic Dialect Corpus are available through the Glossa search system developed by the Text Laboratory. How have you used the system in your work? Could you highlight any features in Glossa that were especially indispensable for your research purposes?

I found Glossa to be a very user-friendly system that is tailored to researchers who do not have a lot of experience with making complex search queries. Unlike most search interfaces which require familiarity with the query language syntax, the main advantage of Glossa is that it allows you to specify detailed morphosyntactic characteristics of the lemma that you’re looking for by simply selecting them in a drop-down menu. I was thereby able to observe the use of the compositional definiteness construction in the Corpus of American Nordic Speech and the Nordic Dialect Corpus by narrowing the Glossa query to those strings that include a noun marked as definite and immediately preceded by an adjective and possibly a determiner. Furthermore, Glossa allowed me to search for important extra-linguistic metadata. For instance, in the Nordic Dialect Corpus, I could restrict the search to Norwegian dialects spoken in a specific location and only include the dialects spoken by the ancestors of our American Norwegian speakers.

6. How does the Text Laboratory facilitate the study of heritage languages?

The TextLab has created the technological foundation for the study of American Norwegian. Basically, all the data that we have on American Norwegian are stored, transcribed, and tagged at the TextLab, where they also built the only corpus available for this heritage language. Such preservation is important because the current speakers of American Norwegian are all elderly people and they are the final generation of speakers. Their children do not speak the language, which means that building and preserving collections of their speech is an urgent task for infrastructures like the TextLab and by extension CLARINO. There are more minority languages that are in a situation like this. If these languages are not recorded, valuable data for linguists are lost forever. In order to further facilitate the study of endangered languages, TextLab recently launched the LIA-corpus, which contains a lot of old recordings of Norwegian dialects, and also includes recordings of the Sami languages.

Click here to read more about Tour de CLARIN