Tour de CLARIN: Setswana Test Suite and Treebank

Submitted by Karina Berger on 12 October 2021


Written by Liané van den Bergh and Ansu Berg

The Setswana Test Suite and Treebank resource was developed as part of the PhD study of Ansu Berg, who performed a rule-based computational syntactic analysis of Setswana with a specific focus on simple sentences in the Setswana language. In order to develop a parser for Setswana, Ansu employed Lexical Functional Grammar (LFG) to frame the description of Setswana grammar. LFG is a non-derivational, constraint-based theory of grammar that distinguishes between two levels of syntactic analysis of a natural language utterance; namely, a constituent and a functional structure. Figure 1 shows the LFG constituent structure for the Setswana simple sentence mosadi ga a ketla a reka ditlhako ('The woman will not buy shoes'), while Figure 2 shows its functional structure. The Setswana grammar was implemented in the XLE parser software, which is an environment for parsing and generating grammars corresponding to the LFG formalism with a rich graphical user interface for writing and debugging such grammars. Setswana is the first Bantu language to be parsed in the XLE parser.

Figure 1: The constituent structure of the simple Setswana sentence mosadi ga a ketla a reka ditlhako, where the NP mosadi ('the woman') is the syntactic sister of the entire VP node and therefore the grammatical subject, while ditlhako ('shoes'), which is embedded deeper in the structure as the complement of the lexical verb, is the grammatical object.

A corpus of simple Setswana sentences was not available, so Ansu developed the first test suite for Setswana. The test suite contains a set of constructed linguistic examples, both grammatical and ungrammatical variants, corresponding to the main grammatical categories of Setswana. Ansu used the test suite to develop the first computational grammar for Setswana, and on the basis of this used the functionality provided by the XLE user interface to create the treebank. The resulting treebank is annotated for deep syntactic information corresponding to dependency relations between sentential constituents. For instance, in Figure 1, the simple NP ditlhako “shoes” is embedded within the verbal phrase and is the object complement of the main verb a reka 'buy'.

Figure 2: The corresponding functional structure, indicating for instance that the auxiliary system (the syntactic head of the VPAUX node) in Figure 1 expresses a negated future tense in the indicative mood.

As one of the first syntactically parsed resources for Bantu languages, the Setswana treebank is important for both the computational and non-computational linguistic research community in South Africa. As a resource for computational purposes, the treebank serves as a gold standard for future Setswana grammar testing and evaluation. This contribution to Setswana can also enable similar projects for South African languages that share their syntactic structures with Setswana. For general linguistic research, it is a crucial resource not only for local South African linguists, who can use the syntactically parsed sentences in the treebank for lexicological and general grammatical purposes, but also for syntacticians who can avail themselves of the LFG dependencies in the treebank to research Setswana in the context of Universal Grammar. As an example of the successful use of the treebank in the research domain, Ansu and her colleagues have used the treebank to perform an LFG-style analysis of those auxiliary verbs in Setswana that indicate tense (Pretorius and Berg 2019). Furthermore, they have used the argument structure level of the LFG formalism to determine subcategorisation frames (such as the omission of the logical subject during passivisation) in the lexical verbal system of Setswana (Berg et al. 2020).



Pretorius, R., and Berg, A. 2019. An LFG Analysis of Setswana Auxiliary Verb Phrases Indicating Tense. In Proceedings of the LFG’19 Conference, 233–250.

Berg, A., Pretorius, L., and Pretorius R. 2020. Using LFG a-structure to determine the subcategorization frames of Setswana verbs. Nordic Journal of African Studies 29 (2): 1–31.