Tour de CLARIN: South Africa

Submitted by Karina Berger on

Written by Liané van den Bergh and Juan Steyn

The South African Centre for Digital Language Resources (SADiLaR) forms part of the South African Research Infrastructure Roadmap (SARIR) programme of the South African government’s Department of Science and Innovation (DSI). SADiLaR joined CLARIN as an observer in 2018 and is the only digital language research centre outside Europe which forms part of the European CLARIN research network as a C-Centre (a metadata providing centre). Current efforts are directed towards SADiLaR becoming a B-Centre (a service providing centre) in the near future.

SADiLaR’s mandate is to support researchers through facilitating the creation and access to digital data through its Digitization Programme and building overall research capacity through training and dissemination activities as part of its Digital Humanities programme.

Langa Khumalo, director of SADiLaR.

SADiLaR is a multi-partner entity located at the North-West University (NWU) – which functions as a host and hub of a network of linked nodes, comprising:

  • the University of Pretoria (Department of African Languages)
  • the University of South Africa (Department of African Languages)
  • the Council of Scientific and Industrial Research (HLT Research Group)
  • the North West University (Centre for Text Technology)
  • the Inter-Institutional Centre for Language Development and Assessment (ICELDA).

SADiLaR runs two programmes to support the South African research community:

Our digitisation programme entails the systematic creation of relevant digital text, speech, and multi-modal technologies and resources primarily related to the eleven official languages of South Africa. Currently, SADiLaR facilitates access to 378 resources of which 238 are downloadable through a dedicated online repository, which is harvested by the Virtual Language Observatory ( ). Prominent tools and technologies include the NCHLT text processing web services and the Autshumato machine translation web services, as well as other downloadable applications and software packages for use within the HLT and domains.

Our Digital Humanities programme facilitates the building of research capacity by promoting and supporting the use of digital data and innovative (computational) methodological approaches within the Humanities and Social Sciences. This has been done primarily through the commissioning and support of more than sixty workshops, conferences and events within Digital Humanities and Social Sciences since 2017. SADiLaR Workshops have covered a wide variety of topics such as use of domain applicable computational tools and approaches as well as general awareness related to what the broad domain of Digital Humanities entails.

These programmes make an impact in three domains:

Language technology domain: As part of the South African Constitution, all our official languages are guaranteed parity of esteem and must be treated equitably. However, the reality is that almost all of our languages are under-resourced. For this reason, a key part of what the centre does is creating new high-level resources and natural language processing tools for all South African languages. This is needed to ensure that our language communities and researchers can have access to technologies and datasets that support research and societal equity of access, where language can be a barrier. Practical technologies that are generated and refined using SADiLaR language resources include machine translation engines for local languages, automatic speech recognition systems, text-to-speech systems, speech-to-speech translation systems, interactive communication systems, as well as a variety of text-related applications such as grammar and spelling checkers, and online electronic dictionaries.

Humanities and Social Sciences domain: Having resources and technologies available is only part of the solution, as it is also necessary for scholars to be able to access and effectively use them. Therefore, SADiLaR actively works toward establishing communities of practice via initiatives for building centralised research capacity. Capacity building ranges from raising awareness about what is available and how researchers can get involved with SADiLaR activities, as well as practical training pertaining to the use of digital data, innovative research methods, and software tools. Through this SADiLaR hopes to enable South African scholars to ask and pursue previously unanswerable questions within their respective disciplines.

Socio-economic domain: Reusable digital language resources are important building blocks that can be used not just for researcher activities, but also by commercial entities to build end-user applications that have a direct impact for language communities. A practical example of this is a recent application called AwezaMed COVID-19, which was developed by a SADiLaR node and aims to remove the communication barriers between health providers and patients. This application was initially developed for maternal healthcare and obstetrics, but was adapted as part of the South African response to the COVID-19 pandemic. The application features speech recognition, machine translation, and text-to-speech developed by the Council for Scientific and Industrial Research in partnership with Aweza. The full report is available online. There is also a YouTube video which shows how the system functions. It is also a good example of how the research outputs are having a translational impact through collaboration with private sector companies and the healthcare sector.