The IceTaboo database is a novel resource for processing offensive words in Icelandic. Developed by a small team at the Language and Technology Lab at the University of Iceland during the summer of 2020, IceTaboo includes 2725 words that are inappropriate or offensive to at least some speakers in some contexts. The database has been designed to be used as part of the automatic proofreading software GreynirCorrect, which the lab developed in close collaboration with an industry partner in commercial software development. IceTaboo can be used to flag contextually inappropriate words when the correction software comes across them in texts, and is already being used as part of the automatic proofreading tool by an Icelandic online news website.
The IceTaboo database consists of a list of words in Icelandic that may be considered inappropriate, taboo or loaded in use or meaning. The list includes words that are biased against certain minorities (for instance, different races, abilities, genders or sexualities), words that are derogatory towards people, unnecessarily gendered or obsolete, and those that are not very inappropriate, but can be considered politically loaded or unsuitable for children.
The database was compiled manually at the Language and Technology Lab at the University of Iceland in 2020. The project team began with extensive brainstorming sessions, followed by a systematic search on the internet. Social media platforms, especially the comment sections, were also consulted. Different spelling variations were taken into consideration, and some slang words from English and Danish were also included.
The output of this work was used to establish a classification system, grouping words together in categories depending on either their meaning, form or use. Categories include swear words, health-related words, nasty adjectives, offensive profession names, offensive words related to religion, offensive descriptions of people’s appearance, and offensive words related to sex. It also includes a class for words with a nuanced relationship with offensiveness, such as political terms or words that have an alternative, non-offensive meaning.
Once classified, each class of words was systematically studied. Synonyms or related words were noted and the Database of Modern Icelandic Inflection (DMII) was used in order to identify compounds that contained inappropriate parts. Each word is coded for part-of-speech, a classification as well as information about the meaning of the word, including an explanation as to why it may be considered inappropriate, and in what context.
As part of the GreynirCorrect automatic proofreading software, the IceTaboo database is already being used to highlight inappropriate words at the Icelandic online news website kjarninn. This means that the system now flags potentially inappropriate words while reporters are writing content for the website.
Sólmundsdóttir says: ‘So now, if journalists are writing a story, they might get a pop-up window. It’s part of the correction software – it corrects spelling and grammar, but now they also get a “ping” if they write a word that some readers might find inappropriate.’
The database is released under an open CC BY 4.0 licence on CLARIN. The proofreading system GreynirCorrect, which was developed by the Language and Technology Lab in collaboration with Miðeind, a leading software company in the field of linguistics and artificial intelligence, is under development in an open repository on Github under an MIT licence.
According to project leader Agnes Sólmundsdóttir, other Icelandic companies working with text have also shown interest in integrating the correction software, including IceTaboo, into their workflow: ‘It’s already running in the news website, and it will probably be running in more scenarios soon. There’s definitely interest.’
While the project team used manual annotation and focused on integrating IceTaboo into the automatic proofreading system, the database could also be used for different purposes. Apart from research focusing on inappropriateness in Icelandic, it could also be of use in the future development of systems that apply machine learning methods to automatic detection of offensive language. The words in the database can inform feature extraction steps of such systems and potentially make them more effective.
The detection of offensive or contextually inappropriate language could also be important for monitoring freely accessible discussion spaces on social media or for helping a user of a word processing system to avoid inappropriate expressions. For such extended applications, the database could serve as useful first step. In addition, the classification system could be useful when trying to extend the database to other languages.
Views on CLARIN
Anton Karl Ingason, Associate Professor at the University of Iceland, and Director of the Language and Technology Lab
Agnes Sólmundsdóttir, Research Assistant at the Language and Technology Lab, University of Iceland, and undergraduate student (first author)