IceTaboo: Offensive Word Database with Commercial Application

Anton Karl Ingason, Agnes S贸lmundsd贸ttir, Lilja Bj枚rk Stef谩nsd贸ttir
Submitted by Karina Berger on 20 December 2021

The Project

The IceTaboo database is a novel resource for processing offensive words in Icelandic. Developed by a small team at the Language and Technology Lab at the University of Iceland during the summer of 2020, IceTaboo includes 2725 words that are inappropriate or offensive to at least some speakers in some contexts. The database has been designed to be used as part of the automatic proofreading software GreynirCorrect, which the lab developed in close collaboration with an industry partner in commercial software development. IceTaboo can be used to flag contextually inappropriate words when the correction software comes across them in texts, and is already being used as part of the automatic proofreading tool by an Icelandic online news website.

'Our lab is really focused on collaborating with industry. We really want our work to benefit the public.'聽
Agnes S贸lmundsd贸ttir

Methodology

The IceTaboo database consists of a list of words in Icelandic that may be considered inappropriate, taboo or loaded in use or meaning. The list includes words that are biased against certain minorities (for instance, different races, abilities, genders or sexualities), words that are derogatory towards people, unnecessarily gendered or obsolete, and those that are not very inappropriate, but can be considered politically loaded or unsuitable for children.

The database was compiled manually at the Language and Technology Lab at the University of Iceland in 2020. The project team began with extensive brainstorming sessions, followed by a systematic search on the internet. Social media platforms, especially the comment sections, were also consulted. Different spelling variations were taken into consideration, and some slang words from English and Danish were also included.

The output of this work was used to establish a classification system, grouping words together in categories depending on either their meaning, form or use. Categories include swear words, health-related words, nasty adjectives, offensive profession names, offensive words related to religion, offensive descriptions of people鈥檚 appearance, and offensive words related to sex. It also includes a class for words with a nuanced relationship with offensiveness, such as political terms or words that have an alternative, non-offensive meaning.

Once classified, each class of words was systematically studied. Synonyms or related words were noted and the Database of Modern Icelandic Inflection (DMII)聽was used in order to identify compounds that contained inappropriate parts. Each word is coded for part-of-speech, a classification as well as information about the meaning of the word, including an explanation as to why it may be considered inappropriate, and in what context.


Outcome

As part of the GreynirCorrect automatic proofreading software, the IceTaboo database is already being used to highlight inappropriate words at the Icelandic online news website kjarninn. This means that the system now flags potentially inappropriate words while reporters are writing content for the website.

S贸lmundsd贸ttir says: 鈥楽o now, if journalists are writing a story, they might get a pop-up window. It鈥檚 part of the correction software 鈥 it corrects spelling and grammar, but now they also get a 鈥減ing鈥 if they write a word that some readers might find inappropriate.鈥

This screenshot from the correction software interface shows how it appears to users. Here, IceTaboo has flagged the word 'hj煤krunarkona', explains why it might be inappropriate, and provides alternative suggestions. Translated, the example sentence says: 'This man is hurt. Is there a nurse here?'聽The flagged word is 'hj煤krunarkona', which literally means 'nurse-woman' or 'nursemaid'. The (translated) explanation for the term's inappropriateness says: 'An unfortunate or inappropriate choice of words, a better word would be the word "hj煤krunarfr忙冒ingur".'聽The suggested word 'hj煤krunarfr忙冒ingur' translates more closely to 'registered nurse', suggesting that the person is educated in this specific field. A further explanation for inaproppriateness adds: 'The word "hj煤krunarkona" can be considered to enforce certain gendered stereo-types and implies that nursing is a job only done by women.'

The database is released under an open CC BY 4.0 licence on CLARIN. The proofreading system GreynirCorrect, which was developed by the Language and Technology Lab in collaboration with Mi冒eind, a leading software company in the field of linguistics and artificial intelligence, is under development in an open repository on Github under an MIT licence.

鈥極ur lab and the language technology community in Iceland emphasises licences that make all products easily reusable. In this case we used the Creative Commons Attribution Licence, which places almost no restrictions on normal use cases.鈥櫬
Agnes S贸lmundsd贸ttir

According to project leader Agnes S贸lmundsd贸ttir, other Icelandic companies working with text have also shown interest in integrating the correction software, including IceTaboo, into their workflow: 鈥業t鈥檚 already running in the news website, and it will probably be running in more scenarios soon. There鈥檚 definitely interest.鈥

While the project team used manual annotation and focused on integrating IceTaboo into the automatic proofreading system, the database could also be used for different purposes. Apart from research focusing on inappropriateness in Icelandic, it could also be of use in the future development of systems that apply machine learning methods to automatic detection of offensive language. The words in the database can inform feature extraction steps of such systems and potentially make them more effective.

The detection of offensive or contextually inappropriate language could also be important for monitoring freely accessible discussion spaces on social media or for helping a user of a word processing system to avoid inappropriate expressions. For such extended applications, the database could serve as useful first step. In addition, the classification system could be useful when trying to extend the database to other languages.


Views on CLARIN

'We deposited our database at CLARIN. It鈥檚 a really well-respected platform for language technology tools. Our lab and the language technology community in Iceland is really focused on making all resources publicly available. And CLARIN is a really good platform for this. It suited the project perfectly.'聽Agnes聽S贸lmundsd贸ttir
'Open access availability on CLARIN and an industry-friendly licencing policy ensures that the resource is ready to be used by any software developer that shows interest.'聽Agnes聽S贸lmundsd贸ttir

Contributors

Anton Karl Ingason, Associate Professor at the University of Iceland, and Director of the Language and Technology Lab

Agnes S贸lmundsd贸ttir, Research Assistant at the Language and Technology Lab, University of Iceland, and undergraduate student (first author)

Lilja Bj枚rk Stef谩nsd贸ttir, Project Manager at the Language and Technology Lab, University of Iceland