Catching Cyberbullies with Neural Networks

Digital harassment is a problem in many corners of the internet, like internet forums, comment sections and game chat. In this article you can play with techniques to automatically detect users that misbehave, preferably as early in the conversation as possible. What you will see is that while neural networks do a better job than simple lists of words, they are also black boxes; one of our goals is to help show how these networks come to their decisions. Also, we apologize in advance for all of the swear words :).

According to a 2016 report, 47% of internet users have experienced online harassment or abuse [1], and 27% of all American internet users self-censor what they say online because they are afraid of being harassed. On a similar note, a survey by The Wikimedia Foundation (the organization behind Wikipedia) showed that 38% of the editors had encountered harassment, and over half them said this lowered their motivation to contribute in the future [2]; a 2018 study found 81% of American respondents wanted companies to address this problem [3]. If we want safe and productive online platforms where users do not chase each other away, something needs to be done.

One solution to this problem might be to use human moderators that read everything and take action if somebody crosses a boundary, but this is not always feasible (nor safe for the mental health of the moderators); popular online games can have the equivalent population of a large city playing at any one time, with hundreds of thousands of conversations taking place simultaneously. And much like a city, these players can be very diverse. At the same time, certain online games are notorious for their toxic communities. According to a survey by League of Legends player Celianna in 2020, 98% of League of Legend players have been 'flamed' (been part of an online argument with personal attacks) during a match, and 79% have been harassed afterwards [4]. The following is a conversation that is sadly not untypical for the game:

Z: fukin bot n this team.... so cluelesss gdam
V: u cunt
Z: wow ....u jus let them kill me
Z: this game is like playign with noobs lol....complete clueless lewl
L: ur shyt noob

For this article, we therefore use a dataset of conversations from this game and show different techniques to separate 'toxic' players from 'normal' players automatically. To keep things simple, we selected 10 real conversations between 10 players that contained about 200 utterances: utterance 1 is the first chat message in the game, utterance 200 one of the last (most of the conversations were a few messages longer than 200, we truncated them to keep the conversations uniform). In each of the 10 conversations, exactly 1 of the 10 persons misbehaves. The goal is to build a system that can pinpoint this 1 player, preferably quickly and early in the conversation; if we find the toxic player by utterance 200, the damage is already done.

Can't we just use a list of bad words?

A first approach for an automated detector might be to use a simple list of swear words and insults like 'fuck', 'suck', 'noob' and 'fag', and label a player as toxic if they use a word from the list more often than a particular threshold. Below, you can slide through ten example conversations simultaneously. Normal players are represented by green faces, toxic players by red faces. When our simple system marks a player as toxic, it gets a yellow toxic symbol. These are all the possible options:

Normal playersToxic players
System says nothing (yet)
Normal situation

Missed toxic player
System says: toxic
False alarm

Detected toxic player

You can choose between detectors with thresholds of 1, 2, 3 and 5 bad words, to see what they do where in the conversation.

As you can see, the detector with the low threshold detects all toxic players early in the game, but has lots of false alarms (false positives). On the other hand, the detector with the high threshold does not have this problem, but misses a lot of toxic players (false negatives). This tension between false positives and false negatives is a problem any approach will have; our goal is to find an approach where this balance is somehow optimal.

Teaching language to machines

A better solution might be to use machine learning: we give thousands of examples of conversations with toxic players to a training algorithm and ask it to figure out how to recognize harassment by itself. Of course, such an algorithm will learn that swear words and insults are good predictors for toxicity, but it can also pick up more subtle word combinations and other phenomena. For example, if you look at how often the green and red faces open their mouths in the visualization above, you'll see that the average toxic player is speaking a lot more than the other players.

We haved used 5000 other conversations to train a network that consists of an embedding layer (300 units), two bidirectional GRU layers (16 units), a pooling layer and two dense layers (256 units). The output layer is a single sigmoid unit indicating the network’s confidence that the input text is toxic.

The first layer, the embedding layer, is the most low level and contains information on what individual words often appear in similar contexts. The idea is that words that are similar in meaning appear in similar contexts, and thus get similar weights in the network. This means that if we visualize the weights in a 3D space with the T-SNE algorithm, words with a similar meaning should appear closer together:

As you explore the 3D space, you will find many interesting clusters of words that indeed are related in meaning. For example, there is a cluster of words related to time (highlight), a number cluster (highlight), a cluster of adjectives to rate something (highlight), but also (and more useful to the current task) a cluster of insults (highlight) and a cluster of variants of the word fuck (highlight).

However, just knowing the rough meaning of relevant words is not enough - we need to know when to act. Higher layers typically pick up increasingly more abstract tasks, like monitoring the temperature of the conversation as a whole. In the interactive visualization below, you can see which neurons in which layers respond positively (green words) or negatively (red words) to different parts of the conversations [5].

In the first layer, we see that example neuron 1 has developed an interest in several abbreviations like 'gj' (good job), 'gw' (good work), 'ty' (thank you) and to a lesser extent 'kk' (okay, or an indication of laughter in the Korean community) and brb 'be right back'. Example neuron 12 focuses on a number of unfriendly words, activating on 'stupid', 'dumb', 'faggot' and 'piece of shit', and also somewhat for 'dirty cunt'. Note that its results are swapped compared to neuron 1 (red for good predictors of toxicity and green for good predictors of collaborative players instead of vice versa), which will be corrected by a neuron in a later layer. Neuron 16 activates on 'mia' (missing in action), which is typically used to notify your team mates of possible danger, and thus a sign that this person is collaborative and probably not toxic.

The neurons in the second layer monitor the conversation on a higher level. In contrast to the abrupt changes in the first layer, the colors in the second layer are fading more smoothly. Neuron 17 is a good example, where we see that the conversation is green in the beginning and slowly goes from yellow to orange, and then later back to green again. Several neurons, like neuron 6 in the second layer, find the repetitive use of 'go' suspicious since they are activated more with each repetition.

But does it work?

The big question is whether a harassment detector using a neural network instead of a word list performs better. Below you can compare the previous word list-based approach against three neural networks with different thresholds [5]. The threshold now is not the number of bad words, but the network's confidence: a number between 0 and 1 indicating how sure the network is that a particular player is toxic. Here you see the results for a neural network with three different confidence thresholds, compared to a word list-based detector with a threshold of 2 bad words:

Like with the word list based approach, we see that a higher threshold means fewer false positives but also fewer true positives. However, we see that two of the neural network based detectors find way more toxic players during the conversation while having fewer false positives at the same time... progress!

The bigger picture

Besides the technical challenge of detecting bad actors early, automated conversation monitoring raises a number of ethical questions: do we only want to use this technique to study the toxicity within a community, or do we really want to monitor actual conversations in real time? And if so, what does it mean for a community to have an automatic watch dog, always looking over your shoulder? And even if it is okay to have a watchdog for toxicity, something broadly desired by people who spend time online, what if the techniques described here are used to detect other social phenomena?

And say we have a system that can perfectly detect bad behavior in online conversations, what should be done when it detects somebody? Should they be notified, warned, muted, banned, reported to an authority? And at what point should action be taken - how toxic is too toxic? The former director of Riot Games' Player Behavior Unit attributes most toxicity to 'the average person just having a bad day'; is labeling a whole person as toxic or non-toxic not too simplistic?

Whatever the best answer to these questions might be, just doing nothing is not it; the Anti-Defamation League and several scholars who study hate speech argue that toxic behavior and harassment online leads to more hate crimes offline [7]. Automatic detection seems a good first step for the online communities that are fighting this problem. Below you can play with both detection techniques introduced in this article, and set the thresholds for yourself. What threshold do you think would make most sense in what use case? For example, how would you tune a system that will later be judged by humans versus a system that can automatically ban users? Are you willing to allow false positives if you catch all toxic players?



[3] ADL. "Online Hate and Harassment: The American Experience." 2019.


[5] All of the confidence values in this article come out of the software package HaRe, funded by the Language in Interaction project. All of the thresholds were picked manually, based on trial and error.

[6] The network consisted of an embedding layer of 300 units, two bidirectional GRU layers of 16 units, a pooling layer and two dense layers of 256 units, and was trained on 5000 conversations.


Author Bio
Wessel Stoop is scientific programmer at the Centre for Language & Speech Technology at the Radboud University Nijmegen. His work focuses on what natural language processing can do for the social relations between people and vice versa. He is also passionate about interactive explanations to clarify complicated subjects, and has created and published several.

Florian Kunneman is assistant professor at the Department of Computer Science at VU Amsterdam, as part of the Social AI group. His research is situated in the field of Artificial Intelligence, with a focus on language technology and conversational agents. He has always enjoyed gaming in his free time, and aims to improve upon the degrees of freedom in conversations with NCP’s in his research.

Antal van den Bosch is director of the Meertens Institute of the Royal Netherlands Academy for Arts and Sciences, Amsterdam, and Professor of Language and Artificial Intelligence at the University of Amsterdam. His work is in the cross‐section of artificial intelligence (machine learning, natural language processing) and its applications in the humanities and social sciences, drawing from a range of new ‘big’ data types, such as internet data, speech, and brain imaging.

Ben Miller is Senior Lecturer of Technical Writing and Digital Humanities at Emory University in Atlanta, Georgia, where he works on approaches to collective memory grounded in natural language processing, narrative theory, and data science.  Some of his past work has looked at hate speech in the chat systems of online games, at the mechanisms and language of online radicalization, at how communities under threat use technology to better enable collective storytelling, and at how data science can help direct aid to communities in need.

For attribution in academic contexts or books, please cite this work as

Wessel Stoop, Florian Kunneman, Antal van den Bosch, Ben Miller, "Catching Cyberbullies With Neural Networks", The Gradient, 2021.

BibTeX citation:

author = {Stoop, Wessel and Kunneman, Florian, and Bosch, Antal and Miller, Ben},
title = {Catching Cyberbullies With Neural Networks},
journal = {The Gradient},
year = {2021},
howpublished = {\url{} },

If you enjoyed this piece and want to hear more, subscribe to the Gradient and follow us on Twitter.