Building a Transformer-Based Hate Speech Detector

Here's the poster without all the talking:

I will be referring to X as Twitter for the entirety of this post. Apologies to those who are particular about such things.

Introduction
Part 1: Scraping and Hubris
Part 2: Pivots and Preprocessing
Part 3: Letting the Machine Learn and Results
Next Steps
Conclusion

Introduction

This semester, I joined CAIS++, the student branch of the USC Center for Ai in Society (CAIS). New members participate in a semester-long curriculum that serves as an introduction to everything from linear regression to transformers and attention. To cap everything off, curriculum members partner up and build a final project: a model with some application in performing social good.

The norm is to take a nice, clean, and pruned Kaggle dataset then train off that— but at this point in the semester I was quite tired of, at least some, classes and obligations and was craving for a justified and productive distraction.

My partner, Allison, and I, after several long tangents, decided to build a hate speech classifier trained off appropriately annotated Tweets. We found a great dataset of about ~60,000 tweets sorted into 6 categories— spam, insult, porn, etc. The paper had an associated CSV living in a Bitbucket. We clicked on it, reading to see a lovely spreadsheet of text, usernames, all the fixings.

It was 7 columns. 6 columns for classes and one for Tweet IDs. No content.

Part 1: Scraping and Hubris

As you probably know, the Twitter API has become prohibitively expensive, at least for a small project like this. Allison and I would have to scrape the tweets to get any meaningful portion of them before the project was due. She was— quite confident— about our chances. I less so, but wasn’t against giving it a shot.

We first tried all the classics: BeautifulSoup, Selenium, just taking the straight HTML, and requesing the syndication API url. Honestly, not exhaustively nor thoroughly, but enough of a probe to determine that they wouldn’t work well. I ultimately landed on using Playwright to render a (headless doesn’t work!) displayed Chromium instance and visit, then extract from, each tweet URL. This seemed to, in our limited attempts, work well enough and hit more tweets than the other methods before being outright rejected by Twitter. It was honestly hard to tell when we were being rate limited, though, many of the tweets were deleted or hidden, which return the same page as a rate limit (Considering the fact these were in a cyberbullying dataset, and were also pretty dated, not surprised).

This script was tested and iterated on (mouse movement, exponential backoffs) locally. That is, Allison and I would sit in our school library, letting Playwright repeatedly render a subset of random tweets on my 15-inch Mac’s screen blaring at full brightness in a public space— one brimming with students cramming for the end of the semester.

Remember that pornography category? Yeah…

A new CSV with a simple pandas df drop was hurriedly made.

Once we found something that seemingly worked, the next step was to deploy it. Allison and I made it a competition to maximize the tweet yield from our assigned half of the dataset (I had the first 25,000). My weapon of choice was AWS Batch, a service that encapsulates things like EKS and Fargate, a simplified, job-based deployment service.

I containerized the scraper into a Docker image and hosted it on AWS Elastic Container Registry. Fargate lets you take these images and give them to an IAM role with the appropriate permissions to run them as jobs. Each job definition can take in arguments, for which I supplemented parameters like what rows a single job was responsible for, backoff times, and tweets per browser session (affects runtime as well as rate limits).

I used a shell script to launch jobs at scale, also across regions (again, in an effort to at least spoof the scraping detection), through the AWS CLI. Every successfully scraped tweet was synced to an AWS DynamoDB instance.

Besides all this, I also tried setting up both free and paid proxies. I had trouble with getitng both to be consistent, whether it was just the quality of the free addresses or the authentication on the paid services (There are only so many ways to format an http request, especially within the contraints of docs, surely one of them should’ve worked … )

The night that all the setup and testing was done, I launched an unholy amount of jobs (I was curious about how many tweets per instance was optimal, so that was part of it). As I was walking Allison back to her apartment off-campus, we discussed our expectations. I was hoping we would get about ~20,000 of the 50,000 tweets. She had similar estimates.

“Watch us wake up to like 200 in the morning.”

I woke up to 700 of 25,000 attempted tweets.

My takeaways:

People say some pretty wild things on the net. My heart goes out to content moderators everywhere.
How to dockerize and deploy simply scripts, plus some other neat tricks
The Twitter API is expensive. I did the math, and even with my horribly inefficient yield, my cost per tweet was still only about ~2x Twitter’s API.
Observability is tough with jobs only in the hundreds (aggregating results and managings runs)
I won the competition! Allison got like 200 with Redis and a couple EC2 instances.

Part 2: Pivots and Preprocessing

700 is unfortunately peanuts compared to what’d we need for a sufficient training and dataset. Back to the paper search! We looked at several possible datasets, including some synthetically generated and others human-annotated. We requested a couple datasets, and got this one. Big thanks to Zeerak Talat for incredible response times and the dataset we ultimately ended up working with!

Some background: The researchers had both expert annotators (feminist and anti-racism activists) and amateur crowdworkers label around 6,900 tweets for racism and sexism to build on top of a previous work. Interestingly, they found that experts were more conservative in labeling content as hate speech, and models trained on expert annotations outperformed those trained on crowdworker labels - suggesting that identifying hate speech requires deeper domain knowledge than traditional crowdsourcing might provide.

We received the CSV of our hopes and dreams— filled with the media links, usernames, and more we had hoped for the in the first. This consisted of about 22,000 tweets marked as containing sexism, racism, both, or none. We decided to remove links, anonymize usernames, and discard of outlier n-grams like #mkr (a hashtag for an Australian cooking show that represented almost 2% of sexist tweets). After that, I scraped all of the images (that the API would let me) from the tweets and stored them in a S3 bucket.

After all the preprocessing, we ended up with:

19,770 tweets
1,028 images

Were all those tweets from earlier wasted then? Not quite! We asked GPT-4 to label them as classify them and added them to our test set when training the model.

Takeaways

How to do dataframe manipulations
I should get better at regex
Competition shows bring out the worst in people

Part 3: Letting the Machine Learn and Results

We utilized two base models that we fine-tuned, GPT-2 and BERT. Most papers we saw leveraged some version of BERT (HateBERT and ToxDectRoBERTa) or simple n-gram models. We decided to try both BERT and GPT2-small to see if the base model made a difference.

I trained the GPT variants, and Allison the BERT ones.

I ended up with 4 different models that all classified a tweet as being both racist and sexist, just racist, just sexist, or neither:

Fine-tuned GPT-2 for Sequence Classification with 4 classes
Ensemble of two binary classificaation GPT-2s (one for racism, one for sexism)
Fine-tuned GPT-2 + CLIP Multimodal Model with 4 classes
Fine-tuned GPT-2 + CLIP Multimodal Ensemble (Two binary classifiers, once again).

Here’s what the GPT-2 + CLIP model looked like:

GPT/CLIP Multimodal Classification Model Architecture

We use a simple linear layer equivalent in size to the GPT-2 last hidden layer dimensions concatenated with the clip embedding dimensions. Additionally, a classification layer is trained before we output to a one-hot vector of either size 2 or 4.

Here are the F1 score results on the text only dataset:

MODEL	NEITHER	RACISM	SEXISM	BOTH
BERT Ensemble	0.8901	0.7554	0.7722	0.0000
BERT	0.8878	0.7505	0.7763	0.0000
GPT2 Ensemble	0.9214	0.7941	0.8581	0.0444
GPT2	0.8810	0.7577	0.7623	0.0000

Results on the image/text tweets:

MODEL	ACCURACY
GPT2/CLIP	0.864
GPT2/CLIP ENSEMBLE	0.830
GPT2	0.857

I didn’t end up getting to calculate the F1 scores per class for multimodal due to time constraints, but they were weren’t great either— as you might guess based off the F1 for text classification. Here’s the distribution of data:

As you can see, the model can get away with a pretty high accuracy just by classifying as “none”. And almost never encounters the “both” category, so its never penalized / doesn’t learn. To improve this, we could've leaned into strategies like oversampling or stratification but decided not to for this project.

Also, embeddings of the image didn’t do much— here’s a visualization of what the (post-normalization) contributions of the respective text and image embeddings look like:

Contribution of each modality's embeddings bar chart

My theory for CLIP’s poor performance is that many of the images were about the same concept, but expressed in varying ways. Consider Islamophobic media (which there was a lot of)— political cartoons and news headlines may be used to express the same hate. Truly, hateful speech can exist across many different concepts and semantic spaces even within the scope of race or gender.

I resuscitated the notebook out of curiosity and did a quick t-SNE map to see what the tweet embeddings looked like:

t-SNE Clustering of CLIP embeddings by class

As you can see, there are some clearly clusters, especially for sexism. But the large majority of data seems interspersed— the embeddings don’t seem to provide too much signal to improve the performance of the model, or instead add noise which is why their magnitude of contribution is learned to be lower.

Takeaways:

I sure am glad I had leftover Colab compute credits
Data matters and accuracy as a metric can be deceiving

Next Steps

I'd like to see how a GPT-4 class model performs on this classification task. I don't expect it'll hit the 95% and above range— a combination of these models (in my limited experience when performing annotations on the scraped tweets) being somewhat reluctant to condemn as well as the task being largely subjective
Generalizing this to other social media platforms
Improving the multimodal aspect with VLMs for image captioning— I expect a transcription of the image to capture more meaning— or maybe another embedding model would work better.
Addressing the class distribution in the training data

Conclusion

Overall, this was a really fun project! I learned a lot but, writing this ~1 month after finishing it, I find a lot of room for improvement not only in methodologies but also implementations. The big technical challenges were writing a classifier class that could take in multimodal and text-only data, and switch between different numbers of classes, as well as building a distributed scraper and all the steps to get there. On the ML side of things— simple CLIP embeddings seem to be insufficient for capturing subject meanings in images. There are also so many more factors to consider before something like this can be considered for deployment in the real world, too, but it's a solid start.

——————————————————————

If you have ANY questions, suggestions, or critiques at all or would simply like to discuss more, please shoot me an email, DM, etc!