Sarcasm detection in news headlines — on cAInvas

Photo by Su for RaDesign on Dribbble

Sarcasm is the use of words that convey a meaning opposite to the one you actually intend to pass on. It has the ability to flip the sentiment of the sentence. This makes sarcasm detection an important part of sentiment analysis.

Most of the datasets available for this purpose rely on tweets written by the public. This can result in noisy data with improper labeling. The context of tweets is dependent on the thread (in case of replies) and thus, understanding the context of the conversation becomes crucial to labeling the text.

To overcome this, here we use a dataset consisting of news headlines. This is a comparatively reliable dataset as the headlines are written by professionals and the source can be used in labeling the dataset samples.

Implementation of the idea on cAInvas — here!

The dataset

On Kaggle by Rishabh Misra

This dataset is collected from two news websites. The Onion aims at producing sarcastic versions of current events and the headlines from News in Brief and News in Photos categories (which are sarcastic) were collected. Also, real (and non-sarcastic) news headlines were collected from HuffPost.

Snapshot of the dataset

Looking into the distribution of class values —

Class value distribution

It is an almost balanced dataset. Good to go!

Data preprocessing

The headline text has to be cleaned, i.e., URLs and HTML elements removed (if any). Next, the text is stripped of any character that is not an alphabet. The functions for the same are defined as follows —

Code: Text preprocessing functions

Once the above preprocessing is done, the sentence is converted to lower case before stemming it into the root word. This helps in identifying words that are different forms of the same word (example, eat, eating, ate, etc.). Snowball stemmer of the nltk.stem module is used for stemming.

Stopwords are not removed as they can help in delivering the context of the sentence as a whole.

Code: Text preprocessing

The words belonging to each of the classes are stored separately for visualization using WordCloud

WordCloud: Not sarcastic
WordCloud: Sarcastic

The size of the word is proportional to its frequency in the dataset. It is not possible to derive any inference based on these images.

Train-validation split

Splitting the dataset into train and validation set using 80–20 ratio.

Code: Train-val split

The train set has 22895 samples and the validation set has 5724 samples.

Tokenization

The Tokenizer function of the keras.preprocessing.text module is used to convert the text into a sequence of integers to give as input to the model. They are then padded with zeros to achieve a maximum length of 200.

Code: Tokenization

The model

The model to be trained consists of an embedding layer that converts the input samples (sparse) into dense arrays of fixed size. This is followed by an LSTM and three Dense layers, the first two having ReLU activation and the last one having Sigmoid activation functions.

The model is compiled using the BinaryCrossentropy loss as there are only two classes — 0 and 1. Adam optimizer is used and the model’s accuracy is tracked.

The EarlyStopping callback function of the keras.callbacks module tracks a metric (here, val_loss by default) for the given number of epochs and stops training if the metric doesn’t improve (decrease if loss, increase if accuracy). The restore_best_weights parameter ensures that the model with the least validation loss (or best value of metric) yet is restored to the model variable.

Code: Model and training

The model achieved ~83.5% accuracy on the test set after training with a learning rate of 0.01.

Test set predictions

Peeking into the confusion matrix to understand the results better —

Confusion matrix

The metrics

Plot of accuracies
Plot of losses

Prediction

Performing predictions on random test samples —

Code: Prediction on test samples
Sample prediction on a random test sample

deepC

deepC library, compiler, and inference framework are designed to enable and perform deep learning neural networks by focussing on features of small form-factor devices like micro-controllers, eFPGAs, CPUs, and other embedded devices like raspberry-pi, odroid, Arduino, SparkFun Edge, RISC-V, mobile phones, x86 and arm laptops among others.

Compiling the model using deepC —

deepC compliation

Head over to the cAInvas platform (link to notebook given earlier) to run and generate your own .exe file!

Credits: Ayisha D

Simplifying AI development for EDGE devices