Fine-Tuning of BERT and DistillBERT for Sentiment Analysis

4 min readMay 19, 2024

Sentiment analysis is a crucial Natural Language Processing (NLP) task that aims to understand the sentiment expressed in a piece of text. By categorizing sentiment as positive, negative, or neutral, businesses and organizations can gain insights from customer feedback, product reviews, and social media trends.

In this project, we explore sentiment analysis of movie reviews using two popular pre-trained transformer models: BERT (Bidirectional Encoder Representations from Transformers) and DistilBERT (a smaller, faster adaptation of BERT).

The code source is available on my github: https://github.com/Kili66/SentimentAnalysis_BERT_DistilBERT?tab=readme-ov-file

BERT and DistilBERT

BERT (Bidirectional Encoder Representations from Transformers):

Developed by Google AI, BERT is a powerful pre-trained language model that excels at various NLP tasks, including sentiment analysis.
BERT’s bidirectional architecture allows it to capture context from both left and right directions, making it effective for understanding context in text.

2. DistilBERT (Distilled Bidirectional Encoder Representations from Transformers):

DistilBERT is a smaller and faster version of BERT, designed for deployment on devices with limited resources.
Despite its smaller size, DistilBERT retains good performance, making it an attractive alternative for sentiment analysis.

The Dataset

We utilize a public dataset of movie reviews from Kaggle: IMDB Dataset of 50K Movie Reviews. This dataset contains 50K customer reviews along with their corresponding sentiment labels (positive or negative).

source:https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Methodology

The sentiment analysis process involves the following steps:

1. Data Loading and Preprocessing:

Load the movie review dataset.
Clean and pre-process the text data, including:
Converting sentiment labels to numeric values.
Removing HTML tags and URLs.
Lowercasing text.
Removing stop words.

2. Data Splitting:

Split the data into training and testing sets.

3. Tokenization and Encoding:

The Bert/Distilbert Tokenizer from transformers is used to convert text data into BERT-compatible tokens (subwords).
Padding is applied to ensure all sequences have the same length, and truncation is used for longer reviews to fit within the maximum sequence length allowed by BERT.
An attention mask is created to distinguish valid word pieces from padding tokens during training.
Encode the tokenized text into numerical representations suitable for the model.

4. Model Training

Define and compile the BERT and DistilBERT models for sentiment classification.
Configure the model’s different hyperparameters (learning rate, batch size, number of epochs) to find an optimal configuration.
TensorFlow Datasets: Training and testing data were converted into TensorFlow datasets for efficient training.
Early Stopping: Early stopping was implemented to prevent overfitting by monitoring validation loss.
The models were trained on the prepared training dataset in batches.
Model Compilation:
Optimizer:Adam with Learning rate=2e-5
Loss: SparseCategoricalCrossentropy: This is typically used when your classes are mutually exclusive and the targets are integers. In the case of binary classification, it expects the labels to be provided as integers (0 or 1), and your final layer should have 2 output neurons with a ‘softmax’ activation function.
Metric: Accuracy to measure the model accuracy
Epochs was kept small(Epochs 2) because of the high training time and lack of resources allocation
Train the models on the training data.

5. Model Evaluation:

Evaluate the trained models on the testing data using metrics like F1_score,accuracy, Precision, Recall using Sklearn Classification report
Confusion matrices were created to visualize the model’s performance.

6. Results

Accuracy: 91% of the reviews in the test set were classified correctly. This is similar to the accuracy achieved by the BERT model (91%).

2. Precision:

Negative Class: 89% of the predicted negative reviews were actually negative (slightly lower than BERT’s 93%).
Positive Class: 93% of the predicted positive reviews were actually positive (slightly higher than BERT’s 90%).

3. Recall:

Negative Class: 93% of the actual negative reviews were correctly classified (slightly higher than BERT’s 89%).
Positive Class: 89% of the actual positive reviews were correctly classified (slightly lower than BERT’s 93%).

4. F1-Score: The F1-score for both classes is around 0.91, indicating a good balance between precision and recall.

These results suggest that DistilBERT is a viable alternative to BERT for sentiment analysis, offering comparable performance while potentially being faster and more lightweight due to its smaller size.

DistilBERT for Efficiency

While BERT achieved excellent performance, its computational demands posed challenges. To explore a more memory-efficient alternative, the DistilBERT model was investigated.

Conclusion

In conclusion, this project successfully explored sentiment analysis of IMDB movie reviews using pre-trained transformer models, BERT and DistilBERT. Both models demonstrated high accuracy and effectiveness in classifying reviews as positive or negative. While BERT remains powerful, DistilBERT offers comparable performance with potential advantages in efficiency due to its smaller size.

Explore the DistilBERT model on the Hugging Face Model Hub: https://huggingface.co/MariamKili/my_distilbert_model