Skip to content

Building a Sentiment Analysis Pipeline with Python

Published: at 12:00 AM

Table of Contents

Open Table of Contents

From Raw Text to Clean Tokens

Sentiment analysis sounds simple until you encounter real-world text: tweets with emojis, financial news with ticker symbols, customer reviews mixing five languages, and HTML scraped from product pages. Before any model sees your data, you need a preprocessing pipeline that normalizes this noise without destroying signal. Strip HTML tags with BeautifulSoup, expand contractions (can'tcannot), lower-case everything, and remove non-alphanumeric characters — but keep emoticons if they carry sentiment. Tokenization choices matter too: subword tokenizers like Byte-Pair Encoding handle rare words and typos far better than whitespace splits.

import re
from bs4 import BeautifulSoup

CONTRACTION_MAP = {"can't": "cannot", "won't": "will not", "n't": " not"}

def clean_text(text: str) -> str:
    text = BeautifulSoup(text, "html.parser").get_text()
    text = text.lower()
    for contraction, expansion in CONTRACTION_MAP.items():
        text = text.replace(contraction, expansion)
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

Fine-Tuning a Transformer for Domain-Specific Sentiment

Pre-trained models like distilbert-base-uncased give a strong baseline out of the box, but domain adaptation dramatically improves accuracy. Financial text, for instance, uses “volatile” as a negative signal whereas medical text does not. Fine-tuning on even a few thousand labeled examples in your domain typically closes the gap between a generic model and a specialist one. Use Hugging Face’s Trainer API with a cosine learning rate schedule, a small weight decay for regularization, and early stopping on validation F1. Keep your batch size small if you’re running on a consumer GPU — gradient accumulation lets you simulate larger batches without OOM errors.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

dataset = Dataset.from_dict({"text": texts, "label": labels})
dataset = dataset.map(tokenize, batched=True).train_test_split(test_size=0.15)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
args  = TrainingArguments(output_dir="./ckpt", num_train_epochs=4, per_device_train_batch_size=16,
                          evaluation_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1")
trainer = Trainer(model=model, args=args, train_dataset=dataset["train"], eval_dataset=dataset["test"])
trainer.train()

Serving Predictions with FastAPI

A model that lives in a Jupyter notebook delivers no business value. Wrap your fine-tuned model in a FastAPI endpoint, load it once at startup using a module-level singleton, and return structured JSON with the predicted label and confidence score. Add a simple caching layer with functools.lru_cache for repeated inputs, and instrument the endpoint with Prometheus metrics to track latency and throughput. Deploy as a Docker container behind an Nginx reverse proxy and you have a production sentiment service that can handle hundreds of requests per second on a single CPU instance.