Table of Contents
Open Table of Contents
From Raw Text to Clean Tokens
Sentiment analysis sounds simple until you encounter real-world text: tweets with emojis, financial news with ticker symbols, customer reviews mixing five languages, and HTML scraped from product pages. Before any model sees your data, you need a preprocessing pipeline that normalizes this noise without destroying signal. Strip HTML tags with BeautifulSoup, expand contractions (can't → cannot), lower-case everything, and remove non-alphanumeric characters — but keep emoticons if they carry sentiment. Tokenization choices matter too: subword tokenizers like Byte-Pair Encoding handle rare words and typos far better than whitespace splits.
import re
from bs4 import BeautifulSoup
CONTRACTION_MAP = {"can't": "cannot", "won't": "will not", "n't": " not"}
def clean_text(text: str) -> str:
text = BeautifulSoup(text, "html.parser").get_text()
text = text.lower()
for contraction, expansion in CONTRACTION_MAP.items():
text = text.replace(contraction, expansion)
text = re.sub(r"[^a-z0-9\s]", " ", text)
text = re.sub(r"\s+", " ", text).strip()
return text
Fine-Tuning a Transformer for Domain-Specific Sentiment
Pre-trained models like distilbert-base-uncased give a strong baseline out of the box, but domain adaptation dramatically improves accuracy. Financial text, for instance, uses “volatile” as a negative signal whereas medical text does not. Fine-tuning on even a few thousand labeled examples in your domain typically closes the gap between a generic model and a specialist one. Use Hugging Face’s Trainer API with a cosine learning rate schedule, a small weight decay for regularization, and early stopping on validation F1. Keep your batch size small if you’re running on a consumer GPU — gradient accumulation lets you simulate larger batches without OOM errors.
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)
dataset = Dataset.from_dict({"text": texts, "label": labels})
dataset = dataset.map(tokenize, batched=True).train_test_split(test_size=0.15)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
args = TrainingArguments(output_dir="./ckpt", num_train_epochs=4, per_device_train_batch_size=16,
evaluation_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="f1")
trainer = Trainer(model=model, args=args, train_dataset=dataset["train"], eval_dataset=dataset["test"])
trainer.train()
Serving Predictions with FastAPI
A model that lives in a Jupyter notebook delivers no business value. Wrap your fine-tuned model in a FastAPI endpoint, load it once at startup using a module-level singleton, and return structured JSON with the predicted label and confidence score. Add a simple caching layer with functools.lru_cache for repeated inputs, and instrument the endpoint with Prometheus metrics to track latency and throughput. Deploy as a Docker container behind an Nginx reverse proxy and you have a production sentiment service that can handle hundreds of requests per second on a single CPU instance.