01. Text Sentiment Analysis
1. Overview and Scenario
Situation: Thousands of reviews pour in every day. If the rating is 5 stars but the review says āIām angry because shipping was so late, but the product is goodā, should we consider this positive? Letās quantify the customer sentiment hidden in text for analysis.
2. Data Preparation
Weāll analyze review text (review_body) from the raw_reviews_relabeled table.
BigQuery (SQL)
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()3. Keyword-Based Sentiment Analysis
The simplest method is to check for āgood wordsā and ābad wordsā.
ā Problem 1: Find Positive/Negative Keywords
Q. If the review text contains bad, poor, terrible, late, etc., classify as āNegativeā. If it contains good, great, excellent, etc., classify as āPositiveā. Otherwise, classify as āNeutralā.
BigQuery (SQL)
Hint: Using REGEXP_CONTAINS lets you find multiple patterns at once.
View Solution
SELECT
review_id,
review_body,
CASE
WHEN REGEXP_CONTAINS(LOWER(review_body), r'bad|poor|terrible|late|slow|worst') THEN 'Negative'
WHEN REGEXP_CONTAINS(LOWER(review_body), r'good|great|excellent|love|best') THEN 'Positive'
ELSE 'Neutral'
END as sentiment_category
FROM `your-project-id.retail_analytics_us.raw_reviews_relabeled`
LIMIT 10;4. Using NLP Libraries (TextBlob)
Simple keyword matching may incorrectly classify āNot goodā as āPositiveā. Specialized NLP libraries consider context to some degree and provide scores ().
ā Problem 2: Calculate Sentiment Score (Polarity Score)
Q. Use Pythonās TextBlob (or BigQuery Remote Function) to calculate the sentiment score of reviews.
Pandas (TextBlob)
Pandas: Use TextBlob(text).sentiment.polarity.
View Solution
from textblob import TextBlob
# Sentiment score calculation function
def get_sentiment(text):
return TextBlob(str(text)).sentiment.polarity
reviews['sentiment_score'] = reviews['review_body'].apply(get_sentiment)
# Check results
print(reviews[['review_body', 'sentiment_score']].head())
# Average score
print(f"Average sentiment score: {reviews['sentiment_score'].mean():.4f}")5. Sentiment Analysis and Visualization by Category
Letās find out which product category has the most customer complaints.
ā Problem 3: Find the Worst Categories
Q. Based on calculated sentiment scores, aggregate average sentiment score and negative review ratio (score < 0) by category and visualize.
Pandas (Visualization)
View Solution
import matplotlib.pyplot as plt
# Aggregation
cat_summary = reviews.groupby('relabeled_category').agg({
'sentiment_score': 'mean',
'review_id': 'count'
}).reset_index()
# Calculate negative review ratio
neg_reviews = reviews[reviews['sentiment_score'] < 0].groupby('relabeled_category').size()
total_reviews = reviews.groupby('relabeled_category').size()
cat_summary['neg_ratio'] = (neg_reviews / total_reviews * 100).fillna(0).values
# Visualization (Scatter Plot)
plt.figure(figsize=(10, 6))
plt.scatter(cat_summary['neg_ratio'], cat_summary['sentiment_score'],
s=cat_summary['review_id']/10, alpha=0.5)
for i, txt in enumerate(cat_summary['relabeled_category']):
plt.annotate(txt, (cat_summary['neg_ratio'][i], cat_summary['sentiment_score'][i]))
plt.xlabel('Negative Review Ratio (%)')
plt.ylabel('Average Sentiment Score')
plt.title('Sentiment Analysis by Category')
plt.axvline(x=30, color='r', linestyle='--') # Danger if over 30%
plt.show()š” Summary
- Keyword Matching: Fast and free but less accurate. Canāt distinguish āNot badā.
- Rule-based NLP (TextBlob): Considers grammar for slightly better accuracy.
- LLM (Next Chapter): Highest level of performance that even understands āsarcasmā.
In the next chapter, weāll use Gemini to understand and classify text like a real human.