Analyze star ratings with simple average, Bayesian average, Wilson confidence, distribution visualization, polarity detection, and entropy-based consensus metrics.
Five-star rating systems power decisions on Amazon, Yelp, Google, App Store, and countless other platforms — but a simple average can be deeply misleading. An item with one 5-star review isn't better than one with 1,000 reviews averaging 4.7 stars. This calculator goes far beyond the crude average to provide statistically rigorous rating analysis.
Three ranking methods are computed: the simple weighted average, the Bayesian average (IMDB-style, which pulls ratings toward a prior when review counts are low), and the Wilson lower bound (which gives a confidence-adjusted "worst reasonable case" score for ranking). Beyond numerical scores, the calculator measures rating consensus through standard deviation and entropy, detects polarized distributions, and computes a net sentiment score.
Whether you're evaluating products, ranking search results, comparing restaurants, or designing your own rating system, this calculator shows you what the star distribution actually reveals — and what a simple "4.2 out of 5" hides.
Every ecommerce platform, review site, and marketplace needs to rank items by ratings — and simple averages fail in predictable ways. This calculator demonstrates three industry-standard solutions (simple, Bayesian, Wilson) side by side, so platform designers can choose the right method and users can understand why ratings feel "off" sometimes.
The distribution visualization, polarity detection, and entropy metrics provide insights that no single number can capture. A "3.5-star" product could be mediocre (most ratings 3-4), controversial (split between 1 and 5), or barely-reviewed (one 3 and one 4). This calculator tells you which.
Simple Average: Σ(star × count) / Σ(count) Bayesian Average: (m × C + Σ(star × count)) / (m + Σ(count)) where m = prior review count, C = prior mean (typically 3.0) Wilson Lower Bound (for % positive): (p̂ + z²/2n − z√(p̂(1−p̂)/n + z²/4n²)) / (1 + z²/n) where p̂ = proportion of 4-5★, z = 1.96 for 95% CI
Result: Simple: 4.05/5, Bayesian: 3.57/5, Wilson: 67.7%, SD: 1.10, Net: +40%
With 100 total ratings weighted toward 5 and 4 stars, the simple average is 4.05. The Bayesian average (with 100-review prior at 3.0) pulls this down to 3.57, reflecting that 100 reviews provide moderate confidence. Wilson lower bound of 67.7% means we're 95% confident that at least 67.7% of future reviews will be positive (4-5★). SD of 1.10 indicates moderate consensus.
IMDB uses a Bayesian average ("weighted rating") for its Top 250 list: WR = (v/(v+m)) × R + (m/(v+m)) × C, where v = votes, m ≈ 25,000, R = mean rating, C = mean across all films (~7.0). Amazon uses a proprietary system that factors in recency, verified purchases, and helpfulness votes alongside star counts. Reddit's "Best" comment sort uses Wilson confidence intervals, as described by Evan Miller's influential blog post.
Online ratings typically follow a J-shaped distribution: many 5-star ratings, gradually fewer 4, 3, 2, and then a bump at 1 star. This happens because satisfied customers leave reviews voluntarily (5★), dissatisfied customers complain (1★), but average-experience customers rarely bother. Any rating system must account for this selection bias.
When designing a rating system, consider: (1) Bayesian averaging to handle cold starts, (2) recency weighting to reflect improving/declining quality, (3) credibility signals to weight verified purchasers higher, (4) display distribution bars (not just the number), and (5) enough volume before showing ratings publicly. Each design choice affects how users interpret and trust the system.
The Bayesian average blends your data with a prior assumption (default: 3.0 stars with 100 reviews' weight). For items with few reviews, the result is pulled toward the prior. As reviews accumulate, the data overwhelms the prior and the Bayesian average converges to the simple average. This prevents a single 5-star review from ranking above a well-reviewed 4.5-star item.
Wilson lower bound is ideal for ranking items by approval rate. It answers: "Given this sample size, what's the lowest percentage of positive ratings we can be 95% confident about?" A product with 10/10 positive ratings gets a lower Wilson score than one with 95/100, because the second has more evidence. Reddit uses a variant of this for comment ranking.
Entropy measures the spread of ratings across star levels. Low entropy means ratings cluster at one level (strong consensus — good or bad). High entropy means ratings are spread evenly (no consensus, controversial item). Maximum entropy occurs when each star level has exactly 20% of ratings.
Polarity measures how bimodal the distribution is — how much of the ratings are at the extremes (1★ and 5★) versus the middle (2-4★). A highly polarized product has fans who love it and critics who hate it. The average might be 3 stars, but the experience is nothing like "average" — it depends on who you are.
Set it to the typical number of reviews for items in your category. If most items have ~200 reviews, use 200 as the prior. This ensures new items with few reviews aren't artificially inflated. IMDB uses approximately 25,000 as the prior for their Top 250 list.
If ratings are skewed (e.g., mostly 5-star with some 1-star), the mean is pulled down by the low ratings while the median stays at 5. The median represents the "typical" review; the mean represents the overall balance. Large disagreements indicate a skewed distribution.