The Star-Rating Dilemma: When More Stars Collide with Fewer Votes

A personal field-report plus a tiny math model.

The dilemma

Picture any familiar choice:

Option A: 4★, 100 ratings Option B: 4.5★, 10 ratings

Intuitively most people understand that this is not trivial. Option B has a higher average rating, but the lower number of ratings makes it less trustworthy. So what do we do when “more stars” collides with “fewer votes”?

Some will intuitively devalue the rating for low amounts of ratings and vice versa. I was not satisfied. I wanted to make this intuition as explicit as possible, so I did some maths.

Three tiny functions are enough

We will now prepare our rating and confidence values, and then combine them while staying aware of risk aversion.

Normalise the rating

First things first, we want to be able to handle any rating system, so we have to normalise any given one into a standardised one. Most rating schemes run from 1 to 5. This can be mapped linearly onto $[0,\,1]$:

★ 1 becomes 0, ★ 5 becomes 1, everything else is proportional.

\[r_{\text{norm}} = \frac{\text{rating} - 1}{4} \tag{1}\]

More generally, for any rating scale with bounds $\min_r$ and $\max_r$:

\[r_{\text{norm}}(x) = \frac{x - \min_r}{\max_r - \min_r} \tag{2}\]

Confidence from the vote count

Next we need a way to express how much we trust a rating based on how many people contributed to it. A restaurant with zero reviews tells us nothing - we have zero confidence. A restaurant with thousands of reviews gives us plenty of data - our confidence is high. But the jump from 0 to 10 reviews matters far more than the jump from 1 000 to 1 010. In other words, every additional review helps, but each one helps a little less than the last.

We can write this requirement as a function $c$ that takes a vote count $n$ (any number from 0 to infinity) and returns a confidence value between 0 and 1:

\[c : [0, \infty) \longrightarrow [0, 1] \tag{3}\]

In plain terms: $c(0) = 0$ (no reviews, no trust), $c$ keeps growing as $n$ increases (more reviews always help), but $c$ never quite reaches 1 (we can never be perfectly certain). Each additional review adds a little less confidence than the one before - the returns are diminishing.

Three simple formulas satisfy all of these requirements:

\[c(n) = 1 - e^{-n} \tag{4}\]

The exponential function. It rockets upward from 0 and then flattens out rapidly - a small number of reviews already pushes confidence close to 1.

\[c(n) = \frac{2}{\pi}\,\arctan(n) \tag{5}\]

The arctangent function. It rises quickly at first, then gradually bends toward 1 - a middle ground between the other two.

\[c(n) = \frac{n}{n + 1} \tag{6}\]

The rational function. The simplest of the three: at $n = 1$ review, confidence is $\frac{1}{2}$; at $n = 9$, it is $\frac{9}{10}$. It climbs the most slowly and stays furthest from 1, giving the most conservative view of how much data is “enough.”

Each of these can be further fitted to what we deem a critical amount of ratings using a half-confidence point $c_h$ - the number of reviews at which we consider the rating exactly 50 % trustworthy, such that $c(c_h) = \tfrac{1}{2}$.

For $(6)$, replace the constant with $c_h$:

\[c(n) = \frac{n}{n + c_h} \tag{6'}\]

so that $c(c_h) = \tfrac{1}{2}$.

For $(4)$, divide the exponent by $c_h$ and multiply by $\ln 2$:

\[c(n) = 1 - e^{\,-n\,\ln 2\,/\,c_h} = 1 - 2^{-n/c_h} \tag{4'}\]

again giving $c(c_h) = \tfrac{1}{2}$.

For $(5)$, divide the argument by $c_h$:

\[c(n) = \frac{2}{\pi}\,\arctan\!\Bigl(\frac{n}{c_h}\Bigr) \tag{5'}\]

and once more $c(c_h) = \tfrac{1}{2}$.

Axes: The horizontal axis is the number of ratings; the vertical axis is the resulting confidence in $[0, 1]$. The dashed grey line marks $c = 0.5$. Look for: all three curves cross the half-confidence line at the same point $n = c_h = 50$. After that, the exponential $(4’)$ saturates fastest - by about 150 reviews it already treats a rating as near-certain. The rational function $(6’)$ climbs the most slowly and stays furthest from 1, offering the most conservative view. The arctangent $(5’)$ sits in between.

Since $(6’)$ is the simplest formula and grows the slowest, there is the most spread between options even at higher review counts - meaning it does not write off all differences as “both good enough” too quickly. For the rest of this post we will use $(6’)$, the rational confidence function.

Merge both via a risk-aversion parameter ρ

Now we have a normalised rating $r \in [0, 1]$ and a confidence value $c(n) \in [0, 1)$.

We could simply multiply rating by confidence, or take the average, but depending on your risk aversion, you will find the confidence value to be more or less important. In other words, we should weight the confidence higher the more risk-averse we are.

Define a parameter $\rho \in [0, \infty)$:

$\rho = 0$ - pure star-gazing (risk-seeking): vote count is irrelevant
$\rho = 1$ - stars and confidence count equally
$\rho \to \infty$ - maximum caution: only sample size matters

\[V = \frac{r \;+\; \rho \cdot c(n)}{1 + \rho} \tag{7}\]

Transparent, tiny, and still explainable to non-math friends. Equation $(7)$ is simply a weighted average of the normalised rating and the confidence, where $\rho$ controls how much weight falls on confidence.

Worked example

Let us put the model to work on a situation you have probably faced: choosing a restaurant on Google Maps.

Using the rational confidence function $(6’)$ with half-confidence $c_h = 50$:

Restaurant A: ★ 4.0 from 200 reviews - a well-established neighbourhood favourite. Normalised rating: $r_A = \frac{4.0 - 1}{4} = 0.75$. Confidence: $c(200) = \frac{200}{200 + 50} = 0.800$.
Restaurant B: ★ 4.5 from 15 reviews - a newly opened spot with glowing early praise. Normalised rating: $r_B = \frac{4.5 - 1}{4} = 0.875$. Confidence: $c(15) = \frac{15}{15 + 50} \approx 0.231$.

Restaurant B has the higher star rating, but only 15 reviews gives it just 23 % confidence - meaning we barely trust that average. Restaurant A’s 200 reviews put it at 80 % confidence. Now let us plug both into equation $(7)$ at different risk-aversion levels:

$\rho$	Restaurant A (★ 4.0; 200 reviews)	Restaurant B (★ 4.5; 15 reviews)	Ahead
0	0.750	0.875	B
0.1	0.755	0.816	B
0.5	0.767	0.660	A
1	0.775	0.553	A

The tipping point sits at $\rho \approx 0.22$. Even mild risk aversion - weighting sample size at roughly a fifth of star quality - is enough to flip the lead to Restaurant A.

Why so low? Because with $c_h = 50$ and the rational function, 15 reviews corresponds to only 23 % confidence while 200 reviews gives 80 %. That gap of nearly 57 percentage points means the model sees a huge difference in trustworthiness. The half-star advantage of Restaurant B simply cannot overcome the confidence deficit once you care even a little about sample size.

Axes: The horizontal axis is the risk-aversion parameter $\rho$; the vertical axis is the unified score $V$. Look for: Restaurant B (red) starts higher at $\rho = 0$ thanks to its better star rating, but drops steeply because its confidence is so low. Restaurant A (blue) barely changes - its high confidence keeps its score stable. The dashed vertical line marks $\rho \approx 0.22$, the exact point where A overtakes B. Takeaway: with the rational function and $c_h = 50$, fifteen reviews is simply not enough to compete with two hundred, even when the star gap is half a star.

Sensitivity to cₕ

The half-confidence parameter $c_h$ has a very strong contribution towards your end results. It encodes your answer to the question: “How many reviews does a place need before I trust it halfway?” Changing $c_h$ changes how much trust the model grants a given number of reviews - and therefore how easily confidence can flip the ranking.

The general tipping-point formula (setting $V_A = V_B$ and solving for $\rho$) is:

\[\rho^\ast = \frac{r_B - r_A}{\,c_A - c_B\,}\]

In words: the tipping point equals the star-rating gap divided by the confidence gap. A wider confidence gap (the denominator) means a lower tipping point — the ranking flips more easily.

In plain English

Here is the big idea behind the 'tipping point': when two options have different star ratings and different numbers of reviews, there is always some level of caution about sample size that flips which one looks better. The graph below maps that out. Each coloured curve represents a different star-rating gap — from a quarter of a star (green) up to two full stars (orange). As you move right along the horizontal axis the confidence gap widens (meaning one option has far more reviews), and the tipping point on the vertical axis drops. In other words, the bigger the difference in how much you trust the two review counts, the less cautious you need to be before the better-reviewed option wins. The three blue dots trace our restaurant example (a half-star gap) at three different trust settings. Notice how even a modest confidence gap pushes the tipping point well below 1 — you barely have to care about review count before it starts outweighing a small star advantage.

Axes: The horizontal axis is the confidence gap $\Delta c = c_A - c_B$; the vertical axis is the tipping-point risk aversion $\rho^\ast$. Each curve represents a fixed normalised rating gap $\Delta r = r_B - r_A$. Look for: every curve is a hyperbola - as the confidence gap widens, less risk aversion is needed to flip the ranking. The three blue dots mark the restaurant example ($\Delta r = 0.125$, i.e. half a star) at $c_h = 10$, $50$, and $200$. Takeaway: even a modest confidence gap makes the ranking very sensitive to risk aversion when the star gap is small; conversely, a large star gap (orange, 2 stars) requires a substantial confidence gap before $\rho^\ast$ drops below 1.

Here is how different choices of $c_h$ affect our restaurant example:

Half-confidence $c_h$	$c(15)$	$c(200)$	Confidence gap	Tipping $\rho^\ast$
10	0.600	0.952	0.352	0.36
50	0.231	0.800	0.569	0.22
200	0.070	0.500	0.430	0.29

At $c_h = 10$, even 15 reviews already carries 60 % confidence - the model is generous, so the confidence gap is smaller and the tipping point is a bit higher. At $c_h = 50$, 15 reviews drops to just 23 % confidence while 200 reviews sits at 80 %, creating the widest gap and the lowest tipping point. At $c_h = 200$, both options score low on confidence (even 200 reviews is only half-trusted), so the gap narrows again.

There is no universally “correct” $c_h$. It depends on the domain: how noisy are individual ratings? How many reviews do you expect a reasonable option to have? For a niche category where 20 reviews is a lot, $c_h = 10$ might be right. For restaurants on Google Maps, $c_h = 50$ feels realistic. For a global platform like Goodreads where popular titles have thousands of reviews, $c_h = 200$ could make sense.

The spread problem

You may have noticed a practical difficulty. Real-world vote counts span several orders of magnitude. A newly opened café might have 12 reviews; a fast-food chain might have 20 000.

With any fixed half-confidence $c_h$, the rational confidence function behaves almost like a step function: options far below $c_h$ cluster near zero, and options far above $c_h$ cluster near one. The useful middle range - where confidence is actually discriminating between options - covers roughly one order of magnitude around $c_h$.

This is not a bug in the model but a genuine outcome of our design choices. When $c_h = 50$, the model sees almost no difference between a place with 500 reviews and one with 5 000 ($c = 0.91$ vs $c = 0.99$). Once we are way beyond our critical trustworthy limit, both are “trusted enough.” But you might feel that the difference should matter.

One possible fix: apply the confidence function to the logarithm of the vote count instead of the raw count. This compresses the wide range of review numbers into a much narrower band before feeding them into the rational function:

\[c(n) = \frac{\log_{10}(n + 1)}{\log_{10}(n + 1) + \log_{10}(c_h + 1)}\]

We add 1 inside both logarithms so that $c(0) = 0$ still holds (since $\log_{10}(1) = 0$). When $n = c_h$ the two logarithms are equal, so $c(c_h) = \tfrac{1}{2}$ - exactly the same half-confidence meaning that $c_h$ has in the rational function. No new parameter is needed.

Here is how the log-based function compares to the standard rational function across a wide range:

Reviews	Rational ($c_h = 50$)	Log-based ($c_h = 50$)
10	0.167	0.379
100	0.667	0.540
1 000	0.952	0.637
10 000	0.995	0.701
1 000 000	≈ 1.000	0.778

Notice how the rational function races to 1 and then stays there - it cannot tell 1 000 reviews apart from 1 000 000. The log-based version climbs more slowly and still has room to differentiate even at very high counts.

Axes: The horizontal axis is the number of reviews on a logarithmic scale (each gridline is 10× the previous one); the vertical axis is confidence. Look for: the green rational curve shoots up quickly and then hugs the top - everything above a few hundred reviews looks the same. The red log-based curve rises more gently and still has meaningful room to grow even at 100 000 reviews. Takeaway: the log-based version trades sharpness for range. Whether that trade-off is worth it depends on your data - and on how much you think the difference between 1 000 and 100 000 reviews should matter.

What’s next

This model is deliberately minimal - three functions and one dial. Possible extensions:

Bayesian priors. Rather than treating the star rating and confidence as separate numbers that get blended, start from a “default expectation” - the average rating across all options - and let each option’s reviews pull it away from that baseline. With few reviews, the score stays close to the overall average; with many reviews, the option’s own rating dominates. This is essentially how IMDb ranks its Top 250: they blend each film’s average with the site-wide mean, weighted by how many votes it has. Our model does something similar in spirit, but keeps the confidence and the rating as distinct, tuneable pieces rather than baking them into one formula.
Variance-aware confidence. Use the spread of individual ratings, not just their count. Ten reviews that all say ★ 4 are more informative than ten reviews split between ★ 1 and ★ 5 - even though the average is the same. A tight cluster of ratings gives you more reason to trust the average.
Multi-attribute scoring. So far we have compared options on a single dimension - one overall star rating. But many real decisions involve several dimensions at once. A hotel might have separate ratings for cleanliness, location, and value. A restaurant might be rated on food, service, and atmosphere. You could apply this model to each dimension independently (each with its own $r$, $c$, and even its own $c_h$), and then combine the per-dimension scores with another set of weights reflecting how much each dimension matters to you personally. Someone who cares most about location would weight that dimension higher; someone who cares about value would weight differently. The maths stays the same at each level - it is just nested.

I’m keen to hear additions, critiques, or totally different angles - the more plural, the more fun.