Half measures

How I killed some time during the lockdown.

The following was inspired by this tweet:

How to count to 100 like a journalist:

A, both, several, five, half a dozen, more than half a dozen, nearly 10, nearly a dozen, a dozen, more than a dozen, nearly two dozen, a score, nearly two dozen (again), dozens, scores, 50, more than 50, more than 75, nearly a hundred, 100.
— Robinson Meyer (@yayitsrob) April 30, 2020

There is a similar journalistic shorthand for fractions, particularly those near 1/2; terms like "almost half", "nearly half", "about half", "more than half", and "over half". What follows is a statistical quantification of what those terms actually mean, as measured by a random sample of 50 articles that use these terms.

Articles were found via google searches for each term, and I only included articles where the actual number could be verified (which sometimes entailed clicking through to a secondary source).

For example, consider the following article from Nature:

Over half of psychology studies fail reproducibility test

If you read the article, you find that "over half" in this case means 61%.

I repeated this process 50 times for 5 terms, and recorded the result. You can view the raw data here.

Here are the charted results:

You can see that terms like "almost half" and "nearly half" fall in a fairly tight band, with 40% looking like a good cutoff for what constitutes "almost" or "nearly". But terms like "more than half" or "over half" have a longer tail. There is more linguistic wriggle room in "more than" versus "nearly".

You'll note a few data points fall on the "wrong" side of the 50% dividing line. These are examples where the headline writer appeared to misinterpret the data. For example, Business Insider reported:

More than half of American retailers didn't pay their rent in April and May, and it could upset the entire economy

But if you click through to the Washington Post story they were aggregating, you find they misinterpreted a chart showing that more than half of retailers did pay their rent in April and May.

Here are the key stats for each term:

term	n	mean	median	stdev	skew	kurtosis
almost half	50	0.46	0.46	0.02	-0.16	0.30
nearly half	50	0.46	0.47	0.02	-0.61	-0.27
about half	50	0.50	0.50	0.04	-0.14	-0.62
more than half	50	0.56	0.55	0.05	0.06	1.49
over half	50	0.56	0.55	0.04	0.94	0.68

"Almost" and "nearly" appear to be statistically identical in many ways, as do "more than" and "over". And it's comforting to see "about half" with a mean and median of 0.50, and a fairly low skew. As a reminder, skew tells you how imbalanced a distribution is, while kurtosis tells you how "fat" the tail is, compared to a standard normal distribution.