How I killed some time during the lockdown.

The following was inspired by this tweet:

There is a similar journalistic shorthand for fractions, particularly those near 1/2; terms like "almost half", "nearly half", "about half", "more than half", and "over half". What follows is a statistical quantification of what those terms actually mean, as measured by a random sample of 50 articles that use these terms.

Articles were found via google searches for each term, and I only included articles where the actual number could be verified (which sometimes entailed clicking through to a secondary source).

For example, consider the following article from Nature:

Over half of psychology studies fail reproducibility test

If you read the article, you find that "over half" in this case means 61%.

I repeated this process 50 times for 5 terms, and recorded the result. You can view the raw data here.

Here are the charted results:

You can see that terms like "almost half" and "nearly half" fall in a fairly tight band, with 40% looking like a good cutoff for what constitutes "almost" or "nearly". But terms like "more than half" or "over half" have a longer tail. There is more linguistic wriggle room in "more than" versus "nearly".

You'll note a few data points fall on the "wrong" side of the 50% dividing line. These are examples where the headline writer appeared to misinterpret the data. For example, Business Insider reported:

More than half of American retailers didn't pay their rent in April and May, and it could upset the entire economy

But if you click through to the Washington Post story they were aggregating, you find they misinterpreted a chart showing that more than half of retailers did pay their rent in April and May.

Here are the key stats for each term:

term n mean median stdev skew kurtosis
almost half 50 0.46 0.46 0.02 -0.16 0.30
nearly half 50 0.46 0.47 0.02 -0.61 -0.27
about half 50 0.50 0.50 0.04 -0.14 -0.62
more than half 50 0.56 0.55 0.05 0.06 1.49
over half 50 0.56 0.55 0.04 0.94 0.68

"Almost" and "nearly" appear to be statistically identical in many ways, as do "more than" and "over". And it's comforting to see "about half" with a mean and median of 0.50, and a fairly low skew. As a reminder, skew tells you how imbalanced a distribution is, while kurtosis tells you how "fat" the tail is, compared to a standard normal distribution.