Carolyn graciously lent me a copy of “How to Lie with Statistics” by Darrell Huff. Despite its 1954 publication date, this book is remarkably relevant today. Below, I explain why the book, despite its high quality, will never achieve its aim, and my suggestion for a substitute.

*How to Lie with Statistics* is a gentle introduction to deceit with numbers. It is brief, the writing is elegant and light-hearted, and every single one of the lies described in the book is still in widespread use sixty years later. The book includes an informal catalogue of common statistical errors, reserving special scorn for the Precision Bias.

It is a valiant effort to craft an accessible and persuasive introduction to the issues. The author seems to believe that with sufficient widespread education, we can banish misleading numbers. I disagree. The problem is hard, in that the tiny individual payoff will never justify the effort needed to detect and oppose numerical deception. We need an easy way of certifying and enforcing honest data presentation.

The core of Statistics is the comparison of expectations to results. All of the lies present accurate and precise numerical results (technical honesty) but mislead about the appropriate comparable expectation (*de facto* dishonesty). The situation is complicated by the fact that even professionals frequently have difficulty crafting the proper expectations. Malicious numerists always have plausible deniability.

To put it another way, statisticians have considerable flexibility in methods and presentation. Special interests abuse the flexibility for their own purposes.

There is an analogy to accounting. Accountants have considerable flexibility in methods and presentation of financial results. Accounting is about leveraging that flexibility to avoid taxes. In response to the inevitable plethora of abuses, accountants developed the Generally Accepted Accounting Principles (GAAP), a catalogue of rules to govern the business.

I propose the development of Generally Accepted Numerical Principles. We must formalize the Expectation side of the statistical Expectation-Results dichotomy so that we may call out a liar and impose consequences where necessary.

How might such a system work? I would leave the details to the expert statisticians, but one way would be to develop a formal catalogue of Expectations given specific Results. It might look like something like the following (though this is not the formal proposal):

**Use of “Average”**

**A number called an “average” in isolation entails the following assumptions:**

- The number presented is an arithmetic mean.
- The sample of the average is an unbiased representative of the stated population
- The population has a normal distribution in the variable.
- The median is within 0.1 standard deviations of the mean.
- p < 0.05

**N out of M/Percentages**

**A statement of the form “N out of M Practitioners <statement>” or “X% of Practioners <statement>” implies:**

- The sample is an unbiased representative of the stated population
- The population has a normal distribution in the variable
- p < 0.01

**Line Graphs**

**A line graph must:**

- Have axes labelled and units included
- y-axis has 0 at the origin and no discontinuities
- All data points collected with equal sample characteristics

Appropriate uses could be given a trustworthy logo or stamp. Publications could be “GANP 2011 certified” indicating that they obey the rules of the GANP. It would become easy for lay people know what numbers to trust.

Obviously, the development of such a catalogue would be a monumental task. The organizing committees would be subject to perpetual corruption and interference attempts. The first several iterations of the GANP would permit rampant abuses while loopholes were found and closed. Chaos, confusion and doubt would run amok. During the development of the rules, at least 452,235,239 people will die and more than 1.37 billion will suffer in poverty. Nevertheless, four out of five University of Toronto experts agree, this is a good idea.

November 20, 2008 at 7:27 am

I think a better set of rules would be the admittedly more vauge, “be as Bayesian as possible” and “don’t be a dick”.

Demanding normal distributions to use the word average is absurd!

November 20, 2008 at 4:45 pm

I’m not saying that you must have a normal distribution if you use the word average. I’m saying if you use the word average *in isolation* the natural assumption will be that you’re talking about a normal distribution. Thus, if you state an average *in isolation* when you have a highly skewed distribution you are misleading the reader until you clarify. It’s about sane defaults.

November 22, 2008 at 8:04 pm

No, the natural assumption is not a normal distribution. Educated people will not automatically assume the normal distribution when told the average of some quantity. They will assume whatever is most reasonable based on their beliefs about the quantity in question. Uneducated people probably won’t know what the normal distribution is and may implicitly assume some arbitrary unimodal distribution.

I think it is much more plausible that people will assume things like “The median is within 0.1 standard deviations of the mean.” than that they will assume normality. Saying the variable comes from a normal distribution is much stronger than that requirement.

The whole POINT of presenting an average or other summary statistic is that you don’t have to make a claim about the entire distribution. The full distribution is ALL the information you can have about a random variable. The reason people use summary statistics (http://en.wikipedia.org/wiki/Summary_statistics) is to communicate as much as possible as simply as possible.

So I have two problems with what you seem to be saying. First, people don’t assume variables come from a normal distribution automatically. For example, “the average salary in a company is X” does NOT make me assume a normal distribution for salaries and probably doesn’t make any other educated person assume that either.

Secondly, anything that requires people to specify the distribution of a variable they report the average of is silly because reporting averages is designed to avoid this. Most of the time the true distribution is unknown. If the true distribution IS know, then by all means report it and try and justify with data that the distribution is correct.

November 24, 2008 at 3:40 pm

We’re talking about two different things.

You’re talking about benevolent communication of information to an educated, interested audience.

I’m talking about standards for quasi-malicious communication of information to an uneducated, hurried, mostly uninterested audience.

In marketing materials, the point of a summary statistic isn’t to communicate as much as possible as simply as possible. It’s to mislead the reader as much as possible in a way that favours the marketer, without actually telling a lie. What I’m talking about is the need for a usable, easy standard to push the bar higher on what counts as an effective lie.