A couple of nights ago I finished John Grishan’s the Rouge Lawyer so I started reading Jordan Ellenberg’s “How not to be wrong. The power of mathematical thinking.” The cover says “a math-world superstar unveils the hidden beauty and logic of the world and puts math’s power in our hands.”
The book was only moderately annoying until I got to page 65. There he talks about statistics on brain cancer deaths per 100,000. The top states according to his data are South Dakota, Nebraska, Alaska, Delaware, and Maine. At the bottom are Wyoming, Vermont, North Dakota, Hawaii and the District of Columbia.
He writes “Now that is strange. Why should South Dakota be brain cancer center and North Dakota nearly tumor free? Why would you be safe in Vermont but imperiled in Maine.”
“The answer: … The five states at the top have something in common, and the five states at the bottom do too. And it’s the same thing: hardly anyone lives there.” There follows a discussion of flipping coins and the fact that frequencies have more random variation when the sample size is small, but he never stops to see if this is enough to explain the observation.
My intuition told me it did not, so I went and got some brain cancer data.
https://www.statecancerprofiles.cancer.gov/incidencerates/
In the next figure the x-axis is population size, plotted on a log scale to spread out the points and the y-axis is the five year average rate per year per 100,000 people. Yes there is less variability as you move to the right, and little Hawaii is way down there, but there are also some states toward the middle that are on the top edge. The next plots shows 99% confidence intervals versus state size. I used 99% rather than 95% since there are 49 data points (nothing for Nevada for some reason).
In the next figure the horizontal line marks the average 6.6. The squares are upper end points of the confidence intervals. When they fall below the line, this suggests that the mean is significantly lower than the national average. From left to right: Hawaii, New Mexico, Louisiana and California. When the little diamond marking the lower end of the confidence interval is above the line, we suspect that the rate for that state is significantly higher than the mean. There are eight states in that category: New Hampshire, Iowa, Oregon, Kentucky, Wisconsin, Washington, New Jersey, and Pennsylvania.
So yes there are 12 significant deviations from the mean (versus 5 we would get if all 49 states had mean 6.6) but they are not the ones at the top or the bottom of the list, and the variability of the sample mean has nothing to do with the explanation. So Jordan, welcome to world of APPLIED math, where you have to look at data to test your theories. Don’t feel bad the folks in the old Chemistry building at Duke will tell you that I don’t know stat either. For aa more professional look at the problem see
http://www.stat.columbia.edu/~gelman/research/published/allmaps.pdf