Category Archives: Probability

Pooled Tests for COVID-19

When one is dealing with a disease that is at a low frequency in the population and one has a large number of people to test, it is natural to do group testing. A fixed number of samples, say 10, are mixed together. If the combined sample is negative, we know all the individuals are. But if a group tests positive then all the samples in the group have to be retested individually.

If the groups are too small then not much work is saved. If the groups are too large then there are too many positive group tests. To find the optimal group size, suppose there are a total of  N individuals, the group size is k, and 1% of the population has the disease. The number of group tests that must be performed is N/k. The probability a group tests positive is k/100. If this happens then we need k more tests. Thus we want to minimize

(N/k)( 1 + k2/100) = N/k + Nk/100

Differentiating we want –N/k2 + N/100=0 or k = 10. In the concrete case N = 1000, the number of tests is 200.

Note: the probability a group test is positive is p = 1 – (1 – 1/100)k but this makes the optimization very messy. When k=10, 1 + kp = 1.956, so the answer does not change by very much.

Recent work reported on in Nature on July 10, 2020 shows that the number of tests needed can be reduced substantially if the individuals are divided into groups in two different ways for group testing before one has to begin testing individuals. To visualize the set-up consider a k by k matrix with one individual in each cell. We will group test the rows and group test the columns . An individual who tests negative in either test can be eliminated. The number of k by k squares is N/k2. For each square there are 2k tests that are always performed. Each of the k2 individuals in the square have their group test positive twice with probability (k/100)2. These events are NOT independent, but that does not matter in computing the expected number of tests

(N/ k2)(2k + k4/10,000) = 2N/k + N k2/10,000

Differentiating we want –2N/k2 + 2Nk/10,000 = 0 or k = (10,000)1/3 = 21.54. In the concrete case N=1000 the expected number of tests is 139.

Practical Considerations:

One could do fewer tests by eliminating the negative rows before testing the columns, but the  algorithm used here allows all the tests to be done at once, avoiding the need to wait for the first round results to come back before  the second round is done.

Larger group sizes will make it harder to detect the virus if only one individual in the group. The Nature article, Sigrum Smola of the Saarland University Medical Center in Homburg has been is quoted as saying he doesn’t recommend grouping more than 30 individuals in one test. Others claim that it is possible to identify the virus when there is one positive individual out of 100.

Ignoring the extra work in creating the group samples, the method described above reduces the cost of test by 86%. The price of $9 per test quoted in the article would be reduced to $1.26, so this could save a considerable amount of money for a university that has to test 6000 undergraduates several times in one semester.

In May, officials in Wuhan used a method of this type to test 2.3 million samples in two weeks.

References

Mutesa, L et al (2020) A strategy for finding people infected with SARS-CoV-2: optimizing pooled testing at low prevalence arXiv: 2004.14934

Malliapaty, Smriti (2020) The mathematical strategy that could transform coronavirus testing. Nature News July 10. https://www-nature-com/articles/d41586-020-02053-6

 

Moneyline Wagering

With the orange jackass (aka widdle Donnie) first declaring the coronova virus a hoax, then telling people to go ahead and go to work if you are sick, and only today tweeting that the people at the CDC are amazed at how much he knows about covid-19, it is time to have some fun.

Tonight (March 7) at 6PM in Cameron Indoor Stadium the Blue Devils with a conference record of 14-5 will take on the UNC Tarheels  who are 6-13 and will end up in lst place if they lose. If you go online to look at the odds for tonight’s Duke-UNC you find the curious looking

Duke -350

UNC 280

What this means is that you have to bet $350 on Duke to win $100, while if you bet $100 on UNC you win $280

Let p be the probability Duke wins.

For the bet on Duke to be fair we need 100p – 350(1-p) = 0 or p = 7/9 = 0.7777

For the bet on UNC to be fair we need -100p + (1-p)280= 0 or p = 0.7368

If 0.7368 < p < 0.7777 both bets are unfavorable.

This suggests that the a priori probability Duke wins is about 3/4.

Another way of looking at this situation is through the money. If a fraction x of people bet on Duke then

When Duke wins the average winnings are 100x – 100(1-x)

When UNC wins the average winnings are -350 x + 280 (1-x)

Setting these equal gives 200 x + 630 x = 100 + 280 or x = 38/83 = 0.4578

If this fraction of people bet on Duke then the average payoff from either wager is -500/83 = -$6.02 and the people who are offering the wager don’t care who wins.

Harry Kesten 1931-2019

 

Harry Kesten at Cornell in 1970 and in his later years

 

 

 

 

 

 

 

On March 29, 2019 Harry Kesten lost a decade-long battle with Parkinson’s disease. His passing is a sad event, so I would like to find solace in celebrating his extraordinary career. In addition I hope you will learn a little more about his work by reading this.

Harry was born in Duisburg Germany on November 19, 1931.  His parents escaped from the Nazis in 1933 and moved to Amsterdam. After studying in Amsterdam, he was a research assistant at the Mathematical Center there until 1956 when he came to Cornell. He received his Ph.D. in 1958 at Cornell University under supervision of Mark Kac.

In his 1958 thesis on Symmetric Random Walks, he showed that the spectral radius equals the exponential decay rate of the return to 0, and the latter is strictly less than 1 if and only if the group is non-amenable  This work has been cited 206 times and is his second most cited publication (according to MathSciNet). Harry was an instructor at Princeton University for one year and at the Hebrew University for two years before returning to Cornell, where he spent the rest of his career. While in Israel, he and Furstenberg wrote their classic paper on Products of Random Matrices.

In the 1960s, he wrote a number of papers that proved sharp or very general results on random walks, branching process, etc. One of the most famous of these is the 1966 Kesten-Stigum theorem which shows that a a normalized branching process Znn has a nontrival limit if and only if the offspring distribution has E(X log+ X) < ∞.  In 1966 he also proved a conjecture of Erdös and Szuzu about the discrepancy between the number of rotations of a point on the unit circle hitting an interval and its length. Foreshadowing his work in physics, he showed in 1963 that the number of self-avoiding walks of length n satisfied σn+2n → μ2 , where μ is the connective constant.

Harry’s almost 200 papers have been cited 3781 times by 2329 authors However, these statistics underestimate his impact. In baseball terms, Harry was a closer. When he wrote a paper about a topic, his results often eliminated the need for future work on it. One of Harry’s biggest weaknesses is that he was too smart. When most of us are confronted with a problem, we need to try different approaches to find a route through the woods to a solution. Harry simply got on his bulldozer and drove over all obstacles. He needed 129 pages in the Memoirs of the AMS to answer the question: “Which processes with stationary independent increments hit points?”, a topic he spoke about at the International Congress at Nice in 1970.

In 1984 Harry gave lectures on first passage percolation at the St. Flour Probability Summer School. This subject dates back to Hammersley’s 1966 paper and was greatly advanced by Smythe and Weirman’s 1978 book. However, Harry’s paper attracted a number of people to work on the subject and it has continued to be a very active area. See 50 years of First Passage Percolation by Auffinger, Damron, and Hanson for more details. You can buy this book from the AMS or download it from the arXiv.  I find it interesting that Harry lists only six papers on his Cornell web page. Five have already been mentioned. The sixth is On the speed of convergence in first-passage percolation, Ann. Appl. Probab. 3(1993), 296–338.

Harry has worked in a large number of areas. There is not enough space for a systematic treatment so I will just tease you with a list of titles. Sums of stationary sequences cannot grow slower than linearly. Random difference equations and renewal theory for products of random matrices. Subdiffusive behavior of a random walk on a random cluster. Greedy lattice animals. How long are the arms of DLA? If you want to try to solve a problem Harry couldn’t, look at his papers on Diffusion Limited Aggregation.

In the late 1990s, Maury Bramson and I organized a conference in honor of Harry’s 66 2/3’s  birthday. (We missed 65 and didn’t want to wait for 70.) A distinguished collection of researchers gave talks and many contributed to a volume of papers in his honor called Perplexing Problems in Probability. The 21 papers in the volume provide an interesting snapshot of research at the time. If you want to know more about Harry’s first 150 papers, you can read my 32 page summary of his work that appears in that volume.

According to math genealogy, Harry supervised 17 Cornell Ph.D. students who received their degrees between 1962-2003. Maury Bramson and Steve Kalikow were part of the Cornell class of 1977 that included Larry Gray and David Griffeath who worked with Frank Spitzer. (Fortunately, I graduated in 1976!). Yu Zhang followed in Harry’s footsteps and made a number of contributions to percolation and first passage percolation. I’ll let you use google to find out about the work Kenji Ichihara, Antal Jarai, Sungchul Lee, Henry Matzinger, and David Tandy.

Another ‘broader impact” of Harry’s work came from his collaborations with a long list of distinguished co-authors: Vladas Sidorovicius (12 papers), Ross Maller (10) , Frank Spitzer (8), Geoffrey Grimmett (7), Yu Zhang (7), Itai Benjamini (6), J.T. Runnenberg (5), Roberto Schonmann (4), Rob van den Berg (4), … I wrote 4 papers with him, all of which were catalyzed by an interaction with another person. In response to a question asked by Larry Shepp, we wrote a paper about an inhomogeneous percolation which was a precursor to work by Bollobas, Janson, and Riordan. Making money from fair games, joint work with Harry and Greg Lawler, arose from a letter A. Spataru wrote to Frank Spitzer. I left it to Harry and Greg to sort out the necessary conditions.

Harry wrote 3 papers with two very different versions of Jennifer Chayes. With a leather-jacketed Cornell postdoc, her husband Lincoln Chayes, Geoff Grimmett and Roberto Schonmann, he studied “The correlation length for the high density phase.” With the manager of the Microsoft research group, her husband Christian Borgs, and Joel Spencer he wrote two papers, one on the birth of the infinite component in percolation and another on conditions implying hyperscaling.

As you might guess from my narrative, Kesten received a number of honors. He won the Brouwer medal in 1981. Named after L.E.J. Brouwer it is The Netherlands’ most prestigious award in mathematics. In 1983 he was elected to the National Academy of Science. In 1986 he gave the IMS’ Wald Lectures. In 1994 he won the Polya Prize from SIAM. In 2001 he won the AMS’ Steele Prize for lifetime achievement.

Being a devout orthodox Jew, Harry never worked on the Sabbath. On Saturdays in Ithaca, I would often drive past him taking a long walk on the aptly named Freese Road, lost in thought. Sadly Harry is now gone, but his influence on the subject of probability will not be forgotten.

Jonathan Mattingly’s work on Gerrymandering

My last two posts were about a hurricane and a colonscopy, so I thought it was time to write about some math again.

For the last five years Mattingly has worked on a problem with important political ramifications: what would a typical set of congressional districts (say the 13 districts in North Carolina) look like if they were chosen at “random” subject to the restrictions that they contain a roughly equal number of voters, are connected, and minimize the splitting of counties. The motivation for this question can be explained by looking at the current congressional districts in North Carolina. The tiny purple snake is district 12. It begins in Charlotte goes up I40 to Greensboro and then wiggles around to contain other nearby cities producing a district with a large percentage of Democrats.

To explain the key idea of gerrymandering, suppose, to keep the arithmetic simple, that a state has 2000 Democrats and 2000 Republicans. If there are four districts and we divide voters

District           Republicans       Democrats

1                           600                            400

2                           600                            400

3                           600                            400

4                           200                            800

then Republicans will win in 3 districts out of 4. The last solution extends easily to create 12 districts where the Republicans win 9. With a little more imagination and the help of a computer one can produce the outcome of the 2016 election in North Carolina election in which 10 Republicans and 3 Democrats were elected, despite the fact that the split between the parties is roughly 50-50.

The districts in the North Carolina map look odd, and the 7th district in Pennsylvania (named Goofy kicks Donald Duck) look ridiculous, but this is not proof of malice.

Mattingly with a group of postdocs, graduate students, and undergraduates has developed a statistical approach to this subject. To explain this we will consider a simple problem that can be analyzed using material taught in a basic probability or statistics class. A company has a machine that produces cans of tomatoes. On the average the can contains a pound of tomatoes (16 ounces), but the machine is not very precise, so the weight has a standard deviation (A statistical measure of the “typical deviation” from the mean) of 0.2 ounces. If we assume the weight of tomatoes follows the normal distribution then 68% of the time the weight will be between 15.8 and 16.2 ounces. To see if the machine is working properly an employee samples 16 cans and finds an average weight of 15.7 pounds.

To see if something is wrong we ask the question: if the machine was working properly then what is the probability that the average weight would be 15.7 pounds or less. The standard deviation of one observation is 0.2 but the standard deviation of the average of 16 observations is 0.2/(16)1/2  = 0.005. The observed average is 0.3 below the mean or 6 standard deviations. Consulting a table of the normal distribution or using a calculator we see that if the machine was working properly then the probability of an average of 15.7 or less would occur with probability less than 1/10,000.

To approach the gerrymandering, we ask a similar question: if the districts were drawn without looking at party affiliation what is the probability that we would have 3 or fewer Democrats elected? This is a more complicated problem since one must generate a random sample from the collection of districts with the desired properties. To do this Mattingly’s team has developed methods to explore the space of possibilities and then making successive small changes in the maps. Using this approach one has make a large number of changes before you have a map that is `independent.” In a typical analysis they generate 24,000 maps. They found that using the randomly generated maps and retallying the votes, ≤3 Democrats were elected in fewer than 1% of the scenarios. The next graphic shows results for the 2012, 2016 maps and one drawn by judges.

Mattingly has also done analyses of congressional districts in Wisconsin and Pennsylvania, and has helped lawyers prepare briefs for cases challenging voting. His research has been cited in many decisions including the three judge panel who ruled in August 2018 that the NC congressional district were unconstitutional. For more details see the Quantifying Gerrymandering blog

https://sites.duke.edu/quantifyinggerrymandering/author/jonmduke-edu/

Articles about Mattingly’s work have appeared in

(June 26, 2018) Proceedings of the National Academy of Science 115 (2018), 6515–6517

(January 17, 2018)  Nature 553 (2018), 250

(October 6, 2017) New York Times  https://www.nytimes.com/2017/10/06/opinion/sunday/computers-gerrymandering-wisconsin.html

The last article is a good (or perhaps I should bad) example of what can happen when your work is written about in popular press. The article, written by Jorden Ellenberg is, to stay within the confines of polite conversation, simply awful. Here I will confine my attention to its two major sins.

  1. Ellenberg refers several times to the Duke team but never mentions them by name. I guess our not-so-humble narrator does not want to share the spotlight with the people who did the hard work. The three people who wrote the paper are Jonathan Mattingly, professor and chair of the department, Greg Herschlag, a postdoc, and Robert Ravier, one of our better grad students. The paper went from nothing to fully written in two weeks in order to get ready for the court case. However, thanks to a number of late nights they were able to present clear evidence of gerrymandering. It seems to me that they deserve to be mentioned in the article, and it should have mentioned that the paper was available on the arXiv, so people could see for themselves.
  2. The last sentence of the article says “There will be many cases, maybe most of them, where it’s impossible, no matter how much math you do, to tell the difference between innocuous decision making and a scheme – like Wisconsin’s – designed to protect one party from voters who might prefer the other.” OMG. With many anti-gerrymandering lawsuits being pursued across the country, why would a “prominent” mathematician write that in most cases math cannot be used to detect gerrymandering?

Abelian Sand Pile Model

Today is January 7, 2018. I am tired of Trump bragging that he is a “very stable genius.” Yes he made a lot of money (or so he says) but he doesn’t know what genius looks like. Today’s column is devoted to work of Wesley Pegden (and friends) on the Abelian Sand Pile Model. Why this topic. Well he is coming to give a talk on Thursday in the probability seminar.

This system was introduced in 1988 by Bak, Tang, and Wiesenfeld (Phys Rev A 38, 364).  The simplest version of the model takes place on a square subset of the two dimensional integer lattice. Grains of sand are dropped at random. The number of grains at a point is ≥ 4 the pile topples and one grain is sent to each neighbor. This may cause other sites to topple setting off an avalanche.

The word Abelian refers to the property that the state after n grains have landed is independent of the order in which they are dropped. The reason that physicists are interested is that the system “self-organizes itself into a critical state” in which avalanche sizes have a power law. The Abelian sand pile has been extensively studied, and there are connections to many branaches of mathematics, but for that you’ll have to go to the Wikipedia page or to the paper “What is … a sandpile?” written by Lionel Levine and Jim Propp which appeared in the Notices of the AMS 57 (2010), 976-979.

In a 2013 article in the Duke Math Journal [162, 627-642] Wesley Pegden and Charles Smart studied what happened when you put n grains of sand at the origin on the infinite d-dimensional lattice and let the system go until it reaches its final state. They used PDE techniques to show that when space is scaled by n 1/d then the configuration converges weakly to a limit, i.e, integrals against a test function converge. As Fermat once said the proof won’t fit in the margin, but in a nutshell what they do is to who used viscosity solution theory to identify the continuum limit of the least action principle of Fey–Levine–Peres (J. Stat. Phys. 138 (2010), 143-159). A picture is worth several hundred words.

 

In a 2016 article in Geometric and Functional Analysis, Pegden teamed up with Lionel Levine (now at Cornell) to study the fractal structure of the limit. The solution is somewhat intricate involving solutions of PDE and Apollonian triangulations that generalize Apollonian circle packings.

Duke grads vote on Union

According to the official press release: “Of the 1,089 ballots cast, 691 voted against representation (“NO”) by SEIU and 398 for representation by SEIU (“YES”). There were, however, 502 ballots challenged based on issues of voter eligibility. Because the number of challenged ballots is greater than the spread between the “YES” and “NO” votes, the challenges could determine the outcome and will be subject to post-election procedures of the NLRB.”

The obvious question is what is the probability this would change the outcome of the election? If the NO’s lose 397 votes and hence the YES lose 015 on the recount the outcome will be 294 NO, 293 YES. A fraction 0.6345 of the votes were NO. We should treat this as an urn problem but to get a quick answer you can suppose the YES votes lost are Binomial(502,0.3655). In the old days I would have to trot out Stirling’s formula and compute for an hour to get the answer but now all I have to do is type into my vintage TI-83 calculator

Binompdf(502,0.3655,105) = 2.40115  X 10-14

i.e., this is the probability of fewer than YES votes lost.

Regular reader of this blog will remember that I made a similar calculation to show that there was a very small probability that the 62,500 provisional ballots would change the outcome of the North Carolina election since before they were counted Cooper had a 4772 vote lead over McCrory. If we flip 62,500 coins then the standard deviation of the change in the number of votes is

{62,500(1/4) 1 / 2 = 125

So McCrory would need 33,636 votes = 2386 above the mean = 19.08 standard deviations. However, as later results showed this reasoning was flawed: Cooper’s lead to a more than 10,000 votes. This is due to the fact that, as I learned later, provisional ballot have a greater tendency to be Democratic while absentee ballots tend to be Republican.

Is this all just #fakeprobability? Let’s turn to a court case de Martini versus Power. In a close electionin a small town, 2,656 people voted for candidate A compared to 2,594 who voted for candidate B, a margin of victory of 62 votes. An investigation of the election found that 136 of the people who voted in the election should not have. Since this is more than the margin of victory, should the election results be thrown out even though there was no evidence of fraud on the part of the winner’s supporters?

In my wonderful book Elementary Probability for Applications, this problem is analyzed from the urn point of view. Since I was much younger when I wrote the first version of its predecessor in 1993, I wrote a program to add up the probabilities and got 7.492 x 10 -8. That computation supported the Court of Appeals decision to overturn a lower court ruling that voided the election in this case.If you want to read the decision you can find it at

http://law.justia.com/cases/new-york/court-of-appeals/1970/27-n-y-2d-149-0.html

Jordan Ellenberg don’t know stat

A couple of nights ago I finished John Grishan’s the Rouge Lawyer so I started reading Jordan Ellenberg’s “How not to be wrong. The power of mathematical thinking.” The cover says “a math-world superstar unveils the hidden beauty and logic of the world and puts math’s power in our hands.”

The book was only moderately annoying until I got to page 65. There he talks about statistics on brain cancer deaths per 100,000. The top states according to his data are South Dakota, Nebraska, Alaska, Delaware, and Maine. At the bottom are Wyoming, Vermont, North Dakota, Hawaii and the District of Columbia.

He writes “Now that is strange. Why should South Dakota be brain cancer center and North Dakota nearly tumor free? Why would you be safe in Vermont but imperiled in Maine.”

“The answer: … The five states at the top have something in common, and the five states at the bottom do too. And it’s the same thing: hardly anyone lives there.” There follows a discussion of flipping coins and the fact that frequencies have more random variation when the sample size is small, but he never stops to see if this is enough to explain the observation.

My intuition told me it did not, so I went and got some brain cancer data.

https://www.statecancerprofiles.cancer.gov/incidencerates/

In the next figure the x-axis is population size, plotted on a log scale to spread out the points and the y-axis is the five year average rate per year per 100,000 people. Yes there is less variability as you move to the right, and little Hawaii is way down there, but there are also some states toward the middle that are on the top edge. The next plots shows 99% confidence intervals versus state size. I used 99%  rather than 95% since there are 49 data points (nothing for Nevada for some reason).

brain_cancer_fig1

In the next figure the horizontal line marks the average 6.6. The squares are upper end points of the confidence intervals. When they fall below the line, this suggests that the mean is significantly lower than the national average. From left to right: Hawaii, New Mexico, Louisiana and California. When the little diamond marking the lower end of the confidence interval is above the line, we suspect that the rate for that state is significantly higher than the mean. There are eight states in that category: New Hampshire, Iowa, Oregon, Kentucky, Wisconsin, Washington, New Jersey, and Pennsylvania.

brain_cancer_fig2

So yes there are 12 significant deviations from the mean (versus 5 we would get if all 49 states had mean 6.6)  but they are not the ones at the top or the bottom of the list, and the variability of the sample mean has nothing to do with the explanation. So Jordan, welcome to world of APPLIED math, where you have to look at data to test your theories. Don’t feel bad the folks in the old Chemistry building at Duke will tell you that I don’t know stat either.  For aa more professional look at the problem see

http://www.stat.columbia.edu/~gelman/research/published/allmaps.pdf

North Carolina Gubernatorial Election

Tuesday night after the 4.7 million votes had been counted from all 2704 precincts Roy Cooper had a 4772 vote lead over Pat McCrory. Since there could be as many as 62,500 absentee and provisional ballots, it was decided to wait until these were counted to declare a winner. The question addressed here is: What is the probability that the votes will change the outcome?

The do the calculation we need to make an assumption:  the addition votes are similar to the overall population so they are like flipping coins. In order to change the outcome of the election Cooper would have to get fewer than 31,250 – (4772)/2 = 28,864 votes. The standard deviation of the number of heads in 62,500 coin flips is (62,250 x ¼) 1 / 2 = 125, so this represents 19.09 standard deviations below the mean.

One could use be brave and use the normal approximation. However, all this semester while I have been teaching Math 230 (Elementary Probability) people have been asking why do this when we can just use our calculator?

Binomcdf(40000, 0.5, 28864) = 1.436 x 10-81

In contrast if we use the normal approximation with the tail bound (which I found impossible to type using equation editor) we get 1.533 x 10-81.

We can’t take this number too seriously since the probability our assumption is wrong is larger than that but it suggests that we will likely have a new governor and House Bill 2 will soon be repealed.

Teaching Statistics using Donald Trump.

Recently, in Pennsylvania Donald Trump said “The only way they can beat me in my opinion, and I mean this 100 percent, if in certain sections of the state they cheat.”  He never said how he determined that. If it is on the basis of the people he talked to as he campaigned, he had  a very biased sample.

At about the time of Trump’s remarks, there was a poll showing 50% voting for Clinton, 40.6% for Trump and others undecided or not stating an opinion. Let’s look at the poll result through the eyes of an elementary statistic class. We are not going to give a tutorial on that subject here, so if you haven’t had the class, you’ll have to look online or ask a friend.

Suppose we have 8.2 million marbles (representing the registered voters in PA) in a really big bowl. Think of one of those dumpsters they use to haul away construction waste. Suppose we reach in and pick out 900 marbles at random, which is the size of a typical Gallup poll. For each blue Hillary Clinton marble we add 1 to our total, for each red Donald Trump marble we subtract 1, and for each white undecided marble we add 0.

The outcomes of the 900 draws are independent. To simplify the arithmetic, we note that since our draws only take the values -1, 0, and 1 they have variance less than 1. Thus when add up the 900 results and divided by 900 the standard deviation of the average is (1/900)1/2 = 1/30. By the normal approximation (central limit theorem) about 95% of the time the result will be within 2/30 = 0.0666 of the true mean. In the poll results above the average is 0.5-0.406 = 0.094, so by Statistics 101 reasoning we are 95% confident that there are more blue marbles than red marbles in the “bowl.”

That analysis is over simplified in at least two ways. First of all, when you draw a marble out of the bowl you get to see what color is. If you ask a person who they are going to vote for then they may not tell you the truth. It is for this reason that use of exit polls have been discontinued. If you ask people how they voted when they leave the polling place, what you estimate is the fraction of blue voters among those willing to talk to you, not the faction of people who voted for blue. A second problem with our analysis is that people will change their opinions over time.

A much more sophistical analysis of polling data can be found at FiveThirtyEight.com, specifically at http://projects.fivethirtyeight.com/2016-election-forecast/ There if you hover your mouse over on Pennsyllvania (today is August 16) you find that Hillary has an 89.3% chance of winning Pennsylania versus Donald Trump’s 10.7%, which is about the same as the predictions for the overall winner of the election.

The methodology used is described in detail at

http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/

In short they use a weighted average of the results of about 10 polls with weights based on how well the polls have done in the past. In addition they are conservative in the early going, since surprises can occur.

Nate Silver, the founder of 538.com, burst onto the scene in 2008 when correctly predicted the way 49 of 50 state voted in the 2008. In 2012, while CNN was noting Obama and Romney were tied at 47% of the popular vote, he correctly predicted that Obama receive more than 310 electoral votes, and easily win the election.

So Donald, based on the discussion above, I can confidently say that  no cheating is needed for you to lose Pennsylvania. Indeed, at this point in time, it would take a miracle for you to win it.

The odds of a perfect bracket are roughly a billion to 1

This time of year it is widely quoted that odds of picking a prefect bracket are 9,2 quintillion to one. In scientific notation that is 9.2 x 1018 or if you like writing out all the digits it is 9,223,372,036,854,775,808 to 1. That number is 263, i.e., the chance that you succeed if you flip a coin to make every pick.

If you know a little then you can do much better than this, by say taking into account the fact that a 16 seed has never beaten a one-seed. In a story widely quoted last year “Duke math professor Jonathan Mattingly calculated the odds of picking all 32 games correctly is actually one in 2.4 trillion.” He doesn’t give any details, but I don’t know why I should trust a person who doesn’t know there are 63 games in the tournament.

Using a different approach, DePaul mathematician Jay Bergen  calculated the odds at one in 128 billion. His youtube video from four years ago https://www.youtube.com/watch?v=O6Smkv11Mj4 is entertaining but light on details.

Here I will argue that the odds are closer to one billion to 1. The key to my calculation of the probability of a perfect bracket is use data from outcomes of the first round games for 20 years of NCAA 64 team tournaments. The columns give the match up, the number of times the two teams won and the percentage

1-16                 80-0                 1

2-15                 76-4                 0.95

3-14                 67-13               0.8375

4-13                 64-16               0.8

5-12                 54-26               0.675

6-11                 56-24               0.7

7-10                 48-32               0.6

8-9                   37-43               0.5375

From this we see that if we pick the 9 seed to “upset” the #8 but in all other case pick the higher seed then we will pick all 8 games correctly with probability 0.09699 or about 0.1, compared to the 1/256 chance you would have by guessing.

Not having data for the other seven games, I will make the rash but simple assumption that picking these seven games is also 0.1. Combining our two estimates, we see that the probability of perfectly predicting a regional tournament is 0.01. All four regional tournaments can then be done with probability 10-8. There are three games to pick the champion from the final four. If we simply guess at this point we have a 1 in 8 chance ad a final answer of about 1 in a billion.

To argue that this number is reasonable, lets take a look at what happened in the 2015 bracket challenge. 320 points are up for grabs in each round: 10 points for each 32 first round games (the play in or “first four games” are ignored), 20 for each of the 16 second round games, and so on until picking the champion gives you 320 points. The top ranked bracket had

27 x 10 + 14 x 20 + 8 x 40 + 4 x 80 + 2 x 160 + 1 x 320 = 1830 points out of 1920.

This person missed 5 first round and 2 second round games. There are a number of other people with scores of 1800 or more, so it is not too far fetched to believe if the number of entries was increased by 27 = 128 we might have a perfect bracket. The last calculation is a little dubious but if the true odds were 4.6 trillion to one or event 128 billion to 1, it is doubtful one of 11 million entrants would get this close.

With some more work one could collect data on how often an ith seed beats a jth seed when they meet in a regional tournament or perhaps you could convince ESPN to see how many of its 11 million entrants managed to pick a regional tournament correctly. But that is too much work for a lazy person like myself on a beautiful day during Spring Break.