A fantastic and informative read in the era of nascence of big-data. Some excerpts captured for better understanding of stats in prediction.

The idea to learn statistics was best summarized as follows:

· Summarize huge quantities of data

· Make better decisions

· Answer important social questions

· Recognize patterns that can refine how we do everything from selling diapers to catching criminals

· Catch cheaters and prosecute criminals

· Evaluate effectiveness of policies, programs, drugs, medical procedures and other innovations

__Descriptive Statistics__

*Mode* = most frequently occurring *Median* = rearrange all numbers in ascending order and select the central value (50 percentile value) *Mean* = Average

A better way is to have decile values, if you’re in top *decile* in earning in USA, you’re earning is more than 90% of the population. *Percentile* scores are better than absolute scores. If 43 correct answers falls into 83^{rd} percentile, then this student is doing better than most of his peers statewide. If he’s in 8^{th} percentile, then he’s really struggling.

Measuring of dispersion matters, if mean score on the SAT mat test is 500 with a standard deviation of 100, and bulk of students taking the test will be within one standard deviation of the mean, or between 400 or 600. How many students do you think will scoring 720 or more? Probably not very many. The most important and common distributions in statistics is the normal distribution.

__Deceptive Description__

Statistical malfeasance has very little to do with bad math. If anything, impressive calculations can obscure nefarious motives. The fact that you’ve calculated the mean correctly will not alter the fact that the median is a more accurate indicator. Judgment and integrity turn out to be surprisingly important. A detailed knowledge of statistics does not deter wrongdoing any more than a detailed knowledge of the law averts criminal behavior. With both statistics and crime, the bad guys often know exactly what they’re doing.

__Correlation__

It measures the degree to which 2 phenomena are related to one another. There’s a correlation between summer temperatures and ice-cream sales. When one goes up, so does the other. Two variables are positively correlated if a change in one is associated with a change in the other in the same direction, such as a relationship between height and weight.

A pattern consisting of dots scattered across the page is somewhat an unwieldy tool. If Netflix tried to make film recommendations by plotting ratings for thousands if films by millio0ns of customers, the results would bury the HQ in scatter plots. Instead, the power of correlation as a statistical tool is that we can encapsulate an association between two variables in a single descriptive statistic: the correlation coefficient. Its value ranges from -1 to 1. Closer to 1 or -1 is perfect +ve or –ve association whereas 0 has no relation at all. There is no unit attached to it.

__Basic Probability__

The Law of Large Numbers (LLN) explains why casinos always make money in the long run. The probabilities associated with all casino games favor the house. Probability tree might help to navigate some problems and to decide. The investment decision and widespread screening for a Rare Disease. The Chicago police department has created an entire predictive analysis unit, in part because gang activity, the source of much of the city’s violence, follows certain patterns. In 2011, New York Times ran the following headline “Sending the Police before There’s a Crime”

__Problems with Probability__

Assuming events are independent when they’re not: The probability of flipping two heads in a row is ½ ^ 2 i.e. ¼. Whereas two engines of jet during transatlantic flight is not 1/100,000 ^ 2.

Not understanding when events ARE independent: If you’re in a casino, you’ll see people looking longingly at the dice or cards and declaring that they’re “due”. If the roulette ball has landed on black five times in a row, then clearly now it must turn up red. No, no, no! The probability if the ball’s landing on a red number remains unchanged: 16/38. The belief otherwise is sometimes called “the gambler’s fallacy”. In fact, if you flip a coin 1,000,000 times and get 1,000,000 heads in a row, the probability of getting tails on the next flip is still 1/2. Even in sports, the notion of streaks may be illusory.

Clusters happen: A great exercise to make rare event is possible. If you’re in a class of 50 or 100 students. More is better. All stand up and flip the coin, anyone who flips head must sit down. Assuming we start with 100 students, roughly 50 will sit down after the first flip. Then we do it again. after which 25 or so are still standing. And so on. More often than not, there’ll be a student standing at the end who has flipped five or six tails in a row. At that point, I ask the student questions like “How did you do it? And what are the best training exercise for flipping do many tails in a row? Or IS there a special diet? This elicit laughter because the class just watched the whole process unfold; they know that the student who flipped six tails has no special talent. When we see anomalous event like that out of context, we assume that something besides randomness must be responsible

Reversion to mean: Have you heard about the Sports Illustrated jinx, whereby individual athletes or teams featured on the cover of Sports Illustrated subsequently see their performance fall off. The more statistically sound explanation is that teams and athletes appear on the cover after some anomalously good stretch (such as a twenty-game winning streak) and their subsequent performance reverts back to what is normal., or the mean. This is the phenomenon known as reversion to the mean, Probability tells us that any outlier – an observation that is particularly far from the mean in one direction or the other – is likely to be followed by outcomes that are more consistent with the long-term average.

__Importance of Data:__

Selection Bias: Is your selected data collection is sufficiently broad range and not confined? As in a survey of consumers in an airport is going to be biased by the fact that people who fly are likely to be wealthier than the general public.

Publication Bias: Positive findings are more likely to be published than the negative findings, which can skew the results that we see.

Recall Bias: memory is fascinating thing – though not always a great source of good data. We’ve a natural impulse to understand the present as a logical consequence of things that happened in the past- cause and effect. A study of diet by breast cancer patients was done. The striking finding was that the women with breast cancer recalled a diet that was much higher in fat than what they consumed; the women with no concern did not.

Survivorship Bias: If you have a room of people with varying heights, forcing the short people to leave will raise the average height in the room, but it doesn’t make anyone tall

__Central Limit Theorem:__

For this to apply, sample sizes need to be relatively large (over 30 as a rule of thumb).

1. If you draw large, random samples from any population, the means of those samples will be distributed normally around the population mean (regard less of what the distribution of the underlying population looks like)

2. Most sample means will lie reasonably close to the population mean; the standard error us what defines “reasonably close”

3. CLT tells us that the probability that a sample mean will lie within a certain distance of the population mean. It is relatively unlikely that a sample mean will lie more than 2 standard errors from the population mean, and extremely unlikely that it will lie three or more standard errors from the population mean.

4. The less likely it is that an outcome has been observed by chance, the more confident we can be in surmising that some other factor is in play.

__Inference__

Statistics cannot prove anything with certainty. Instead the power of statistical inference derives from observing some pattern or outcome and then using probability to determine the most likely explanation for that outcome. Suppose a strange gambler arrives in town and offers you a wager: He wins $1000 if he rolls a six with a single die; you win $500 if he rolls anything else – a pretty good bet from your standpoint. He then proceeds to roll ten sixes in a row, taking $10,000 from you. One possible explanation is that he was lucky. An alternative explanation is that he cheated somehow. The probability if rolling ten sixes in a row with a fair die is roughly 1 in 60 million. You can’t prove that he cheated, but you ought at least to inspect the die. Null hypothesis, Type I and Type II errors to be explored as well.

__Regression Analysis__

It allows us to analyze how one variable affects the other. In a large sample of weight versus height, if plotted on a graph looks like as below:

If you say the pattern is “Weight increases with height” – it may not be very insightful. One step further is to “fit a line” that best describes a linear relationship between the two variables. Regression analysis typically uses a methodology called Ordinary Least Squares, or OLS to do this and is best visually explained here and further advanced techniques and concepts are here. Once we have an equation, how we the results are statistically significant or not?

Standard Error is a measure of error in the coefficient computed for the regression equation. If we take 30 different samples of 20 peoples to arrive at the regression equation, then in each case the coefficient will reflect a value akin to this group and from central limit theorem, we can infer that this should be around the true association coefficient. With this assumption we can calculate the Standard Error for the regression coefficient.

One rule of thumb: Coefficient is likely to be statistically significant when the coefficient is at least twice the size of the standard error. T-statistic = observed regression coefficient/standard error. P-Value = chance of getting an outcome as extreme as no true association between the variables. R^{2} = total amount of variation explained by the regression equation i.e. how much variation around mean is due to height differences alone. When eth sample (degree of freedom) size reaches large number, t-statistic becomes similar to normal distribution.

__Top Sever Regression Mistakes__

1. Using regression to analyze a nonlinear relationship

2. Correlation does not equal causation

3. Reverse Causality: ensure in a statistical equation between A and B, were an affects B, it’s entirely plausible that B affects A.

4. Omitted variable bias: This is about omitting an important variable in the regression equation

5. Highly correlated explanatory variables (multi-co-linearity): If we want to find effect of illegal drug use on SAT scores. If we assess heroin and cocaine are used, then using these variables individually may not yield good results than a combined one as those who use cocaine may not use heroin and vice-versa. So their data points individually may be small and may not give correct results

6. Extrapolating beyond the data: you cannot use the weight/height data to predict the weight of new-born

7. Data-mining with too many variables

__There are two lessons in designing a proper regression model__

1. Figuring out what variables should be examined and where the data should come from – is more important than the underlying statistical calculations. This process is referred to as estimating the equation, or specifying a good regression equation. The best researches are the ones who can think logically about what variables ought to be included in a regression equation, what might be missing, and how the eventual results can and should be interpreted.

Regression analysis builds only a circumstantial case. An association between two variables is like a fingerprint at the scene of the crime. It points us in the right direction, but it’s rarely enough to convict. (and sometimes a fingerprint at the scene of a crime may not belong to the perpetrator) Any regression analysis needs a theoretical underpinning. What are explanatory variables in the equation? What phenomena from other disciplines can explain the observed results? For instance, why do we think that wearing purple shoes would boost performance on the math portion of the SAT or that eating of popcorn can help prevent prostate cancer?