knowledge-base

Statistics

Statistics (or statistical analysis) is a lot like good detective work. The data yields clues and patterns that can ultimately lead to meaningful conclusions.

Why statistics

This allows to determine answers to questions which are not economical to answer. You can not count every homeless in a country, thats expensive and time consuming. You can take a small quality sample (quality is key!) to then interfere or project the probably number.

Even in best of circumstances, statistical analysis rarely unveils “the truth”. We are usually building circumstantial based based on imperfect data. As a result, there are numerous reasons that intellectually honest people may disagree about statistical results or their implications.

Statistics can give some insight who the best baseball player is, but that’s not perse what makes up being “the best”, cause there is no objective definition of what a “the best baseball player” means.

Smart and honest people will often disagree about that data re trying to tell us.

Why Learn?

Basic Experiments

In a good experiment you have “one variable” that differs between your “experiment group” and your “control group”. The control group shares everything with the experiment group, except for one variable. For example, the experiment group eats everyday 1 apple, and the other group eats everyday 1 pear.

Statistical Significance

The analysis has uncovered an association of 2 variables that ia not likely to be product of chance alone. Regression analysis can for example find a relation between 2 variables and how likely that relationship is by accident or not. If there is a relationship, and it doesn’t seem very accidental, it’s statistically significant.

Regression Analysis

The tool to isolate the relationship between to variables, such as smoking and cancer, while holding constant (or “controlling for”) the effects of other important variables (such as diet, exercise, weight, …)

Regression analysis is primarily used to assess the strength and nature of relationships between variables, but it does not directly establish causation. This is also the limitation, we can identify strong relationships, but we cant determine causality, or in other words “WHY?”.

Usually you start with a hypotheses, “I think these 2 variables have a relationship” and then use regression analysis to validate that hypothesi.

Example

The Conclusion

Eating a bran muffin every day will reduce your chances of getting colon cancer.

The Methodology

First they gather detailed information on thousands of people, including how frequently they eat brand muffin’s and then apply regression analysis:

  1. Quantify the association observed between eating bran muffins and contracting colon cancer.
    • E.g. Hypothetical finding that people who eat bran muffins have 9% lower incidence of colon cancer, controlling for the other factors that may affect the incidence of the disease.
  2. Quantify the likelihood that the association (relation) between bran muffin’s and a lower rate of colon cancer observed is merely a coincidence (a quick in the data for this sample of people), rather than a meaningful insight about the relationship between diet and health.

Descriptive Statistic (aka Summary Statistic)

A simplification of a complex data set or array of data. We perform calculations that reduce a complex array of data into a handful of numbers that describe those data. These descriptive statistics give us a manageable and meaningful summary of the underlying phenomenon. But simplification invites abuse.

Examples Descriptive Statistic:

Statistical Measures

The Mean

What most people understand as “average”, although, more specific its the Arithmetic Mean that most people assume what is referred to.

The Median

What is the middle value? 50% of values are below and 50% of values are above the median.

The Mode

For what value is the highest concentration? Let’s say, there is no bigger group than the group that earns 5000 EUR.

Percentiles

Tells you what percentage of a dataset falls below a certain point.

For example:

If you’re in the 90th percentile in a test, that means you scored higher than 90% of the people who took the test. Similarly, if a value is at the 25th percentile, it means that 25% of the data points are below that value.

The 50th percentile is also known as the median, meaning half the data is below it and half is above.

Standard Deviation

How spread out the values in a dataset are from the mean (average). It measures the dispersion of the values.

P-Valu

What’s a p-value?

A p-value helps you decide if something you’re testing in an experiment is just a coincidence or if it’s likely a real effect. It’s basically a number that tells you how surprising your results are if the thing you’re testing isn’t actually doing anything.

Imagine this:

Let’s say you’re flipping a coin, and you think the coin might be biased (like maybe it lands on heads more than tails). Normally, a fair coin should land on heads 50% of the time and tails 50% of the time, right?

Now, you flip the coin 10 times and get 8 heads. This seems like a lot of heads, but before saying the coin is unfair, you want to know: “Could this just happen by chance?”

Here’s where the p-value comes in:

How do you use it?

Scientists pick a cutoff number (usually 0.05) to decide:

In short:

Hope that makes it clearer! Let me know if you need more examples. :)

How to Lie with statistics

The Sample With the Built-In Bias

When there is a poll or statistic, one must scrutinize the sample which was used to calculate the statistic. Getting an unbiased sample is extremely hard or near to impossible?

Example: The avg salary of a Yale student of the 1924 graduate is now earning 25000 Dollar/Year. Did they send a questionnaire to all students (did they have their address), maybe those who they don’t have the address of are the super successful, or they all passed away and so on. Next, those with known addresses are now already a biased set that might miss out on significant higher or lower earners. How many respond? Maybe 10% ? What differentiates the onces who feel like responding from the others? When they do respond, are they being honest. People dare to lie , a lot, on questionnaires, interviews and poll. Consciously or subconsciously.

Just going on the street and interview people is biased, based on the place, time and maybe who the interviewers feels attracted to.

A purely random sample is hard, almost impossible to come by. Ask yourself how a sample can be biased.

The Well Chosen Average

There are different types of averages that one can “conveniently” use and not lie.

When there is a perfect bell shape distribution, these 3 types of averages would have the exact same value. If the distribution has another shape, these can be widely different and misleading. How these 3 differ exactly from each other, tells something about the distribution.

Distributions

Distributions describe the pattern of data points. Distributions can be of any shape, but there are a few key types.

Extra Average Types

Critically think about statistics

Following are some types on how to analyze critically any statistics.