Probability and Distribution
Analyzing data
- The scientific method
- Further observing and experimenting
- Refining and retesting explanations
- Probabilities
- Here's an analogy:
- Rules - deduction
- Probability - induction
- Probability will be important for inferential statistics.
- Probability
- What is Probability?
- Reasoning with probabilities
- The law of large numbers
Consider these examples
- What do these statements mean?
- There is a 90% chance of rain.
- I'm 75% sure I'm right about this.
- There is a 50% chance that my parents are coming to visit this weekend.
- 10% of children born with disease develop lung cancer.
- The probability that this coin will come up heads is 50%.
- Types of Probabilities
- Some probabilities are proportions of elements drawn from a set
- Probability of side effect given disease
- Long-run probability of having this side effect
- Percentage of actual cases having this side effect
- Probability of rain
- Any place in the area covered by the forecast has a 90% chance of measurable rain.
- Forecast is correct if rain does fall on at least one place in the area.
- Single event probabilities
- Some events occur only once
- Cannot be viewed as coming from a set
- Must construct some kind of set
- There is a 35% chance I failed this test.
- Under circumstances like these, I have failed 35% of the tests I have taken in the past.
Communicative Probabilities
- Often we use the language of probability
- 50% certain means "I don't know"
- 0% certain means "No way"
- 100% certain means "Definitely"
- We don't use fine-grained probabilities
- I am 32.4% certain that this event will occur
How is probability used?
- Probability plays an important role in statistics
- How likely is that a given event was due to chance?
- That is a question we will try to answer
- Probability distributions will be important
- Discrete probability distributions
- Events can only have certain values (e.g., Heads/Tails)
- Continuous probability distributions
- Events can take on any possible values (e.g., means)
- Random Variable
- A variable whose value is a numerical outcome of a random phenomenon
- The variable X, where X is the number of heads resulting from four coin flips
What does this mean?
- What is the relative likelihood of getting a head on a coin flip?
- What if we tried 4 coin flips 96 times?
- How many times should you get 4 heads?
- E(N) = Probability * Trials
- E(N) = 0.0625 * 96 = 6
This is a binomial distribution...
Distributions
- Discrete vs. Continuous
- Distribution contains the space of all outcomes
- Probability density function
- Cumulative density function
Demo with coin flips
Demo with dice
How do probabilities combine?
- Disjoint events
- Both A and B cannot occur at the same time?
- Probability of A or B
- Pr(A or B) = Pr(A) + Pr(B)
- Probability I will go to A = .5
- Probability I will go to B = .27
- P(A or B) = .5 + .27 = .77
- What is the probability I will go to another club?
Independent events
- Two events are independent if the outcome of one event does not determine the outcome of another.
- Pr(A | B) = Pr(A)
- iid: identical independently distributed
- Two successive coin flips are independent
- Two spins of a roulette wheel are independent
- The election of the President and the members of Congress are not independent
- They share common causal factors
Probability of two independent events
- How likely is it that two independent events will occur?
- Probability of getting two heads on two successive flips
- Pr(1H) = .5
- Pr(2H) = .5 * .5 = .25
- The probability of a conjunction is always less than or equal to the probability of either conjunct.
The conjunction fallacy
- Linda is 31 years old, single, and bright. She majored in philosophy. As a student, she was deeply concerned with issues of social justice and participated in demonstrations.
- How likely is it:
- Linda is a banker?
- Linda is a banker and active in the feminist movement?
The mean of a random variable
- The mean of the random variable is the average of possible values of the variable.
- For discrete random variables, sum of (Values*Probabilities)
- (.0625*0) + (.25*1) + (.375*2) + (.25*3) + (.0625*4)=2
- For continuous random variables
- Mean is the point where the density curve would balance
- Looking at Multiple events
- Flipping a coin 4 times
- If I did this 3 times, what would happen?
- Three heads the first time
- 0 the second time
- 1 the third time
- Even though 2 heads is the most likely event, I might not witness it in 3 tries.
The law of large numbers
- If I made 1000 sets of 4 coin flips
- The distribution ought to start to look more like the one we saw before.
- The distribution of a random variable will look right in the long run.
- That's why we want lots of subjects!
- There is no law of small numbers
- If I see 8 heads in a row, that does not increase the probability that I will see a tail on the next flip
Probabilities will be important
- You use probability to determine how likely it is that an observation was due to chance.
- You will rely on various probability distributions.
- The Normal distribution (z)
- Student's t distribution
- The F distribution
Here's a tricky test - The Monty Hall problem demo
An Explanation
Questions
- What is the probability of getting three heads in a row with a fair coin?
- Answer: .5*.5*.5=.125 or (1/2)*(1/2)*(1/2)=.125
- Imagine you have a trick coin that lands heads up 75% of the time and that you flip the coin 4 times.
- Write out every possible outcome.
- Hint 1: There are 16 possible outcomes.
- Hint 2: "Head, Tail, Head, Head " or simply "HTHH" is one possible outcome.
- Calculate the probability of every outcome in the previous question.
- Hint: They are not all equal.
- What do the probabilities sum to in the previous question?
- Plot the probability distribution of getting 0, 1, 2, 3, or 4 heads.
- Hint: Use your answers from the previous questions to figure out the exact numbers.
- If you flipped the coin 4 times (as discussed above) 250 times (for a total of a 1000 flips), how many times would you
expect to get 2 heads? How many times would you expect to get 2 or more heads?
- If the probability of student Z getting an A on a test is .3, a B is .5, a C is .1, and a F is .1, what is the probability of student Z getting an A or a C?
- There is a 20% chance the number 28 bus in Olympia, WA will run late. There is a 30% chance the number 5 bus in Austin will run late. These events are independent. What is the probability that both buses will run late?
Distributions and Variability
- What to do if you actually collect data?
- Variables in statistics
- What is a distribution?
- Visualizing distributions
What do we do with data?
- Imagine you collect a lot of data on how long it takes to press a button after a flash of light.
- Now what?
- What have we got here?
What are the entries in this table?
- Variables
- Something that can be expressed as a number.
- Value
- Numerical value taken on by that variable.
- What variable are we dealing with here?
- What are the units of measurement?
Types of Variables
- Quantitative variables
- A quantitative variable is one for which mathematical operations make sense
- Response time is a good quantitative variable
- Response on survey - rating scales
- Quantitative variables often correspond to dependent measures in experiments.
- Categorical variables
- Define groups or classes in the data
- Gender or Year in college are good categorical variables
- Demographic information
- Categorical variables often correspond to things manipulated in experiments.
Variability
- The difficult thing about analyzing data is that not all of the data are the same.
- What do we do if all of the data points are not identical?
- How do we understand the variability in the data?
Sorting the data can help
- The data may look different when sorted.
- Still, there seem to be a lot of numbers here.
Graphics in data analysis
- It is often helpful to graph the data in some way.
- Humans are visual creatures.
- Patterns become evident in graphs.
- One simple type of graph is the stemplot.
16 | 12344688899
17 | 1122233333333344455677788899
18 | 01234455
19 |
20 | 1
Stemplot
- The spread of the numbers is a distribution.
- Think in distributions always.
- It makes you smarter and wiser.
- Think about what is possible.
- Think about how likely the possibilities are.
- That is what a distribution is - the overall pattern.
Qualities of distributions to keep in mind.
- What is the center?
- What is the shape?
- What is the spread?
- Are there outliers?
Aspects of the stemplot
- The stemplot is good for looking at a small set of numbers.
- We can see whether the distribution is symmetric or skewed.
- Unimodal?
- We can find any potential outliers.
16 | 12344688899
17 | 1122233333333344455677788899
18 | 01234455
19 |
20 | 1
Splitting the Stems
- Don't get hung-up on the rules...
16 | 12344
16 | 688899
17 | 11222333333333444
17 | 55677788899
18 | 012344
18 | 55
19 |
19 |
20 | 1
What if you have too many data points?
- Stemplots quickly start to look crowded.
- If there is too many data, use a histogram instead.
- A histogram is a bar-graph with frequency along the y-axis
A Histogram
Here's an interactive demo that shows how bin size affects histograms.
Here's a related example.
So, what does this mean?
- Interpreting data is not a mechanical process.
- Interpretation involves thinking both about the data and about how they were obtained.
- How were the data collected?
- What are the units of measurement?
- How much information is contained in those units?
- Are there any distinctive patterns in the data?
- Is the distribution symmetric or skewed?
- Are there any outliers?
Are the data well-behaved?
- It is important to look at the shape of the data.
- Many statistics that we will see will assume that the data are symmetric with a peak in the middle.
- Why is this important?
- Statistics like the mean (average) provide information about the central tendency of the distribution.
- If the distribution is not well-behaved, statistics like the mean will provide little information.
Other Ways of Looking at Data
Other Plots
- You can look at data in any way.
- Time plots
- Plot observations by the time they were taken
- Can reveal patterns related to timing of events
- Common for showing practice effects.
Time Plot (Practice Effect)
Summary
- Data (and life in general) come in distributions
- An important first step in analyzing data is to plot and look at the data
- Patterns may become evident
- Outliers may become apparent
- There are no rules for plotting data
- Find ways to look at data that are revealing
- Good analysis takes practice
- Don't be frustrated if it takes time to develop the skill.
Central Tendency and Standard Deviation - More on Distributions
- Central Tendency
- Variability
Sample Size
- How many people should you ask?
- Suppose you wanted to know who was going to win a presidential election.
- How confident would you be if you asked one person how they would vote?
- Two people? Ten people?
- One hundred? One thousand? Ten thousand? All voters?
- Clearly, up to point, more is better.
- At some point, there are diminishing marginal returns.
- Many (very accurate) national polls are based on only 1000-2000 respondents.
Sample size and Distributions
- How large should a sample be?
- To answer this question more specifically, we must think about distributions.
- Samples must be moderately large, because sampling gives rise to distributions.
- We'll talk more about precisely how large over the course of the semester.
- Recently, we thought about how to visualize a distribution with a graph.
- How can we summarize a distribution?
Central Tendency and Variability
- How could we describe a well-behaved distributions?
Central Tendency
- When the distribution has one peak, a measure of central tendency makes sense.
- Not a good measure for multi-peaked distributions
- Measures of central tendency
- Mean (arithmetic average)
- Median
- Mode
The mean
- The mean is familiar as the arithmetic average of a set of numbers
- Mean = ( SUM xi ) / N
- Mean = (x1 + x2 + x3 + ... + xN) / N
- In this example
- Mean = 8373 / 48 = 174.4375
A problem with the mean
- The mean is not resistant to outliers
- Without this outlier
- Mean = 8172 / 47 = 173.8723
- Removing one observation decreases the mean
- If this outlier had been 1000, the mean would be
<---
The mean and skew
- The extreme values in the tail of a skewed distribution pull the mean into the tail.
The Median
- The Median is the observation at the 50th percentile
- Half of the observations are above the median
- Half are below
- Finding the median (M)
- When N is odd: M is the middle observation
- When N is even: M is the mean of the two observations around the middle.
The median is resistant to outliers
- The median is unaffected by a single outlier.
- No matter what the value of the largest observation here, the median is still 173.
- The median is also resistant to skew.
- Median values are often used for skewed distributions.
<---
The Mode
- The Mode is the most frequent value.
- The Mode is not often used in statistics
Demo on Mean, Median, Mode, and Variability
Variability
- Central tendency alone leaves out a lot
- How representative of the distribution is the central tendency
- Ways of describing the variability
- Quartiles
- Variance and Standard Deviation
Percentiles
- A percentile is the observation such that P% of the observations
are below that observation
- The 25th percentile is often called the first quartile (Q1)
- The first quartile is the median of the observations from the smallest to the median
- The median is the 50th percentile
- The 75th percentile is the third quartile (Q3)
- The third quartile is the median of the observations from
the median to the largest
- he interquartile range is the size of the interval between Q1 and Q3.
- A measure of the variability of the distribution
The Five Number Summary
- A distribution can be summarized by five numbers
- Minimum, Q1, M, Q3, Maximum
- 161, 171, 173, 178, 201
- What does this five number summary say about this distribution?
The Boxplot
- The five-number summary can be graphed as a boxplot.
- This is a modified boxplot - the outlier is shown separately.
- The boxplot gives no information about the shape of the distribution.
The standard deviation
- The most common measure of the variability
- Not the most resistant measure
- Has some nice mathematical properties.
- How do you find the standard deviation?
- The mean is the point where the sum of the deviations is zero.
- Variance = s 2 = SUM(xi - Mean)2 / (N-1)
- Use N-1, because this calculation requires knowing the mean
- The Variance has N-1 degrees of freedom
- The variance is in squared units
- Standard Deviation = s = (Variance)1/2
The standard deviation
- Only use the standard deviation when the mean is used as a measure of central tendency.
- Variance = 51.44
- Standard Deviation = 7.17
The effects of linear transformations
- A unit can be transformed by multiplying it by some number and adding a value.
- This is a linear transform
- Measures of central tendency
- Measures of variability
Summary
- Measures of central tendency
- Mean (not resistant)
- Median (resistant)
- Mode (resistant)
- Measures of variability
- Interquartile range (resistant)
- Can be graphed as a boxplot
- Standard deviation (not resistant)
- Must be used with the mean.
Questions
- Data
- Set A: 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5
- Set B: 1 1 1 1 1 7 7 7 7 7 2 2 3 3 4 4 5 5 6 6
- Set C: 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 5 5 5 5 20
- For each data set, calculate the mean, median, and variance. Also print out a histogram of the data.
- What is the shape of each distribution? Is it symmetric? Is it skewed?
- Are there points in the distribution that fall far away from the rest of the points (that is, are there outliers)? If yes, how does removing the outlier affect the mean, median, and variance.
- Where is the middle of each distribution? What value do you think gives a good characterization of the middle of the distribution?
- How much variation is there in each distribution? Are all of the values generally clustered closely around a single value or are the values very spread out?
The Normal Distribution and Sampling Distributions
- Ways to describe distributions
- Central Tendency
- Variability
- A theoretical distribution in statistics
- The Normal Distribution
- The Ultimate Well-Behaved Distribution
- Mathematical models
- Introduction to Sampling Distributions
Mathematical Distributions
- Whenever we have a set of points, we might want to describe them with an equation.
- This provides a formal description or model
- When the set of points are observations from a sample, the model is a distribution
- This model will smooth the curve from a histogram.
A Histogram with a Model
Not as good of a fit:
Why a Model?
- If the mathematical properties are known, then we can use this distribution to reason about the data.
- For example, suppose we wanted to know whether a particular observation we obtained was common or extreme.
Using the Normal Distribution
- The Normal distribution is one model distribution
- It is defined by an equation that has 2 parameters that determine its shape
- The Mean and the Standard Deviation
Different Normal Distributions
- Changing the Mean shifts the distribution.
- Changing the Standard Deviation makes the distribution wider or narrower.
Demo on mean and standard deviation of the normal distribution
Area under the curve
- Because the equation that specifies the distribution is known, we know the area under the curve.
- The area under the curve is the proportion of observations that fall between those values.
Demo on area under the normal curve
Another Demo on area under the normal curve
The standard normal distribution
- All normal distributions are the same, except for a transformation.
- We can change any observation into a standard score (sometimes called a z-score)
mean=0, sd=1
What is a z-score?
- Using the z-score, you can find the proportion of observations as extreme or more extreme in the normal distribution
- Just use your normal distribution chart.
Practice with z-scores
- Suppose you are scouting for potential Olympic long jumpers. You observe 5000 sixth graders in the standing broad jump.
- The distribution of the sample looks well-behaved.
- Mean = 6.53 feet
- sd = 1.14 feet
- What are the z-scores and probabilities for jumps of
- 6.21 feet
- 6.53 feet
- 4.38 feet
- 7.21 feet
- 9.77 feet
- 3.11 feet
Demo on z scores and probability
Sampling Distributions
- The normal distribution can help.
- If you survey N people, your survey will get some mean response X1.
- If you took another survey of N people from the same population, this survey would have a mean X2.
- If you took a bunch of surveys and plotted the means on a histogram, you would find something that looked like a normal distribution
- Even if the data you are sampling is not normally distributed.
- Sampling Distribution of the Mean
Lots of Means
- This distribution of survey results would follow a normal distribution.
- Mean = mu
- standard deviation = sigma/(N)1/2
- Underlying distribution: mu=100, sigma=10
Increasing Sample Size
- As N (sample size) increases, the variability in this distribution decreases substantially.
- By N = 1000, the true mean is quite likely to be very close to the mean obtained in the survey
Demo on Sampling Distributions and Variance
How big a sample?
- The sampling distribution of the mean
- Mean = mu
- standard deviation = sigma/(N)1/2
- As N gets larger, the standard deviation of the sampling distribution gets smaller.
- Diminishing returns for additional observations
Where is this train taking me?
- The normal distribution is a convenient mathematical construct, but it may not be a good model of your data.
- Poorly behaved distributions deviate from the normal.
- The normal distribution is an important concept when we talk about inferential statistics.
- Many of the statistical tests we will talk about assume that the data being tested follow a normal distribution.