Correlation and Regression
Correlational Research
- Your research questions may be too complex to be handled by experimental research.
- It may not be feasible to do experimental research to address a question.
- In these cases, correlational research can be very useful.
- Correlations don't let us assess causation.
- Still, correlations let us see patterns of behavior and predict them.
Here's a Scenario
- Market research firm
- Studying preference for a new type of computer
- What is the relationship between people's degree of computer expertise and their preference for the new brand?
- Get a measure of their preference
- Self-rating of expertise
Expertise vs Preference
Correlation
- Correlation is a measure of mathematical association.
- There are many possible correlations we might measure.
- Typically, when we talk about correlation, we are referring to Pearson's correlation coefficient (called r)
- Correlation is defined as:
It's like calculating a t.
- Each observation has its mean removed
- Each observation is divided by its standard deviation
- The resulting value has no units.
- What are the properties of the correlation?
- Range
- -1: Perfect negative relationship
- 1: Perfect positive relationship
- 0: No relationship at all.
A Positive Relationship (r = .54)
A Negative Relationship (r = -.62)
No Relationship (r = -.08)
Important Things to Know
- Correlation measures the linear association between the variables.
- No good for nonlinear associations
- The correlation coefficient is not affected by changes in the units of measurement of the variables
- A correlation of 1 or -1 indicates that the observed points all fall on a straight line.
What does this value signify?
- The square of the correlation is the proportion of variance in one variable that can be accounted for by the other.
- This value is abbreviated r2
- What does this mean?
- How much of the variability in a variable can be predicted from the regression line?
- How much comes from other factors?
- How far do the points lie from the regression line?
Proportion of Variance (r2 = .542)
Resistance
- Regression and correlation coefficients are not resistant
- They are strongly influenced by outliers
- Must graph our data to ensure that the effects we see are not due to outliers.
Resistance (r = .54 vs. r = 1.0)
Demo on outliers in correlation/regression
Extrapolation
- A strong relationship suggests that you can predict the value of one variable from the value of another.
- Prediction is only valid within the range for which measurements were taken.
- Consider height and basketball ability
- True for correlation and regression
Interpreting Correlations
- A correlation coefficient is just a description of our data.
- What it means depends on how data were collected.
- All the correlation implies is a numerical relationship.
-
Imagine we found a positive correlation between atmospheric pollution levels and murder rates for 100 counties in the United States.
- Why would this be?
- The third variable problem
Summary
- Correlation measures linear association
- Units are stripped off the measurements
- Correlation ranges from -1 to +1
- Correlations must be interpreted in light of the way the data were collected.
Regression
- Modeling data in a scatterplot
- Linear regression
- Measures of goodness
The Deal
- Staring at a scatterplot gives us a sense of the relationship between variables.
- How can we give a more precise description of the relationship between variables?
- Linear regression
- Draws a straight line on the data.
- The line is called a regression line.
-
The question is which line is the best line?
Line (r = .99)
Deviations-Residuals
- How far from the regression line?
- Less implies higher correlation
- More Variance accounted for
Which line do we draw?
- Draw the line that minimizes the squared deviation from the points to the line.
- Deviation = observed y - predicted y
- Least-squares regression
- How do we do that?
- The equation for a line
- y = a + bx
- y is the variable to be predicted
- x is the predictor variable
- a is the intercept of the line
- b is the slope of the line
Actually Fitting a Line
- You Don't have to guess.
- predicted y = a + bx
- b = r * (sy / sx)
- a = y_bar - (b * x_bar)
- line goes through (x_bar, y_bar)
- It makes sense
A Regression Line
predicted y = 64.9 + (.63*x)
Some Irregular Data (r = .05)
What do we use this line for?
- The regression line provides a model of the data.
- We can do three things with this line.
- How good a model is the line?
- Use the line as a quantitative description of the data.
- Predicting values not given.
Accounting for Variance
- If the model (the regression line) is a good one, then the line will account for most of the variance.
- Most of the points will fall around the line.
- The residuals will generally be small.
- If the model is a poor one, then the line will account for very little of the variance.
- Most of the points will fall far from the line.
- The residuals will generally be large.
- There is a measure called r2, which is the proportion of variance that a model accounts for.
What r2 means
- The correlation r squared
- Calculate the variance of the predicted y's, then divide it by the variance of the observed y's
Plot the residuals
- We also want to know whether the line fits equally well everywhere.
- We can plot the residuals.
- Residual = observed y - predicted y
Know the Warning Signs
- Any systematic pattern of residuals suggests systematic variance that is not being explained.

Nonlinear Relationships
- Linear regression tries to fit a line to the data.
- A line is not always the best relationship between points.
Creating Linear Relationships
- Linear regression is a fast easy way to model data.
- If the data are non-linear, there may be a way to transform one of the variables.
- If the data have an exponential relationship, then taking the logarithm of one variable will yield a linear relationship.
The Regression Line as Description
- Looking at a scatterplot gives a general idea of the relationship between variables.
- The regression line allows a more precise statement of the relationship.
- This aspect of regression is particularly important in psychology.
- If a line provides a good model of the data, that will affect how we think about that process.
The Regression Line as Predictor
- Regression lines permit extrapolation from the data.
- Data are collected at particular points.
- Using the regression line, we can predict data at intermediate points.
- Our confidence in that prediction is based on the goodness of the line as a model for the data.
- Predictions should be made from good-fitting lines, but not from poorly fitting lines.
Summary
- Scatterplots display the relationship between variables.
- This relationship can be described using least squares regression.
- Minimizes the squared deviations.
- Lines can be assessed for their goodness.
- Look at r2
- Look at residuals
- Good lines can be used as descriptions of data.
- Good lines can be used to predict new values.