Behavioural Genetic Interactive Modules
Correlation & Regression

Overview
This module aims to introduce the related concepts of
correlation and regression and to demonstrate their
relationship with variance and covariance.
Tutorial
As mentioned in the previous module, in order to assess the
magnitude of a covariance statistic we can use information
about the variances of the measures to standardise
it. We can also calculate so-called regression
coefficients easily from the covariance if we know the
variance of the measures.
This module allows the user to explore these related
measures by simulating a bivariate dataset with
certain known properties.
This panel is used to specify the relationship
between the two simulated variables, X and Y. It
represents three separate influences, or sources of
variation, on the two measures:
- S1 is some unmeasured, or latent, variable that
directly influences X only
- S2 is some unmeasured variable that directly
influences Y only
- C influences both X and Y (although, as we shall see,
it can influence X and Y similarly, or it can
influence them differentially
Imagine the following scenario to make this more concrete: imagine two
boys, Mike and Joe, who both attend the same school.
If we were to look at all results they obtained in all the tests
they performed at school, we could ask whether or not their
scores tended to be related.
By related in this context, we do not just mean similar. For
example, Mike could consistently score 20 points below Joe, and so they
would have quite dissimilar mean scores. What we mean here by
related is that on the tests where Mike tends to score higher than
he usually does, so does Joe. In statistical language, we ask whether
these measures are correlated.
In this scenario,
- X represents Mike's test scores
- Y represents Joe's test scores
- S1 represents all of the influences specific to Mike that impact
upon his test scores: for instance, whether or not his mother
helps him with his homework, Mike's own IQ, etc.
- S2 represents the factors that are specific to Joe
- C represents the factors that influence their test scores
that are shared by both Mike and Joe: for instance, the
teachers they share for different subjects. If some teachers
are better than others, and this influences their scores similarly,
this shared factor will tend to induce a correlation between
Mike's and Joe's scores.
The sliders determine how much of an influence these different
sources of variance have on X and Y. The scales are arbitrarily set
between 0 to 100 for the two specific measures and -50 to 50 for the
shared measure. A negative value for the shared source of variance
does not mean that the mean score of both Mike and Joe would go down
(remember: this entire module is not concerned with mean
differences). Rather, a
negative value here means precisely the opposite of sharing - this factor
tends to make Mike and Joe less similar. Be careful with the
language here - although S1 and S2 are not 'shared' with each other, they
neither make the two boys' scores similar nor dissimilar to each
other. They are said to be independent from one
another.
Moving the sliders causes the module to begin simulating normally-distributed
data. The
number of datapoints is determined by the slider in the sample size
panel, which can be between 5 and 500. (Perhaps we should stop the analogy
with Mike and Joe here - it would be rather unfair to make them do 500 tests!)
Note that if you set all the sliders to zero (which would imply no
variation for either measure) the module will 'compensate' by introducing
a small amount of shared variation. If it did not do this, an error
would occur when calculating the bivariate statistics (it would
imply division by zero). In any case, the concept of covariation is
meaningless if there is no variation in at least one of the measures.
This module does not show us the individual data points, unlike the
previous two modules (although
we will see a scatter-plot of all the points). Instead, just the
summary statistics are presented, now that we know how they are calculated.
The univariate statistics presented are the variance and the
standard deviation for X and Y.
Note the effect of moving the sliders on these variance estimates. Firstly,
note they that will tend to 'jump around' quite a lot. This is because of
random sampling variation. When we specify the strengths of S1, S2
and C in the panel above, we are specifying the population
parameters. The computer then acts as if it were randomly selected
between 5 and 500 (however many as specified) individuals from this
population. So, on average, we should expect the sample to be
representative of the population.
But we also expect sample-to-sample variation in these estimates. We should also
expect that this sample-to-sample variation is greater if the samples drawn
are of a relatively small size. Confirm this for yourself by making the
sample size smaller. The sliders at the top directly represent the variance
in X and Y attributable to the three causal factors. The estimated sample
variance for X and Y should therefore be close to the sum of these values
(i.e. S1+C for X and S2+C for Y): this will be more likely to be the case
when the sampe size is larger. Note that it is the absolute value of C that
contributes to the variance, however. That is, if S1 were 50 but C
were -25, we should expect the variance of X to be 75 and not 25. This is
because C still generates variation in X - if we are not considering the
relationship between X and Y, then the sign of C is meaningless.
The sign of C will not be meaningless when considering the bivariate
statistics, however. These statistics are essentially trying to
quantify the extent of the relationship between X and Y. In this
context, they are evaluating the magnitude and sign of the shared cause, C,
relative to the magnitudes of the specific causes, S1 and S2.
The presentation in terms of the shared and nonshared latent variables is
arbitrary, however. See the Appendix for a discussion of the possible
reasons for observing associations between two variables: having a
shared cause is only one of them.
First, we see that the covariance between X and Y has
been calculated. Inside the module, this would have followed exactly the same
procedure we saw in last module. So, again, we find ourselves asking
the question, in this case: so what does a covariance of 31.735
actually tell us about the relationship between X and Y?
The correlation presented in this panel attempts to
answer this, by standardising the covariance with respect to
the variances of X and Y. Specifically, it represents the
covariance divided by the square-root of the product of the
two variances (see the Appendix for a more detailed discussion
of the correlation coefficient). In this case, we see that
31.735 / ( sqrt(55.98*57.75)) = 0.558. Correlations range from
-1 to +1, where +1 represents a perfect positive association,
-1 a perfect negative association and 0 no association. So
a correlation of 0.558 implies a moderate positive association
between X and Y. The square of the correlation represents the
amount of total variance explained by the covariation between the
two traits: in this case, 0.558*0.558 = 0.31.
As explained in the Appendix, the regression coefficients
are a function of the amount of variance in one measure explained by
variation in the other. As such, regression is asymmetrical, in
that we can ask how well X predicts Y as well as how well Y predicts
X. The standard error of the regression coefficent can be
interpreted as a measure of certainty regarding that estimate. If the
standard error is large, it implies that the estimated regression
coefficient may not be representative of the true population value.
The scatter-plot represents all the points in the sample as well as the
two colour-coded regression slopes.

Exploring the module
Let's use the module to get a feel for the relationship between
these different measures. Firstly, note how moving either of
the two specific slides, S1 or S2, changes the shape of the
bivariate distribution. Increasing S1 will stretch it out
horizontally (for X is represented along the horizontal axis,
traditionally known as the x-axis in any case). This
is because it is introducing variation in X that is not shared
by Y.
Conversely, moving the S2 slider will make the distribution
grow or shrink along the vertical axis. Changing the C slider
has a different effect on the shape of the distribution, and its
effect will be more noticeable when S1 and S2 are quite low. Set
S1 and S2 to about 10 each. If C is at zero, the points should be
clustered in a evenly-shaped little ball in the middle of
the scatter-plot. (Make sure that the sample size is quite high,
over 400 for these effects to be clear).
Note that the variance of both X and Y should be near 10. The
covariance will be very small, however, and the correlation will
be even smaller. The proportion of variance explained should be
virtually zero. The regression slopes should also be virtually zero,
so that the red line will follow vertical axis and the blue line
will follow the horizontal axis. This is because, given that neither
measure tells us anything about the other in this scenario, the
most likely estimate of Y will be the mean of Y, for all values of X.
This is represented by the blue line, the regression slope that predicts
Y given X, which will be virtually flat. Likewise, the red regression
slope predicting X given Y will remain very close to the mean of X at all
points of Y (i.e. will tend to follow the y-axis).
If you then move the C slider to the right slowly, note what happens to
all the statistics and the scatter-plot. Both variances increase, unsurprisingly,
as that is what the slider is doing: adding more variation. But the covariance
will increase also, as will the correlation and all the other statistics. This
is perhaps best represented in the scatter-plot: a clear association between
X and Y will result, in that points that tend to be higher on X also tend to
be higher on Y and vice versa. The two regression lines will begin to converge,
representing this fact.
With C on some high value, set both S1 and S2 to zero and observe what happens.
This state of affairs implies that the only source of variation in both X
and Y is completely shared. Because there is no unique variance in either
X or Y, every individual's score on X would be identical to their score on Y:
thus the straight line on the scatter-plot. In this case, note that the covariance
equals the variance of both measures, and so the correlation is 1. Now move
the C slider to a negative value: the covariance and correlation become
negative, the points align themselves differently on the scatter-plot.
Try reducing the sample size to get a feel for the way in which the
estimates become less consisent. With only a few points, say less than
10, you should be able to see the way in which the regression slopes can be
drastically influenced by just one extreme observation. Note how the
standard errors of the regression coefficients increase too - the implication
of this is that the coefficients would be unlikely to be significantly different
from zero, in a statistical sense.
Finally, we will consider a case that highlights the difference between
the correlation and regression coefficients. See the following scenario,
where X has a great deal of specific variation (S1=88) but there is no
shared variation (C=0). As mentioned above, the program will
not let a variable have zero variance (i.e S2=0 and C=0); it will
add a very small amount to make the routine run. So think of Y as having
a very small degree of unique variation and no association with X.
We see these values reflected in the variances of the simulated sample -
X has a relatively large variance, Y has a relatively small variance.
But take a look at the regression slopes, however:
Although we know that X and Y should be unrelated, the regression slope of
X on Y (that is, the red line that plots the expected value of X as
a function of Y) is far from flat. If we look at the actual bivariate
statistics, we see the same pattern:
That is, despite the fact that the covariance and correlation are
very small, as expected, as in the regression of Y on X, the
regression of X on Y has a very large coefficient. In this particular
sample there are a reasonable number of individuals, too (approximately
200), which seems to have been enough to estimate the other statistics
accurately. What is going on?
The asymmetry in regression slopes is caused by the
asymmetry in the variances of the two measures and is
unsurprising when you think about it. It is easy to predict
something that does not have very much variation - the mean
will always be a good guess. Therefore, it is easy to predict Y.
It is difficult for Y to predict X however, as X does have a
lot of variation. Because there is such a 'restriction in range' in
Y, relative to X, it is unable to accurately account for variation in X.
This is reflected in the very large standard error for the regression of
X on Y seen here - 0.922 (taken as relative to the magnitude of the
regression slope itself, 1.637). This is why the two regression
slopes tell us slightly different things: whether one
variable predicts another variable better or the other way round is
due to the relative balance of unique variance in each. Note that
if both X and Y have the same amount of unique, 'residual' variance
then both regression slopes will equal the correlation coefficient.
These relations are clearly apparent in the simple equations
that describe the relationship between variance, covariance,
correlation and regression, as described in the Appendix.
Questions
-
-
-
Answers
-
-
-

Please refer to the Appendix for further discussion of this topic.

Site created by S.Purcell, last updated 6.11.2000
|