Behavioural Genetic Interactive Modules
Extremes Analysis

Overview
This module aims to introduce the concept of DF group analysis.
Introduction to DeFries-Fulker Extremes Analysis
DeFries-Fulker analysis (DF analysis) is a regression-based
method for the analysis of twin data. It was designed
particularly for the case of proband-ascertained samples:
an example of a non-random sample. This approach ascertains
extreme scoring individuals (probands) along with their
co-twin. This means that in every twin pair,
at least one of the twins will be a high scorer. As a result,
standard correlational and model-fitting methods of
analysis are no longer entirely appropriate.
DF analysis is based on the principle of regression to
the mean. To illustrate this, consider the following example:
Imagine that we have 1000 individuals and test them on
some measure, such as the amount of exercise they had
done in the last week. We might expect the following type of
frequency distribution chart, where some individuals
do lots (at the right-hand-side of the plot), some do none
(at the left-hand-side), but most do an intermediate amount
(in the middle).
If we were to re-test the same individuals a month later,
what would we expect the plot to look like? We have no
reason to believe the mean or variance of the distribution
should have changed at all: on average, nothing special
should have happened to our sample in their exercise
habits. We would expect exactly the same pattern :
However, what if we focused on the group of individuals
who scored very low at the first measurement occasion:
Is there any way of predicting what will happen to
the mean of this group at the second measurement
occasion?
The mean at the second occasion of the group who were
extreme at the first occasion will depend on the
underlying correlation between the two occasions.
Imagine the correlation was +1. All individuals would have
exactly the same score at the second time as they had
for the first time. The implication of this is that the
extreme group would continue to be equally extreme at
the second occasion.
Imagine the correlation was 0. An individual's score at the
second time point would be unrelated to their score at the
first. The implication of this is that the mean of
the extreme group should regress back to the mean
on the second occasion.
And for intermediate correlations, individuals who score
low the first time are more likely to also score low at
the second time. The implication of this is that the mean of
the extreme group would be expected to partially regress
towards the mean at the second occasion.
In this manner, studying the extent of
regression to the mean over two points
can be used as a tool for estimating the
correlation between those two measures.
In this sense, we are observing means
to say something about the variance/covariance
of the measures. (cf. ANOVA, which observes
variances in order to say something about
mean differences).
DF analysis takes advantage of the ability to
extract information about the correlation between
two measure by studying the regression to
the mean. We are not concerned here with measurements
at two time points in the same individual: we are
interested in the correlation between twins. In
this context, the two measurements are the scores
of each twin. The equivalent of the extreme group
at time one are all the probands; the
second measurement occasion is equivalent to their
co-twin's score.
If a trait is genetic, we should expect differential
regression to the mean between MZ and DZ co-twins of
extreme probands. As the extent of regression to the
mean represents the correlation between twins, this
is equivalent to asking whether or not the
trait is more correlated amongst MZ than DZ twins:
i.e. the fundamental basis of quantitative genetic
analysis.
Because the regression to the mean essentially
reflects information about the correlation
between twins, if the means are appropriately
standardised so that the population mean
becomes 0 and the proband mean becomes
1, then the transformed MZ and DZ co-twin
means can be interpreted in exactly the
same manner as MZ and DZ correlations can
be.
That is, twice the difference between MZ and
DZ co-twin mean estimates the heritability,
1 minus the MZ co-twin mean represents
the effect of nonshared environment.
Represented graphically:
The standardisation procedure is very simple :
we have four means, that we should expect to
find in this order:
- Population mean
- DZ co-twin mean
- MZ co-twin mean
- Proband mean
If we a) subtract the population mean from each
of the four means, and then divide each of the results
by this new proband mean,
this will ensure the transformed population mean is zero,
the proband mean will be 1, and the MZ and DZ co-twin
means should be somewhere in between.
The Module
The module simulates a population of 20,000
individuals (5000 MZ pairs and 5000 DZ pairs) according
to the population parameters specified in the panel, as shown
to the right. As well as using the sliders to determine
the relative balance of additive genetic, shared environmental
and nonshared environmental variance for this
trait (A, C and E), we also have to specify a threshold
that will be used to select probands.
The module simulates a trait that ranges between 1 and
20: the threshold is expressed in the raw-units of the
trait. All individuals scoring at the threshold or lower
become probands. Here we see the distribution of the
all individuals' scores. That is, both twins in a
pair are plotted and so both can become probands if their
scores are both below threshold. Both twins will also be co-twins
in this case. This is called double-entry
and has to be statistically corrected for when estimating
the significance of a DF result (as some individuals are
essentially counted twice) but it does not affect the
ideas we are dealing with here in any way.
The little white and red dots on the axis represent
the population and proband means respectively.
After simulating the twin population, the module selects
the co-twins of probands and plots their scores,
separately for MZ and DZ co-twins. The co-twin means are
also calculated and plotted on the axes, also. Here is
the MZ co-twin distribution.
And here is the DZ co-twin distribution.
As can be seen from these graphs, the DZ co-twins score
higher on average than the MZ co-twins. This represents
the differential regression to the mean between MZ and
DZ co-twins, that reflects the difference in correlations
between MZ and DZ co-twins.
These means are summarised in the panel, shown to the
right. As expected, the proband mean is the lowest, the
MZ co-twin mean the next lowest, the DZ co-twin mean
next and the population mean last.
If we transform these means in the manner described
above we arrive at the DF analysis estimates for
A, C and E. This panel also tells you what percentage
of the population were selected as probands for the
specific threshold used.
The DF analysis has done a good job here. In this instance,
the data were simulated under:
We see that the estimates derived from DF analysis
are incredibly similar, as we should hope! (it is
only rounding error which makes the total variance 101% here)
We should have expected this because the sample size was
so large (20,000 individuals, 14% of which were probands).
If our sample had only contained the probands
and their co-twins, however, the standard model-fitting results
would have been much harder to derive: DF analysis would
have been uniquely useful in that context.
Exploring the module
Try experimenting with different thresholds and
different values for A, C and E in order to see
how the MZ and DZ co-twin means reflect these
components of variance. For example, notice
however both co-twin means regress back to near
the population mean when then trait is
predominantly determined by nonshared environmental
sources of variation.
Regression model
Typically DF analysis is performed within a linear regression
model. The heart of it is embodied above, however: the
comparison of co-twin means. The regression estimates the
same quantities but also allows for confidence intervals
to be put around the estimates in a very straight-forward
manner, using standard statistical packages such as SPSS.
See the Appendix and Box 8.1 of the Behavioral Genetics
text for further explanation of DF analysis.
Heterogeneous Populations
But what if DF analysis of extreme probands and their
co-twins gives significantly different results to standard
ACE analyses for the same trait in similar populations? In
this simulation we know that we have simulated a
homogeneous population, so the DF estimates should
always mirror the population parameters, give or take
sampling error. The population is homogeneous in the sense
that the population parameters describe the sample equally
well at any point along the distribution. It might be that there
is 'something special' about the individuals with extreme scores
that might represent a different aetiology compared to the
unselected population acting in these individuals, however.
In this case, we might expect DF analysis to reflect this
difference, relative to analysis of variation in the trait
throughout the normal range. Such 'special' factors might
include rare genes of major effect, certain strong
epistatic or gene by environmental effects, or admixed
populations. Such factors might imply that the causes of
disability, for example, (as indexed by proband status)
are quantitatively or qualitatively different from the
causes of normal variation (as indexed by standard analysis
of individual differences).
Site created by S.Purcell, last updated 13.11.2000
|