1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
SNP imputation and association testing
This page describes PLINK functions to impute SNPs that are
not directly genotyped but are present on a reference panel such as
the HapMap. As well as imputing genotypes (either making the most
likely call, or outputting the posterior probabilities of each
genotype, or the dosage) some simple association tests can be framed
in this context. These methods do not necessarily need whole-genome
data to work however: with dense SNP genotyping in a particular
region, these methods could still straightforwardly be applied. These
methods utilise the proxy association set of
commands.
In the text below, an observed SNP refers to one that was genotyped
in both the reference and the WGAS sample. An imputed SNP refers
to one that only appears in the reference panel.
IMPORTANT The approach is a simple one, essentially
based around the concept of multi-marker tagging, designed to provide
a straightforward albeit quick and dirty approach to
imputation for common variants. It is unlikely to be optimal,
particularly for rarer alleles, when compared to other imputation
methods available. These features are also still in beta
meaning that they are still under development. As such, you are
advised only to use these routines in an exploratory manner, if at
all.
Basic steps for using PLINK imputation functions
The first step is to create a single fileset with the reference panel
merged in with your dataset. We assume that the HapMap CEU founders
will be used in this example.
HINT A PLINK binary fileset of the Phase 2 HapMap
data can be downloaded from here. For studies
of individuals of European ancestry, the CEU founder fileset will be
the one to download from that link.
Given the HapMap data, hapmap-ceu.*
or hapmap-ceu-all.*, for example, you merge in your WGAS data
as follows,
./plink --bfile hapmap-ceu --bmerge mydata.bed mydata.bim mydata.fam
--make-bed --out merged
In imputation mode, the reference panel is denoted by making
those individuals have a missing value for the
phenotype. You will therefore need to edit the .fam
files to make the 6th column (phenotype) 0 for all HapMap
individuals and 1 (control) or 2 (case) for the
individuals in your sample. If you have trio data, make sure that no
observed individuals have missing phenotypes (i.e. set parents to
controls in a TDT context, rather than have a missing phenotype code).
Strand issues
The HapMap SNPs are all given on the +ve strand, and so it is your
responsibility to ensure that your data are aligned also, for the
merge to work. The --flip
command can help changing strand. If there are strand problems, PLINK
will report a list of SNPs that did not match in terms of strand.
Naturally, if there are SNPs
A/T or C/G SNPs in your dataset, these will
potentially go unflagged. As such, it is always a good idea to check
allele frequencies between the HapMap and the WGAS sample to identify
grossly deviant SNPs and/or undetected strand issues (i.e. create an
alternate phenotype file with the HapMap individuals coded as controls
and the rest of WGAS data as cases, and run a basic association
command). The --flip-scan
command can also help to detect some incorrectly aligned variants.
NOTE This will create a very large dataset
and take some time; particularly if you have a parallel computing
environment available, you might want to split the files and the
merge procedures up by chromosomes, e.g. first download the archive
with the HapMap CEU founder fileset split by chromosome, then
merge each chromosome separately:
./plink --bfile mydata --chr 1 --make-bed --out data-1
./plink --bfile mydata --chr 2 --make-bed --out data-2
etc, followed by
./plink --bfile hapmap-ceu-chr1 --bmerge data-1.bed data-1.bim data-1.fam --make-bed --out merged-1
./plink --bfile hapmap-ceu-chr2 --bmerge data-2.bed data-2.bim data-2.fam --make-bed --out merged-2
This will create 22 separate filesets
(merged-1, merged-2, etc) and all the following
routines can then be run separately on each.
Combined imputation and association analysis of case/control data
Given the merged fileset, containing both the reference panel and the
(more sparse) WGAS samples, PLINK will attempt to perform case/control
association for every SNP (both observed and imputed) with the following command:
./plink --bfile merged-1 --proxy-assoc all
which will generate an output file
plink.assoc.proxy
with the fields
CHR Chromosome code
SNP SNP identifier
BP Physical position (base-pairs)
A1 First allele code (not necessarily minor allele)
A2 Second allele code (not necessarily major allele)
GENO Genotyping rate in entire sample and reference panel
NPRX Number of proxy SNPs selected
INFO Information content metric
F_A Allele 1 frequency in cases
F_U Allele 1 frequency in controls
OR Odds ratio
P Significance value of case/control association test
The fields INFO and NPRX refer to how well PLINK
managed, if at all, to impute the SNP. If NPRX is zero, then
it could not be even poorly imputed. If INFO ranges from
between 0 and 1, although it can be greater than 1 occasionally. A
higher value general means a better imputed SNP; roughly speaking,
only looking at imputed SNPs with a INFO value greater than
0.8 or so is probably good practice. More specific details on these
metrics will be posted soon.
Modifying options for basic imputation/association testing
One of the most important modofying options for
the --proxy-assoc test is --proxy-drop, which means
that the observed SNPs are dropped, one at a time, from the
WGAS sample when they are tested as the reference SNP (i.e. they will
be re-imputed given the surrounding SNPs). That is, the command,
./plink --bfile merged-1 --proxy-assoc all --proxy-drop
would mean that every single SNP test statistic
in plink.assoc.proxy would not involve a single observed
genotype for that particular SNP; as such, running this association
test with the --proxy-drop command is a good idea as it will
provide both a means to assess the performance of the imputation (by
comparing the results against the results of the observed genotypes)
but also of an extra level of QC (if you still see a significant
result, it cannot be due to technical artifacts specific to that SNP,
as no observed genotypes were used in the test for that SNP).
The value of not using --proxy-drop always
with --proxy-assoc (given that the basic --assoc
command more straightforwardly calculates association for observed
SNPs) is if there is a reasonable amount of missing genotype data for
an observed SNP and you want to use imputation to recover
it. (Although, in this case, there is perhaps less need to use a
separate reference panel in any case, and so the
standard proxy association approach, without
any reference panel, can be used.)
Parameters modifying selection of proxies
Imputation in this context works simply by selecting a set of proxy
SNPs (using the reference panel information) and then phasing these
SNPs in both reference panel and WGAS sample jointly. By grouping
haplotypes, the corresponding single SNP tests of imputed
SNPs can then be straightforwardly performed.
There are a number of parameters that impact the choice of proxy
SNPs. Fine tuning of these parameters is still in progress. These
parameters will be described in more detail shortly. For now, the
default parameters should be sufficient in most cases. See
the proxy association page for a description of
the parameters, the defaults, and how they can be changed.
Imputing discrete genotype calls
The association test described above performs imputation on-the-fly
and does not save the imputed genotype calls or probabilities. To do
so, and to generate other metrics of imputation performance, use
the --proxy-impute command.
To generate summary statistics for the imputation performance of each
SNP, use the command
./plink --bfile merged-1 --proxy-impute all
which produces a file
plink.proxy.impute
which has the fields
CHR Chromosome
SNP SNP ID
NPRX Number of proxy SNPs
INFO Information metric
TOTAL_N Total number of WGAS sample genotypes (exc. reference panel)
OBSERVD Proportion of these w/ observerd genotypes
IMPUTED Proportion of these imputed
OVERLAP Proportion of SNPs with both an imputed and overlapping
CONCORD Concordance rate in the overlapping set
Here are some example lines:
CHR SNP NPRX INFO TOTAL_N OBSERVD IMPUTED OVERLAP CONCORD
18 rs7233673 5 0.993 3469 0 0.991 0 NA
18 rs7233597 5 0.998 3469 0.999 0.993 0.992 0.986
18 rs7505507 4 0.632 3469 0.999 0.332 0.332 0.891
e.g. the first line represents an unobserved SNP, for which 99% of
individuals were imputed; the second line was an observed SNP, but if
we drop it and try to re-impute, we get 99.3%; the concordance rate
between imputed and genotyped is 98.6% for this SNP. The final line
represents a SNP that did not perform as well: we only impute a third
of genotypes and these are less than 90% concordant (this was an
observed SNP also). In this case, we see the INFO score is
lower (below 0.8) for this third SNP than for the other two: at the
standard 0.8 threshold this SNP would have been ignored in any case.
The required confidence threshold for making a call can be changed with,
for example,
--proxy-impute-threshold 0.8
(it is set to 0.95 by default currently).
To give genotype-specific concordances, use the additional option:
--proxy-genotypic-concordance
then a set of extra fields are append to the plink.proxy.impute output
F_AA Frequency of true 'AA' genotype
I_AA Proportion imputed for true AA genotype
C_AA Concordance rate for true AA genotype
F_AB As above, for 'AB' genotype
... ...
That is, for a very rare SNP, overall concordance would be high just
by chance, even if none of the rare genotypes were correctly
called. This option is therefore useful to get a better picture of
imputation performance (when the observed genotype is also available).
In additon, if
--proxy-show-proxies
is also specified, an extra PROXIES field will appear
in plink.proxy.impute showing the specific SNPs selected.
To perform imputation and save the dosages (fractional count of 0 to 2 alleles for each genotype),
add the --proxy-dosage option;
./plink --bfile merged-1 --proxy-impute all --proxy-dosage
which produces a file
plink.proxy.impute.dosage
in which each imputed SNP is represented as a row; the fields (which does not have any header row)
SNP Identifier
Allele 1 code
Allele 2 code
Information content score for SNP
Allele dosage for first individual in sample
Allele dosage for second individual in sample
...
Allele dosage for final individual in sample
This file can then be analysed outside of PLINK.
To perform imputation and save the called (most likely) genotypes in a new fileset,
add the --make-bed option;
./plink --bfile merged-1 --proxy-impute all --make-bed --out imputed-1
By default, PLINK will only replace genotypes that were missing in the original WGAS sample;
to make PLINK re-impute all genotypes (whether they were actually observed or not), add the --proxy-replace
flag,
./plink --bfile merged-1 --proxy-impute all --proxy-replace --make-bed --out imputed-1
Note Future versions will do obvious things, like
let you generate proxy-impute and proxy-assoc output files in the
same run (you can't now).
Important Making discrete calls for the most likely
genotype will necessarily introduce error and bias in the all but
perfectly imputed SNPs. As such, one should take care in the analysis
and interpretation of imputed datasets -- they should not be treated
as if they were directly observed with certainty. In particular, one
should be particularly cautious when combining multiple imputed files,
particularly if different platforms were used and/or if the files also
differ by disease state. Indeed, such an analysis is currently not
recommended.
Verbose output options
To get a verbose output for a single SNP in the association mode, use
instead of the all keyword the specific SNP name:
--proxy-assoc rs123235
See the web-page on proxy association methods
to interpret this output.
You can also specify verbose imputation for one or more SNPs, e.g.
--proxy-impute rs8096534 --proxy-verbose
which will add extra lines to the file plink.proxy.impute
representing the actual calls per person:
rs8096534 78-03C15376 TBI-78-03C15376-1 01 01 0 1 0
rs8096534 78-03C15377 TBI-78-03C15377-1 00 00 1 0 0
rs8096534 78-03C15378 TBI-78-03C15378-1 01 01 0 1 0
rs8096534 78-03C15398 TBI-78-03C15398-1 00 00 1 0 0
rs8096534 78-03C15448 TBI-78-03C15448-1 01 01 0 1 0
rs8096534 78-03C20292 TBI-78-03C20292-1 11 11 0 0 1
rs8096534 78-03C20300 TBI-78-03C20300-1 11 10 0 0.08199 0.918
rs8096534 78-03C20317 TBI-78-03C20317-1 01 01 0 1 0
rs8096534 78-03C20335 TBI-78-03C20335-1 01 01 0 1 0
...
where the fields are (note: currently there is no header for these fields)
SNP SNP identifier
FID Family ID
IID Individual ID
OBS Observed genotype (coded 00,01,11 = AA,AB,BB, 10 = missing)
IMP Imputed genotype (as above)
PAA Probability of 'AA' genotype
PAB Probability of 'AB' genotype
PBB Probability of 'BB' genotype (i.e. these last 3 numbers sum to 1.00)
In addition, after these lines you will see a table of counts which
summarises the actual calls versus the true values (if known). Ideally, you would
observe high numbers down the diagonal therefore (the columns are the same as the rows):
Imputation matrix (rows observed, columns imputed)
A/A 292 2 0 1
A/G 0 1389 8 55
G/G 0 5 1585 130
0/0 1 1 0 0
and this is then followed by the normal, single-line non-verbose report for that SNP
CHR SNP NPRX INFO TOTAL_N OBSERVD IMPUTED OVERLAP CONCORD
18 rs8096534 5 0.961 3469 0.999 0.946 0.946 0.995
Although you are able to specify --proxy-impute all
and --proxy-verbose together, be warned that this will
typically result in a very large output file for real data. It is
better used for single SNPs in its current format.
|
|