| 1. Introduction 
2. Basic information 
3. Download and general notes 
4. Command reference table 
5. Basic usage/data formats 
6. Data management 
 
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK |  | SNP imputation and association testingThis page describes PLINK functions to impute SNPs that are
not directly genotyped but are present on a reference panel such as
the HapMap. As well as imputing genotypes (either making the most
likely call, or outputting the posterior probabilities of each
genotype, or the dosage) some simple association tests can be framed
in this context. These methods do not necessarily need whole-genome
data to work however: with dense SNP genotyping in a particular
region, these methods could still straightforwardly be applied. These
methods utilise the proxy association set of
commands.
In the text below, an observed SNP refers to one that was genotyped 
in both the reference and the WGAS sample. An imputed SNP refers
to one that only appears in the reference panel.
IMPORTANT The approach is a simple one, essentially
based around the concept of multi-marker tagging, designed to provide
a straightforward albeit quick and dirty approach to
imputation for common variants. It is unlikely to be optimal,
particularly for rarer alleles, when compared to other imputation
methods available. These features are also still in beta
meaning that they are still under development. As such, you are
advised only to use these routines in an exploratory manner, if at
all.Basic steps for using PLINK imputation functionsThe first step is to create a single fileset with the reference panel 
merged in with your dataset. We assume that the HapMap CEU founders 
will be used in this example. 
HINT A PLINK binary fileset of the Phase 2 HapMap
data can be downloaded from here. For studies
of individuals of European ancestry, the CEU founder fileset will be
the one to download from that link.
Given the HapMap data, hapmap-ceu.*
or hapmap-ceu-all.*, for example, you merge in your WGAS data
as follows,
 ./plink --bfile hapmap-ceu --bmerge mydata.bed mydata.bim mydata.fam 
--make-bed --out merged
In imputation mode, the reference panel is denoted by making
those individuals have a missing value for the
phenotype. You will therefore need to edit the .fam
files to make the 6th column (phenotype) 0 for all HapMap
individuals and 1 (control) or 2 (case) for the
individuals in your sample.  If you have trio data, make sure that no
observed individuals have missing phenotypes (i.e. set parents to
controls in a TDT context, rather than have a missing phenotype code).Strand issuesThe HapMap SNPs are all given on the +ve strand, and so it is your
responsibility to ensure that your data are aligned also, for the
merge to work.  The --flip
command can help changing strand. If there are strand problems, PLINK
will report a list of SNPs that did not match in terms of strand.
Naturally, if there are SNPs
A/T or C/G SNPs in your dataset, these will
potentially go unflagged.  As such, it is always a good idea to check
allele frequencies between the HapMap and the WGAS sample to identify
grossly deviant SNPs and/or undetected strand issues (i.e. create an
alternate phenotype file with the HapMap individuals coded as controls
and the rest of WGAS data as cases, and run a basic association
command). The --flip-scan
command can also help to detect some incorrectly aligned variants.
 
NOTE This will create a very large dataset 
and take some time; particularly if you have a parallel computing 
environment available, you might want to split the files and the 
merge procedures up by chromosomes, e.g. first download the archive 
with the HapMap CEU founder fileset split by chromosome, then 
merge each chromosome separately: 
./plink --bfile mydata --chr 1 --make-bed --out data-1
 
./plink --bfile mydata --chr 2 --make-bed --out data-2
etc, followed by 
./plink --bfile hapmap-ceu-chr1 --bmerge data-1.bed data-1.bim data-1.fam  --make-bed --out merged-1
 
./plink --bfile hapmap-ceu-chr2 --bmerge data-2.bed data-2.bim data-2.fam  --make-bed --out merged-2
This will create 22 separate filesets
(merged-1, merged-2, etc) and all the following
routines can then be run separately on each.Combined imputation and association analysis of case/control dataGiven the merged fileset, containing both the reference panel and the
(more sparse) WGAS samples, PLINK will attempt to perform case/control
association for every SNP (both observed and imputed) with the following command:
 ./plink --bfile merged-1 --proxy-assoc all 
which will generate an output file
     plink.assoc.proxy
with the fields
     CHR     Chromosome code
     SNP     SNP identifier
     BP      Physical position (base-pairs)
     A1      First allele code (not necessarily minor allele)
     A2      Second allele code (not necessarily major allele)
     GENO    Genotyping rate in entire sample and reference panel
     NPRX    Number of proxy SNPs selected
     INFO    Information content metric
     F_A     Allele 1 frequency in cases
     F_U     Allele 1 frequency in controls
     OR      Odds ratio 
     P       Significance value of case/control association test
The fields INFO and NPRX refer to how well PLINK
managed, if at all, to impute the SNP.  If NPRX is zero, then
it could not be even poorly imputed.  If INFO ranges from
between 0 and 1, although it can be greater than 1 occasionally. A
higher value general means a better imputed SNP; roughly speaking,
only looking at imputed SNPs with a INFO value greater than
0.8 or so is probably good practice. More specific details on these
metrics will be posted soon.Modifying options for basic imputation/association testingOne of the most important modofying options for
the --proxy-assoc test is --proxy-drop, which means
that the observed SNPs are dropped, one at a time, from the
WGAS sample when they are tested as the reference SNP (i.e. they will
be re-imputed given the surrounding SNPs). That is, the command,
 ./plink --bfile merged-1  --proxy-assoc all --proxy-drop
would mean that every single SNP test statistic
in plink.assoc.proxy would not involve a single observed
genotype for that particular SNP; as such, running this association
test with the --proxy-drop command is a good idea as it will
provide both a means to assess the performance of the imputation (by
comparing the results against the results of the observed genotypes)
but also of an extra level of QC (if you still see a significant
result, it cannot be due to technical artifacts specific to that SNP,
as no observed genotypes were used in the test for that SNP).
  The value of not using --proxy-drop always
with --proxy-assoc (given that the basic --assoc
command more straightforwardly calculates association for observed
SNPs) is if there is a reasonable amount of missing genotype data for
an observed SNP and you want to use imputation to recover
it. (Although, in this case, there is perhaps less need to use a
separate reference panel in any case, and so the
standard proxy association approach, without
any reference panel, can be used.)Parameters modifying selection of proxiesImputation in this context works simply by selecting a set of proxy
SNPs (using the reference panel information) and then phasing these
SNPs in both reference panel and WGAS sample jointly. By grouping
haplotypes, the corresponding single SNP tests of imputed
SNPs can then be straightforwardly performed.
There are a number of parameters that impact the choice of proxy
SNPs. Fine tuning of these parameters is still in progress. These
parameters will be described in more detail shortly.  For now, the
default parameters should be sufficient in most cases. See
the proxy association page for a description of
the parameters, the defaults, and how they can be changed.Imputing discrete genotype callsThe association test described above performs imputation on-the-fly
and does not save the imputed genotype calls or probabilities. To do
so, and to generate other metrics of imputation performance, use
the --proxy-impute command.
To generate summary statistics for the imputation performance of each
SNP, use the command
 ./plink --bfile merged-1  --proxy-impute all
which produces a file
     plink.proxy.impute
which has the fields
     CHR       Chromosome
     SNP       SNP ID
     NPRX      Number of proxy SNPs
     INFO      Information metric
     TOTAL_N   Total number of WGAS sample genotypes (exc. reference panel)
     OBSERVD   Proportion of these w/ observerd genotypes
     IMPUTED   Proportion of these imputed
     OVERLAP   Proportion of SNPs with both an imputed and overlapping
     CONCORD   Concordance rate in the overlapping set
Here are some example lines:
 CHR             SNP NPRX     INFO  TOTAL_N  OBSERVD  IMPUTED  OVERLAP  CONCORD 
  18       rs7233673    5    0.993     3469        0    0.991        0       NA 
  18       rs7233597    5    0.998     3469    0.999    0.993    0.992    0.986 
  18       rs7505507    4    0.632     3469    0.999    0.332    0.332    0.891 
e.g. the first line represents an unobserved SNP, for which 99% of
individuals were imputed; the second line was an observed SNP, but if
we drop it and try to re-impute, we get 99.3%; the concordance rate
between imputed and genotyped is 98.6% for this SNP. The final line
represents a SNP that did not perform as well: we only impute a third
of genotypes and these are less than 90% concordant (this was an
observed SNP also).  In this case, we see the INFO score is
lower (below 0.8) for this third SNP than for the other two: at the
standard 0.8 threshold this SNP would have been ignored in any case.
  
The required confidence threshold for making a call can be changed with, 
for example, 
     --proxy-impute-threshold 0.8
(it is set to 0.95 by default currently).
  
To give genotype-specific concordances, use the additional option:
     --proxy-genotypic-concordance
then a set of extra fields are append to the plink.proxy.impute output
     F_AA     Frequency of true 'AA' genotype
     I_AA     Proportion imputed for true AA genotype
     C_AA     Concordance rate for true AA genotype
     F_AB     As above, for 'AB' genotype
     ...      ...
That is, for a very rare SNP, overall concordance would be high just
by chance, even if none of the rare genotypes were correctly
called. This option is therefore useful to get a better picture of
imputation performance (when the observed genotype is also available).
  
In additon, if
     --proxy-show-proxies
is also specified, an extra PROXIES field will appear
in plink.proxy.impute showing the specific SNPs selected.
  
To perform imputation and save the dosages (fractional count of 0 to 2 alleles for each genotype), 
add the --proxy-dosage option;
 ./plink --bfile merged-1  --proxy-impute all  --proxy-dosage
which produces a file
     plink.proxy.impute.dosage
in which each imputed SNP is represented as a row; the fields (which does not have any header row)
     SNP Identifier
     Allele 1 code
     Allele 2 code
     Information content score for SNP
     Allele dosage for first individual in sample
     Allele dosage for second individual in sample
     ...
     Allele dosage for final individual in sample
This file can then be analysed outside of PLINK.
  
To perform imputation and save the called (most likely) genotypes in a new fileset, 
add the --make-bed option;
 ./plink --bfile merged-1 --proxy-impute all --make-bed --out imputed-1
By default, PLINK will only replace genotypes that were missing in the original WGAS sample;
to make PLINK re-impute all genotypes (whether they were actually observed or not), add the --proxy-replace
flag,
 ./plink --bfile merged-1 --proxy-impute all --proxy-replace --make-bed --out imputed-1
Note Future versions will do obvious things, like 
let you generate proxy-impute and proxy-assoc output files in the 
same run (you can't now).
Important Making discrete calls for the most likely
genotype will necessarily introduce error and bias in the all but
perfectly imputed SNPs. As such, one should take care in the analysis
and interpretation of imputed datasets -- they should not be treated
as if they were directly observed with certainty.  In particular, one
should be particularly cautious when combining multiple imputed files,
particularly if different platforms were used and/or if the files also
differ by disease state. Indeed, such an analysis is currently not
recommended.Verbose output optionsTo get a verbose output for a single SNP in the association mode, use
instead of the all keyword the specific SNP name:
     --proxy-assoc rs123235
See the web-page on proxy association methods 
to interpret this output.
You can also specify verbose imputation for one or more SNPs, e.g.
     --proxy-impute rs8096534  --proxy-verbose
which will add extra lines to the file plink.proxy.impute
representing the actual calls per person:
     rs8096534       78-03C15376 TBI-78-03C15376-1   01 01 0 1 0
     rs8096534       78-03C15377 TBI-78-03C15377-1   00 00 1 0 0
     rs8096534       78-03C15378 TBI-78-03C15378-1   01 01 0 1 0
     rs8096534       78-03C15398 TBI-78-03C15398-1   00 00 1 0 0
     rs8096534       78-03C15448 TBI-78-03C15448-1   01 01 0 1 0
     rs8096534       78-03C20292 TBI-78-03C20292-1   11 11 0 0 1
     rs8096534       78-03C20300 TBI-78-03C20300-1   11 10 0 0.08199 0.918
     rs8096534       78-03C20317 TBI-78-03C20317-1   01 01 0 1 0
     rs8096534       78-03C20335 TBI-78-03C20335-1   01 01 0 1 0
     ...
where the fields are (note: currently there is no header for these fields)
     SNP     SNP identifier
     FID     Family ID
     IID     Individual ID
     OBS     Observed genotype (coded 00,01,11 = AA,AB,BB,  10 = missing)
     IMP     Imputed genotype (as above)
     PAA     Probability of 'AA' genotype
     PAB     Probability of 'AB' genotype
     PBB     Probability of 'BB' genotype (i.e. these last 3 numbers sum to 1.00)
In addition, after these lines you will see a table of counts which
summarises the actual calls versus the true values (if known). Ideally, you would 
observe high numbers down the diagonal therefore (the columns are the same as the rows):
     Imputation matrix (rows observed, columns imputed)
     A/A     292     2       0       1
     A/G     0       1389    8       55
     G/G     0       5       1585    130
     0/0     1       1       0       0
and this is then followed by the normal, single-line non-verbose report for that SNP
 CHR             SNP NPRX     INFO  TOTAL_N  OBSERVD  IMPUTED  OVERLAP  CONCORD 
  18       rs8096534    5    0.961     3469    0.999    0.946    0.946    0.995 
Although you are able to specify --proxy-impute all
and --proxy-verbose together, be warned that this will
typically result in a very large output file for real data. It is
better used for single SNPs in its current format. 
 |  |