This page describes PLINK functions to impute SNPs that are not directly genotyped but are present on a reference panel such as the HapMap. As well as imputing genotypes (either making the most likely call, or outputting the posterior probabilities of each genotype, or the dosage) some simple association tests can be framed in this context. These methods do not necessarily need whole-genome data to work however: with dense SNP genotyping in a particular region, these methods could still straightforwardly be applied. These methods utilise the proxy association set of commands.

In the text below, an observed SNP refers to one that was genotyped in both the reference and the WGAS sample. An imputed SNP refers to one that only appears in the reference panel.

IMPORTANT The approach is a simple one, essentially based around the concept of multi-marker tagging, designed to provide a straightforward albeit quick and dirty approach to imputation for common variants. It is unlikely to be optimal, particularly for rarer alleles, when compared to other imputation methods available. These features are also still in beta meaning that they are still under development. As such, you are advised only to use these routines in an exploratory manner, if at all.

Basic steps for using PLINK imputation functions

The first step is to create a single fileset with the reference panel merged in with your dataset. We assume that the HapMap CEU founders will be used in this example.

HINT A PLINK binary fileset of the Phase 2 HapMap data can be downloaded from here. For studies of individuals of European ancestry, the CEU founder fileset will be the one to download from that link.

Given the HapMap data, hapmap-ceu.* or hapmap-ceu-all.*, for example, you merge in your WGAS data as follows,

./plink --bfile hapmap-ceu --bmerge mydata.bed mydata.bim mydata.fam --make-bed --out merged

In imputation mode, the reference panel is denoted by making those individuals have a missing value for the phenotype. You will therefore need to edit the .fam files to make the 6th column (phenotype) 0 for all HapMap individuals and 1 (control) or 2 (case) for the individuals in your sample. If you have trio data, make sure that no observed individuals have missing phenotypes (i.e. set parents to controls in a TDT context, rather than have a missing phenotype code).

Strand issues

The HapMap SNPs are all given on the +ve strand, and so it is your responsibility to ensure that your data are aligned also, for the merge to work. The --flip command can help changing strand. If there are strand problems, PLINK will report a list of SNPs that did not match in terms of strand. Naturally, if there are SNPs A/T or C/G SNPs in your dataset, these will potentially go unflagged. As such, it is always a good idea to check allele frequencies between the HapMap and the WGAS sample to identify grossly deviant SNPs and/or undetected strand issues (i.e. create an alternate phenotype file with the HapMap individuals coded as controls and the rest of WGAS data as cases, and run a basic association command). The --flip-scan command can also help to detect some incorrectly aligned variants.

NOTE This will create a very large dataset and take some time; particularly if you have a parallel computing environment available, you might want to split the files and the merge procedures up by chromosomes, e.g. first download the archive with the HapMap CEU founder fileset split by chromosome, then merge each chromosome separately:

./plink --bfile mydata --chr 1 --make-bed --out data-1

./plink --bfile mydata --chr 2 --make-bed --out data-2

etc, followed by

./plink --bfile hapmap-ceu-chr1 --bmerge data-1.bed data-1.bim data-1.fam --make-bed --out merged-1

./plink --bfile hapmap-ceu-chr2 --bmerge data-2.bed data-2.bim data-2.fam --make-bed --out merged-2

This will create 22 separate filesets (merged-1, merged-2, etc) and all the following routines can then be run separately on each.

Combined imputation and association analysis of case/control data

Given the merged fileset, containing both the reference panel and the (more sparse) WGAS samples, PLINK will attempt to perform case/control association for every SNP (both observed and imputed) with the following command:

./plink --bfile merged-1 --proxy-assoc all

which will generate an output file

     plink.assoc.proxy

with the fields

     CHR     Chromosome code
     SNP     SNP identifier
     BP      Physical position (base-pairs)
     A1      First allele code (not necessarily minor allele)
     A2      Second allele code (not necessarily major allele)
     GENO    Genotyping rate in entire sample and reference panel
     NPRX    Number of proxy SNPs selected
     INFO    Information content metric
     F_A     Allele 1 frequency in cases
     F_U     Allele 1 frequency in controls
     OR      Odds ratio 
     P       Significance value of case/control association test

The fields INFO and NPRX refer to how well PLINK managed, if at all, to impute the SNP. If NPRX is zero, then it could not be even poorly imputed. If INFO ranges from between 0 and 1, although it can be greater than 1 occasionally. A higher value general means a better imputed SNP; roughly speaking, only looking at imputed SNPs with a INFO value greater than 0.8 or so is probably good practice. More specific details on these metrics will be posted soon.

Modifying options for basic imputation/association testing

One of the most important modofying options for the --proxy-assoc test is --proxy-drop, which means that the observed SNPs are dropped, one at a time, from the WGAS sample when they are tested as the reference SNP (i.e. they will be re-imputed given the surrounding SNPs). That is, the command,

./plink --bfile merged-1 --proxy-assoc all --proxy-drop

would mean that every single SNP test statistic in plink.assoc.proxy would not involve a single observed genotype for that particular SNP; as such, running this association test with the --proxy-drop command is a good idea as it will provide both a means to assess the performance of the imputation (by comparing the results against the results of the observed genotypes) but also of an extra level of QC (if you still see a significant result, it cannot be due to technical artifacts specific to that SNP, as no observed genotypes were used in the test for that SNP).

The value of not using --proxy-drop always with --proxy-assoc (given that the basic --assoc command more straightforwardly calculates association for observed SNPs) is if there is a reasonable amount of missing genotype data for an observed SNP and you want to use imputation to recover it. (Although, in this case, there is perhaps less need to use a separate reference panel in any case, and so the standard proxy association approach, without any reference panel, can be used.)

Parameters modifying selection of proxies

Imputation in this context works simply by selecting a set of proxy SNPs (using the reference panel information) and then phasing these SNPs in both reference panel and WGAS sample jointly. By grouping haplotypes, the corresponding single SNP tests of imputed SNPs can then be straightforwardly performed.

There are a number of parameters that impact the choice of proxy SNPs. Fine tuning of these parameters is still in progress. These parameters will be described in more detail shortly. For now, the default parameters should be sufficient in most cases. See the proxy association page for a description of the parameters, the defaults, and how they can be changed.

Imputing discrete genotype calls

The association test described above performs imputation on-the-fly and does not save the imputed genotype calls or probabilities. To do so, and to generate other metrics of imputation performance, use the --proxy-impute command.

To generate summary statistics for the imputation performance of each SNP, use the command

./plink --bfile merged-1 --proxy-impute all

which produces a file

     plink.proxy.impute

which has the fields

     CHR       Chromosome
     SNP       SNP ID
     NPRX      Number of proxy SNPs
     INFO      Information metric
     TOTAL_N   Total number of WGAS sample genotypes (exc. reference panel)
     OBSERVD   Proportion of these w/ observerd genotypes
     IMPUTED   Proportion of these imputed
     OVERLAP   Proportion of SNPs with both an imputed and overlapping
     CONCORD   Concordance rate in the overlapping set

Here are some example lines:

 CHR             SNP NPRX     INFO  TOTAL_N  OBSERVD  IMPUTED  OVERLAP  CONCORD 
  18       rs7233673    5    0.993     3469        0    0.991        0       NA 
  18       rs7233597    5    0.998     3469    0.999    0.993    0.992    0.986 
  18       rs7505507    4    0.632     3469    0.999    0.332    0.332    0.891

e.g. the first line represents an unobserved SNP, for which 99% of individuals were imputed; the second line was an observed SNP, but if we drop it and try to re-impute, we get 99.3%; the concordance rate between imputed and genotyped is 98.6% for this SNP. The final line represents a SNP that did not perform as well: we only impute a third of genotypes and these are less than 90% concordant (this was an observed SNP also). In this case, we see the INFO score is lower (below 0.8) for this third SNP than for the other two: at the standard 0.8 threshold this SNP would have been ignored in any case.

The required confidence threshold for making a call can be changed with, for example,

     --proxy-impute-threshold 0.8

(it is set to 0.95 by default currently).

To give genotype-specific concordances, use the additional option:

     --proxy-genotypic-concordance

then a set of extra fields are append to the plink.proxy.impute output

     F_AA     Frequency of true 'AA' genotype
     I_AA     Proportion imputed for true AA genotype
     C_AA     Concordance rate for true AA genotype
     F_AB     As above, for 'AB' genotype
     ...      ...

That is, for a very rare SNP, overall concordance would be high just by chance, even if none of the rare genotypes were correctly called. This option is therefore useful to get a better picture of imputation performance (when the observed genotype is also available).

In additon, if

     --proxy-show-proxies

is also specified, an extra PROXIES field will appear in plink.proxy.impute showing the specific SNPs selected.

To perform imputation and save the dosages (fractional count of 0 to 2 alleles for each genotype), add the --proxy-dosage option;

./plink --bfile merged-1 --proxy-impute all --proxy-dosage

which produces a file

     plink.proxy.impute.dosage

in which each imputed SNP is represented as a row; the fields (which does not have any header row)

     SNP Identifier
     Allele 1 code
     Allele 2 code
     Information content score for SNP
     Allele dosage for first individual in sample
     Allele dosage for second individual in sample
     ...
     Allele dosage for final individual in sample

This file can then be analysed outside of PLINK.

To perform imputation and save the called (most likely) genotypes in a new fileset, add the --make-bed option;

./plink --bfile merged-1 --proxy-impute all --make-bed --out imputed-1

By default, PLINK will only replace genotypes that were missing in the original WGAS sample; to make PLINK re-impute all genotypes (whether they were actually observed or not), add the --proxy-replace flag,

./plink --bfile merged-1 --proxy-impute all --proxy-replace --make-bed --out imputed-1

Note Future versions will do obvious things, like let you generate proxy-impute and proxy-assoc output files in the same run (you can't now).

Important Making discrete calls for the most likely genotype will necessarily introduce error and bias in the all but perfectly imputed SNPs. As such, one should take care in the analysis and interpretation of imputed datasets -- they should not be treated as if they were directly observed with certainty. In particular, one should be particularly cautious when combining multiple imputed files, particularly if different platforms were used and/or if the files also differ by disease state. Indeed, such an analysis is currently not recommended.

Verbose output options

To get a verbose output for a single SNP in the association mode, use instead of the all keyword the specific SNP name:

     --proxy-assoc rs123235

See the web-page on proxy association methods to interpret this output.

You can also specify verbose imputation for one or more SNPs, e.g.

     --proxy-impute rs8096534  --proxy-verbose

which will add extra lines to the file plink.proxy.impute representing the actual calls per person:

     rs8096534       78-03C15376 TBI-78-03C15376-1   01 01 0 1 0
     rs8096534       78-03C15377 TBI-78-03C15377-1   00 00 1 0 0
     rs8096534       78-03C15378 TBI-78-03C15378-1   01 01 0 1 0
     rs8096534       78-03C15398 TBI-78-03C15398-1   00 00 1 0 0
     rs8096534       78-03C15448 TBI-78-03C15448-1   01 01 0 1 0
     rs8096534       78-03C20292 TBI-78-03C20292-1   11 11 0 0 1
     rs8096534       78-03C20300 TBI-78-03C20300-1   11 10 0 0.08199 0.918
     rs8096534       78-03C20317 TBI-78-03C20317-1   01 01 0 1 0
     rs8096534       78-03C20335 TBI-78-03C20335-1   01 01 0 1 0
     ...

where the fields are (note: currently there is no header for these fields)

     SNP     SNP identifier
     FID     Family ID
     IID     Individual ID
     OBS     Observed genotype (coded 00,01,11 = AA,AB,BB,  10 = missing)
     IMP     Imputed genotype (as above)
     PAA     Probability of 'AA' genotype
     PAB     Probability of 'AB' genotype
     PBB     Probability of 'BB' genotype (i.e. these last 3 numbers sum to 1.00)

In addition, after these lines you will see a table of counts which summarises the actual calls versus the true values (if known). Ideally, you would observe high numbers down the diagonal therefore (the columns are the same as the rows):

     Imputation matrix (rows observed, columns imputed)
     A/A     292     2       0       1
     A/G     0       1389    8       55
     G/G     0       5       1585    130
     0/0     1       1       0       0

and this is then followed by the normal, single-line non-verbose report for that SNP

 CHR             SNP NPRX     INFO  TOTAL_N  OBSERVD  IMPUTED  OVERLAP  CONCORD 
  18       rs8096534    5    0.961     3469    0.999    0.946    0.946    0.995

Although you are able to specify --proxy-impute all and --proxy-verbose together, be warned that this will typically result in a very large output file for real data. It is better used for single SNPs in its current format.

This document last modified Wednesday, 25-Jan-2017 11:39:28 EST