PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

Family-based association analysis

The main focus of PLINK is for population-based samples. There is some support for family-based analyses however, described in this section, for disease traits and quantitative traits.

Family-based association (TDT)

PLINK supports basic family-based association testing for disease traits, using the TDT and a variant of this test that also incorporates parental phenotype information, the parenTDT.

To run a basic TDT analysis for family data:
plink --file mydata --tdt

which generates the file
     plink.tdt
If permutation has been requested, then either
     plink.tdt.perm
or
     plink.tdt.mperm
will be generated also. The main output file, plink.tdt, contains the following fields:
     CHR         Chromosome number
     SNP         SNP identifier
     A1          Minor allele code
     A2          Major allele code
     T           Transmitted minor allele count
     U           Untransmitted allele count
     OR          TDT odds ratio
     CHISQ       TDT chi-square statistic
     P           TDT asymptotic p-value
     A:U_PAR     Parental discordance counts
     CHISQ_PAR   Parental discordance statistic
     P_PAR       Parental discordance asymptotic p-value
     CHISQ_COM   Combined test statistic
     P_COM       Combined test asymptotic p-value
If the --ci option has been requested, then two additional fields will appear after TDT_OR:
     L95         Lower 95% confidence interval for TDT odds ratio
     U95         Upper 95% confidence interval for TDT odds ratio
(naturally, if a value other than 0.95 was used as the argument for the --ci option, it will appear here instead.)

The TDT statistic is calculated simply as
     (b-c)^2 / (b+c)
where b and c are the number of transmitted and untransmitted alleles as shown in plink.tdt; under the null, it is distributed as a 1df chi-squared.

The parental discordance test is based on counting the number of alleles in affected versus unaffected parents, treating each nuclear family parental pair as a matched pair. These counts can be combined with the T and U counts of the basic TDT to give a combined test statistic, also shown in the output. The parenTDT assumes homogeneity within families rather than between families, in terms of population stratification. If parents are measured on the phenotype, then this test can add considerable power to family-based association analysis, whilst providing a strong degree (but not complete) protection against population stratification. The increase in power will depend on the proportion of parents that are discordant for the disease. This approach is described in Purcell et al AJHG (2005). PLINK uses a more simple approach to calculate the PAR and COM statistics, however: if
                   Unaffeced parent
                   A/A   A/B   B/B
     Affected A/A   -     p     r
      parent  A/B   x     -     q
              B/B   z     y     -
i.e. such that the A:U_PAR fields represents p+q+2r : x+y+2z, then
     PAR = ( (p+q+2r) - (x+y+2z) )^2
            / ( p+q+x+y+4(r+z) )
and
     COM =  ( ( b+p+q+2r ) - ( c+x+y+2z ) )^2 
             / ( b+p+q+c+x+y+4(r+z) )
Both statistics follow a 1 df chi-squared distribution under the null.

When running the --tdt option, PLINK will first perform a check for Mendel errors and make missing the offending genotypes.

Using the --tdt option, if permutation is requested (using either --perm or --mperm) a file entitled either
     plink.tdt.perm
or
     plink.tdt.mperm
will be generated: the empirical p-value will be based on the standard TDT test. The permutation procedure will flip transmitted/untransmitted status constantly for all SNPs for a given family, thereby preserving the LD and linkage information between markers and siblings.

parenTDT

The parenTDT, described in the paragraph above, is automatically included when using the --tdt option. These alternate commands generate the same output as for the --tdt command, described above, except the permutation is based not on the standard TDT, but either the parenTDT if using the option
plink --file mydata --parentdt1

or, the combined test (TDT and parenTDT) if using the option
plink --file mydata --parentdt2

Parent of origin analysis

When performing family-based TDT analysis, it is possible to separately consider transmissions from heterozygous fathers versus heterozygous mothers to affected offspring. This is performed by adding the --poo to request parent-of-origin analysis:
plink --file mydata --tdt --poo

which generates the file plink.tdt.poo. If permutation is also requested, this also generates the file plink.tdt.poo.perm or plink.tdt.poo.mperm, depending which permutation procedure is used. The main output file has the following format:
     CHR        Chromosome number
     SNP        SNP identifier
     A1:A2      Allele 1 : allele 2 codes
     T:U_PAT    Paternal transmitted : untransmitted counts
     OR_PAT     Paternal odds ratio
     CHISQ_PAT  Paternal chi-squared test 
     T:U_MAT    Maternal, as above
     OR_MAT     Matneral, as above
     CHISQ_MAT  Maternal, as above
     Z_POO      Z score for difference in paternal versus maternal odds ratios
     P_POO      Asymptotic p-value for parent-of-origin test
If permutation is requested, the default test statistic is the absolute value of the Z score for the parent-of-origin test (i.e. making a two-sided test). The flags --pat and --mat indicate that the permutation statistic should be the paternal TDT chi-squared statistics, or the maternal statistic, instead.

NOTE When both parents are heterozygous, these ambiguous transmissions are counted as 0.5 for both mother and father -- this is why the T:U counts will often not be whole numbers.

DFAM: family-based association for disease traits

The DFAM procedure in PLINK implements the sib-TDT and also allows for unrelated individuals to be included (via a clustered-analysis using the Cochran-Mantel-Haesnzel). To perform this test:
plink --bfile mydata --dfam

which generates the file
     plink.dfam
which contains the fields
     CHR        Chromosome code
     SNP        SNP identifier
     A1:A2      Minor and major allele codes
     OBS        Number of observed minor alleles 
     EXP        Number of expected minor alleles
     CHISQ      Chi-squared test statistic
     P          Asymptotic p-value
This test can therefore be used to combine discordant sibship data, parent-offspring trio data and unrelated case/control data in a single analysis.

NOTE If you are analysing a sibling-only sample (i.e. no parents) then also add the --nonfounders option; otherwise, all SNPs will be pruned out at the filtering stage, as PLINK will by default only consider founder alleles when calculating allele frequency, Hardy-Weinberg, etc.

QFAM: family-based association tests for quantitative traits

PLINK offers a somewhat ad-hoc procedure to perform family-based tests of association with quantitative phenotypes: the QFAM procedure, which uses permutation to account for the dependence between related individuals. It adopts the between/within model as used by Fulker et al (1999, AJHG) and Abecasis et al (2000, AJHG) as implemented in the QTDT package. However, rather than fitting a maximum likelihood variance components model, as QTDT does, PLINK performs a simple linear regression of phenotype on genotype, but then uses a special permutation procedure to correct for family structure.

There are several ways to run QFAM: a total association test (between and within components)
plink --bfile mydata --qfam-total --mperm 100000

or a within-family test
plink --bfile mydata --qfam --mperm 100000

or a test including parental phenotypes
plink --bfile mydata --qfam-parents --mperm 100000

(Also, --qfam-between will look only at the between-family component of association).

NOTE In all cases above, we have used --mperm to specify permutation; adaptive permutation can also be used with QFAM (--perm). Permutation is necessary for the QFAM test.

The columns in the QFAM permutation result files are:
     CHR     Chromosome code
     SNP     SNP identifier
     STAT    Test statistic (ignore)
     EMP1    Pointwise empirical p-value
     NP      Number of permutations performed
The columns in the non-permutation file (e.g. plink.qfam.total, if plink.qfam.total.mperm contains the permuted results) are as follows:
     CHR     Chromosome code
     SNP     SNP identifier
     A1      Minor allele (corresponds to beta given below; absent in earlier PLINK releases)
     TEST    Type of test, TOT, WITH and BET
     NMISS   Number of non-missing individuals in analysis
     BETA    Regression coefficient
     STAT    Test statistic (ignore; not corrected for family-structure)
     P       Asymptotic p-value (ignore; use empirical p-value)
These results are from a standard --linear type analysis, i.e. which ignores family structure. They are displayed so that the direction of effect may be determined (from the BETA) -- but otherwise, only the empirical p-value from the permuted results file should be looked at.

The B and W components are calcalated using parental genotypes if they are available for both parents, otherwise siblings are used. Singletons can be included in this analysis (i.e. B=G and W=0 for them): for example, the scores are shown below for a few configurations, when parents are available:
     Genotype     G
     AA           1
     Aa           0
     aa          -1

     B = ( Pat + Mat ) / 2
     W = G - B
 
     Pat     Mat     Offspring     G     B     W

     AA      AA      AA            1     1     0

     Aa      AA      AA            1     0.5   0.5
     Aa      AA      Aa            0     0.5   -0.5

     aa      AA      Aa            0     0     0

     etc

The QFAM permutation procedure breaks down the genotypes into between (B) and within (W) components, permutes them independently (i.e. at the family level, either swapping the B component for one family with another family, or flipping the sign of all W's in a family with 50:50 chance) and then (for the total association test) reconstructs the individual level "genotypes" as the sum of the new B's and W's i.e:
     1) G -> B + W (individual-level)

     2a) Permute B (family-level) -> B'
     2b) Permute W (family-level) -> W'

     3) B' + W' -> G' (individual-level)
The logic is that we know how to permute both B and W separately whilst maintaining the familial structural component, and they are orthogonal components, so we should permute them separately, but then recombine them as a single individual-level genotypic score.

NOTE The total --qfam-total test is designed to extract all association information from a family-based sample, controlling for relatedness: it is not robust to stratification. Use the --qfam for a strictly within family test.

In many circumstances, the standard QTDT as implemented in Goncalo Abecasis' QTDT program will perhaps be more appropriate. The disadvantages of the QFAM procedure are
  • that it uses permutation and so is slower
  • appears to be slightly less powerful when there is a higher residual correlation
On the plus side, the advantages of the QFAM procedure are
  • that is uses permutation and so is appropriate for non-normal phenotypes; it could also be used for disease phenotypes, although it will not be appropriate for affected-only TDT style designs
  • that it can be applied to genome-wide data easily (albeit not necessarily quickly)
Technical note As a technical point: when permuting genotype between families in this way, one has to be careful with missing genotype data, particularly in the instance in which a family is completely missing. Because a missing B component cannot be recombined with a non-missing W component, and vice versa, this process would tend to increase the amount of missingness in permutations versus the original data.

One could exclude individuals with missing genotypes first and permute separately for each SNP, but this would no longer maintain the correlation between SNPs (and require more computation). Instead, we use the following scheme. We permute once per replicate (e.g. a table of F (original family) and F' (permuted family), true and permuted families). e.g. but let's say that 2 is missing their B component (denoted 2*) For example:
     F  F'
     0  5
     1  2*  <- remove ?
     2* 4   <- remove ?
     3  1
     4  0
     5  3
This would knock out families 1 and 2 from the permutation. We therefore permute once to create a single table for permutation for all SNPs, but then resursively edit the table on a SNP-by-SNP basis, to regroup the missing families, by swapping missing F' families: in this case, swap 2* with 4 (the other partner of 2*), e.g.
     F  F'
     0  5
     1  4
     2* 2* <- remove
     3  1
     4  0
     5  3
So now we have a permuted sample but the total level of missingness is the same. This procedure still generates valid, completely random permutations of the non-missing genotype data and trys to maintain as much of the correlation between SNPs as possible (i.e. as typically only a small % of genotypes are missing and so we do not need to edit the table much).
 

This document last modified Wednesday, 25-Jan-2017 11:39:26 EST