PLINK: Whole genome data analysis toolset

PLINK: Whole genome data analysis toolset [an error occurred while processing this directive]

Association analysis

The basic association test is for a disease trait and is based on comparing allele frequencies between cases and controls (asymptotic and empirical p-values are available). Also implemented are the Cochran-Armitage trend test, dominant and recessive models and a two degree of freedom general model. The ability to compare different models by likelihood ratio test and to evaluate the significant of the most significant model by permutation are also incorporated. We also test for difference in missing genotype rates between cases and controls.

PLINK is designed to perform these basic tests quickly: to calculate the chi-square statistics and odds ratios for all SNPs in a 100K dataset containing 350 individuals takes ~2 seconds on a standard Linux workstation, making a permutation-based approach to whole genome analysis feasible.

Family-based data can be analyzed with the basic TDT, using either asymptotic or empirical significance values. In addition, the basic test is modified to include information on parental phenotypes to give a more powerful combined test. The permutation procedure will flip transmitted/untransmitted status constantly for all SNPs for a given family, thereby preserving the LD and linkage information between markers and siblings.

Quantitative traits can be tested for association also, using either asymptotic (likelihood ratio test and Wald test) or empirical significance values. As above, any clustering scheme can be specified for the permutations: in this way, it is possible to control for subpopulation membership as well as family structure, if related individuals are being analyzed.

For all tests (disease, quantitative and family-based) it is possible to specify 'sets' of SNPs to calculate 'set-based' or 'gene-based' test statistics: these are based on cumulative sums of rank-ordered single SNP statistics, and are evaluated by permutation.

Basic case/control association test

To perform a standard case/control association analysis, use the option:

plink --file mydata --assoc

Results are stored in the file

plink.assoc

in the form:

	Chromosome
	SNP ID
	Allele 1
	Frequency of allele 1
	Allele 2
	Association chi-square
	Empirical p-value **
	Adjusted empirical p-value
	Odds ratio (allele 1)

The empirical p-value is currently based on a method that will be more conservative if there are missing data that are unevenly distributed between cases and controls: this may help to partially control the confounding that can arise with non-random missing data.

Currently X/Y chromosome markers are not properly handled by these analyses.

To obtain a missing chi-sq test (i.e. does, for each SNP, missingness differ between cases and controls?): this should append an extra column onto the standard results in plink.assoc:

plink --file mydata --assoc --test-missing

Permutation procedure: clusters

The number of permutations (default is 1000) is set as follows:

plink --file mydata --assoc --perm 20000

To perform stratification clustering, then permute only within cluster, e.g. make all nearest neighbour case-control pairs (within a certain threshold) and permute only within pairs:

plink --file mydata --cluster --cc --mc 2 --merge 0.001 --assoc --within

To make permutation within family ID grouping (i.e. you can swap any measure in as the Family ID column, as this is not used for any other purpose).

plink --file mydata --assoc --within --family

TODO Describe how the permutation / clustering works, and potential applications (e.g. controlling for factors).

Alternate / full model association tests

This option, for case/control data, performs a series of tests other than the basic allelic test, for association:

plink --file mydata --model

     Chromosome
     SNP
     MAF
     Case(11) genotype count
     Case(12) genotype count
     Case(22) genotype count
     Control(11) genotype count
     Control(12) genotype count
     Control(22) genotype count
     Cochran-Armitage trend test chi-square
     Standard allele-based chi-square
     General model 2df chi-square
     Dominant model chi-square
     Recessive model chi-square

     Cochran-Armitage trend test p-value
     Standard allele-based p-value
     General model 2df p-value
     Dominant model p-value
     Recessive model p-value

     General versus allelic test p-value
     General versus dominant test p-value
     General versus recessive test p-value

     Flag (0/1) indicating whether genotypic test was valid
     Best model (G, M, D, R, X=invalid)

TODO Describe the output of this analysis

Quantitative trait association

If the phenotype (column 6 of the PED file or the phenotype as specified with the --pheno option) is quantitative (i.e. contains values other than 1, 2 or missing) then plink will automatically treat the analysis as a quantitative trait analysis.

plink --file mydata --assoc

will generate the file

plink.qassoc

with fields as follows:

     Col 1 : SNP name
     Col 2 : SNP chromosome
     Col 3 : # non-missing genotypes
     Col 4 : regression coefficient (beta)
     Col 5 : Var(beta)
     Col 6 : Regression r^2
     Col 7 : Likelihood ratio test (chi-sq, 1 df)
     Col 8 : Likelihood ratio test p-value
     Col 9 : Wald statistic (chi-sq, 1 df)
     Col 10 : Wald statistic p-value

as well as the standard

plink.assoc

which contains the same fields as for case/control association testing, except the odds ratio columns are not defined. The primary use of this second plink.assoc file is for empirical p-values (which are based on the Wald statistic).

Family-based association (TDT)

PLINK supports basic family-based association testing for disease traits, using the TDT and a variant that also incorporates parental phenotype information, parenTDT.

All families must be nuclear families (i.e. two-generations) otherwise an error message will be given. To evaluate pedigree structure, break multi-generational families into nuclear family units, and enumerate different types of families, please use the famtypes program first.

plink --file mydata --tdt

This option will first perform a check for Mendel errors, report these errors and make the offending genotypes missing.

Output is sent to two files:

     plink.tdt.asym
     plink.tdt.perm

The first contains the basic transmitted/untransmitted allele counts for all markers, the chi-square statistics and asymptotic p-values. Also included is information about the parenTDT and a combined test.

TODO Add documentation.

parenTDT

plink --file mydata --parentdt1

plink --file mydata --parentdt2

TODO Add documentation.

Set-based tests

To perform gene-based/set-based sum-of-chi-squares tests:

plink --file mydata --assoc --set my.set

where the file my.set is in form

     SET1
     rs1234
     rs28384
     rs29334
     END

     SET2
     rs4774
     rs662662
     rs77262
     END

     ...

The output for the set statistics is in the file

     plink.set-assoc

in the form:

	
     Set name (SET)
     Number of SNPs in set (S<1, S<2, etc) (S)
     nth most associated SNP (i.e. order of inclusion) (SNP)
     Average chi-square for set of n SNPs (T)
     Empirical p-value for average chi-square (p0)
     Empirical p-value corrected for all tests within this set (p1)
     Empirical p-value corrected for all tests in all sets (p2)

The p1 value for the S<1 statistic will be a statistic of natural interest -- the significance of the best hit in a gene, controlling for all other SNPs in that gene. [an error occurred while processing this directive]

This document last modified [an error occurred while processing this directive]