PSEQ association analysis
A number of approaches to testing phenotype-genotype associations are available in PLINK/Seq, using permutation or asymptotic statistics, with either single variants, genes or gene-sets as the unit of analysis. On this page we describe:
- Basic single site case/control tests
- Linear and logistic regression models for single variants.
- A collection of gene-based tests.
- A description of the approaches to permutation implemented, to obtain empirical significance values.
In general, we expect that methods for relating genotype to phenotype will be an area of future focus and expansion in PLINK/Seq, beyond the few methods described on this page.
Single variant association
To test each variant for association with a disease (dichotomous) outcome, use the command:
pseq proj v-assoc --phenotype my.phenotype
This assumes that my.phenotype is a dichotomous phenotype that exists in the INDDB. As for most commands, v-assoc can be combined with a mask.
In single VCF mode, rather than specifying a project file (proj) the VCF is directly specified. In this case, the --phenotype argument takes two parameters: the filename followed by the phenotype label:
pseq data.vcf.gz v-assoc --phenotype phenofile.txt my.phenotype
By default, the v-assoc command (expecting a dichotomous phenotype to be specified) uses Fisher's exact test to perform allelic, dominant and recessive single variant tests. To instead calculate empirical significance values based on a fixed number of permutations, add:
--perm 10000
To use adpative permutations:
--perm -1
The output is a tab-delimited file with one row per variant:
VAR Variant identifier REF Reference allele SAMPLES Sample tags for which site is called, space-delimited FILTER FILTERs for those sites, space-delimited VMETA Variant meta-information CONMETA Consensus sample-variant meta-information ALT Alternate allele(s), comma-delimited MAF Minor allele frequency HWE P-value from Hardy-Weinberg disequilibrium test (exact test) MINA Number of minor alleles in cases MINU Number of minor alleles in controls OBSA Number of non-null genotypes in cases OBSU Number of non-null genotypes in controls REFA Number of reference homozygotes in cases HETA Number of reference/alternate heterozygotes in cases HOMA Number of alternate/alternate homozygsotes in cases REFU As above, in controls HETU HOMU P p-value for single site association (allelic) OR Allelic odds ratio PDOM p-value, dominant model for alternate allele ORDOM Odds ratio under dominant model PREC Empirical p-valie, recessive model for alternate allele ORREC Odds ratio under recessive model
If any of the four cells are zero, the Woolf-Haldane correction is applied (adding 0.5 to all cells) in order to calculate an odds ratio.
Linear and logistic regression tests
To run a linear or logistic regression on each single variant, use the glm command:
pseq proj glm --phenotype phe1 --covar mds1 mds2
The type of test will depend on the phenotype (quantitative trait or dichotomous disease trait). Currently only asymptotic p-values are given. Naturally, for rare alleles these analyses will be under-powered and/or unreliable: a mask options such as maf=0.02 should probably be routinely applied.
VAR Variant identifier REF Reference allele ALT Alternate allele(s) N Number of individuals included in analysis F Frequency of the alternate allele(s) F_A For logistic models, case frequency F_U For logistic models, control frequency OR For logistic models, odds ratio BETA For linear models, regression coefficient SE Standard error of estimate STAT Test statistic (Z or T) P Asymptotic p-value
The coefficients for covariates are not displayed.
Gene/group-based association tests
To perform set-based association (where the set may be a gene, pathway, or any other entity), use the command:
pseq proj assoc --phenotype my.phenotype --mask loc.group=refseq
The assoc command currently assumes a dichotomous phenotype. By default, this command uses adaptive permutation. To use a fixed number per gene/region add:
--perm 10000
Other filters can be specified with the mask here, although a grouping factor must be present -- refseq in this case, which refers to a group in the LOCDB. Typically, given a focus on rare variants, the mask will filter on minor allele count or frequency:
pseq proj assoc --phenotype phe1 --mask loc.group=refseq mac=2-20
LOCUS POS ALIAS NVAR TEST P I DESC NM_000021 chr14:72742931..72742931 CCDS9812.1,PSEN1 1 BURDEN 0.538462 0.153846 3/2 NM_000043 chr10:90752918..90758660 CCDS7393.1,FAS 3 BURDEN 0.6 0.2 8/9 NM_000055 chr3:166973982..167031223 CCDS3198.1,BCHE 6 BURDEN 0.833333 0.5 4/9 NM_000077 chr9:21960916..21960916 CCDS6510.1,CDKN2A 1 BURDEN 0.714286 0.285714 2/4 NM_000112 chr5:149337637..149341438 CCDS4300.1,SLC26A2 9 BURDEN 0.0555556 0.008547 14/4 NM_000113 chr9:131620904..131620904 CCDS6930.1,TOR1A 1 BURDEN 0.714286 0.714286 0/1 ...
The NVAR field is the number of variants (meeting the mask). The default test is a simple comparison of the BURDEN of case to control minor alleles. The P field is based on permutation, the empirical significance. The I field indicates the proportion of null replicates for which the best test statistic was tied. That is, this is a measure of the discreteness of the empirical distribution of the test statistic, and effectively gives a sense of the smallest possible empirical p-value. In other words, for a gene with only a single singleton variant, in a sample of equal cases and controls, half the time you'd expect to see the variant in a case (under null permutations of the data) and half the time in a control. Thus I would be 0.5. Filtering out tests with very high I values is one way to achieve a set of tests for which the distribution of p-values will be more interpretable. This is comparable to removing SNPs with very low minor allele frequencies (e.g. under 1%) in GWAS.
There are a number of other gene-based tests available in PSEQ (and more being added). Currently available tests include, that are included with the --tests argument:
Output label | Statistic | keyword for --tests | Directionality |
---|---|---|---|
BURDEN | Excess of rare alleles in cases compared to controls | no-burden (test included by default, use to remove) | 1-sided |
UNIQ | Count of case-unique rare alleles | uniq | 1-sided |
VT | Variable threshold test, Price et al, 2010 | vt | 1-sided |
FQRWGT | Frequency-weighted test, in spirit of Madsen-Browning | fw | 1-sided |
CALPHA | C-alpha test, Neale et al, 2011 | calpha | 2-sided |
SUMSTAT | Sum of single-site statistics | sumstat | 2-sided |
Multiple tests can be added at the same time:
--tests calpha vt
For the BURDEN test, the DESC field contains the number of case/control minor alleles. For UNIQ, it contains the number of case-only non-reference alleles in cases/controls. If the C-alpha test is performed, a frequency breakdown is given in DESC. This indicates the number of variants with a particular case/control count of the minor allele, e.g.
0/1=3;1/0=2;29/25=1
means 3 control-only singletons, 2 case-only singletons and 1 common variant (29 case alleles, 25 control alleles) were observed.
Additional tests: A large number of other gene-based have and are being developed and we plan to implement a proportion of these. In collaboration with the authors of the respective packages, three tests that have been incorporated in the development version of PSEQ and will be released soon, pending a round of testing:
- Step-up test, Hoffman, Marini and Witte (2010)
- CMC test, Li & Leal (2008)
- KBAC test, Lui & Leal (2011)
Collaboration? If you are the author of an appropriate test and would like to discuss ways to implement it within the PLINK/Seq framework, please contact us.
Permutation
Most association tests are currently based on permutation to assess empirical p-values, invoked either by default (for gene-based tests) or via the --perm option. A few important points to note:
Maintaining correlation between tests: where possible, for each entity tested (variant, gene, pathway) an identical set of permutations is performed (label-swapping of phenotype across individuals). As such, this will preserve the pattern of correlation in test statistics (due to LD) across sites.
Stratified permutation: if an apprpriate factor is present in the INDDB, one can performed stratified permutation (only label-swapping phenotype within strata) as follows:
--strata batch
(assuming that batch has been uploaded to the INDDB). Individuals who are missing for the stratifying variable will not be permuted.
Currently no support for family-data: permutation (and all currently implemented tests) are focused on samples of unrelated individuals. Development of family-based analyses and procedures will likely be an active area of future development.
Adaptive permutations: the number of permutations for each test can be specified:
--perm 10000
It is also possible to use adaptive permutations, in which a test is stopped early if it appears that the ultimate empirical p-value will not be highly significant:
--perm -1
Permutation and missing data: by default, permutation is performed without respect to missing genotype data. (By default, any individual with missing phenotype data will be removed prior to initiating the association test.) To permute phenotype labels only within individuals with no missing genotype data for that variant (or group of variants, in gene-based tests) add the argument:
--fix-null
Experiment-wide multiple-test correction: experiment-wide p-values can be calculated at the end of the analysis. By default this option is disabled in the current release.