1. Introduction
2. Basic information
3. Download and general notes
|
A PLINK tutorialIn this tutorial, we will consider using PLINK to analyse example data: randomly selected genotypes (approximately 80,000 autosomal SNPs) from the 89 Asian HapMap individuals. A phenotype has been simulated based on the genotype at one SNP. In this tutorial, we will walk through using PLINK to work with the data, using a range of features: data management, summary statistics, population stratification and basic association analysis. NOTE These data do not, of course, represent a realistic study design or a realistic disease model. The point of this exercise is simply to get used to running PLINK.89 HapMap samples and 80K random SNPsThe first step is to obtain a working copy of PLINK and of the example data files.
Two phenotypes were generated: a quantitative triat and a disease trait (affection status, coded 1=unaffected, 2=affected), based on a median split of the quantitative trait. The quantitative trait was generated as a function of three simple components:
Chinese Japanese Control 34 11 Case 11 33which shows a strong relationship between these two variables. The next table shows the association between the variant rs2222162 and disease: Genotype 11 12 22 Control 17 22 6 Case 3 19 22Again, the strong association is clear. Note that the alleles have been recoded as 1 and 2 (this is not necessary for PLINK to work, however -- it can accept any coding for SNPs). In summary, we have a single causal variant that is associated with disease. Complicating factors are that this variant is one of 83534 SNPs and also that there might be some degree of confounding of the SNP-disease associations due to the subpopulation-disease association -- i.e. a possibility that population stratification effects will exist. Even though we might expect the two subpopulations to be fairly similar from an overall genetic perspective, and even though the sample size is small, this might still lead to an increase in false positive rates if not controlled for. We will use the affection status variable as the default variable for analysis (i.e. the sixth column in the PED file). The quantitative trait is in a separate alternate phenotype file, qt.phe. The file pop.phe contains a dummy phenotype that is coded 1 for Chinese individuals and 2 for Japanese individuals. We will use this in investigating between-population differences. You can view these alternate phenotype files in any text editor. In this tutorial dataset we focus on autosomal SNPs for simplicity, although PLINK does provide support for X and Y chromosome SNPs for a number of analyses. See the main documentation for further information. Using PLINK to analyse these dataThis tutorial is intended to introduce some of PLINK's features rather than provide exhaustive coverage of them. Futhermore, it is not intended as an analysis plan for whole genome data, or to represent anything close to 'best practice'. These hyperlinks show an overview of topics:
plink --file hapmap1The --file option takes a single parameter, the root of the input file names, and will look for two files: a PED file and a MAP file with this root name. In otherwords, --file hapmap1 implies hapmap1.ped and hapmap1.map should exist in the current directory. HINT! It is possible to specify files outside of the current directory, and to have the PED and MAP files have different root names, or not end in .ped and .map, by using the --ped and --map options. PED and MAP files are plain text files; PED files contain genotype information (one person per row) and MAP files contain information on the name and position of the markers in the PED file. If you are not familiar with the file formats required for PED and MAP files, please consult this page. The above command should generate something like the following output in the console window. It will also save this information to a file called plink.log.@----------------------------------------------------------@ | PLINK! | v0.99l | 27/Jul/2006 | |----------------------------------------------------------| | (C) 2006 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Web-based version check ( --noweb to skip ) Connecting to web... OK, v0.99l is current *** Pre-Release Testing Version *** Writing this text to log file [ plink.log ] Analysis started: Mon Jul 31 09:00:11 2006 Options in effect: --file hapmap1 83534 (of 83534) markers to be included from [ hapmap1.map ] 89 individuals read from [ hapmap1.ped ] 89 individuals with nonmissing phenotypes Assuming a binary trait (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 Before frequency and genotyping pruning, there are 83534 SNPs Applying filters (SNP-major mode) 89 founders and 0 non-founders found 0 of 89 individuals removed for low genotyping ( MIND > 0.1 ) 859 SNPs failed missingness test ( GENO > 0.1 ) 16994 SNPs failed frequency test ( MAF < 0.01 ) After frequency and genotyping pruning, there are 65803 SNPs Analysis finished: Mon Jul 31 09:00:19 2006The information contained here can be summarized as follows:
plink --file hapmap1 --make-bed --out hapmap1If it runs correctly on your machine, you should see the following in your output:above as before ... Before frequency and genotyping pruning, there are 83534 SNPs Applying filters (SNP-major mode) 89 founders and 0 non-founders found 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 83534 SNPs Writing pedigree information to [ hapmap1.fam ] Writing map (extended format) information to [ hapmap1.bim ] Writing genotype bitfile to [ hapmap1.bed ] Using (default) SNP-major mode Analysis finished: Mon Jul 31 09:10:05 2006There are several things to note:
plink --file hapmap1 --make-bed --mind 0.05 --out highgenowhich would create fileshighgeno.bed highgeno.bim highgeno.famWorking with the binary PED file To specify that the input data are in binary format, as opposed to the normal text PED/MAP format, just use the --bfile option instead of --file. To repeat the first command we ran (which just loads the data and prints some basic summary statistics): plink --bfile hapmap1Writing this text to log file [ plink.log ] Analysis started: Mon Jul 31 09:12:08 2006 Options in effect: --bfile hapmap1 Reading map (extended format) from [ hapmap1.bim ] 83534 markers to be included from [ hapmap1.bim ] Reading pedigree information from [ hapmap1.fam ] 89 individuals read from [ hapmap1.fam ] 89 individuals with nonmissing phenotypes Reading genotype bitfile from [ hapmap1.bed ] Detected that binary PED file is v1.00 SNP-major mode Before frequency and genotyping pruning, there are 83534 SNPs Applying filters (SNP-major mode) 89 founders and 0 non-founders found 0 of 89 individuals removed for low genotyping ( MIND > 0.1 ) 859 SNPs failed missingness test ( GENO > 0.1 ) 16994 SNPs failed frequency test ( MAF < 0.01 ) After frequency and genotyping pruning, there are 65803 SNPs Analysis finished: Mon Jul 31 09:12:10 2006The things to note here:
plink --bfile hapmap1 --missing --out miss_statwhich should generate the following output:... 0 of 89 individuals removed for low genotyping ( MIND > 0.1 ) Writing individual missingness information to [ miss_stat.imiss ] Writing locus missingness information to [ miss_stat.lmiss ] ...Here we see that no individuals were removed for low genotypes (MIND > 0.1 implies that we accept people with less than 10 percent missingness). The per individual and per SNP (after excluding individuals on the basis of low genotyping) rates are then output to the files miss_stat.imiss and miss_stat.lmiss respectively. If we had not specified an --out option, the root output filename would have defaulted to "plink". These output files are standard, plain text files that can be viewed in any text editor, pager, spreadsheet or statistics package (albeit one that can handle large files). Taking a look at the file miss_stat.lmiss, for example using the more command which is present on most systems: more miss_stat.lmisswe seeCHR SNP N_MISS F_MISS 1 rs6681049 0 0 1 rs4074137 0 0 1 rs7540009 0 0 1 rs1891905 0 0 1 rs9729550 0 0 1 rs3813196 0 0 1 rs6704013 2 0.0224719 1 rs307347 12 0.134831 1 rs9439440 2 0.0224719 ...That is, for each SNP, we see the number of missing individuals (N_MISS) and the proportion of individuals missing (F_MISS). Similarly: more miss_stat.imisswe seeFID IID MISS_PHENO N_MISS F_MISS HCB181 1 N 671 0.00803266 HCB182 1 N 1156 0.0138387 HCB183 1 N 498 0.00596164 HCB184 1 N 412 0.00493212 HCB185 1 N 329 0.00393852 HCB186 1 N 1233 0.0147605 HCB187 1 N 258 0.00308856 ...The final column is the actual genotyping rate for that individual -- we see the genotyping rate is very high here. HINT If you are using a spreadsheet package that can only display a limited number of rows (some popular packages can handle just over 65,000 rows) then it might be desirable to ask PLINK to analyse the data by chromosome, using the --chr option. For example, to perform the above analysis for chromosome 1: plink --bfile hapmap1 --chr 1 --out res1 --missingthen for chromosome 2:plink --bfile hapmap1 --chr 2 --out res2 --missingand so on. Summary statistics: allele frequencies Next we perform a similar analysis, except requesting allele frequencies instead of genotyping rates. The following command generates a file called freq_stat.frq which contains the minor allele frequency and allele codes for each SNP.plink --bfile hapmap1 --freq --out freq_statIt is also possible to perform this frequency analysis (and the missingness analysis) stratified by a categorical, cluster variable. In this case, we shall use the file that indicates whether the individual is from the Chinese or the Japanese sample, pop.phe. This cluster file contains three columns; each row is an individual. The format is described more fully in the main documentation. To perform a stratified analysis, use the --within option.plink --bfile hapmap1 --freq --within pop.phe --out freq_statThe output will now indicate that a file called freq_stat.frq.strat. has been generated instead of freq_stat.frq. If we view this file:more freq_stat.frq.stratwe see each row is now the allele frequency for each SNP stratifed by subpopulation:CHR SNP CLST A1 A2 MAF 1 rs6681049 1 1 2 0.233333 1 rs6681049 2 1 2 0.193182 1 rs4074137 1 1 2 0.1 1 rs4074137 2 1 2 0.0568182 1 rs7540009 1 0 2 0 1 rs7540009 2 0 2 0 1 rs1891905 1 1 2 0.411111 1 rs1891905 2 1 2 0.397727 ...Here we see that each SNP is represented twice - the CLST column indicates whether the frequency is from the Chinese or Japanese populations, coded as per the pop.phe file. If you were just interested in a specific SNP, and wanted to know what the frequency was in the two populations, you can use the --snp option to select this SNP: plink --bfile hapmap1 --snp rs1891905 --freq --within pop.phe --out snp1_frq_statwould generate a file snp1_frq_stat.frq.strat containing only the population-specific frequencies for this single SNP. You can also specify a range of SNPs by adding the --window kb option or using the options --from and --to, following each with a different SNP (they must be in the correct order and be on the same chromosome). Follow this link for more details. Basic association analysis Let's now perform a basic association analysis on the disease trait for all single SNPs. The basic command isplink --bfile hapmap1 --assoc --out as1which generates an output file as1.assoc which contains the following fieldsCHR SNP A1 F_A F_U A2 CHISQ P OR 1 rs6681049 1 0.1591 0.2667 2 3.067 0.07991 0.5203 1 rs4074137 1 0.07955 0.07778 2 0.001919 0.9651 1.025 1 rs1891905 1 0.4091 0.4 2 0.01527 0.9017 1.038 1 rs9729550 1 0.1705 0.08889 2 2.631 0.1048 2.106 1 rs3813196 1 0.03409 0.02222 2 0.2296 0.6318 1.553 1 rs12044597 1 0.5 0.4889 2 0.02198 0.8822 1.045 1 rs10907185 1 0.3068 0.2667 2 0.3509 0.5536 1.217 1 rs11260616 1 0.2326 0.2 2 0.2754 0.5998 1.212 1 rs745910 1 0.1395 0.1932 2 0.9013 0.3424 0.6773 ...where each row is a single SNP association result. The fields are:
sort --key=7 -nr as1.assoc | headwould give13 rs9585021 1 0.625 0.2841 2 20.62 5.586e-06 4.2 2 rs2222162 1 0.2841 0.6222 2 20.51 5.918e-06 0.2409 9 rs10810856 1 0.2955 0.04444 2 20.01 7.723e-06 9.016 2 rs4675607 1 0.1628 0.4778 2 19.93 8.05e-06 0.2125 2 rs4673349 1 0.1818 0.5 2 19.83 8.485e-06 0.2222 2 rs1375352 1 0.1818 0.5 2 19.83 8.485e-06 0.2222 21 rs219746 1 0.5 0.1889 2 19.12 1.228e-05 4.294 1 rs4078404 2 0.5 0.2 1 17.64 2.667e-05 4 14 rs1152431 2 0.2727 0.5795 1 16.94 3.862e-05 0.2721 14 rs4899962 2 0.3023 0.6111 1 16.88 3.983e-05 0.2758Here we see that the simulated disease variant rs2222162 is actually the second most significant SNP in the list, with a large difference in allele frequencies of 0.28 in cases versus 0.62 in controls. However, we also see that, just by chance, a second SNP on chromosome 13 shows a slightly higher test result, with coincidentally similar allele frequencies in cases and controls. (Whether this result is due to chance alone or perhaps represents some confounding due to the population structure in this sample, we will investigate below). This highlights the important point that when performing so many tests, particularly in a small sample, we often expect the distribution of true positive results to be virtually indistinguishable from the best false positive results. That our variant appears in the top ten list is reassuring however. To get a sorted list of association results, that also includes a range of significance values that are adjusted for multiple testing, use the --adjust flag: plink --bfile hapmap1 --assoc --adjust --out as2This generates the file as2.assoc.adjust in addition to the basic as2.assoc output file. Using more, one can easily look at one's most significant associations:more as2.assoc.adjustedCHR SNP UNADJ GC BONF HOLM SIDAK_SS SIDAK_SD FDR_BH FDR_BY 13 rs9585021 5.586e-06 3.076e-05 0.3676 0.3676 0.3076 0.3076 0.09306 1 2 rs2222162 5.918e-06 3.231e-05 0.3894 0.3894 0.3226 0.3226 0.09306 1 9 rs10810856 7.723e-06 4.049e-05 0.5082 0.5082 0.3984 0.3984 0.09306 1 2 rs4675607 8.05e-06 4.195e-05 0.5297 0.5297 0.4112 0.4112 0.09306 1 2 rs1375352 8.485e-06 4.386e-05 0.5584 0.5583 0.4279 0.4278 0.09306 1 2 rs4673349 8.485e-06 4.386e-05 0.5584 0.5583 0.4279 0.4278 0.09306 1 21 rs219746 1.228e-05 6.003e-05 0.8083 0.8082 0.5544 0.5543 0.1155 1 1 rs4078404 2.667e-05 0.0001159 1 1 0.8271 0.827 0.2194 1 14 rs1152431 3.862e-05 0.0001588 1 1 0.9213 0.9212 0.2621 1 14 rs4899962 3.983e-05 0.000163 1 1 0.9273 0.9272 0.2621 1 8 rs2470048 4.487e-05 0.0001804 1 1 0.9478 0.9478 0.2684 1Here we see a pre-sorted list of association results. The fields are as follows:
Genomic inflation factor (based on median chi-squared) is 1.18739 Mean chi-squared statistic is 1.14813These values would actually suggest that although no very strong stratification exists, there is perhaps a hint of an increased false positive rate, as both values are greater than 1.00. HINT The adjusted significance values that control for multiple testing are, by default, based on the unadjusted significance values. If the flag --gc is specified as well as --adjust then these adjusted values will be based on the genomic-control significance value instead. In this particular instance, where we already know about the Chinese/Japanese subpopulations, it might be of interest to directly look at the inflation factor that results from having population membership as the phenotype in a case/control analysis, just to provide extra information about the sample. That is, running the command using the alternate phenotype option (i.e. replacing the disease phenotype with the one in pop.phe, which is actually subpopulation membership): plink --bfile hapmap1 --pheno pop.phe --assoc --adjust --out as3we see that testing for frequency differences between Chinese and Japanese individuals, we do see some departure from the null distribution:Genomic inflation factor (based on median chi-squared) is 1.72519 Mean chi-squared statistic is 1.58537That is, the inflation factor of 1.7 represents the maximum possible inflation factor if the disease were perfectly correlated with subpopulation that could arise from the Chinese/Japanese split in the sample (this does not account for any possible within-subpopulation structure, of course, that might also increase SNP-disease false positive rates). We will return to this issue below, when we consider using the whole genome data to detect stratification more directly. Genotypic and other association models We can calculate association statistics based on the 2-by-3 genotype table as well as the standard allelic test; we can also calculate tests that assume dominant or recessive action of the minor allele; finally, we can perform the Cochran-Armitage trend test instead of the basic allelic test. All these tests are performed with the single command --model. Just as the --assoc command, this can be easily applied to all SNPs. In this case, let's just run it for our SNP of interest, rs2222162 plink --bfile hapmap1 --model --snp rs2222162 --out mod1This generates the file mod1.model which has more than one row per SNP, representing the different tests performed for each SNP. The format of this file is described here. The tests are the basic allelic test, the Cochran-Armitage trend test, dominant and recessive models and a genotypic test. All test statistics are distributed as chi-squared with 1 df under the null, with the exception of the genotypic test which has 2 df. But there is a problem here: in this particular case, running the basic model command will not produce values for the genotypic tests. This is because, by default, every cell in the 2-by-3 table is required to have at least 5 observations, which does not hold here. This default can be changed with the --cell option. This option is followed by the minimum number of counts in each cell of the 2-by-3 table required before these extended analyses are performed. For example, to force the genotypic tests for this particular SNP just for illustrative purposes, we need to run:plink --bfile hapmap1 --model --cell 0 --snp rs2222162 --out mod2and now the genotypic tests will also be calculated, as we set the minimum number in each cell to 0. We see that the genotype counts in affected and unaffected individuals areCHR SNP TEST AFF UNAFF CHISQ DF P 2 rs2222162 GENO 3/19/22 17/22/6 19.15 2 6.932e-05 2 rs2222162 TREND 25/63 56/34 19.15 1 1.207e-05 2 rs2222162 ALLELIC 25/63 56/34 20.51 1 5.918e-06 2 rs2222162 DOM 22/22 39/6 13.87 1 0.0001958 2 rs2222162 REC 3/41 17/28 12.24 1 0.0004679which, reassuringly, match the values presented in the table above, which were generated when the trait was simulated. Looking at the other test statistics, we see all are highly significant (as would be expected for a strong, common, allelic effect) although the allelic test has the most significant p-value. This makes sense, as the data were essentially simulated under an allelic (dosage) model. Stratification analysis The analyses so far have ignored the fact that our sample consists of two similar, but distinct subpopulations, the Chinese and Japanese samples. In this particular case, we already know that the sample consists of these two groups; we also know that the disease is more prevalent in one of the groups. More generally, we might not know anything of potential population substructure in our sample upfront. One way to address such issues is to use whole genome data to cluster individuals into homogeneous groups. There are a number of options and different ways of performing this kind of analysis in PLINK and we will not cover them all here. For illustrative purposes, we shall perform a cluster analysis that pairs up individuals on the basis of genetic identity. The command, which may take a number of minutes to run, is: plink --bfile hapmap1 --cluster --mc 2 --ppc 0.05 --out str1which requests IBS clustering (--cluster) but with the constraints that each cluster has no more than two individuals (--mc 2) and that any pair of individuals who have a significance value of less than 0.05 for the test of whether or not the two individuals belong to the same population based on the available SNP data are not merged. These options and tests are described in further detail in the relevant main documentation. We see the following output in the log file and on the console:... Clustering individuals based on genome-wide IBS Merge distance p-value constraint = 0.05 Of these, 3578 are pairable based on constraints Writing cluster progress to [ plink.cluster0 ] Cannot make clusters that satisfy constraints at step 45 Writing cluster solution (1) [ str1.cluster1 ] Writing cluster solution (2) [ str1.cluster2 ] Writing cluster solution (3) [ str1.cluster3 ] ...which indicate that IBS-clustering has been performed. These files are described in the main documentation. The file str1.cluster1 contains the results of clustering in a format that is easy to read: more str1.cluster1SOL-0 HCB181_1 JPT260_1 SOL-1 HCB182_1 HCB225_1 SOL-2 HCB183_1 HCB193_1 SOL-3 HCB184_1 HCB202_1 SOL-4 HCB185_1 HCB217_1 SOL-5 HCB186_1 HCB196_1 SOL-6 HCB187_1 HCB213_1 SOL-7 HCB188_1 HCB194_1 SOL-8 HCB189_1 HCB192_1 SOL-9 HCB190_1 HCB224_1 SOL-10 HCB191_1 HCB220_1 SOL-11 HCB195_1 HCB204_1 SOL-12 HCB197_1 HCB211_1 SOL-13 HCB198_1 HCB210_1 SOL-14 HCB199_1 HCB221_1 SOL-15 HCB200_1 HCB218_1 SOL-16 HCB201_1 HCB206_1 SOL-17 HCB203_1 HCB222_1 SOL-18 HCB205_1 HCB208_1 SOL-19 HCB207_1 HCB223_1 SOL-20 HCB209_1 HCB219_1 SOL-21 HCB212_1 HCB214_1 SOL-22 HCB215_1 HCB216_1 SOL-23 JPT226_1 JPT242_1 SOL-24 JPT227_1 JPT240_1 SOL-25 JPT228_1 JPT244_1 SOL-26 JPT229_1 JPT243_1 SOL-27 JPT230_1 JPT267_1 SOL-28 JPT231_1 JPT236_1 SOL-29 JPT232_1 JPT247_1 SOL-30 JPT233_1 JPT248_1 SOL-31 JPT234_1 JPT255_1 SOL-32 JPT235_1 JPT264_1 SOL-33 JPT237_1 JPT250_1 SOL-34 JPT238_1 JPT241_1 SOL-35 JPT239_1 JPT253_1 SOL-36 JPT245_1 JPT262_1 SOL-37 JPT246_1 JPT263_1 SOL-38 JPT249_1 JPT258_1 SOL-39 JPT251_1 JPT252_1 SOL-40 JPT254_1 JPT261_1 SOL-41 JPT256_1 JPT265_1 SOL-42 JPT257_1 SOL-43 JPT259_1 JPT269_1 SOL-44 JPT266_1 JPT268_1Here we see that all but one pair are concordant for being Chinese or Japanese (the IDs for each individual are in this case coded to represent which HapMap subpopulation they belong to: HCB and JPT. Note, these do not represent any official HapMap coding/ID schemes -- I've used them purely to make it clear which population each individual belongs to). We see that one individual was not paired with anybody else, as there is an odd numbered of subjects overall. This individual would not contribute to any subsequent association testing that conditions on this cluster solution. We also see that the Japanese individual JPT260_1 paired with a Chinese individual HCB181_1 rather than JPT257_1. Clearly, this means that HCB181_1 and JPT260_1 do not differ significantly based on the test we performed: this test will have limited power to distinguish individuals from very similar subpopulations, alternatively, one of these individuals could be of mixed ancestry. In any case, it is interesting that JPT260_1 was not paired with JPT257_1 instead. Further inspection of the data actually reveal that JPT257_1 is somewhat unusual, having very long stretches of homozygous genotypes (use the --homozyg-kb and --homozyg-snp options) are a high inbreeding coefficient, which probably explain why this individual was not considered similar to the other Japanese individuals by this algorithm. Note By using the --genome option, it is possible to examine the significance tests for all pairs of individuals, as described in the main documentation. Association analysis, accounting for clusters After having performed the above matching based on genome-wide IBS, we can now perform the association test conditional on the matching. For this, the relevant file is the str1.cluster2 file, which contains the same information as str1.cluster1 but in the format of a cluster variable file, that can be used in conjunction with the --within option. For this matched analysis, we shall use the Cochran-Mantel-Haenszel (CMH) association statistic, which tests for SNP-disease association conditional on the clustering supplied by the cluster file; we will also include the --adjust option to get a sorted list of CMH association results: plink --bfile hapmap1 --mh --within str1.cluster2 --adjust --out aac1The relevant lines from the log are:... Reading clusters from [ str1.cluster2 ] 89 of 89 individuals assigned to 45 cluster(s) ... Cochran-Mantel-Haenszel 2x2xK test, K = 45 Writing results to [ aac1.cmh ] Computing corrected significance values (FDR, Sidak, etc) Genomic inflation factor (based on median chi-squared) is 1.03878 Mean chi-squared statistic is 0.988748 Writing multiple-test corrected significance values to [ aac1.cmh.adjusted ] ...We see that PLINK has correctly assigned the individuals to 45 clusters (i.e. one of these clusters is of size 1, all others are pairs) and then performs the CMH test. The genomic control inflation factors are now reduced to essentially 1.00, which is consistent with the idea that there was some substructure inflating the distribution of test statistics in the previous analysis. Looking at the adjusted results file: more aac1.cmh.adjustedCHR SNP UNADJ GC BONF HOLM SIDAK_SS SIDAK_SD FDR_BH FDR_BY 2 rs2222162 1.906e-06 2.963e-06 0.1241 0.1241 0.1167 0.1167 0.1241 1 2 rs4673349 6.436e-06 9.577e-06 0.4192 0.4191 0.3424 0.3424 0.1261 1 2 rs1375352 6.436e-06 9.577e-06 0.4192 0.4191 0.3424 0.3424 0.1261 1 13 rs9585021 7.744e-06 1.145e-05 0.5043 0.5043 0.3961 0.3961 0.1261 1 2 rs4675607 5.699e-05 7.845e-05 1 1 0.9756 0.9756 0.7423 1 13 rs9520572 0.0002386 0.0003122 1 1 1 1 0.9017 1 11 rs992564 0.00026 0.0003392 1 1 1 1 0.9017 1 12 rs1290910 0.0002738 0.0003566 1 1 1 1 0.9017 1 4 rs1380899 0.0002747 0.0003577 1 1 1 1 0.9017 1 14 rs1190968 0.0002747 0.0003577 1 1 1 1 0.9017 1 8 rs951702 0.0003216 0.0004164 1 1 1 1 0.9017 1 ...Here we see that the "disease" variant, rs2222162 has moved from being number 2 in the list to number 1, although it is still not significant after genome-wide correction. In this last example, we specifically requested that PLINK pair up the most similar individuals. We can also perform the clustering, but with fewer, or different, constraints on the final solution. For example, here we do not impose a maximum cluster size: rather we request that each cluster contains at least 1 case and 1 control (i.e. so that it is informative for association) with the --cc option, and specify a threshold of 0.01 for --ppc: plink --bfile hapmap1 --cluster --cc --ppc 0.01 --out version2which generates the following final solution (version2.cluster1):SOL-0 HCB181_1(1) HCB189_1(1) HCB198_1(1) HCB210_1(2) HCB222_1(1) HCB203_1(1) HCB196_1(1) HCB183_1(2) HCB195_1(1) HCB185_1(1) HCB187_1(1) HCB215_1(2) HCB216_1(1) SOL-1 HCB182_1(1) HCB186_1(1) HCB207_1(2) HCB223_1(1) HCB194_1(1) HCB188_1(1) HCB199_1(1) HCB221_1(2) HCB225_1(1) HCB217_1(1) HCB190_1(1) HCB202_1(1) HCB224_1(2) HCB201_1(2) HCB206_1(1) HCB208_1(1) HCB209_1(1) HCB213_1(1) HCB212_1(1) HCB214_1(2) SOL-2 HCB184_1(1) HCB219_1(2) HCB218_1(1) HCB200_1(1) HCB191_1(2) HCB220_1(1) HCB197_1(1) HCB211_1(2) HCB192_1(1) HCB204_1(1) JPT255_1(2) HCB193_1(1) JPT245_1(2) SOL-3 HCB205_1(1) JPT264_1(2) JPT253_1(1) JPT258_1(2) JPT228_1(1) JPT244_1(2) JPT238_1(2) JPT269_1(2) JPT242_1(2) JPT234_1(2) JPT265_1(1) JPT230_1(2) JPT262_1(1) JPT267_1(2) JPT231_1(1) JPT239_1(2) JPT263_1(2) JPT260_1(2) SOL-4 JPT226_1(1) JPT251_1(2) JPT240_1(2) JPT227_1(2) JPT232_1(2) JPT235_1(2) JPT237_1(2) JPT250_1(1) JPT246_1(2) JPT229_1(2) JPT243_1(1) JPT266_1(2) JPT252_1(2) JPT249_1(2) JPT233_1(1) JPT248_1(2) JPT241_1(2) JPT254_1(1) JPT261_1(2) JPT259_1(2) JPT236_1(2) JPT256_1(1) JPT247_1(2) JPT268_1(2) JPT257_1(2)The lines have been wrapped for clarity of reading here: normally, an entire cluster file is on a single line. Also note that the phenotype has been added in parentheses after each family/individual ID (as the --cc option was used). Here we see that the resulting clusters have largely separated Chinese and Japanese individuals into different clusters. The clustering results in a five class solution based on the --ppc constraint -- i.e. clearly, to merge any of these five clusters would have involved merging two individuals who are different at the 0.01 level, and this is why the clustering stopped at this point (as opposed to merging everybody, ultimately arriving at a 1-class solution). Based on this alternate clustering scheme, we can repeat our association analysis. Note This is not necessarily how actual analysis of real data should be conducted, of course, i.e. by trying different analyses, clusters, etc, until one finds the most significant result... The point of this is just to show what options are available. plink --bfile hapmap1 --mh --within version2.cluster2 --adjust --out aac2Now the log file records that five clusters were found, and a low inflation factor:... Cochran-Mantel-Haenszel 2x2xK test, K = 5 Writing results to [ aac2.cmh ] Computing corrected significance values (FDR, Sidak, etc) ... Genomic inflation factor (based on median chi-squared) is 1.01489 Mean chi-squared statistic is 0.990643 ...Looking at aac2.cmh.adjusted, we now see that the disease SNP is genome-wide significant: CHR SNP UNADJ GC BONF HOLM SIDAK_SS SIDAK_SD FDR_BH FDR_BY 2 rs2222162 8.313e-10 1.104e-09 5.47e-05 5.47e-05 5.47e-05 5.47e-05 5.47e-05 0.0006384 13 rs9585021 2.432e-06 2.882e-06 0.16 0.16 0.1479 0.1479 0.06931 0.809 2 rs4675607 3.16e-06 3.731e-06 0.2079 0.2079 0.1877 0.1877 0.06931 0.809 2 rs4673349 5.63e-06 6.594e-06 0.3705 0.3705 0.3096 0.3096 0.0741 0.8648 2 rs1375352 5.63e-06 6.594e-06 0.3705 0.3705 0.3096 0.3096 0.0741 0.8648 ...That is, rs2222162 now has a significance value of 5.47e-05 even if we use Bonferroni adjustment for multiple comparisons. A third way to perform the stratification analysis is to specify the number of clusters one wants in the final solution. Here we will specify two clusters, using the --K option, and remove the significance test constraint by setting --ppc to 0 (by omitting that option): plink --bfile hapmap1 --cluster --K 2 --out version3This analysis results in the following two-class solution:SOL-0 HCB181_1 HCB182_1 HCB225_1 HCB189_1 HCB188_1 HCB194_1 HCB205_1 HCB208_1 HCB199_1 HCB221_1 HCB201_1 HCB206_1 HCB196_1 JPT253_1 HCB202_1 HCB203_1 HCB191_1 HCB220_1 HCB197_1 HCB211_1 HCB215_1 HCB216_1 HCB212_1 HCB213_1 HCB183_1 HCB195_1 HCB193_1 HCB186_1 HCB207_1 HCB223_1 HCB187_1 HCB209_1 HCB214_1 HCB184_1 HCB219_1 HCB218_1 HCB200_1 HCB185_1 HCB217_1 HCB198_1 HCB210_1 HCB222_1 HCB192_1 HCB190_1 HCB224_1 SOL-1 HCB204_1 JPT255_1 JPT257_1 JPT226_1 JPT242_1 JPT228_1 JPT244_1 JPT238_1 JPT269_1 JPT232_1 JPT247_1 JPT231_1 JPT239_1 JPT229_1 JPT243_1 JPT236_1 JPT256_1 JPT265_1 JPT227_1 JPT266_1 JPT268_1 JPT263_1 JPT235_1 JPT237_1 JPT250_1 JPT246_1 JPT240_1 JPT251_1 JPT259_1 JPT252_1 JPT233_1 JPT248_1 JPT241_1 JPT254_1 JPT261_1 JPT245_1 JPT264_1 JPT249_1 JPT258_1 JPT230_1 JPT267_1 JPT262_1 JPT234_1 JPT260_1Here we see that the solution has assigned all Chinese and all Japanese two separate groups except for two individuals. If we use this cluster solution in the association analysis, we obtain the following results, again obtaining genome-wide significance: CHR SNP UNADJ GC BONF HOLM SIDAK_SS SIDAK_SD FDR_BH FDR_BY 2 rs2222162 8.951e-10 1.493e-09 5.89e-05 5.89e-05 5.89e-05 5.89e-05 5.89e-05 0.0006875 2 rs4675607 9.255e-06 1.217e-05 0.609 0.609 0.4561 0.4561 0.2679 1 13 rs9585021 1.222e-05 1.594e-05 0.8038 0.8038 0.5524 0.5524 0.2679 1 2 rs1375352 2.753e-05 3.519e-05 1 1 0.8366 0.8365 0.3505 1 2 rs4673349 2.753e-05 3.519e-05 1 1 0.8366 0.8365 0.3505 1 9 rs7046471 3.196e-05 4.071e-05 1 1 0.8779 0.8779 0.3505 1 6 rs9488062 4.481e-05 5.659e-05 1 1 0.9476 0.9476 0.4213 1 ...with similarly low inflation factors: Genomic inflation factor (based on median chi-squared) is 1.02729 Mean chi-squared statistic is 0.982804Finally, given that the actual ancestry of each individual is known in this particular sample, we can always use this external clustering in the analysis: plink --bfile hapmap1 --mh --within pop.phe --adjust --out aac3Unsurprisingly, this gives very similar results to the two-class solution derived from cluster analysis. In summary,
plink --bfile hapmap1 --cluster --matrix --out ibd_viewwhich generates a file ibd_view.mdist. Then, in R, perform the following commands: (note: obviously, you need R installed for to perform these next actions -- it can be freely downloaded here)m <- as.matrix(read.table("ibd_view.mdist"))mds <- cmdscale(as.dist(1-m))k <- c( rep("green",45) , rep("blue",44) )
plot(mds,pch=20,col=k)which should generate a plot like this: (green represents Chinese individuals, blue represents Japanese individuals).plink --bfile hapmap1 --assoc --pheno qt.phe --out quant1This analysis generates a file quant1.qassoc which has the following fields:CHR SNP NMISS BETA SE R2 T P 1 rs6681049 89 -0.2266 0.3626 0.004469 -0.6249 0.5336 1 rs4074137 89 -0.2949 0.6005 0.002765 -0.4911 0.6246 1 rs1891905 89 -0.1053 0.3165 0.001272 -0.3328 0.7401 1 rs9729550 89 0.5402 0.4616 0.0155 1.17 0.2451 1 rs3813196 89 0.8053 1.025 0.00705 0.7859 0.434 1 rs12044597 89 0.01658 0.3776 2.217e-05 0.04392 0.9651 1 rs10907185 89 0.171 0.373 0.00241 0.4584 0.6478 1 rs11260616 88 0.03574 0.444 7.533e-05 0.08049 0.936 1 rs745910 87 -0.3093 0.4458 0.005632 -0.6938 0.4897 1 rs262688 89 0.411 0.4467 0.009637 0.9201 0.3601 1 rs2460000 89 -0.03558 0.3821 9.969e-05 -0.09314 0.926 1 rs260509 89 -0.551 0.438 0.01787 -1.258 0.2118 ...The fields in this file represent:
CHR SNP UNADJ GC BONF HOLM SIDAK_SS SIDAK_SD FDR_BH FDR_BY 2 rs2222162 9.083e-11 3.198e-09 5.977e-06 5.977e-06 5.977e-06 5.977e-06 5.977e-06 6.976e-05 21 rs219746 1.581e-07 1.672e-06 0.01041 0.0104 0.01035 0.01035 0.005203 0.06072 7 rs1922519 4.988e-06 3.038e-05 0.3283 0.3282 0.2798 0.2798 0.1094 1 2 rs2969348 1.008e-05 5.493e-05 0.6636 0.6636 0.485 0.485 0.1122 1 3 rs6773558 1.313e-05 6.857e-05 0.8638 0.8638 0.5785 0.5784 0.1122 1 10 rs3862003 1.374e-05 7.123e-05 0.9038 0.9038 0.595 0.595 0.1122 1 8 rs660416 1.554e-05 7.905e-05 1 1 0.6405 0.6404 0.1122 1 14 rs2526935 1.611e-05 8.146e-05 1 1 0.6536 0.6535 0.1122 1 ...Here we see that the disease variant is significant after genome-wide correction. However, these tests do not take into account the clustering in the sample in the same way we did before. The genomic control inflation factor estimate is now: Genomic inflation factor (based on median chi-squared) is 1.19824 Mean chi-squared statistic is 1.21478Instead of performing a stratified analysis or including covariates, one approach is to use permutation: specifically, it is possible to permute (i.e. label-swap phenotypes between individuals) but only within cluster. This controls for any between-cluster association, as this will be constant under all permuted datasets. We request clustered permutation as follows, using the original pairing approach to matching: plink --bfile hapmap1 --assoc --pheno qt.phe --perm --within str1.cluster2 --out quant2In this case we are using adaptive permutation. See the section of the main documentation that describes permutation testing for more details. The output will show:... 89 of 89 individuals assigned to 45 cluster(s) ... Set to permute within 45 cluster(s) Writing QT association results to [ quant2.qassoc ] Adaptive permutation: 1000000 of (max) 1000000 : 25 SNPs leftThis analysis will take some time depending on how fast your computer is, probably at least 1 hour. The last line shown above will change, counting the number of permutations performed, and the number of SNPs left in the analysis at any given stage. Here it reaches the default maximum of 1 million permutations and 25 SNPs remain still (see the link above for more details on this procedure). The adaptive permutation procedure results in a file quant2.qassoc.perm. Sorting this file by the emprical p-value (EMP1, the fourth column) we see that the disease variant rs2222162 is top of the list, with an empirical significance value of 1e-6 (essentially indicating that no permuted datasets had a statistic for rs2222162 that exceeded this). CHR SNP STAT EMP1 NP 2 rs2222162 42.01 1e-06 1000000 6 rs1606447 5.869 5.206e-05 38415 10 rs1393829 16.34 0.0001896 10549 21 rs219746 27.49 0.0001896 10549 2 rs2304287 9.021 0.0001896 10549 6 rs2326873 6.659 0.0001896 10549 2 rs1385855 7.59 0.0002227 13468 2 rs6543704 8.131 0.0002227 13468 ...IMPORTANT When using the --within option along with permutaion, the empirical significance values EMP1 will appropriately reflect that we have controlled for the clustering variable. In contrast, the standard chi-squared statistics (STAT in this file) will not reflect the within-cluster analysis. That is, the test used is the same identical test as used in standard analysis -- the only thing that changes is the way we permute the sample. The STAT values will be identical to the standard, non-clustered analysis therefore. The NP field shows how many permutations were conducted for each SNP. For the SNPs at the bottom of the list, PLINK may well have given up after only 6 permutations (i.e. these were SNPs that were clearly not going to be highly significant if after 6 permutations they were exceeded more than a couple of times). Naturally, this approach speeds up permutation analysis but does not provide a means for controlling for multiple testing (i.e. by comparing each observed test statistic against the maximum of all permuted statistics in each replicate). This can be achieved with the --mperm option: plink --bfile hapmap1 --assoc --pheno qt.phe --mperm 1000 --within str1.cluster2 --out quant3With --mperm you must also specify the number of replicates -- this number can be fairly low, as one is primarily interested in the corrected p-values being less than some reasonably high nominal value such as 0.05, rather than accurately estimating the point-wise empirical significance, which might be very small. Finally, we might want to test whether the association with the continuous phenotype differs between the two populations: for this we can use the --gxe option, along with population membership (which is currently limited to the dichotomous case) being specified as a covariate with the --covar option (same format as cluster files). Let's just perform this analysis for the main SNP of interest rather than all SNPs:plink --bfile hapmap1 --pheno qt.phe --gxe --covar pop.phe --snp rs2222162 --out quant3The output will show that a file quant3.qassoc.gxe has been created, which contains the following fields:CHR SNP NMISS1 BETA1 SE1 NMISS2 BETA2 SE2 Z_GXE P_GXE 2 rs2222162 45 -2.271 0.2245 44 -1.997 0.1722 -0.9677 0.3332which show the number of non-missing individuals in each category along with the regression coefficient and standard error, followed by a test of whether these two regression coefficients are significantly different (Z_GXE) and an asymptotic significance value (P_GXE). In this case, we see the similar effect in both populations (regression coefficients around -2) and the test for interaction of SNP x population interaction is not significant. Extracting a SNP of interest Finally, given you've identified a SNP, set of SNPs or region of interest, you might want to extract those SNPs as a separate, smaller, more manageable file. In particular, for other applications to analyse the data, you will need to convert from the binary PED file format to a standard PED format. This is done using the --recode options (fully described here). There are a few forms of this option: we will use the --recodeAD that codes the genotypes in a manner that is convenient for subsequent analysis in R or any other non-genetic statistical package. To extract only this single SNP, use: plink --bfile hapmap1 --snp rs2222162 --recodeAD --out rec_snp1(to select a region, use the --to and --from options instead, or use --window 100 with --snp to select a 100kb region surrounding that SNP, for example). This particular recode feature codes genotypes as additive (0,1,2) and dominance (0,1,0) components, in a file called rec_snp1.recode.raw. We can then load this file into our statistics package and easily perform other analyses: for example, to repeat the main analysis as a simple logistic regression using the R package (not controlling for clusters):d <- read.table("rec_snp1.recode.raw" , header=T)summary(glm(PHENOTYPE-1 ~ rs2222162_A, data=d, family="binomial"))Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.6795 0.4827 -3.479 0.000503 *** rs2222162_A 1.5047 0.3765 3.997 6.42e-05 ***which confirms the original analysis. Naturally, things such as survival analysis or other models not implemented in PLINK can now be performed. Other areas... That's all for this tutorial. We've seen how to use PLINK to analyse a dummy dataset. Hopefully things went smoothly and you are now more familiar with using PLINK and can start applying it to your own datasets. There are a large number of areas that we have not even touched here:
|