1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
SNP simulation routine
PLINK provides an interface to a very simplistic SNP
simulation routine, designed to generate large SNP datasets for
population-based, case/control studies. This function is largely
intended as a convenience function for generating data to
prototype new methods, comparing the power of different approaches,
etc, rather than producing realistic whole genome data.
Critically, all SNPs simulated are unlinked and in linkage
equilibrium.
Basic usage
The basic command to simulate a SNP data file is the --simulate option,
./plink --simulate wgas.sim --make-bed --out sim1
which takes as a parameter the name of a file (here wgas.sim)
that describes the to-be-simulated data.
The simulation file wgas.sim is as follows:
100000 null 0.00 1.00 1.00 1.00
100 disease 0.00 1.00 2.00 mult
These files can have 1 or more rows, where each row has exactly five fields, as follows
Number of SNPs in this set
Label of this set of SNPs
Lower allele frequency range
Upper allele frequency range
Odds ratio for disease, heterozygote
Odds ratio for disease, homozyygote (or "mult")
Specifying mult implies a multiplicative risk for the
homozygote, e.g. 2*2=4 in the above example.
Given this file, PLINK would generate 100,000 SNPs with no
association with disease. Each SNP would have its own population
allele frequency, generated as a uniform number between, in this case,
0.00 and 1.00. In addition, 100 extra SNPs will be simulated that are
associated with disease (population odds ratio of 2.00).
The names of each SNP would follow from the label (which must be unqiue),
with a number appended, e.g.
null_0
null_1
null_2
...
disease_99
An exception is that if a set only contains a single SNP, nothing is
appended to the label. This is useful in generating multiple samples
from the same population, as described below.
Obviously, a uniform allele frequency range is not realistic: one could instead specify a series of bins
to enrich for rarer SNPs, if so desired, to build a more realistic spectrum of allele frequencies (not that the example
below is meant to be more realistic).
20000 nullA 0.00 0.05 1.00 1.00
10000 nullB 0.05 0.10 1.00 1.00
5000 nullC 0.10 0.20 1.00 1.00
10000 nullD 0.20 0.99 1.00 1.00
...
As well as generating the actual data, the --simulate outputs to the LOG file the following:
Reading simulation parameters from [ wgas.sim ]
Writing SNP population frequencies to [ plink.simfreq ]
Read 2 sets of SNPs, specifying 100100 SNPs in total
Simulating 100 cases and 100 controls
Assuming a disease prevalence of 0.01
The plink.simfreq file is described below. By default, 100 cases and 100 controls are generated. This
can be changed with the command-line options
--simulate-ncases 5000
and
--simulate-ncontrols 5000
for example. Likewise, the default disease prevalence is assumed to be 0.01. This can be changed with
--simulate-prevalence 0.05
for example.
In the example above, the simulated data were directly saved to a binary fileset: this need not be the case. For example,
any other analysis command could instead have been applied, e.g. --simulate acts just like --file or --bfile:
./plink --simulate wgas.sim --assoc
although the actual simulated data would be subsequently lost of course.
Hint This tool only generates individuals drawn from
a homogeneous population, but you can easily imagine using several
--simulate runs then using PLINK commands to merge
the resulting files to specify more complex scenarios,
e.g. representing population stratification, allelic heterogeneity,
etc.
The command
--simulate-label POP1
will append the text label POP1 to the ID of each individual
generated. This can be useful when generating and subsequently merging
multiple simulated samples (so unique IDs can be specified across all
samples).
Specification of LD between marker and causal variant
It is also possible to simulate data in which the observed marker SNP
is only in incomplete LD with the actual causal allele. This is
achieved with either the --simulate-tags or --simulate-haps
options, e.g.
plink --simulate ld.sim --simulate-tags
PLINK now expects 9 fields instead of 6 in the simulation file: namely,
Number of SNPs in this set
Label of this set of SNPs
Lower allele frequency range, causal variant
Upper allele frequency range, causal variant
Lower allele frequency range, marker
Upper allele frequency range, marker
Marker / causal variant LD (D-prime)
Odds ratio for disease, heterozygote causal variant
Odds ratio for disease, homozyygote causval variant, (or "mult")
For example,
5 snp 0.05 0.05 0.5 0.5 0.5 2 mult
Implies 5 CVs, each of 5% MAF and 2-fold multiplicative effect size
(each in linkage equilibrium still) but with 10 additional markers,
each of 50% MAF, each in complete LD with D'=0.5 with their respective
CV (i.e. pairs of markers are simulated). The command
plink --simulate ld.sim --simulate-tags --assoc
will therefore generate only 10 SNPs, that are the markers that tag the CVs, i.e.
note the frequency of these variants is around 50%
CHR SNP BP A1 F_A F_U A2 CHISQ P OR
1 snp_0 1 D 0.474 0.5045 d 3.723 0.05368 0.8851
1 snp_1 2 D 0.484 0.4985 d 0.8413 0.359 0.9436
1 snp_2 3 D 0.493 0.4775 d 0.9618 0.3267 1.064
1 snp_3 4 D 0.486 0.494 d 0.2561 0.6128 0.9685
1 snp_4 5 D 0.4925 0.5005 d 0.256 0.6129 0.9685
In contrast, the related command
plink --simulate ld.sim --simulate-haps --assoc
will output both the causal variant and the marker, with _M
appended to the marker name:
CHR SNP BP A1 F_A F_U A2 CHISQ P OR
1 snp_0 1 D 0.0995 0.044 d 46.25 1.042e-11 2.401
1 snp_0_M 2 B 0.4885 0.494 A 0.121 0.7279 0.9782
1 snp_1 3 D 0.0875 0.044 d 30.8 2.853e-08 2.083
1 snp_1_M 4 B 0.4815 0.5055 A 2.304 0.129 0.9084
1 snp_2 5 D 0.088 0.0575 d 13.79 0.0002044 1.582
1 snp_2_M 6 B 0.459 0.5115 A 11.03 0.0008943 0.8103
1 snp_3 7 D 0.0965 0.047 d 36.79 1.316e-09 2.166
1 snp_3_M 8 B 0.5005 0.491 A 0.361 0.5479 1.039
1 snp_4 9 D 0.093 0.0535 d 22.98 1.634e-06 1.814
1 snp_4_M 10 B 0.4895 0.496 A 0.169 0.681 0.9743
WARNING Again, please note that this procedure does
not produce anything like realistic patterns of LD as one would expect
to observe in whole genome datasets: rather, this simply simulates
pairs of markers, for which there is LD within, but not between,
pairs.
Resimulating a sample from the same population
The --simulate command also generates the file plink.simfreq. This records, for each SNP of the two sets, null and
disease from the wgas.sim example, the actual allele frequency chosen for that particular SNP when simulating the
data. For example,
1 null_0 0.1885 0.1885 1.00 1.00
1 null_1 0.424675 0.424675 1.00 1.00
1 null_2 0.12797 0.12797 1.00 1.00
1 null_3 0.544394 0.544394 1.00 1.00
1 null_4 0.938641 0.938641 1.00 1.00
....
Conveniently, this information is output in the same format as the original simulation file: note how the upper and lower allele frequency
range is converged to specify a particular value, i.e. the first row shows a range of 0.1885 to 0.1885, i.e. effectively forcing the allele
frequency for the first SNP to be 0.1885. This can be useful, as to generate a new independent dataset from the same population as the
first, you would simply use the
plink.simfreq output file, as input for a new --simulate command, see below.
Putting this together, one might imagine setting up a simple screen/replicate simulation design: first we generate the original
WGAS screening data
./plink --simulate wgas.sim --make-bed --out screen
run our association test
./plink --bfile screen --assoc
and extract a list of significant SNPs (here using the Unix gawk command, to filter on the p-value column, 9)
gawk ' NR>1 && $9 < 1e-3 { print $2 } ' plink.assoc > positives
and then generate and test these same SNPs in an independent sample
./plink --simulate screen.simfreq --extract positives --assoc --out replication
etc. By labeling true disease SNPs and null SNPs sensibly as above,
you can tell how many true positives and false positives appear at the
screening and the replication stages, e.g. using Unix bash
shell scripting to summarise results:
t=1e-3
s0=`fgrep null plink.assoc | gawk ' $9 < t ' t=$t | wc -l`
s1=`fgrep disease plink.assoc | gawk ' $9 < t ' t=$t | wc -l`
echo "Detected $s1 true positives and $s0 false positives in screening"
t=1e-2
s0=`fgrep null replication.assoc | gawk ' $9 < t ' t=$t | wc -l`
s1=`fgrep disease replication.assoc | gawk ' $9 < t ' t=$t | wc -l`
echo "Of these, $s1 true positives and $s0 false positives replicate"
Simulating a quantitative trait
To simulate a quantitative trait based on one or more unlinked SNPs,
use the command
plink --simulate-qt myfile.sim --simulate-n 1000
where myfile.sim is similar in format to the file taken by the standard --simulate option, except
the two final fields represent the additive genetic variance, and the ratio of dominance to additive effects, e.g. if
myfile.sim is
10 qtl 0.05 0.95 0.01 0
then
plink --simulate-qt myfile.sim
will generate ten unlinked QTLs, with allele frequency between 0.05
and 0.95 for the trait-increasing allele. In each case, the effect
size will be calculated to give a population variance explained of 0.01. The
final 0 indicates the effects are additive (1 means complete dominance, -1
means complete recessive). In this case, the output is
Reading QT simulation parameters from [ sim1.sim ]
Writing SNP population frequencies to [ plink.simfreq ]
Read 1 sets of SNPs, specifying 10 SNPs in total
Simulating QT for 1000 individuals
Total QTL variance is 0.1
Simulating disease variants (direct association)
This lists the total population variance explained by all QTLs -- in this case, 0.1. If the variance
explained is greater than 1, an error message will be reported.
If the --assoc command were also specified along with the above command, standard QT association
tests would be performed for each simulate SNP, e.g. plink.qassoc:
CHR SNP BP NMISS BETA SE R2 T P
1 qtl_0 1 1000 0.1988 0.05263 0.0141 3.777 0.0001678
1 qtl_1 2 1000 0.09956 0.04623 0.004626 2.154 0.03151
1 qtl_2 3 1000 -0.2101 0.0683 0.009391 -3.076 0.002156
1 qtl_3 4 1000 0.186 0.04728 0.01528 3.935 8.911e-05
1 qtl_4 5 1000 0.1489 0.04714 0.009899 3.159 0.001632
1 qtl_5 6 1000 -0.1854 0.04569 0.01623 -4.057 5.351e-05
1 qtl_6 7 1000 0.2287 0.05745 0.01563 3.981 7.355e-05
1 qtl_7 8 1000 -0.2011 0.08519 0.005551 -2.36 0.01845
1 qtl_8 9 1000 0.166 0.04337 0.01446 3.827 0.0001377
1 qtl_9 10 1000 -0.1007 0.04773 0.004437 -2.109 0.0352
showing R2 (variance explained) values around 0.01, as
expected, with the sampling variation due to sample of 1000
individuals (the default sample size, if --simulate-n is not
specified).
The additional commands such as --simulate-label and
--simulate-tags, etc, can be used with the
--simulate-qt option.
|
|