1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
Inclusion thresholds
This secion describes options that can be used to filter out
individuals or SNPs on the basis of the summary statistic measures
described in the previous summary
statistics page.
Summary statistics versus inclusion criteria
The following table summarizes the relationship between the commands
to generate summary statistics (as described on the previous page, versus the commands to exclude
individuals and/or markers, which are described on this page.
Feature |
As summary statistic |
As inclusion criteria |
Missingness per individual |
--missing |
--mind N |
Missingness per marker |
--missing |
--geno N |
Allele frequency |
--freq |
--maf N |
Hardy-Weinberg equilibrium |
--hardy |
--hwe N |
Mendel error rates |
--mendel |
--me N M |
Default threshold values
By default, PLINK does not impose any filters on minor allele
frequency or genotyping rate. (Note that versions prior to 1.04 use to
have thresholds of 0.01 for frequency and 0.1 for individual and SNP
missing rate -- this is no longer the case, i.e. it is as if
the --all keyword is always specified).
To perform an analysis, or generate a new dataset, with filters
applied, add the --mind, --geno or --maf
options are to the command line, for example, when
the --remove command is given.
Missing rate per person
The initial step in all data analysis is to exclude individuals with too much missing
genotype data. This option is set as follows:
plink --file mydata --mind 0.1
which means exclude with more than 10% missing genotypes (this is the defalt
value). A line in the terminal output will appear, indicating how many
individuals were removed due to low genotyping. If any individuals were
removed, a file called
plink.irem
will be created, listing the Family and Individual IDs of these removed individuals.
Any subsequent analysis also specifeid on the same command line will be
performed without
these individuals.
One might instead wish to create a new PED file with these individuals
permanently removed, simply add an option to generate a new fileset: for example,
plink --file data --mind 0.1 --recode --out cleaned
will generate files
cleaned.ped
cleaned.map
with the high-missing-rate individuals removed; alternatively, to create a binary fileset
with these individuals removed:
plink --file data --mind 0.1 --make-bed --out cleaned
which results in the files
cleaned.bed
cleaned.bim
cleaned.fam
HINT You can specify that certain genotypes
were never attempted, i.e. that they are obligatory missing, and these
will be handled appropriately by these genotyping rate filters. See
the summary statistics page
for more details.
Allele frequency
Once individuals with too much missing genotype data have been excluded, subsequent
analyses can be set to automatically exclude SNPs on the
basis of MAF (minor allele frequency):
plink --file mydata --maf 0.05
means only include SNPs with MAF >= 0.05. The default value is 0.01. This quantity is based
only on founders (i.e. individuals for whom the paternal and maternal individual codes and
both 0).
This option is appropriately counts alleles for X and Y chromosome SNPs.
Missing rate per SNP
Subsequent analyses can be set to automatically exclude SNPs on the
basis of missing genotype rate, with the --geno option: the default is to include all SNPS (i.e. --geno 1).
To include only SNPs with a 90% genotyping rate (10% missing) use
plink --file mydata --geno 0.1
As with the --maf option, these counts are calculated after removing individuals with
high missing genotype rates.
Hardy-Weinberg Equilibrium
To exclude markers that failure the Hardy-Weinberg test at a specified significance
threshold, use the option:
plink --file mydata --hwe 0.001
By default this filter uses an exact test (see this section).
The standard asymptotic (1 df genotypic chi-squared test) can be requested with the --hwe2
option instead of --hwe.
The following output will appear in the console window and in plink.log,
detailing how many SNPs failed the Hardy-Weinberg test, for the sample as a whole,
and (when PLINK has detected a disease phenotype) for cases and
controls separately:
Writing Hardy-Weinberg tests (founders-only) to [ plink.hwe ]
30 markers failed HWE test ( p <= 0.05 ) and have been excluded
34 markers failed HWE test in cases
30 markers failed HWE test in controls
This test will only be based on founders (if family-based data are being
analysed) unless the --nonfounders option is also specified.
In case/control samples, this test will be based on controls only, unless the
--hwe-all option is specified, in which case the phenotype
will be ignored. This can be important if parents are coded as missing
in an affected offspring trio sample.
Please refer to the --hardy option for more details on
producing summary statistics of all HWE rates.
Mendel error rate
For family-based data only, to exclude individuals and/or markers on
the basis on Mendel error rate, use the option:
plink --file mydata --me 0.05 0.1
where the two parameters are:
- the first parameter determines that families with more than 5% Mendel errors
(considering all SNPs) will be discarded.
- the second parameter indicates that SNPs with more than 10% Mendel error rate
will be excluded (i.e. based on the number of trios);
Please refer to the summary
statistics page for more details on generating summary statistics
for Mendel error rates.
Note Currently, PLINK calculates the per
SNP Mendel error rates at the same time as the per family error
rates. In future releases, this may change such that the per family
error rate is calculated after SNPs failing this test have
been removed. Also, using this command currently removes entire
nuclear families on the basis of high Mendel error rates: it will
often be more appropriate to remove particular individuals (e.g. if a
second sibling shows no Mendel errors). For this more fine-grained
procedure, use the --mendel option to generate a complete
enumeration of error rates by family and individual and exclude
individuals as desired.
Finally, it is possible to zero out specific Mendelian inconsistencies
with the option --set-me-missing. This should be used in
conjunction with a data generation command and the --me
option. Specifically, the --me parameters should be both to
1, in order not to exclude any particular SNP or individual/family,
but instead to zero out only specific genotypes with Mendel errors and
save the dataset as a new file. (Both parental and offspring genotypes
will be set to missing.)
plink --bfile mydata --me 1 1 --set-me-missing --make-bed --out newdata
|
|