This secion describes options that can be used to filter out individuals or SNPs on the basis of the summary statistic measures described in the previous summary statistics page.

Summary statistics versus inclusion criteria

The following table summarizes the relationship between the commands to generate summary statistics (as described on the previous page, versus the commands to exclude individuals and/or markers, which are described on this page.

Feature	As summary statistic	As inclusion criteria
Missingness per individual	`--missing`	`--mind` N
Missingness per marker	`--missing`	`--geno` N
Allele frequency	`--freq`	`--maf` N
Hardy-Weinberg equilibrium	`--hardy`	`--hwe` N
Mendel error rates	`--mendel`	`--me` N M

Default threshold values

By default, PLINK does not impose any filters on minor allele frequency or genotyping rate. (Note that versions prior to 1.04 use to have thresholds of 0.01 for frequency and 0.1 for individual and SNP missing rate -- this is no longer the case, i.e. it is as if the --all keyword is always specified).

To perform an analysis, or generate a new dataset, with filters applied, add the --mind, --geno or --maf options are to the command line, for example, when the --remove command is given.

Missing rate per person

The initial step in all data analysis is to exclude individuals with too much missing genotype data. This option is set as follows:

plink --file mydata --mind 0.1

which means exclude with more than 10% missing genotypes (this is the defalt value). A line in the terminal output will appear, indicating how many individuals were removed due to low genotyping. If any individuals were removed, a file called

     plink.irem

will be created, listing the Family and Individual IDs of these removed individuals. Any subsequent analysis also specifeid on the same command line will be performed without these individuals.

One might instead wish to create a new PED file with these individuals permanently removed, simply add an option to generate a new fileset: for example,

plink --file data --mind 0.1 --recode --out cleaned

will generate files

     cleaned.ped
     cleaned.map

with the high-missing-rate individuals removed; alternatively, to create a binary fileset with these individuals removed:

plink --file data --mind 0.1 --make-bed --out cleaned

which results in the files

     cleaned.bed
     cleaned.bim
     cleaned.fam

HINT You can specify that certain genotypes were never attempted, i.e. that they are obligatory missing, and these will be handled appropriately by these genotyping rate filters. See the summary statistics page for more details.

Allele frequency

Once individuals with too much missing genotype data have been excluded, subsequent analyses can be set to automatically exclude SNPs on the basis of MAF (minor allele frequency):

plink --file mydata --maf 0.05

means only include SNPs with MAF >= 0.05. The default value is 0.01. This quantity is based only on founders (i.e. individuals for whom the paternal and maternal individual codes and both 0).

This option is appropriately counts alleles for X and Y chromosome SNPs.

Missing rate per SNP

Subsequent analyses can be set to automatically exclude SNPs on the basis of missing genotype rate, with the --geno option: the default is to include all SNPS (i.e. --geno 1). To include only SNPs with a 90% genotyping rate (10% missing) use

plink --file mydata --geno 0.1

As with the --maf option, these counts are calculated after removing individuals with high missing genotype rates.

Hardy-Weinberg Equilibrium

To exclude markers that failure the Hardy-Weinberg test at a specified significance threshold, use the option:

plink --file mydata --hwe 0.001

By default this filter uses an exact test (see this section). The standard asymptotic (1 df genotypic chi-squared test) can be requested with the --hwe2 option instead of --hwe.

The following output will appear in the console window and in plink.log, detailing how many SNPs failed the Hardy-Weinberg test, for the sample as a whole, and (when PLINK has detected a disease phenotype) for cases and controls separately:

Writing Hardy-Weinberg tests (founders-only) to [ plink.hwe ]
30 markers failed HWE test ( p <= 0.05 ) and have been excluded
        34 markers failed HWE test in cases
        30 markers failed HWE test in controls

This test will only be based on founders (if family-based data are being analysed) unless the --nonfounders option is also specified. In case/control samples, this test will be based on controls only, unless the --hwe-all option is specified, in which case the phenotype will be ignored. This can be important if parents are coded as missing in an affected offspring trio sample.

Please refer to the --hardy option for more details on producing summary statistics of all HWE rates.

Mendel error rate

For family-based data only, to exclude individuals and/or markers on the basis on Mendel error rate, use the option:

plink --file mydata --me 0.05 0.1

where the two parameters are:

the first parameter determines that families with more than 5% Mendel errors (considering all SNPs) will be discarded.
the second parameter indicates that SNPs with more than 10% Mendel error rate will be excluded (i.e. based on the number of trios);

Please refer to the summary statistics page for more details on generating summary statistics for Mendel error rates.

Note Currently, PLINK calculates the per SNP Mendel error rates at the same time as the per family error rates. In future releases, this may change such that the per family error rate is calculated after SNPs failing this test have been removed. Also, using this command currently removes entire nuclear families on the basis of high Mendel error rates: it will often be more appropriate to remove particular individuals (e.g. if a second sibling shows no Mendel errors). For this more fine-grained procedure, use the --mendel option to generate a complete enumeration of error rates by family and individual and exclude individuals as desired. Finally, it is possible to zero out specific Mendelian inconsistencies with the option --set-me-missing. This should be used in conjunction with a data generation command and the --me option. Specifically, the --me parameters should be both to 1, in order not to exclude any particular SNP or individual/family, but instead to zero out only specific genotypes with Mendel errors and save the dataset as a new file. (Both parental and offspring genotypes will be set to missing.)

plink --bfile mydata --me 1 1 --set-me-missing --make-bed --out newdata

This document last modified Wednesday, 25-Jan-2017 11:39:28 EST