Masks: syntax and examples
Masks are the primary means of extracting subsets of the variant database (VARDB). By default, a command such as v-view will use all available genotypes for all individuals on all variants. Masks can be used to restrict to subsets of variants, genotypes or individuals, given various criteria. Additionally, masks also are used to specify grouping factors for commands that assume groups of variants (e.g. assoc or g-stats).- Table of all masks grouped by type
- Basic syntax for masks options
- Combining multiple masks : rules for includes (ORs), requires (ANDs) and excludes (NOTs)
- @includes : including information from files in mask specifications
Available filters
Class | Attribute | Keywords |
---|---|---|
Filter variants | Variant sets |
var var.ex var.req |
Loci | loc loc.ex loc.req loc.subset loc.skip | |
Locus-sets | locset locset.ex locset.req locset.subset locset.skip | |
Variant ID |
id id.ex id.req novel novel.ex |
|
Reference variant database | ref ref.ex ref.req | |
Genomic region | reg reg.ex reg.req | |
Segments (regions in individuals) | seg seg.ex seg.req | |
Filter field (from VCF) | filter filter.ex filter.req any.filter any.filter.ex any.filter.req |
|
Variant meta-information | qual meta meta.req include exclude v-include v-exclude | |
Variant type | biallelic biallelic.ex indel indel.ex | |
Minor/Alternate allele frequency | mac maf aac aaf hwe monomorphic monomorphic.ex biallelic biallelic.ex | |
Missing data | null null.prop assume-ref | |
Case/control genotype counts | case.uniq control.uniq case.control | |
Functional annotation | annot annot.ex annot.req | |
Misc. | limit downcode collapse any-merge no-merge exact-merge fix-xy genotype-model no-vmeta no-geno no-gmeta |
|
Filter individuals | File/sample | file file.ex fail.nfile obs.file obs.file.req obs.file.ex obs.nfile alt.file alt.file.req alt.file.ex alt.nfile |
Individual ID | indiv index.ex alt.group alt.group.req alt.group.ex | |
Phenotype | phe phe.ex phe.req phe.obs phe.allow.missing | |
Set genotypes to null | Genotype meta-information | geno geno.req |
Group by | various | loc.group var.group locset.group all.group empty.group |
Append meta-inforation | various | loc.append locset.append var.append ref.append annot.append meta.attach |
Mask format
To illustrate some points of mask syntax, consider this example:
--mask loc.subset=CCDS,CCDS12345.1 mac=2-20 limit=1000 any.filter.ex
This mask translate to "variants in CCDS gene CCDS12345.1 that pass all VCF filters and have a minor allele count between 2 and 20, restricted to not more than 1000 variants". Specific points of syntax include: mask keywords...
- can be given in any order after --mask
- do not start with the double-dash (--), i.e. unlike all primary PSEQ options
- are space-delimited, i.e. a space between the loc.subset and mac options
- are case-sensitive (for both the options themselves and any arguments they take)
- often take one or more arguments, which should be in the form option=value1,value2. Importantly, here there is no whitespace between the equals sign; multiple arguments should be comma-delimited.
- exceptions to the last rule are the include family of masks, that take a quoted string that can have spaces, e.g. include="DP > 10"
If an identifier has space characters, the whole idenitifer must be enclosed in quotes. For example, the following is valid
indiv.ex=P1,P2,"ID A","ID B",P3
and indicates individuals with identifiers ID-space-A, etc. The following is not valid
indiv.ex = P1 , P2, "ID A", "ID B" , P3
as in this example there are spaces between the = and , elements of the option. Also, note that
indiv.ex="P1 , P2, ID A, ID B, P3"is valid and implies five individuals, but that the first ID is literally P1-space-, which is probably not the desired effect.
Some mask options expect a range of values to be specified. This should be entered, e.g. for mac mask option:
mac=1-5
meaning between 1 and 5 copies of the minor allele. The mask
mac=2-
means 2 or more copies. These two options can be equivalently written as specified
mac=1:5
and
mac=2:
which can be useful, as it avoids ambiguity in the term
mac=-2
(as some masks can in theory take negative numbers). In general, though, it is best to be specific when lower and upper bounds are known: i.e.
mac=0-2
for the above example. Also, when specifying a single number (i.e. singletons) give both upper and lower bounds:
mac=1-1
Combining multiple filters: includes, excludes and requires
A number of mask options come in three versions: as includes, excludes or requires. For example, the loc, loc.ex and loc.req correspond to these three types of mask.
- If one or more includes are specified, a variant that does not match at least one will be masked.
- If one or more excludes are specified, a variant that matches at least one will be masked.
- If one or more requires are specified, a variant that does not match all will be masked.
Including information from a file in a mask specification
It may often be necessary to provide long lists of identifiers (of genes, or individuals, etc) when creating a mask, which can be inconvenient in the context of a command-line. If this list exists as a separate file, it can be included as follows
--mask indiv.ex=@ex.list
will read a list of individuals from the file ex.list and, in
this example, forward them to the indiv.ex component of the
mask (which excludes individuals). The file ex.list is
gzip-compressed, it will be uncompressed automatically. By default,
the file is expected to contain one entry per row (which will be
converted into the equivalent comma-delimited string). These
statements can be included with any --mask options.
More than one @include can be given for a particular mask option. Also, @include statements can be incorporated along with other identifiers on the command line:
--mask indiv.ex=ID0001,@ex.list,ID0022
The basic @include syntax can be modified in two ways: with the # character, to specify entries that should be excluded from the generated list; with the #N suffix, to specify a particular column (in a tab-delimited file) should be used. (Avoid using characters @ or # in filenames.) If the file a.txt contains
ID1 ID2 ID3
then the command
--mask indiv.ex=@a.txt,ID4
will pass individuals ID1, ID2, ID3 and ID4 to indiv.ex. The command
--mask indiv.ex=@a.txt,#ID2,ID4
excudes ID2 from the @a.txt list, to pass ID1, ID3 and ID4 to indiv.ex.
If a second file b.txt contained the ID2 entry,
then the following would have a similar effect as above
--mask indiv.ex=@a.txt,#@b.txt,ID4
In summary:
- @file : read values from file
- @file,#value1,#value2 : read from file, but exclude values 1 and 2
- @file,#@file : read from first file, but exclude values from second file
- @file#2 : read values from 2nd column of file
The @include syntax is effectively equivalent to the Linux Bash shell syntax:
--mask indiv.ex=`cat ex.list`
Filter variants on variant-set(s)
If a variant-set called vset1 has been created (with the var-set command), then the mask option:
var=vset1
will include only those variants in that set. To include multiple sets, give a comma-delimited list of those sets:
var=vset1,vset2
To exclude variants in that set, use var.ex; to require that any variant returned is a member of at least one of the sets vset1 and vset2, use var.req=vset1,vset2.
A similar logic applies to any variant super-sets (sets of sets), via the mask options varset, varset.ex and varset.req.
Filter variants on loci
If a group called CCDS (case-sensitive) exists in the LOCDB, to include variants that fall within these loci, add the mask option loc=CCDS
Here CCDS implies a group of loci. The basic loc inclusion option can be modified in two ways, such that only a subset of loci in the group are included, via the subset and skip keywords. Specifically, to restrict the mask to a subset of loci in the CCDS group:
loc.subset=CCDS,CCDS635.1
More than one specific locus can be included as a comma-delimited list following the specification of the group:
loc.subset={group},locus1,locus2,...
Also, using the @ file inclusion option on the command line can load a list of loci/genes from a file:
loc.subset=CCDS,@my_gene_list.txt
assuming the file my_gene_list.txt is a plain-text file that contains a list of CCDS gene IDs. Entries in a .subset that do not match with the LOCDB will be ignored.
A similar syntax applies to the loc.skip command:
loc.skip=CCDS,CCDS769.1,CCDS31082.1
would include all variants in CCDS loci, ignoring CCDS769.1 and CCDS31082.1. Note that the skip option effectively ignores these loci for this particular group's inclusion, so if another locus spanned a variant that was in one of these skipped loci, it would still be included -- use the loc.ex command to ensure that variants in these loci are definitely excluded (i.e. whether or not a separate locus, or include statement would otherwise lead to their inclusion), if that is the goal.
The loc option can take more than one group, in the form of a comma-delimited list. (The loc.subset and loc.skip only apply to a single group at a time, however.)
The exclude and require forms of the loc mask option are:
loc.ex=CCDS
to exclude variants falling in loci in a CCDS gene, and
the require form (here taking two groups from LOCDB):
loc.req=CCDS,target_regions
which indicates that variants in a CCDS gene and in a target_regions group are included. In contrast,
loc=CCDS,target_regions
includes variants in a CCDS gene or in the target_regions group. Using the loc form will generally be faster than the loc.req form, and so should be preferred for single groups.
Note: the subset and skip forms do not exist for excludes and requires.
Filter variants on locus-set(s)
This set of commands applies to locus-sets, as loaded into a LOCDB by the locset-load PSEQ command. The syntax is similar to the loc mask option, but takes the form:
locset=group,set-name1,set-name2,...
Similar to loc, loc.subset, locset.skip, locset.ex and locset.req also exist. The principle syntactic difference is that the locus-set is always specified by a group-name (e.g. CCDS) and a locus-set name (e.g. KEGG).
Filter variants on reference-variant database(s)
If a REFDB is attached, and, for example, a group called g1k (i.e. 1000 Genomes data) exists in that database, the mask option
ref=g1k
will only include variants that are present in that reference variant group (present in 1000 Genomes data, in this case). If the REFDB group contains meta-information associated with that reference variant, it will be appended. For example, in this example, the reference variant group contains a flag attribute that indicates which 1000 Genomes panels the variant was seen in:
chr1:861078 rs28419423 C/G PASS 1 PASS DB=1;strand=+;P1-CHB+JPT chr1:867694 rs6672356 T/C PASS 1 PASS DB=1;strand=+;P1-CHB+JPT;P1-CEU;P1-YRI
Similarly, the ref.ex option will exclude sites that are found in the REFDB group(s), and ref.req will include only sites found in all REFDB groups, if more than one group is specified: e.g.
ref.req=g1k,dbSNP ref.ex=HGMD
means "sites in 1000 Genomes and dbSNP but not in the HGMD database" (assuming corresponding groups exist in the REFDB).
Filter variants on genomic region
To filter on a genomic interval, use the following mask option:
reg=chr1:10000000..15000000
will extract only variants on chromosome 1 between 10 and 15Mb. Multiple regions can be specified as a comma-delimited list (or with the @include syntax), and can point to individual bases or whole chromosomes, as follows:
reg=chr2,chr3:333333,chr4:444444..555555
note: we require a range to be specified by two dots start..stop -- the dash/minus character cannot be used here to indicate a range.
As with the other principle mask options, the reg keyword also comes in reg.ex and reg.req versions, to either exclude variants within the specified ranges, or to only include variants in the intersection of all specified ranges (by default, reg implies the union, if more than one range is given.
Filter genotypes on individuals' genomic interval
Filter variants on ID
The mask option
id=rs12345,rs99999
will extract only those two SNPs from the database (if they exist, and if compatible with other masks, of course). The corresponding options id.ex and id.req will exclude those variants, or require that any variant returned has an ID matching one of these two.
The convenience mask option novel can be used to return only novel variants (id=.); similarly, novel.ex implies id.ex=. (i.e. only return variants that do not an assigned rs-number (or any other non-missing entry in the ID field of the VCF).
Filter variants on FILTER field
Genotypic data in VCF files always have a FILTER field, which is used to indicate various, often technical properties, of the variant (i.e. relating to the likelihood that the variant is real and accurately annotated and genotyped). By convention, a "good" variant is given the filter value PASS. Otherwise, the FILTER field can take one or more flags. For example, the GATK has flags LowQual to represent a called variant with a below-threshold QUAL score; or SnpCluster to indicate that the variant has a large number of nearby variants also called (which can be potentially indicative of alignment problems, e.g. due to mismapping reads, or the presence of an undetected insertion/deletion polymorphism.).
The filter mask option is used to include, or exclude, variants based on this field:
filter=GATKStandard
implies that only variants with a GATKStandard flag are included (i.e. only low-quality variants that failed the default thresholds used by GATK in assigning filters). As with other mask options, this command also has filter.ex and filter.req forms, to exclude or require that multiple filters are present.
Three other forms match on any non-PASS values in the FILTER field (i.e. denoting variants that "fail" for any of one or more unspecified reasons). These commands do not take any arguments: a common form will be
--mask any.filter.ex
to remove any variants that have a non-PASS flag (equivalent to filter.req=PASS) -- i.e. typically indicating the "QC+" dataset. Commands any.filter and any.filter.req also exist.
Filter variants on variant meta-information
The meta (and related meta.req) mask options can be used to filter variants based upon the information encoded in the extensible meta-information attributes that can exist in a VCF, in the INFO field. Note that the include mask statement and the eval expressions provide a more flexible (and syntactically intuitive) way to filter on variant meta-information (although this will come at the cost of slightly slower processing, which is why the meta (and, even more so, the geno mask options are retained)
For example, a particular variant might have values as follows:
AB=0.75;AC=2;AF=0.01;AN=222;DB=0;DP=438;MQ=91.29;MQ0=0;type=Nonsense
which relate to allele balance (AB), allele count (AC), allele frequency (AF), total number of alleles/chromosomes called (AN), presence in dbSNP (DB, 0=false, 1=true), mean mapping quality (MQ), the number of mapping-quality zero reads spanning that position (MQ0), and functional class of the variant (TYPE, being Nonsense in this example). In a valid VCF, these tags will have been defined in the header of the VCF, e.g.
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
(generated by the GATK in this case), that indicates that the DP field is a single integer that conveys information about total depth (number of reads) at the variant's position.
For this VCF, the following mask:
meta=DP:ge:50,DB:is:1
will include a variant if the DP field is greater than or equal to 50 or the variant is in dbSNP (assuming the DP and DB fields have been specified in the VCF). The meta.req command is similar, except the conditions are combined as a logical AND.
The syntax of these expressions are variable:operator:value where the operator is one of:
eq equal to ne not equal to gt greather than ge greater than or equal to lt less than le less than or equal to
Filter variants based on type
Filter variants on sample polymorphism
To filter variants on the minor allele count, use mac:
mac=1-10
or
mac=1:10
which implies to select variants with between 1 and 10 minor alleles. (Note: this is a per-allele, not per-individual count, so 5 rare homozygotes are counted similarly to 10 rare heterozygotes.) If only a single integer is given, this implies the lower bound:
mac=10
means to include variants with 10 or more minor alleles.
To filter on frequencies rather than absolute counts, use maf instead of mac. This behaves in a similar manner to mac except it naturally takes the rate of missing genotypes into account when calculating frequency. Also, if only a single parameter is given, this implies an upper bound. Both mac and maf can be used in conjunction, therefore,
mac=2 maf=0.05
means to select variants with a sample MAF less than 0.05, but with at least 2 minor alleles present.
All frequencies and counts are calculated after any individuals have been removed, e.g. by indiv and phe filters, and after any genotype filters (geno) that might set genotypes to null values.
The corresponding aac and aaf filters consider the alternate allele count and alternate allele frequency, respectively (lumping together all alternate alleles at multi-allelic sites). This differs from the mac and maf filters in that the mac and maf look at the frequency of the minor allele (the less frequent allele), whether it's the reference or alternate alleles (thus maf is, by definition, never greater than 0.5, whereas aaf ranges from 0 to 1).
The keywords monomorphic (and monomorphic.ex) can be used to include (or exclude) monomorphic variants.
The keywords biallelic (and biallelic) can be used to include (or exclude) variants with exactly two alleles observed in the sample (post filtering of individuals and genotypes).
Genotyping rate
To filter on the number of null genotypes, use the mask option null, that takes a range of positive integers to specify the allowed range:
null=0-5
Ranges must always contain a dash/minus symbol, and always are assumed to represent positive integers:
null=0-5 between 0 and 5 null genotypes null=-5 as above, 5 or less null=5- 5 or more
Counts of null genotypes are calculated after removing any individuals and/or setting genotypes to null on the basis of a geno mask.
Filter variants on functional annotation
function not fully implemented yet / has known issuesTo filter on functional annotation of variants, use the mask command:
annot=Misense,Nonsense
as well adding the main flag
--annot refseq
to denote which transcript group in LOCDB should be used for annotation.
Alternate forms of this option are annot.ex and annot.req to specify exclude and require masks.
Note: although not completely implemented yet, future versions will handle annotation of variants by either by a) a VCF INFO attribute, specified in the meta-meta file, or b) annotation on-the-fly, using a specified LOCDB and SEQDB.Misc. filters
Options not covered elsewhere include:
To return only the first 100 variants (e.g. useful for testing a method, particularly when working in R, to avoid too many variants being read into memory at once) add the option:
limit=100
If a variant has genotype data from more than one sample in the VARDB: typically, filters such as:
include=" DP > 10 && DP < 300 "
apply to each sample (meaning a collection of 1 or more individuals
loaded from a single VCF) separately, as each sample/VCF will have
it's own DP field. If there are two samples in the database
(e.g. two VCFs loaded) but only one fails the above filter, by default
the remaining variant will be processed (with missing genotypes for
the individuals from the first sample and not in the second
sample). To exclude variants completely if at least one sample fails a
filter, add the mask option:
failsingle.file
Under development: If genotypes have a GL (genotype likelihood, vector of 3 floats) field and are biallelic, adding the mask option:
em=0.99
will perform an on-the-fly EM to estimate the sample allele frequency and posterior genotype probabilities (assuming HWE), and recall genotypes based on, in this example, a maximum posterior probability threshold of 0.99.
Filter by file/sample
To include (or exclude) particular files:
file=1,3
or
file.ex=2
Files can be referred to in two ways: by the number of the sample, as appears in the output of:
./pseq /path/to/project var-summary
or by the file tag that has been assigned to that file:
file.ex=batch2
Filter individuals by ID
To include or exclude individuals based on their ID:
indiv=ID001,ID002
(to exclude use indiv.ex). As shown above, this command can be combined with the @include syntax to work with lists of IDs in a file.
Filter individuals by phenotype
If phenotype data are available in INDDB, the individuals in a dataset can be filtered on those, using the phe mask:
phe=disease_a:2
which means include people with a phenotype value of 2 for the phenotype disease_a. As with other mask options such as loc, multiple phenotype filters can be specified with either the phe, phe.req or phe.ex options: e.g.
phe.req=disease_a:2,batch:B
means to include individuals with a phenotype value of 2 for disease_a (i.e. typically, that indicates a "case", or "affected") AND a value of B for the variable batch.
Note: currently, quantitative traits and ranges are not supported for filtering on phenotype.Note: : not currently implemented, but future versions will allow a more natural syntax, e.g. phe=" disease_a == 2 && batch == 'B' ", following the include mask option.
If the same phenotype is specified to match more than one value in phe.req, this is interpreted as follows:
phe.req=dis:2,batch:A,batch:B
means dis equals 2 AND ( batch equals A OR batch equals B).
Filter case/control non-reference allele counts
If a single binary phenotype from INDDB has been specified (via the --phenotype or --make-phenotype commands), the following mask option:
case.uniq
will include only variants for which the non-reference allele is only seen in cases. Similarly, the control.uniq works as expected. For more flexibility, two ranges can be directly specified
case.control=10-9999,0-1
will select variants with 10 or more alternate (non-reference) alleles in cases, but not more than 1 in controls.
Filter genotypes on genotype meta-information
To set certain genotypes to missing if they fail to meet certain criteria (based on any per-genotype meta-information):
geno=DP:ge:10
means to set the genotype to null if the person DP is not 10 or more. Multiple criteria can be combined in a comma-delimited list; geno implies a logical OR; geno.req can also be used to require that all criteria are meet.
geno.req=DP:ge:10,GQ:ge:0.99
If an individual does not possess the required tags to evaluate the specified criteria, the genotype will be set to null.
Note: in future, we will introduce the more flexible syntax following the include option, e.g.
geno=" DP >= 10 && GQ >= 0.99 "
Group variants by
A number of commands require groups of variants, e.g. that will often correspond to a unit such as a gene, set of genes, or genomic interval. The single grouping factor is specified:
loc.group=CCDS
which implies that the LOCDB contains a group called CCDS (i.e. previously loaded into that database from a GTF). This means that the variants will be grouped, in this example, as CCDS transcripts. This command can be combined with other options, e.g. the whole command may be
./pseq /path/to/project g-view --mask loc.group=CCDS loc.subset=CCDS,CCDS10123.1,CCDS10125.1
to produce per-gene summaries for two genes CCDS10123.1 and CCDS10125.1.
One can additionally group by variant super-set (sets of variant sets):
varset.group=mysets
or locus-sets (sets of loci)
locset.group=myloci
Adding the mask keyword empty.group will ensure that groups containing no variants are reported in downstream analysis (by default, they are excluded).
Instead of loc.group, varset.group or locset.group, one can specify all.group which means that all variants meeting the mask will be placed in a single group. This can be useful for defining quick groups on-the-fly (e.g. a pair of existing genes, without having to define a locus-set group; or a small region defined by a reg mask):
reg=chr22:17000000..19000000 all.group
Note that this action pulls all variant and genotype data into memory, and so if a very large number of variants are implied in the all.group mask, you are very likely to run out of memory (at least when working with a large dataset in terms of the number of individuals).
Note: Future releases will add a reg.group option that allows for sliding-windows to define a series of groups (i.e. as might be more applicable for whole-genome, rather than exome sequence data).
Append meta-information
Information from project databases can be appended as meta-information to the variant, but not used as a filter, using the append family of mask options. For example, loc.append=CCDS
will annotate each variant with two additional tags,if it falls in a CCDS gene (where CCDS is a group in the LOCDB), for example (columns omitted from v-view output)
chr1:1280751:. _LSET=CCDS24.1;_LGRP=3 chr1:1320660:. _LSET=CCDS30557.1,CCDS30558.1;_LGRP=3,3
Here, the tag _LSET indicates a field from the LOCDB has been appended; the tag _LGRP indicates which LOCDB group that value is from (i.e. in this example, the CCDS group is number 3 in the LOCDB (as evident from running the loc-summary). This can be used to keep track of the origin of each appended piece of information, if more than one group was specified (e.g. loc.append=CCDS,refseq). If a variantis in more than one gene, as in the secondrow, a vector of entries is appended.
In this example, we append from a REFDB group that corresponds to dbSNP.
ref.append=dbSNP130
which adds the following variant-level meta-information
chr1:939460:rs61766284 class=single;observed=C/T;refUCSC=C;strand=+;NAME=rs61766284;VALUE=coding-synon
These tags are appended at the Variant rather than the SampleVariant level. They can be queried via the v-include mask option:
--v-include=" VALUE == 'coding-synon' "
Note : options locset.append, var.append and annot.append are not yet implemented.
PSEQ and R mask syntax
Identical syntax is used in both the PSEQ command-line interface and the R library. Consider the following (contrived) example: to test singleton non-reference alleles in missense variants in all genes in CCDS on chromsome 22 in a certain KEGG pathway, that are also in target region, excluding genotypes with low covarage and variants that failed the standard GATK filter.
Using PSEQ: pseq /path/to/project assoc
--mask loc.group=CCDS
loc.req=targets
reg.req=chr22
locset.req=refseq,KEGG,ADHERENS_JUNCTION
meta=type:eq:Missense,type:eq:Nonsense
mac=1-1
geno=DP:ge:10
--phenotype scz
--perm 10000
Using the R library (lines are wrapped for clarity; also, this assumes a suitable assoc.test() function exists):
m <- "loc.group=CCDS loc.req=targets reg.req=chr22 locset.req=refseq,KEGG,ADHERENS_JUNCTION meta=type:eq:Missense,type:eq:Nonsense mac=1-1 geno=DP:ge:10" var.iterate( assoc.test , m )
Note that quotes, e.g. around include= statements, need to be escaped in R:
m <- " include=\" !DB || DP < 100 \" loc.group=CCDS "