Masks: syntax and examples

Masks are the primary means of extracting subsets of the variant database (VARDB). By default, a command such as v-view will use all available genotypes for all individuals on all variants. Masks can be used to restrict to subsets of variants, genotypes or individuals, given various criteria. Additionally, masks also are used to specify grouping factors for commands that assume groups of variants (e.g. assoc or g-stats).

Available filters

Class Attribute Keywords
Filter variants Variant sets
var var.ex var.req
Loci loc loc.ex loc.req loc.subset loc.skip
Locus-sets locset locset.ex locset.req locset.subset locset.skip
Variant ID
id id.ex id.req
novel novel.ex
Reference variant database ref ref.ex ref.req
Genomic region reg reg.ex reg.req
Segments (regions in individuals) seg seg.ex seg.req
Filter field (from VCF) filter filter.ex filter.req
any.filter any.filter.ex any.filter.req
Variant meta-information qual meta meta.req include exclude v-include v-exclude
Variant type biallelic biallelic.ex indel indel.ex
Minor/Alternate allele frequency mac maf aac aaf hwe monomorphic monomorphic.ex biallelic biallelic.ex
Missing data null null.prop assume-ref
Case/control genotype counts case.uniq control.uniq case.control
Functional annotation annot annot.ex annot.req
Misc. limit downcode collapse
any-merge no-merge exact-merge
fix-xy genotype-model
no-vmeta no-geno no-gmeta
Filter individuals File/sample file file.ex fail.nfile
obs.file obs.file.req obs.file.ex obs.nfile
alt.file alt.file.req alt.file.ex alt.nfile
Individual ID indiv index.ex alt.group alt.group.req alt.group.ex
Phenotype phe phe.ex phe.req phe.obs phe.allow.missing
Set genotypes to null Genotype meta-information geno geno.req
Group by various loc.group var.group locset.group
all.group empty.group
Append meta-inforation various loc.append locset.append var.append ref.append annot.append meta.attach

Mask format

To illustrate some points of mask syntax, consider this example:

  --mask loc.subset=CCDS,CCDS12345.1 mac=2-20 limit=1000 any.filter.ex 

This mask translate to "variants in CCDS gene CCDS12345.1 that pass all VCF filters and have a minor allele count between 2 and 20, restricted to not more than 1000 variants". Specific points of syntax include: mask keywords...

  • can be given in any order after --mask
  • do not start with the double-dash (--), i.e. unlike all primary PSEQ options
  • are space-delimited, i.e. a space between the loc.subset and mac options
  • are case-sensitive (for both the options themselves and any arguments they take)
  • often take one or more arguments, which should be in the form option=value1,value2. Importantly, here there is no whitespace between the equals sign; multiple arguments should be comma-delimited.
  • exceptions to the last rule are the include family of masks, that take a quoted string that can have spaces, e.g. include="DP > 10"

If an identifier has space characters, the whole idenitifer must be enclosed in quotes. For example, the following is valid

  indiv.ex=P1,P2,"ID A","ID B",P3

and indicates individuals with identifiers ID-space-A, etc. The following is not valid

  indiv.ex = P1 , P2, "ID A", "ID B" , P3

as in this example there are spaces between the = and , elements of the option. Also, note that

  indiv.ex="P1 , P2, ID A, ID B, P3"
is valid and implies five individuals, but that the first ID is literally P1-space-, which is probably not the desired effect.

Some mask options expect a range of values to be specified. This should be entered, e.g. for mac mask option:

 mac=1-5

meaning between 1 and 5 copies of the minor allele. The mask

 mac=2-

means 2 or more copies. These two options can be equivalently written as specified

 mac=1:5 

and

 mac=2: 

which can be useful, as it avoids ambiguity in the term

 mac=-2 

(as some masks can in theory take negative numbers). In general, though, it is best to be specific when lower and upper bounds are known: i.e.

 mac=0-2 

for the above example. Also, when specifying a single number (i.e. singletons) give both upper and lower bounds:

 mac=1-1 

Combining multiple filters: includes, excludes and requires

A number of mask options come in three versions: as includes, excludes or requires. For example, the loc, loc.ex and loc.req correspond to these three types of mask.

  • If one or more includes are specified, a variant that does not match at least one will be masked.
  • If one or more excludes are specified, a variant that matches at least one will be masked.
  • If one or more requires are specified, a variant that does not match all will be masked.
In other words, includes and requires are similar, except that if more than one is specified, an include acts as a logical OR whereas a require acts as a logical AND.

Including information from a file in a mask specification

It may often be necessary to provide long lists of identifiers (of genes, or individuals, etc) when creating a mask, which can be inconvenient in the context of a command-line. If this list exists as a separate file, it can be included as follows

--mask indiv.ex=@ex.list will read a list of individuals from the file ex.list and, in this example, forward them to the indiv.ex component of the mask (which excludes individuals). The file ex.list is gzip-compressed, it will be uncompressed automatically. By default, the file is expected to contain one entry per row (which will be converted into the equivalent comma-delimited string). These statements can be included with any --mask options.

More than one @include can be given for a particular mask option. Also, @include statements can be incorporated along with other identifiers on the command line:

--mask indiv.ex=ID0001,@ex.list,ID0022

The basic @include syntax can be modified in two ways: with the # character, to specify entries that should be excluded from the generated list; with the #N suffix, to specify a particular column (in a tab-delimited file) should be used. (Avoid using characters @ or # in filenames.) If the file a.txt contains

  ID1
  ID2
  ID3

then the command

--mask indiv.ex=@a.txt,ID4

will pass individuals ID1, ID2, ID3 and ID4 to indiv.ex. The command

--mask indiv.ex=@a.txt,#ID2,ID4

excudes ID2 from the @a.txt list, to pass ID1, ID3 and ID4 to indiv.ex.

If a second file b.txt contained the ID2 entry, then the following would have a similar effect as above --mask indiv.ex=@a.txt,#@b.txt,ID4

In summary:

  • @file : read values from file
  • @file,#value1,#value2 : read from file, but exclude values 1 and 2
  • @file,#@file : read from first file, but exclude values from second file
  • @file#2 : read values from 2nd column of file

The @include syntax is effectively equivalent to the Linux Bash shell syntax:

--mask indiv.ex=`cat ex.list`

Filter variants on variant-set(s)

If a variant-set called vset1 has been created (with the var-set command), then the mask option:

var=vset1

will include only those variants in that set. To include multiple sets, give a comma-delimited list of those sets:

var=vset1,vset2

To exclude variants in that set, use var.ex; to require that any variant returned is a member of at least one of the sets vset1 and vset2, use var.req=vset1,vset2.

A similar logic applies to any variant super-sets (sets of sets), via the mask options varset, varset.ex and varset.req.

Filter variants on loci

If a group called CCDS (case-sensitive) exists in the LOCDB, to include variants that fall within these loci, add the mask option loc=CCDS

Here CCDS implies a group of loci. The basic loc inclusion option can be modified in two ways, such that only a subset of loci in the group are included, via the subset and skip keywords. Specifically, to restrict the mask to a subset of loci in the CCDS group:

loc.subset=CCDS,CCDS635.1

More than one specific locus can be included as a comma-delimited list following the specification of the group:

loc.subset={group},locus1,locus2,...

Also, using the @ file inclusion option on the command line can load a list of loci/genes from a file:

loc.subset=CCDS,@my_gene_list.txt

assuming the file my_gene_list.txt is a plain-text file that contains a list of CCDS gene IDs. Entries in a .subset that do not match with the LOCDB will be ignored.

A similar syntax applies to the loc.skip command:

loc.skip=CCDS,CCDS769.1,CCDS31082.1

would include all variants in CCDS loci, ignoring CCDS769.1 and CCDS31082.1. Note that the skip option effectively ignores these loci for this particular group's inclusion, so if another locus spanned a variant that was in one of these skipped loci, it would still be included -- use the loc.ex command to ensure that variants in these loci are definitely excluded (i.e. whether or not a separate locus, or include statement would otherwise lead to their inclusion), if that is the goal.

The loc option can take more than one group, in the form of a comma-delimited list. (The loc.subset and loc.skip only apply to a single group at a time, however.)

The exclude and require forms of the loc mask option are:

loc.ex=CCDS to exclude variants falling in loci in a CCDS gene, and the require form (here taking two groups from LOCDB): loc.req=CCDS,target_regions

which indicates that variants in a CCDS gene and in a target_regions group are included. In contrast,

loc=CCDS,target_regions

includes variants in a CCDS gene or in the target_regions group. Using the loc form will generally be faster than the loc.req form, and so should be preferred for single groups.

Note: the subset and skip forms do not exist for excludes and requires.

Filter variants on locus-set(s)

This set of commands applies to locus-sets, as loaded into a LOCDB by the locset-load PSEQ command. The syntax is similar to the loc mask option, but takes the form:

locset=group,set-name1,set-name2,...

Similar to loc, loc.subset, locset.skip, locset.ex and locset.req also exist. The principle syntactic difference is that the locus-set is always specified by a group-name (e.g. CCDS) and a locus-set name (e.g. KEGG).

Filter variants on reference-variant database(s)

If a REFDB is attached, and, for example, a group called g1k (i.e. 1000 Genomes data) exists in that database, the mask option

ref=g1k

will only include variants that are present in that reference variant group (present in 1000 Genomes data, in this case). If the REFDB group contains meta-information associated with that reference variant, it will be appended. For example, in this example, the reference variant group contains a flag attribute that indicates which 1000 Genomes panels the variant was seen in:

   chr1:861078  rs28419423   C/G  PASS   1   PASS   DB=1;strand=+;P1-CHB+JPT
   chr1:867694  rs6672356    T/C  PASS   1   PASS   DB=1;strand=+;P1-CHB+JPT;P1-CEU;P1-YRI

Similarly, the ref.ex option will exclude sites that are found in the REFDB group(s), and ref.req will include only sites found in all REFDB groups, if more than one group is specified: e.g.

ref.req=g1k,dbSNP ref.ex=HGMD

means "sites in 1000 Genomes and dbSNP but not in the HGMD database" (assuming corresponding groups exist in the REFDB).

Filter variants on genomic region

To filter on a genomic interval, use the following mask option:

reg=chr1:10000000..15000000

will extract only variants on chromosome 1 between 10 and 15Mb. Multiple regions can be specified as a comma-delimited list (or with the @include syntax), and can point to individual bases or whole chromosomes, as follows:

reg=chr2,chr3:333333,chr4:444444..555555

note: we require a range to be specified by two dots start..stop -- the dash/minus character cannot be used here to indicate a range.

As with the other principle mask options, the reg keyword also comes in reg.ex and reg.req versions, to either exclude variants within the specified ranges, or to only include variants in the intersection of all specified ranges (by default, reg implies the union, if more than one range is given.

Filter genotypes on individuals' genomic interval

Filter variants on ID

The mask option

id=rs12345,rs99999

will extract only those two SNPs from the database (if they exist, and if compatible with other masks, of course). The corresponding options id.ex and id.req will exclude those variants, or require that any variant returned has an ID matching one of these two.

The convenience mask option novel can be used to return only novel variants (id=.); similarly, novel.ex implies id.ex=. (i.e. only return variants that do not an assigned rs-number (or any other non-missing entry in the ID field of the VCF).

Filter variants on FILTER field

Genotypic data in VCF files always have a FILTER field, which is used to indicate various, often technical properties, of the variant (i.e. relating to the likelihood that the variant is real and accurately annotated and genotyped). By convention, a "good" variant is given the filter value PASS. Otherwise, the FILTER field can take one or more flags. For example, the GATK has flags LowQual to represent a called variant with a below-threshold QUAL score; or SnpCluster to indicate that the variant has a large number of nearby variants also called (which can be potentially indicative of alignment problems, e.g. due to mismapping reads, or the presence of an undetected insertion/deletion polymorphism.).

The filter mask option is used to include, or exclude, variants based on this field:

filter=GATKStandard

implies that only variants with a GATKStandard flag are included (i.e. only low-quality variants that failed the default thresholds used by GATK in assigning filters). As with other mask options, this command also has filter.ex and filter.req forms, to exclude or require that multiple filters are present.

Three other forms match on any non-PASS values in the FILTER field (i.e. denoting variants that "fail" for any of one or more unspecified reasons). These commands do not take any arguments: a common form will be

--mask any.filter.ex

to remove any variants that have a non-PASS flag (equivalent to filter.req=PASS) -- i.e. typically indicating the "QC+" dataset. Commands any.filter and any.filter.req also exist.

Filter variants on variant meta-information

The meta (and related meta.req) mask options can be used to filter variants based upon the information encoded in the extensible meta-information attributes that can exist in a VCF, in the INFO field. Note that the include mask statement and the eval expressions provide a more flexible (and syntactically intuitive) way to filter on variant meta-information (although this will come at the cost of slightly slower processing, which is why the meta (and, even more so, the geno mask options are retained)

For example, a particular variant might have values as follows:

   AB=0.75;AC=2;AF=0.01;AN=222;DB=0;DP=438;MQ=91.29;MQ0=0;type=Nonsense

which relate to allele balance (AB), allele count (AC), allele frequency (AF), total number of alleles/chromosomes called (AN), presence in dbSNP (DB, 0=false, 1=true), mean mapping quality (MQ), the number of mapping-quality zero reads spanning that position (MQ0), and functional class of the variant (TYPE, being Nonsense in this example). In a valid VCF, these tags will have been defined in the header of the VCF, e.g.

 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

(generated by the GATK in this case), that indicates that the DP field is a single integer that conveys information about total depth (number of reads) at the variant's position.

For this VCF, the following mask: meta=DP:ge:50,DB:is:1

will include a variant if the DP field is greater than or equal to 50 or the variant is in dbSNP (assuming the DP and DB fields have been specified in the VCF). The meta.req command is similar, except the conditions are combined as a logical AND.

The syntax of these expressions are variable:operator:value where the operator is one of:

  eq    equal to
  ne    not equal to
  gt    greather than
  ge    greater than or equal to
  lt    less than 
  le    less than or equal to

Filter variants based on type

Filter variants on sample polymorphism

To filter variants on the minor allele count, use mac:

mac=1-10

or

mac=1:10

which implies to select variants with between 1 and 10 minor alleles. (Note: this is a per-allele, not per-individual count, so 5 rare homozygotes are counted similarly to 10 rare heterozygotes.) If only a single integer is given, this implies the lower bound:

mac=10

means to include variants with 10 or more minor alleles.

To filter on frequencies rather than absolute counts, use maf instead of mac. This behaves in a similar manner to mac except it naturally takes the rate of missing genotypes into account when calculating frequency. Also, if only a single parameter is given, this implies an upper bound. Both mac and maf can be used in conjunction, therefore,

mac=2 maf=0.05

means to select variants with a sample MAF less than 0.05, but with at least 2 minor alleles present.

All frequencies and counts are calculated after any individuals have been removed, e.g. by indiv and phe filters, and after any genotype filters (geno) that might set genotypes to null values.

The corresponding aac and aaf filters consider the alternate allele count and alternate allele frequency, respectively (lumping together all alternate alleles at multi-allelic sites). This differs from the mac and maf filters in that the mac and maf look at the frequency of the minor allele (the less frequent allele), whether it's the reference or alternate alleles (thus maf is, by definition, never greater than 0.5, whereas aaf ranges from 0 to 1).

The keywords monomorphic (and monomorphic.ex) can be used to include (or exclude) monomorphic variants.

The keywords biallelic (and biallelic) can be used to include (or exclude) variants with exactly two alleles observed in the sample (post filtering of individuals and genotypes).

Genotyping rate

To filter on the number of null genotypes, use the mask option null, that takes a range of positive integers to specify the allowed range:

null=0-5

Ranges must always contain a dash/minus symbol, and always are assumed to represent positive integers:

   null=0-5   between 0 and 5 null genotypes
   null=-5    as above, 5 or less
   null=5-    5 or more 

Counts of null genotypes are calculated after removing any individuals and/or setting genotypes to null on the basis of a geno mask.

Filter variants on functional annotation

function not fully implemented yet / has known issues

To filter on functional annotation of variants, use the mask command:

annot=Misense,Nonsense

as well adding the main flag

--annot refseq

to denote which transcript group in LOCDB should be used for annotation.

Alternate forms of this option are annot.ex and annot.req to specify exclude and require masks.

Note: although not completely implemented yet, future versions will handle annotation of variants by either by a) a VCF INFO attribute, specified in the meta-meta file, or b) annotation on-the-fly, using a specified LOCDB and SEQDB.

Misc. filters

Options not covered elsewhere include:

To return only the first 100 variants (e.g. useful for testing a method, particularly when working in R, to avoid too many variants being read into memory at once) add the option:

limit=100

If a variant has genotype data from more than one sample in the VARDB: typically, filters such as:

include=" DP > 10 && DP < 300 "

apply to each sample (meaning a collection of 1 or more individuals loaded from a single VCF) separately, as each sample/VCF will have it's own DP field. If there are two samples in the database (e.g. two VCFs loaded) but only one fails the above filter, by default the remaining variant will be processed (with missing genotypes for the individuals from the first sample and not in the second sample). To exclude variants completely if at least one sample fails a filter, add the mask option: failsingle.file

Under development: If genotypes have a GL (genotype likelihood, vector of 3 floats) field and are biallelic, adding the mask option:

em=0.99

will perform an on-the-fly EM to estimate the sample allele frequency and posterior genotype probabilities (assuming HWE), and recall genotypes based on, in this example, a maximum posterior probability threshold of 0.99.

Filter by file/sample

To include (or exclude) particular files:

file=1,3

or

file.ex=2

Files can be referred to in two ways: by the number of the sample, as appears in the output of:

./pseq /path/to/project var-summary

or by the file tag that has been assigned to that file:

file.ex=batch2

Filter individuals by ID

To include or exclude individuals based on their ID:

indiv=ID001,ID002

(to exclude use indiv.ex). As shown above, this command can be combined with the @include syntax to work with lists of IDs in a file.

Filter individuals by phenotype

If phenotype data are available in INDDB, the individuals in a dataset can be filtered on those, using the phe mask:

phe=disease_a:2

which means include people with a phenotype value of 2 for the phenotype disease_a. As with other mask options such as loc, multiple phenotype filters can be specified with either the phe, phe.req or phe.ex options: e.g.

phe.req=disease_a:2,batch:B

means to include individuals with a phenotype value of 2 for disease_a (i.e. typically, that indicates a "case", or "affected") AND a value of B for the variable batch.

Note: currently, quantitative traits and ranges are not supported for filtering on phenotype.

Note: : not currently implemented, but future versions will allow a more natural syntax, e.g. phe=" disease_a == 2 && batch == 'B' ", following the include mask option.

If the same phenotype is specified to match more than one value in phe.req, this is interpreted as follows:

phe.req=dis:2,batch:A,batch:B

means dis equals 2 AND ( batch equals A OR batch equals B).

Filter case/control non-reference allele counts

If a single binary phenotype from INDDB has been specified (via the --phenotype or --make-phenotype commands), the following mask option:

case.uniq

will include only variants for which the non-reference allele is only seen in cases. Similarly, the control.uniq works as expected. For more flexibility, two ranges can be directly specified

case.control=10-9999,0-1

will select variants with 10 or more alternate (non-reference) alleles in cases, but not more than 1 in controls.

Filter genotypes on genotype meta-information

To set certain genotypes to missing if they fail to meet certain criteria (based on any per-genotype meta-information):

geno=DP:ge:10

means to set the genotype to null if the person DP is not 10 or more. Multiple criteria can be combined in a comma-delimited list; geno implies a logical OR; geno.req can also be used to require that all criteria are meet.

geno.req=DP:ge:10,GQ:ge:0.99

If an individual does not possess the required tags to evaluate the specified criteria, the genotype will be set to null.

Note: in future, we will introduce the more flexible syntax following the include option, e.g.

geno=" DP >= 10 && GQ >= 0.99 "

Group variants by

A number of commands require groups of variants, e.g. that will often correspond to a unit such as a gene, set of genes, or genomic interval. The single grouping factor is specified:

loc.group=CCDS

which implies that the LOCDB contains a group called CCDS (i.e. previously loaded into that database from a GTF). This means that the variants will be grouped, in this example, as CCDS transcripts. This command can be combined with other options, e.g. the whole command may be

./pseq /path/to/project g-view --mask loc.group=CCDS loc.subset=CCDS,CCDS10123.1,CCDS10125.1

to produce per-gene summaries for two genes CCDS10123.1 and CCDS10125.1.

One can additionally group by variant super-set (sets of variant sets):

varset.group=mysets

or locus-sets (sets of loci)

locset.group=myloci

Adding the mask keyword empty.group will ensure that groups containing no variants are reported in downstream analysis (by default, they are excluded).

Instead of loc.group, varset.group or locset.group, one can specify all.group which means that all variants meeting the mask will be placed in a single group. This can be useful for defining quick groups on-the-fly (e.g. a pair of existing genes, without having to define a locus-set group; or a small region defined by a reg mask):

reg=chr22:17000000..19000000 all.group

Note that this action pulls all variant and genotype data into memory, and so if a very large number of variants are implied in the all.group mask, you are very likely to run out of memory (at least when working with a large dataset in terms of the number of individuals).

Note: Future releases will add a reg.group option that allows for sliding-windows to define a series of groups (i.e. as might be more applicable for whole-genome, rather than exome sequence data).

Append meta-information

Information from project databases can be appended as meta-information to the variant, but not used as a filter, using the append family of mask options. For example,

loc.append=CCDS

will annotate each variant with two additional tags,if it falls in a CCDS gene (where CCDS is a group in the LOCDB), for example (columns omitted from v-view output)

   chr1:1280751:.  _LSET=CCDS24.1;_LGRP=3
   chr1:1320660:.  _LSET=CCDS30557.1,CCDS30558.1;_LGRP=3,3

Here, the tag _LSET indicates a field from the LOCDB has been appended; the tag _LGRP indicates which LOCDB group that value is from (i.e. in this example, the CCDS group is number 3 in the LOCDB (as evident from running the loc-summary). This can be used to keep track of the origin of each appended piece of information, if more than one group was specified (e.g. loc.append=CCDS,refseq). If a variantis in more than one gene, as in the secondrow, a vector of entries is appended.

In this example, we append from a REFDB group that corresponds to dbSNP.

ref.append=dbSNP130

which adds the following variant-level meta-information

  chr1:939460:rs61766284  class=single;observed=C/T;refUCSC=C;strand=+;NAME=rs61766284;VALUE=coding-synon

These tags are appended at the Variant rather than the SampleVariant level. They can be queried via the v-include mask option:

--v-include=" VALUE == 'coding-synon' "

Note : options locset.append, var.append and annot.append are not yet implemented.

PSEQ and R mask syntax

Identical syntax is used in both the PSEQ command-line interface and the R library. Consider the following (contrived) example: to test singleton non-reference alleles in missense variants in all genes in CCDS on chromsome 22 in a certain KEGG pathway, that are also in target region, excluding genotypes with low covarage and variants that failed the standard GATK filter.

Using PSEQ: pseq /path/to/project assoc --mask loc.group=CCDS loc.req=targets reg.req=chr22 locset.req=refseq,KEGG,ADHERENS_JUNCTION meta=type:eq:Missense,type:eq:Nonsense mac=1-1 geno=DP:ge:10 --phenotype scz --perm 10000

Using the R library (lines are wrapped for clarity; also, this assumes a suitable assoc.test() function exists):

  m <- "loc.group=CCDS loc.req=targets reg.req=chr22 locset.req=refseq,KEGG,ADHERENS_JUNCTION
        meta=type:eq:Missense,type:eq:Nonsense mac=1-1 geno=DP:ge:10"
  
  var.iterate( assoc.test , m ) 

Note that quotes, e.g. around include= statements, need to be escaped in R:

m <- " include=\" !DB || DP < 100 \" loc.group=CCDS "