Evals
PLINK/SEQ provides a mechanism for on-the-fly calculation of custom variant attributes through an expressive syntax known as evals. This page describes evals in the following sections:
- Overview: basic concepts of eval syntax
- Usage: using evals on basic variant-level meta-data
- Vector operations: how vector meta-information variables are handled
- Genotype-level data: evaluating genotypes and genotypic meta-information with g-functions
- Phenotypes: incorporating phenotype information in eval expressions
- Examples: some example eval expressions
- Reference: list of operators and functions in eval expressions
Overview
Eval expressions primarily provide a more flexible way to filter variants based on meta-information than the meta and geno mask options. They also extend to:
- handle meta-data in array form
- allow arbitrary logical and arithmetic expressions to be evaluated
- allow the creation of new meta-data variables on-the-fly
- allow the modification of existing meta-data
- allow genotype and phenotype level information to be intersected with variant-level meta-data.
This increased flexibility comes at the cost of being computationally a slower compared to the meta and geno masks.
Basic usage with variant-level meta-data
The basic syntax for an eval expression is via the include statement in the mask, which takes a quoted expression as it sole argument:
--mask include=" DP > 50 "
This is equivalent to the following mask meta mask that includes the variant if the DP tag (e.g. a tag defined in a VCF to represent variant read-depth, for example) is greater than 50 (with less intuitive syntax, but slightly faster processing):
--mask meta=DP:gt:50
Along with include, there is a mask statement exclude (that excludes variants where the expression evaluates to true) and v-include (that operates on variant-level, rather than sample-variant level data: this distinction only applies in the context of a multi-sample project).
Arthimetic expression: eval expressions can evaluate simple
--mask include=" X/Y ≤ 2 "
or more complex
--mask include="(X-Y)^2/(X+Y)>3.8 && sqrt(log(A/B))<C"
arthimetic and logical expressions with arbitrary meta-data and involving various mathematical functions
Meta-data types: Eval statements have strong typing in the sense of each token being of Integer, Float, String or Bool type (as per the VCF specification, with the Bool type corresponding to Flag). This impacts whether certain operations are allowed, or how they are evaluated: for example, if I is an integer value (2), F is a floating-point value (0.5), and S is a string-value (text), then:
I + I --> 4 (integer) I + F --> 2.5 (float) S + S --> 'texttext' (string concatenation) I + S --> '.' (operation not defined, so returns a 'null' value)
Null values: missing data (i.e. if a requested tag is not defined for a particular variant) or illegal operations (e.g. trying to add a string and an integer value) result in null values. Most operations that involve a null value will return a null value too, meaning that operation is undefined. By default, if the entire expression evaluates to null, it will not be included by an include statement. (It will also not be excluded by an exclude statement, so these two forms are not completely redundant.) Handling datasets with irregular meta-data (i.e. not every variant has the same set of attributes) can often be inconvenient for this reason. Here, one can use one or more of the following:
- the set(tag-name) function, that returns true if that tag exists for that variant
- the fact that A || B is evaluated in a lazy manner, meaning that if A is true, then B (i.e. that might otherwise return a null value) is not evaluated; similarly, in the expression A && B, B is never evaluated if A is false.
- the ifelse() conditional function, described below.
Creating new variables: Eval expressions can also
be used to create new variables, via the assignment operator (a single
equal character,=) that will be present for any subsequent
operation (e.g. writing out a new VCF, or as part of an analytic
procedure):
--mask include=" X = A + B "
Following C/C++ convention, assignments always evaluate to true (that
is ( A = B ) == TRUE is always true, regardless of the
value of B). Thus, in the above example, all variants would
still be included. Similarly, the expression:
--mask include " A = B < 10 "
will include all variants, not just those for which A is true
Conditional statements: Eval expressions allow for conditonal evaluation of expressions, via the ifelse() function. Here is one example of using it to provide a default value for a meta-field if it is missing in the VCF:
--mask include=" DP = ifelse( set(DP) , DP , 0 ) "
i.e. if the DP tag is not present, create it with a default value of 0; otherwise, keep its value as is.
Multi-clause expressions: Eval expressions can contain multiple, semi-colon delimited clauses:
--mask include="X=Y/Z; X>10"
In this example, we both create a new variable (i.e. that would appear in the output of a v-view --vmeta command) and restrict the output only to those variants for which X is greater than 10. That is, the expressions are evaluated sequentially left-to-right, with any modifications to variables, or newly created variables, being available for the rest of the expression. Only the Boolean value of the rightmost expression is returned as the final result of the expression (i.e. for determining inclusion or exclusion of that variant).
To illutstrate concretely some of these things on a whole command line: (i.e. where there are existing AC and AN tags in the VCF):
pseq proj v-view --mask include="A=2; B=A+2; C=AC/AN" --vmeta
chr1:20352 rs422363 AC=10;AN=10;A=2;B=4;C=1 chr1:59374 rs2691305 AC=147;AN=180;A=2;B=4;C=0.816667 chr1:855442 . AC=1;AN=258;A=2;B=4;C=0.00387597 ...A current known issue however, is that the type of a variable must be constant across the expression:
A=2; A=4
is acceptable, and A will equal 4 by the end of the expression. However,
for the expression:
A=2; A=2.5
where A implicitly changes type (it will initially be an
Integer from the first assignment), will not evaluate as expected: A
will remain equal to 2. Although there is implicit type-casting across operations (i.e.
so you can add an Integer and a Float (which will return a Float), you cannot change the
type of any one tag. This can be an issue in at least two circumstances:first, when trying to 'normalize'
an Integer array, e.g. to express the AD allelic depth (of Integer, allele-specific read depth)
metric as a normalized, floating-point vector that sums to unity:
AD = AD / sum(AD)
Here you should instead call it something else:
AD2 = AD / sum(AD)
More annoyingly, you cannot redefine a tag that exists prior to the expression: e.g. if DP is the Integer
read-depth, then a simply expression
DP = 0.1
will not evaluate. PLINK/Seq currently has a list of reserved, pre-defined tag names (from the VCF spec) that should therefore be avoided. We expect to fix this issue (so that existing types can be overridden) in the next releae.
Treatment of vector (array) meta-data
VCF allows for tags (meta-information) to be in the form of vectors as well as scalars (e.g. PL=0,87,255). This section describes the various ways that you can use, or modify, vectors in evaluated expressions.
Typically arrays will be specified in the VCF, although they can be assigned within an eval expression:
X = int(1,2,3)
is equivalent to X=1,2,3 (and the specification the X is expected to be an Integer variable) in a VCF. Other functions to create arrays are vec(), str() and bool() for floating-point, string and boolean arrays respectively. Note that arrays cannot consist of heterogeneous types (consistent with the VCF spec).
Most of the main arithmetic and logical operators can be applied to vectors as well as scalars, namely:
+ - * / == != < > <= >= != && ||
The above operators can be used on two vectors, as long as they are the
same length. Note that these operate element-wise, so that if
A=int(1,2,3)
and
B=int(2,4,6)
then C=A*B equals [2,8,18]. These operations can also combine vector and scalar variables, so C=A/2 gives [0.5,1,1.5]. To give some more examples:
C < 10
would in this example evaluate to the Boolean vector:
[ T , T , F ]
If X is a vector, constructs in the form:
sum( X == 10 )
or
max( X == 10 )
are often useful for evaluating either how many elements of a vector (sum()), or whether at least one (max()), match that condition (in this case, being equal to 10).
Individual elements of array can be accessed with the [] operator: e.g. PL[2] will return the second element in an array called PL (or a null value if the array doesn't exist, or has fewer than two elements). Note that array indexes are 1-based (1,2,3), not 0-based as in many programming languages including C/C++ (0,1,2). The [] operator can take a scalar, as above, or one can extract out multiple elements (in a manner somewhat similar to the R language): this expression
X[int(1,3)] or, equivalently: I = int(1,3) ; X[I]
will return a vector of two elements, being the 1st and 3rd element of the vector X. If M is a boolean array of the same length as X then the expression:
X[M]
will return the subset of X for which the corresponding element in M is true.
Vector functions include:
- min(x) : return the minimum value in any array (applicable to all types)
- max(x) : return the maximum value in any array (applicable to all types)
- sum(x) : return the sum of values in an array (Integer and Float arrays only)
- mean(x) : return the mean of values in an array (Integer and Float arrays only)
- sort(x) : returns a sorted array (applicable to all types)
- c(x,y) : concatenate two arrays (applicable to arrays of similar types)
- length(x) : return the length of an array
For exaple, if you are expecting an array of three values (e.g. a genotype likelihood PL tag), the following would extract the median value, otherwise return 0 by default:
ifelse( length(PL)==3 , sort(PL)[2] , 0 )
Incorporating genotype-level meta-data in eval statements
Some special functions (g-functions) take an expression and apply it to all the genotypes, returning a summary value for that variant, which can be incorporated into the overall variant-level expression.
There are currently three functions that work at the genotype level:
- g(): evaluate the expression at the genotpe level, return to the variant level
- gf(): filter (set genotypes to null) based on the per-genotype evaluation of meta-data
- gs(): set genotypes to some non-null value based on per-genotype meta-data
Any expression within the g*( expression ) enclosure is evaluated with respect to all genotype-level meta-data for that variant. For example, consider the dummy dataset, comprising five individuals and 3 variants, with a genotype-level tag called GM:
pseq ex1.vcf v-view --show GM --gmeta
chr1:1001 rs1001 T/C . 1 PASS P001 1 C/C [GM=1] P002 1 T/T [GM=2] P003 1 T/C [GM=3] P004 1 C/C [GM=4] P005 1 C/C [GM=5] chr1:1002 rs1002 G/A . 1 PASS P001 1 A/A [GM=6] P002 1 G/A [GM=7] P003 1 G/A [GM=8] P004 1 A/A [GM=9] P005 1 A/A [GM=10] chr1:1003 rs1003 G/A . 1 PASS P001 1 G/G [GM=11] P002 1 G/G [GM=12] P003 1 G/A [GM=13] P004 1 A/G [GM=14] P005 1 G/G [GM=15]
The above statement, but that includes the following mask:
--mask include=" gf( GM > 3 && GM %% 2 != 0 ) "
would result in the following output:
chr1:1001 rs1001 T/C . 1 PASS P001 1 ./. [GM=1] P002 1 ./. [GM=2] P003 1 ./. [GM=3] P004 1 C/C [GM=4] P005 1 C/C [GM=5] chr1:1002 rs1002 G/A . 1 PASS P001 1 ./. [GM=6] P002 1 G/A [GM=7] P003 1 ./. [GM=8] P004 1 A/A [GM=9] P005 1 ./. [GM=10] chr1:1003 rs1003 G/A . 1 PASS P001 1 G/G [GM=11] P002 1 ./. [GM=12] P003 1 G/A [GM=13] P004 1 ./. [GM=14] P005 1 G/G [GM=15]
That is, genotypes are included (using the gf() filter) if either GM is greater than 3 and an odd number (%% is the modulus operator), but otherwise set to null (./.).
To illustrate the g() function -- this example would return a vector (of length N, the number of individuals in the dataset) to the variant expression: for example:
pseq ex1.vcf v-view --mask include="X = g( GM > 3 ) " --vmeta --show X
chr1:1001 rs1001 T/C . 1 PASS . X=0,0,0,1,1 chr1:1002 rs1001 G/A . 1 PASS . X=1,1,1,1,1 chr1:1003 . G/A . 1 PASS . X=1,1,1,1,1
In this manner, one can imagine how new variant-level meta-tags can be easily calculated based on the genotype-level meta-data, e.g.:
--mask include=" X = mean( g( GM > 3 ) ) "
chr1:1001 rs1001 T/C . 1 PASS . X=0.4 chr1:1002 rs1001 G/A . 1 PASS . X=1 chr1:1003 . G/A . 1 PASS . X=1
In general, all the same featrures available in usual expressions can be applied within a g-function (with the exception that nested g-functions are not allowed). The g() (that simply evaluates an expression based on the genotype-level meta-data, and returns the value to the parent variant-level expression) and gf() functions can also be used to create (or modify) genotype-level meta-data:
--mask include=" gf( GG = GM > 3 && GM %% 2 != 0 ) "
1:1001 rs1001 T/C . 1 PASS P001 1 C/C [GM=1;GG=0] P002 1 T/T [GM=2;GG=0] P003 1 T/C [GM=3;GG=0] P004 1 C/C [GM=4;GG=0] P005 1 C/C [GM=5;GG=1] 1:1002 rs1002 G/A . 1 PASS P001 1 A/A [GM=6;GG=0] P002 1 G/A [GM=7;GG=1] P003 1 G/A [GM=8;GG=0] P004 1 A/A [GM=9;GG=1] P005 1 A/A [GM=10;GG=0] 1:1003 rs1003 G/A . 1 PASS P001 1 G/G [GM=11;GG=1] P002 1 G/G [GM=12;GG=0] P003 1 G/A [GM=13;GG=1] P004 1 A/G [GM=14;GG=0] P005 1 G/G [GM=15;GG=1]
In this example, we create a new tag, GG rather than set the genotype to null (i.e. because an assignment is always true, using the gf() and g() functions would give identical results in this example).
The g-functions understand some special codes to access the genotype data:
include=" MAF = mean( g( GT_A ) ) / 2 "
The GT_A tag is a special keyword that returns a vector of the number of alternate allele counts for each individual. The function above therefore calculates minor allele frequency (assuming no missing data).
Note that eval expressions including (complex) g-functions in large samples (i.e. 100s or 1000s of individuals) will be particularly slow compared to the geno mask, because the expression has to be re-evaluated for every genotype. Therefore, use in a directed, sparing manner. (It will still likely be quicker than coding your own Perl script and working directly on the VCF, for most people at least...).
Incorporating phenotypes in eval statements
If a phenotype named phe1 exists in the INDDB, this can be accessed via the p() function:
--mask include=" P = p(phe1)"
In future versions, a p() will be allowed to be placed on the lefthandside of an assignment, to create phenotypes on the fly, that can be a function of genotype meta-data, for example:
--mask include=" p(p1) = g(DP) "
Reference: available operators and functions
Eval expressions follow closely C/C++ style syntax for the following operations:
&& logical AND (also &) || logical OR (also |) + addition - subtraction * multiplication % modulus (also %%) == equals != not equals ! logical (unary) not > greater than >= greater than or equal to < less than <= less than or equal to ( ) parentheses, to group expressions = assignment operator ; separate different expressions
The following statistical functions are also available:
sqr(x) square sqrt(x) square-root pow(x,y) x to power of y log(x) natural of x exp(x) exponent of x set(V) returns true/false for whether variable V is present ifelse(cond,X,Y) return X if condition is true, otherwise Y
Examples
Here we give a few examples of how eval expressions in masks might be used:
Standardize values: To make a vector sum to unity:
" Y = X / sum(X) "
To standardize a quantitative trait:
" X = X-mean(X) ; VX=(sum(sqr(X))-sqr(sum(X)))/(n()-1); Z=X/sqrt(VX) "
Convert GL to PL: This keeps the PL tag if it exists; otherwise, substitutes -10log(GL) if the GL tag exists:
" PL = ifelse(set(PL), PL, ifelse(set(GL), -10.0*log(GL), vec(0,0,0) ) ) "
Calculation of genotype-phenotype association statistics: This is more of a party-trick
than a practical suggestion, but it does serve to show a few aspects of
eval expressions in use: the calculation of a correlation coefficient between genotype and
phenotype, purely using the eval expression language. We'll first generate some dummy data for a single
SNP, using PLINK:
echo "1 snp1 0.2 0.2 2 4" > sim1.sim
plink --simulate sim1.sim
--simulate-ncases 200
--simulate-ncontrols 200
--simulate-prevalence 0.01
--make-bed
--out sim1
Here we output this file to a plain text file that can be read into R:
plink --bfile sim1 --recodeA --out s1
and then use R to calculate the correlation coefficient between the phenotype and genotype (allele dosage) fields:
cor( d$PHENOTYPE , d$snp1_d )
[1] 0.1934540
To do similarly using PSEQ, we first need to create a new project:
pseq pp1 new-project
and load the binary PED file into it:
pseq pp1 load-plink --file sim1 --id 1
expecting 1 variants on 400 individuals inserted 400 individuals, 1 variants
Now, for this single SNP, we can use the following include expression to calculate the variance of the phenotype, of the genotype (allele dosage, from GT_A) and the covariance/correlation between these two measures, using g() and p() functions, vector operations and functions in a multi-expression statement:
pseq pp1 v-view --vmeta --show VG VP COR
--mask include="PM = p(phe1) - mean( p(phe1) ) ;
GM = g(GT_A) - mean( g(GT_A) ) ;
VP = sum( sqr(PM) ) / (n() - 1) ;
VG = sum( sqr(GM) ) / (n() - 1) ;
COR = sum( PM * GM ) / sqrt( VP * VG ) ;
COR = COR / (n()-1) "
chr1:1 snp1 d/D . 1 PASS . VP=0.250627;VG=0.385739;COR=-0.193454
That is, this gives us the same estimate for the correlation (r=0.19) as R (note that the diffrence in the sign of the correlation is arbitrary here, simply reflecting that PLINK coded the minor/reference allele as 1 whereas PSEQ codes the alternate allele as 1 (that, in this dummy dataset, also happened to be the major allele). But, to reiterate, this is not the recommended way to implement association statistics in PSEQ...
Final syntax notes:
1) This table shows how null values are handled in logical comparisons && and || (AND or OR) (following the spirit of lazy evaluation in C/C++ -- although, the righthandside actually always is evaluated in the first two scenarios (partly as it might contain assignments that impact the rest of the expression, for example):
false AND NULL --> FALSE true OR NULL --> TRUE true AND NULL --> NULL false OR NULL --> NULL NULL AND (NULL/false/true) --> NULL NULL OR (NULL/false/true) --> NULL
This can be useful when testing a value of field that is often not present: e.g. some fields are only defined if the site is polymorphic in the sample, such as AB
exclude=" set(AB) && AB > 0.75 "
2) The ifelse() must return similar, or easily-convertable types:
ifelse( DB == 1 , DP , TYPE )
is not allowed if, for example, DP is an integer, and TYPE is a string.
3) PSEQ scans each statement for basic adherence to syntax (that parentheses are properly nested and matched, that functions have the correct number of arguments, etc, although there will likely still be some edge cases that are not properly handled. Currently, if a logical, rather than a syntactical, error means a null value is introduced, the PSEQ will give a warning. It can be a little difficult to figure out why something didn't work sometimes -- in a future release we will add a verbose mode to print out each step of the evaluation of a particular expression, so that errors can be spotted more easily.