Attaching auxiliary data (meta-information and variants sets)

This page describes how to append external meta-information of various types to existing variant databases and how to create and use variant sets and super-sets.

Attaching additional variant meta-information

After variants have been imported into a project (from a VCF or PLINK fileset), it is possible to associate additional meta-information with that variant:

pseq /path/to/project attach-meta --file myinfo.dat --id CEU TSI --group myannot

This command will take annotation information from the file myinfo.dat and import it into the project, attaching it specifically to variants in the CEU or TSI samples; the set of imported attributes will be given a group name of myannot, which one can use subsequently to refer to that set of attributes.

The --id flag indicates which files in VARDB to apply these annotations to. These numbers, or file-tags, correspond to the numbers/tags given by a vardb-summary command.

The file myinfo.dat should have the following, tab-delimited format: initial header lines start with two ## symbols and list the name of the field, the number of entries, the type (of Integer, Float and String) and a description in quotes; subsequent lines contain a variant (specified either by an ID, or by a chromosomal position (note: currently only 1 base-pair positions allowed), the name of the field, followed by the value of that field for that variant:

   ##assoc,1,Integer,"Associated in GWAS?"
   ##gwas,1,String,"GWAS platform"
   ##expr,1,Float,"dummy QT"
   chr1:1015614    assoc   1
   rs6668667       assoc   0
   rs6668667       gwas    affy6
   rs61733845      gwas    ilmn
   chr1:1110240    expr    2.22
   rs6603781       expr    1.23
   chr1:1153827    assoc   0
   chr1:1153827    gwas    affy5
   rs12751100      assoc   1
   rs12751100      gwas    affy6

Meta-information attached in this manner is stored in the database, but is not automatically extracted (for performance reasons). Subsequently, this information can be retrieved by adding to the mask: e.g. if there are two fields called expr and gwas --mask meta.attach=gwas,expr

To attach all user-defined meta-fields, add:

--mask meta.attach=ALL

To create a new VARDB in which these are folded into the core meta-fields for each variant, one can use the write-vardb command:

pseq /path/to/my/project write-vardb --new-vardb vardb2 --new-project /my/new/project --mask meta.attach=ALL

When subsequently using /my/new/project, the imported meta-fields will be automatically available and so the meta.attach mask option is no longer needed. e.g.

pseq /my/new/project v-view --vmeta --mask include=" gwas == 'affy5' "

To clear all attached meta-information from a VARDB, use the command

pseq /path/to/my/project delete-meta --group myannot

Specifying variant sets

Variant sets provide an easy way to specify groups of variants by a single identifier that can be used in mask statements. Variant sets are indexed, meaning that they can be retrieved quickly from a large file. For example, one could create a variant set representing all variants with a certain FILTER attribute, calling it filter1. They could subsequently be retrieved in a mask, without the need to scan through all variants again:

--mask var=filter1

Note: currently a variant set only defines sites rather than variants per se. That is, there is no way to distinguish between two different alternate alleles at a given locus. In future releases, this assumption will be relaxed.

To create a variant set, use the var-set command, that can take one of two forms: either loading the variants specified from a file:

pseq proj1 var-set --file myfile.txt

or creating the the set on-the-fly, based on the existing variant meta-information:

pseq proj1 var-set --group filter1 --mask (options...)

The first format (reading the list of variants from a file) assumes the file contains the either 2, 3 tab-delimited fields on each line:

 set-name   description

For any line with only two tab-delimited fields, this will be taken to define a new variant set and associated free-text description (that can include spaces, but not tabs). (Note that sets do not need to be defined in advance.) Alternatively, the line is expected to start with the keyword VAR or REG:

 VAR  set-name   variant-position

or, alternatively,

 REG  set-name   variant-interval

The REG keyword indicates that a variant interval will be specified on that line. In this case, all variants from the VARDB that fall within this interval will be added to the variant-set (i.e. the behaviour of this option is not modified by any --mask arguments). The VAR keyword will instead look only for a single variant that exactly matches that position specified (i.e. an indel, if the variant-position implies more than one base).

Alternatively, a variant-set can be generated on-the-fly, in one of two ways: i) including as members of a single set all variants that pass a given mask, or ii) creating sets that correspond to the values of a string or integer meta-tag.

To illustrate the first usage:

pseq proj1 var-set --group filter1 --mask filter=VALUE...

For example, this command might be used to specify all passing, genic variants in some "critical region":

pseq proj1 var-set --group s1
--mask reg=chr7:50000000..52000000 any.filter.ex loc.req=refseq

that could be referred to subsequently by the mask:

pseq proj1 v-view --mask var=s1

To illustrate the second usage, that also requires the --name argument to be given, e.g.:

pseq proj1 var-set --group GENE --name mygenes

Note that this command can be combined with a mask option (unlike the option to read a list of variants from a file). Here, if a variant tag called GENE existed (i.e. was in the INFO field of the VCF), this command would create multiple variant-sets, each corresponding to a distinct value of GENE. If a given variant had GENE=A,B then that variant would be added to both variant-sets, that will be automatically named GENE[A] and GENE[B]. In addition, a super-set (meaning a set of sets, described below) will created, named mygenes. One could then, for example, apply gene-based tests to each variant set in this super-set with a command in the form:

pseq proj1 assoc --mask varset.group=mygenes --phenotype phe1

Variant super-sets

A variant super-set is a collection of variant sets. These will be automatically created when creating sets from a variant tag, as described above. Alternatively, they can be specified separately: for example, given these four illustratve variant sets:

pseq proj1 var-set --group chr1 --mask reg=chr1 pseq proj1 var-set --group chr2 --mask reg=chr2 pseq proj1 var-set --group chr3 --mask reg=chr3 pseq proj1 var-set --group chr4 --mask reg=chr4

then one could create the following super-sets:

pseq proj1 var-superset --group odd --members chr1 chr3 pseq proj1 var-superset --group even --members chr2 chr4

Subsequently, the mask:

pseq proj1 v-view --mask varset=odd

would list all chromosome 1 and 3 variants. The mask:

pseq proj1 g-stats --mask varset.group=even

would group all variants into two groups (chr2 and chr4 variants) and calculate this group-based statistics (from the g-stats command) on each of these two sets.

Finally, sets can be assigned to super-sets by reading in a file:

pseq proj1 var-superset --file myfile.txt

where myfile.txt is assumed to be a simply list of tab-delimited set/super-set pairs: e.g.

 chr1  odd
 chr2  even
 chr3  odd
 chr4  even

Note: The var-summary command will list information on current variants sets and super-sets in the variant-database.

Removing sets and super-sets

To drop a single set from a VARDB:

pseq proj1 var-drop-set --group chr1

To drop all sets (and super-sets):

pseq proj1 var-drop-all-sets

To drop a super-set from a VARDB (but still leave the sets intact):

pseq proj1 var-drop-superset --group odd

To drop all super-sets:

pseq proj1 var-drop-all-supersets

Note: these commands do not actually remove any variant or genotype information from the database, only the specified groupings of thoses variants into sets and super-sets.