Tutorial: working with 1000 Genomes Pilot 3 VCFs

By way of introducing some of the features and approaches of PLINK/SEQ, this page provides a tutorial that uses PSEQ and the R interface to PLINK/SEQ to work with next-generation sequence data from the 1000 Genomes Project. We will use two datasets of SNP genotype calls from the pilot 3 (exon study). As described in the manuscript, these two samples comprise data from a very small proportion of the complete 1000 Genomes pilot study: 90 CEU and 66 TSI individuals for over 8000 exons at high coverage.

If you haven't already, first take a look at this page, that gives an overview of PLINK/SEQ architecture, before embarking on this tutorial.

Downloading the data for the tutorial

We assume you have a working copy of PLINK/SEQ already installed. The data for this tutorial are in two archives you need to download:

The genotype data are in (compressed) VCF format, which is the primary input format used by PLINK/Seq. They should be identical to those available on the 1000 Genomes FTP site: namely, CEU.exon.2010_09.genotypes.vcf.gz and TSI.exon.2010_09.genotypes.vcf.gz.

Extract the contents from these archives:

tar -xzvf pseq-tut1.tar.gz tar -xzvf resources-hg18-tut1.tar.gz

This should create two folders in your current folder: data and hg18, with the following contents:

data:
total 1.1M
 625K CEU.exon.2010_09.genotypes.vcf.gz
   72 exlist.txt
  273 lookup.txt
   92 meta.meta
 1.6K pop.phe
 454K TSI.exon.2010_09.genotypes.vcf.gz

hg18:
total 2.1G
 216M locdb
 1.1G refdb
 831M seqdb

Overview

The following topics are covered in this tutorial:

  • Quick looks at the VCFs: Direct inspection of the VCFs; Using PSEQ to view variants and genotypes; Outputting (filtered) VCFs; Summary statistics on VCFs
  • Setting up a PLINK/Seq project: Project specification; Importing VCFs; Attaching phenotypic information; Reference databases
  • Viewing and filtering the data in different ways: Basic views and filters; Individual genotype data; Gene-centric queries; Filtering on meta-information; Outputting variant and genotype data in different formats
  • Summary statistics: Whole-sample variant statistics; Per-individual summary statistics; Per-gene summary statistics
  • Annotating variants : Functional annotation of coding status; Novelty (presence in dbSNP); Human Gene Mutation Database; Look-ups of variant lists
  • Examining genotype-phenotype relationships: Frequencies and HWE; Global distributions of rare alleles; Finding variants unique to particular groups; Single locus association; Gene-based rare-variant association tests;
  • Using R to work with this project: Attaching the project; Visualising variant meta-information; Defining and applying arbitrary functions

Quick looks at the VCFs

You can view the compressed VCFs with a Unix command such as zless:

zless data/CEU.exon.2010_09.genotypes.vcf.gz

For example: after the headers, we see information for a variant with ID rs111751804 on chromosome 1 at 1105366 (hg18 co-ordinates): (lines wrapped in output below for clarity)

#CHROM  POS     ID              REF     ALT     QUAL    FILTER     (line wrapped)
1       1105366 rs111751804     T       C       .       PASS       (line wrapped)

INFO                                           FORMAT  NA06984 NA06985 NA06986 ...
AA=T;AC=4;AN=114;DP=3251;GP=1:1115503;BN=132   GT:DP   ./.:0   ./.:0   0/0:107 ...

This line shows that the reference allele at this position is T and alternate allele C. There is no associated QUAL score for this variant, and it PASSed filters (note: all variants in these two particular datasets passed QC). We then see a string of tags (meta-information); the fields are described in the header of the VCF:

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele, ...">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Number of alternate alleles ...">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total allele count ...">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth">
...

Looking at the genotype information for this variant (with format GT:DP, i.e. genotype and read-depth), we see the first two individuals have a missing genotype due to no coverage: ./.:0. The third individual is homozygous for the reference allele, based on 107 reads: 0/0:107.

Using PSEQ to view variants and genotypes in a single VCF

As described here, one can use PSEQ either with a single VCF, or with a PLINK/Seq project. To view the variants in a single VCF, use the v-view command (the second argument given to PSEQ), with the VCF filename (rather than a project name) as the first argument:

pseq data/CEU.exon.2010_09.genotypes.vcf.gz v-view
chr1:1105366	rs111751804	T/C	.	1	PASS
chr1:1105411	rs114390380	G/A	.	1	PASS
chr1:1108138	rs61733845	C/T	.	1	PASS
chr1:1110240	rs116321663	T/A	.	1	PASS
chr1:1110294	rs1320571	G/A	.	1	PASS
...

Looking at the first row, we see the variant with ID rs111751804 at position chr1:1105366. The third column lists the reference allele (T) and alternate allele(s) (in this case, C). The fourth column indicates that this row represents a consensus variant and can be ignored for now. The fifth column indicates the number of samples in which this variant was observed (necessarily 1 in this case). The last column displays the FILTER field of the VCF. In the cleaned VCFs in this tutorial, all variants PASS.

We can also output the associated meta-information with the --vmeta argument; here we will restrict output to only the first two variants in the file, by adding the mask option limit=2:

pseq data/CEU.exon.2010_09.genotypes.vcf.gz v-view --vmeta --mask limit=2
chr1:1105366	rs111751804	T/C	.	1	PASS	.	AA=T;AC=4;AN=114;DP=3251;GP=1:1115503;BN=132
chr1:1105411	rs114390380	G/A	.	1	PASS	.	AA=G;AC=1;AN=106;DP=2676;GP=1:1115548;BN=132

We can change the format in which variant meta-information is output, for example, by using the meta-matrix command, designed to generate rectangular/tabular files:

pseq data/CEU.exon.2010_09.genotypes.vcf.gz meta-matrix --mask limit=2
CHR POS     ID          AA AC AN  DP   HM2 HM3 GP        BN  NR AR OR MP PASS
1   1105366 rs111751804 T  4  114 3251 0   0   1:1115503 132 0  NA NA 0  1
1   1105411 rs114390380 G  1  106 2676 0   0   1:1115548 132 0  NA NA 0  1

Note that when using the meta-matrix command, all tags in the file/project are listed for every variant (e.g. the OR does not appear for either of these two variants, and so is listed NA). The presence or absence of flags such as HM2 is indicated by 0 and 1.

We can view individual-level genotype data (and meta-information) in a more human-readable format: pseq data/CEU.exon.2010_09.genotypes.vcf.gz v-view --geno --gmeta --mask limit=2

chr1:1105366    rs111751804   T/C    .    1    PASS
NA06984   .    ./.    [DP=0]
NA06985   .    ./.    [DP=0]
NA06986   .    T/T    [DP=107]
NA06989   .    ./.    [DP=0]
NA06994   .    ./.    [DP=0]
NA07000   .    T/T    [DP=25]
NA07037   .    T/T    [DP=30]
NA07048   .    T/T    [DP=31]
NA07051   .    T/T    [DP=57]
NA07346   .    T/T    [DP=69]
NA07347   .    T/T    [DP=53]
NA07357   .    T/T    [DP=225]
...

Other ways to view the variant and genotype data are descibed below and in the main PSEQ documentation.

Outputting filtered VCFs

It is possible to create new VCFs with the write-vcf command. By itself, this would simply reproduce the original input file. It can be combined with other options to limit the output and create simplified VCFs (removing unwanted tags, for example). Here we request only HM2 and HM3 tags to be shown; we also request genotypes only for two individuals (NA11920 and NA12761), limited to chromosome 7:

pseq data/CEU.exon.2010_09.genotypes.vcf.gz write-vcf --show HM2 HM3 --mask indiv=NA11920,NA12761 reg=chr7
##fileformat=VCFv4.0
##source=pseq
##INFO=<ID=HM2,Number=0,Type=Flag,Description="HapMap2 membership">
##INFO=<ID=HM3,Number=0,Type=Flag,Description="HapMap3 membership">
##FILTER=<ID=PASS,Description="Passed variant FILTERs">
#CHROM  POS       ID          REF ALT QUAL  FILTER  INFO     FORMAT NA11920 NA12761
chr7    2257048   rs1799832   C   T   .     PASS    HM3      GT     0/0     0/0
chr7    5993056   rs1805324   C   T   .     PASS    HM2;HM3  GT     0/0     0/0
chr7    5993133   rs1805318   T   A   .     PASS    HM2;HM3  GT     0/0     0/0
chr7    5993234   rs63750668  C   A   .     PASS    .        GT     0/0     0/0
chr7    5993301   rs2228006   T   C   .     PASS    HM2;HM3  GT     ./.     ./.
chr7    5993468   rs1805323   G   T   .     PASS    .        GT     0/0     ./.
chr7    5993514   rs1805321   G   A   .     PASS    HM2      GT     ./.     ./.
chr7    6001737   rs116788608 T   C   .     PASS    .        GT     0/0     0/0
chr7    6003506   rs1805319   G   C   .     PASS    HM2      GT     1/1     1/1
chr7    16782525  rs2290837   T   C   .     PASS    HM2;HM3  GT     0/1     0/1
...

Note: the full mask syntax is described here -- but note that not all mask options are available in single VCF mode (as opposed to working with a project, see below).

Summary statistics on VCFs

We will explore the details of the summary statistic commands (v-stats, i-stats and g-stats) later, but here will illustrate two simple commands that can be performed on a single VCF. The v-stats command yields some basic summary information on things such as call rate, allele frequency and transition/transversion ratio:

pseq data/CEU.exon.2010_09.genotypes.vcf.gz v-stats
 NVAR         3489
 RATE         0.918095
 MAC          15.7263
 MAF          0.101213
 SING         1083
 MONO         49
 TITV         3.46735
 TITV_S       3.53138
 DP           6992.37
 QUAL         0
 PASS         1
 FILTER|PASS  1
 PASS_S       1

Next we calculate per-individual summary statistics on the whole VCF, with the i-stats command, but output only a subset using the gcol utility:

pseq data/CEU.exon.2010_09.genotypes.vcf.gz i-stats | gcol ID NVAR SING TITV
ID	NVAR	SING	TITV
NA06984	3162	13	3.7193
NA06985	3144	11	3.43243
NA06986	3437	22	3.56391
NA06989	3130	11	3.2521
NA06994	3002	4	3.5625
NA07000	3388	11	3.51145
NA07037	3374	5	3.63359
NA07048	3373	15	3.27632
NA07051	3451	13	3.07843
NA07346	3419	4	3.62791
NA07347	3417	16	3.53788
...

This gives the number of non-missing calls per individual (NOBS), the number of singleton non-reference alleles they carry (SING) and the individual's Ti/Tv based only on variants for which they carry a non-reference allele. (The gcol program is a simple utility to extract columns from a rectangular whitespace-delimited text file based on names in the header; it is available for download from here).

A more detailed description of the various summary statistic commands available in PSEQ is given here.

Setting up a project

Generally, it is preferable to build a project rather than working with single VCFs as we have done above. Setting up a project allows data from multiple sources to be combined easily; it provides a wider range of filtering and grouping options via masks; it allows data to be accessed via different PLINK/Seq front-ends (such as the R interface, or the exome-browser); it will process files more quickly than parsing the VCFs for every analysis (as the data are stored in a compressed, indexed binary format inside the VARDB). Therefore, in most non-trivial scenarios, it is advisable to create a project first. This simple process is described below.

Creating a new project and linking resource databases

pseq proj new-project --metameta data/meta.meta --resources hg18

This creates a text file called proj in the current directory that henceforth will represent the project. This file is simply a list of the other files that are considered part of the project:

/full/path/to/current/proj_out/                               OUTPUT
/full/path/to/current/hg18/                                   RESOURCES
/full/path/to/current/data/meta.meta                          METAMETA
/full/path/to/current/proj_out/vardb                          VARDB
/full/path/to/current/proj_out/inddb                          INDDB
/full/path/to/current/hg18/locdb                              LOCDB
/full/path/to/current/hg18/refdb                              REFDB
/full/path/to/current/hg18/seqdb                              SEQDB

--metameta and --resources are configuration options; both are optional. --metameta points to a file with information about the meta-information tags that will be used in this project. Specifically, data/meta.meta contains information to indicate which variant tags in the VCFs should be considered as unvarying population-wide attributes of the variant (denoted as STATIC) versus sample-specific properties. This information is used when merging information from multiple sources.

The resources archive downloaded as part of this tutorial contains hg18 versions of the core PLINK/Seq reference databases: LOCDB (containing RefSeq gene transcripts), REFDB (containing dbSNPv130) and SEQDB (containing the hg18 sequence itself). By adding the --resources argument to the above command, and pointing to the hg18 folder, it will automatically attach these three databases.

Importing VCF data into a project

To load VCF files into the project, use the command:

pseq proj load-vcf --vcf data/*.vcf.gz
loading : CEU.exon.2010_09.genotypes.vcf.gz ( 90 individuals )
loading : TSI.exon.2010_09.genotypes.vcf.gz ( 66 individuals )
CEU.exon.2010_09.genotypes.vcf.gz : inserted 3489 variants
TSI.exon.2010_09.genotypes.vcf.gz : inserted 3281 variants

Each file has a number (reflecting the order in which it was loaded into the VARDB) and also a tag. The tag can be changed to provide a convenient identifier for each file in a project, using the tag-file command: pseq proj tag-file --id 1 --name CEU pseq proj tag-file --id 2 --name TSI pseq proj var-summary

---Variant DB summary---

4449 unique variants
File tag : CEU (3489 variants, 90 individuals)
File tag : TSI (3281 variants, 66 individuals)
var-summary displays the number of individuals and variants in a VARDB. These values correspond to those in the 1000 Genomes manuscript, so it appears the data have been successfully imported.

We can now perform all the commands as above, but referring to the project instead of a single VCF file:

pseq proj v-view
chr1:1105366	rs111751804	T/C	.	2	PASS PASS
chr1:1105411	rs114390380	G/A	.	2	PASS PASS
chr1:1108138	rs61733845	C/T	.	2	PASS PASS
chr1:1110240	rs116321663	T/A	.	2	PASS PASS
chr1:1110294	rs1320571	G/A	.	2	PASS PASS
chr1:3537996	rs2760321	T/C	.	2	PASS PASS
chr1:3538692	rs2760320	G/C	.	2	PASS PASS
chr1:3538715	rs114376964	T/C	.	1	PASS
chr1:3541597	rs116230480	C/T	.	1	PASS
chr1:3541599	rs115982402	C/T	.	1	PASS
...

The fifth column indicates that the variant was seen in two files in some cases. The final column lists the gene annotation of this variant, as well as the FILTER values for each sample. You can obtain sample-specific meta-information by adding the --samples flag along with --vmeta:

pseq proj v-view --samples --vmeta
chr1:1105366	rs111751804	T/C	.	2	PASS PASS	AA=T;BN=132;GP=1:1115503	.
chr1:1105366	rs111751804	T/C	CEU	PASS	AA=T;AC=4;AN=114;DP=3251;BN=132;GP=1:1115503
chr1:1105366	rs111751804	T/C	TSI	PASS	AA=T;AC=3;AN=130;DP=4820;BN=132;GP=1:1115503
...

Note that --vmeta prints three lines for the same variant. The first row displays the consensus sample, the population/whole-project representation of the variant. Only tags specified as STATIC in data/meta.meta are displayed in the consensus representation; other meta information tags are assumed to be sample specific. The next two rows show sample specific information; the fourth indicates the sample via the file tags that were set earlier.

Loading individual/phenotypic information

An arbitrary number of phenotypes can be imported into a project. Phenotypes can be integer or floating-point quantitative traits; additionally, string values can be loaded (as factors). The format for phenotype files is described here:

head data/pop.phe
##phe1,Integer,-9,"1/2 = TSI/CEU"
#ID       phe1
NA20504   1
NA20506   1
NA20508   1
NA20509   1
NA20510   1
NA20511   1
NA20512   1
NA20513   1

In this example, we have a single integer phenotype named phe1. As the header suggests, this phenotype actually indicates population membership in our example dataset, with values of 1 and 2 representing TSI and CEU respectively. Later in this tutorial we will perform dummy case/control analyses using this phenotype (i.e. contrasting TSI and CEU samples). As is standard in PED files, a value of 2 indicates a case; a value of 1 indicates a control; a value of -9 means missing.

To import phenotype information, use the load-pheno command:

pseq proj load-pheno --file data/pop.phe

Individual phenotypic data can subsequently be viewed with the i-view command:

pseq proj i-view
#phe1 (Integer) "1/2 = TSI/CEU"
#PHE     .
#STRATA  .
#ID      FID IID MISS  SEX  PAT  MAT  META
NA06984  .   .   0     0    0    0    phe1=2
NA06985  .   .   0     0    0    0    phe1=2
NA06986  .   .   0     0    0    0    phe1=2
NA06989  .   .   0     0    0    0    phe1=2
NA06994  .   .   0     0    0    0    phe1=2
NA07000  .   .   0     0    0    0    phe1=2
...

Multiple phenotypes can be attached to individuals, and will be shown in the META field. This database also tracks family-based ID schemes (FID and IID being family and individual ID); a flag to indicate missing data (MISS), sex (SEX), paternal and maternal IDs (PAT and MAT).

The PHE and STRATA fields in the header will indicate if a particular phenotype or stratifying variable has been set or created (via the --pheno and --strata command lines, see the PSEQ documentation for details).

The project's reference databases

The reference databases in this tutorial represent a cut-down (and already out-of-date) selection of some of the types of data that might routinely be of use when analysing genetic variation datasets: information on genes (RefSeq transcripts), variation (dbSNP) and sequence (hg18). More comprehensive reference datasets are (or soon will be) available from the primary resources/download page of this site. See the main PSEQ documentation for information on how to create your own reference databases.

Here we confirm the contents of the reference databases via a series of summary commands: first the LOCDB:

pseq proj loc-summary
---Locus DB summary---

Group : refseq (30072 entries) n/a

Similarly, we can request a description of the REFDB (that contains 18M variants from the v130 release of dbSNP):

pseq proj ref-summary
---Reference DB summary---

Group : dbSNP130 (18833531 entries) : dbSNP130 (default name)

and SEQDB:

pseq proj seq-summary
---Sequence DB summary---

chr1:1..247249719	MB=247
chr2:1..242951149	MB=242
chr3:1..199501827	MB=199
chr4:1..191273063	MB=191
chr5:1..180857866	MB=180
chr6:1..170899992	MB=170
chr7:1..158821424	MB=158
chr8:1..146274826	MB=146
chr9:1..140273252	MB=140
chr10:1..135374737	MB=135
chr11:1..134452384	MB=134
chr12:1..132349534	MB=132
chr13:1..114142980	MB=114
chr14:1..106368585	MB=106
chr15:1..100338915	MB=100
chr16:1..88827254	MB=88
chr17:1..78774742	MB=78
chr18:1..76117153	MB=76
chr19:1..63811651	MB=63
chr20:1..62435964	MB=62
chr21:1..46944323	MB=46
chr22:1..49691432	MB=49
chrX:1..154913754	MB=154
chrY:1..57772954	MB=57
chrM:1..16571	MB=0

SEQDB meta-information: BUILD = hg18
SEQDB meta-information: DESC = from-UCSC-6-aug-2010
SEQDB meta-information: IUPAC = 0
SEQDB meta-information: NAME = hg18
SEQDB meta-information: REPEATMODE = lower

 

We now have a fully populated PLINK/SEQ project and we can begin analysis.

Viewing the data, with filters (masks)

Basic views and filters

As described above, the v-view command can be used for viewing variant and genotype data, which can be modified by other options such as --samples, --vmeta, --geno and --gmeta, as described here. To obtain human-readable versions of the meta-information, add the --verbose argument:

pseq proj v-view --vmeta --samples --verbose --mask limit=1
chr1:1105366	rs111751804	T/C	.	2	PASS PASS
	AA = T
	BN = 132
	GP = 1:1115503


chr1:1105366	rs111751804	T/C	CEU	PASS
	AA = T
	AC = 4
	AN = 114
	DP = 3251
	BN = 132
	GP = 1:1115503

chr1:1105366	rs111751804	T/C	TSI	PASS
	AA = T
	AC = 3
	AN = 130
	DP = 4820
	BN = 132
	GP = 1:1115503

Individual-level genotype data

To view phenotype information along-side genotype calls:

pseq proj v-view --geno --phenotype phe1 --mask limit=1
chr1:1105366	rs111751804	T/C	.	2	PASS PASS
NA06984 CEU     CASE     ./.
NA06985 CEU     CASE     ./.
NA06986 CEU     CASE     T/T
NA06989 CEU     CASE     ./.
NA06994 CEU     CASE     ./.
NA07000 CEU     CASE     T/T
NA07037 CEU     CASE     T/T
...
NA20504 TSI     CONTROL  T/T
NA20506 TSI     CONTROL  T/T
NA20508 TSI     CONTROL  T/T
NA20510 TSI     CONTROL  T/T
NA20511 TSI     CONTROL  T/T
...

To also add genotype-specific meta-information, add the --gmeta flag. The specific fields that are shown or hidden can be modified with the --show and --hide commands.

There are a number of variations on the basic v-view command that are designed to look at individual level genotype data (the examples below use --mask to select two particular sites for illustration.) The rv-view only lists the rarer genotypes per site:

pseq proj rv-view --mask reg=chr13:27794979,chr13:27795044
chr13:27794979:rs56314249   .   C/T   2   PASS PASS
 NA07051   CEU  C/T
 NA12717   CEU  C/T
 NA12874   CEU  C/T
 NA20506   TSI  C/T
 NA20538   TSI  C/T
 NA20773   TSI  C/T
 NA20828   TSI  C/T
chr13:27795044:rs41291686   .   C/T   1   PASS
 NA12155   CEU  C/T

This command can be combined with --gmeta and --phenotype too. The mv-view lists multiple variants together (output is similar to the g-view command, but where the variants are not necessarily existing within a predefined group):

pseq proj mv-view --mask reg=chr13:27794979,chr13:27795044
 NAME=[]     N(V)=2	N(I)=156
 V1	     chr13:27794979	
 V2	     chr13:27795044		

 I	      V1	V2
 NA06984      C/C	C/C
 NA06985      C/C	C/C
 NA06986      C/C	C/C
 NA06989      C/C	C/C
 NA06994      C/C	C/C
 NA07000      C/C	C/C
 NA07037      C/C	C/C
 NA07048      C/C	C/C
 NA07051      C/T	C/C
 NA07346      C/C	C/C
...

Finally, mrv-view only lists the genotypes that contain a minor allele in at least one of the set of multiple variants, here combined with the --phenotype command:

pseq proj mrv-view --phenotype phe1 --mask reg=chr13:27794979,chr13:27795044
NAME=[]	N(V)=2	N(I)=156
V1	chr13:27794979	
V2	chr13:27795044	

I	PHE	V1	V2
NA07051	CASE	C/T	C/C
NA12155	CASE	C/C	C/T
NA12717	CASE	C/T	C/C
NA12874	CASE	C/T	C/C
NA20506	CONTROL	C/T	./.
NA20538	CONTROL	C/T	./.
NA20773	CONTROL	C/T	./.
NA20828	CONTROL	C/T	./.

Gene-centric views of genotype data

With the appropriate information in an attached LOCDB, we can filter (or group) variants based on their location in specific genes. The command g-view provides a summary of variants by gene. Assuming a group called refseq has been imported into the LOCDB (it has already, in the case of the LOCDB downloaded with this tutorial):

pseq proj g-view --mask loc.group=refseq
NAME=[NM_000021]	N(V)=1	N(I)=156
V1	chr14:72742931	

NAME=[NM_000043]	N(V)=4	N(I)=156
V1	chr10:90752918	
V2	chr10:90757462	
V3	chr10:90758660	
V4	chr10:90761809	

NAME=[NM_000055]	N(V)=7	N(I)=156
V1	chr3:166973974	
V2	chr3:166973982	
V3	chr3:167030263	
V4	chr3:167030667	
V5	chr3:167030704	
V6	chr3:167030881	
V7	chr3:167031223	
...

The loc.group mask element indicates that we will iterate over sets or groups of variants, rather than single variants. Here, the sets are defined by the members of the refseq group in the LOCDB, i.e. corresponding to genes. The loc.group option is heavily used in gene-based association tests. Note that the order of iteration is defined by the alpha-numeric value of the gene/set name, rather than genomic location.

LOCDBs can optionally contain tables of gene-name aliases: for example, containing the RefSeq transcript IDs, the CCDS IDs as well as the gene symbol. One convenience function is the gene mask element, that matches genes (one or more RefSeq transcripts) based on the gene symbol in (by default) a group called refseq. pseq proj g-view --geno --mask gene=GRIK3

NAME=[NM_000831]   N(V)=3   N(I)=156
V1 chr1:37098064
V2 chr1:37098185
V3 chr1:37110546

I       V1  V2  V3
NA06984 C/C C/C G/G
NA06985 A/C C/C G/G
NA06986 A/A C/C G/G
NA06989 A/C C/C G/G
NA06994 A/C C/C G/G
NA07000 A/C C/C G/A
NA07037 A/A C/C G/G
NA07048 A/A C/C G/G
NA07051 A/A C/C G/G
NA07346 A/A C/C G/G
NA07347 A/A C/C G/G
NA07357 A/A C/C G/G
NA10847 ./. ./. G/G

The above command is identical to the following, explicitly setting the loc.group and that only a subset of sets from this group should be considered, via loc.subset:

pseq proj g-view --geno --mask loc.group=refseq loc.subset=refseq,NM_000831

To determine which individuals carry rare non-reference variants in a given gene, one can use something like the following: in this example, the mask specifies the filter on variants with between 1 and 5 copies of the minor allele (mac):

pseq proj g-view --phenotype phe1 --mask gene=GRIK3 mac=1-5 --rarelist
NAME=[NM_000831]	N(V)=2	N(I)=156
I	NM_000831	NA07000	CASE	1		chr1:37110546;A/G
I	NM_000831	NA12234	CASE	1		chr1:37110546;A/G
I	NM_000831	NA12342	CASE	1		chr1:37110546;A/G
I	NM_000831	NA12842	CASE	1		chr1:37110546;A/G
I	NM_000831	NA12890	CASE	1		chr1:37098185;C/T
I	NM_000831	NA20538	CONTROL	1		chr1:37098185;C/T
I	NM_000831	NA20808	CONTROL	1		chr1:37098185;C/T
V	NM_000831	chr1:37098185	C	3	NA12890;C/T,NA20538;C/T,NA20808;C/T
V	NM_000831	chr1:37110546	G	4	NA07000;A/G,NA12234;A/G,NA12342;A/G,NA12842;A/G

The rows starting I indicate an individual who carries a rare allele; the fifth column is the number of variants they carry; the sixth column lists the variant(s) and the genotype(s) for that individual. The rows starting V convey similar information but for a variant, listing the (number of) individuals carrying at least one minor allele; the reference allele is also listed in each line of variant output.

Using masks to filter variants based on meta-information and other properties

Here we illustrate some more complex filters that can be applied to any dataset. For example, this command uses the include mask element to evaluate an arbitrary expression as true or false (that is a function of the meta-information for the variants): in this example, we require the dbSNP build number in which the variant first appeared (BN tag) to be 123 and for read-depth to be less then 2000. Three variants on chromosome 7 meet these criteria:

pseq proj v-view --vmeta --samples --show BN DP --mask reg=chr7 include="BN==123 && DP<2000"
chr7:34833707	rs17170011	C/T	.	1	PASS	BN=123	.
chr7:34833707	rs17170011	C/T	TSI	PASS	DP=1589;BN=123
chr7:120757061	rs17143291	C/A	.	1	PASS	BN=123	.
chr7:120757061	rs17143291	C/A	TSI	PASS	DP=624;BN=123
chr7:120759251	rs17143296	G/A	.	1	PASS	BN=123	.
chr7:120759251	rs17143296	G/A	TSI	PASS	DP=502;BN=123

The include element works at the level of sample-variants, not the consensus variant. In the above example, rs17170011 was seen in both the CEU and TSI samples, but DP in CEU was greater than 2000. As such, CEU was filtered out, but TSI was kept in. We can confirm this with:

pseq proj v-view --samples --vmeta --mask reg=chr7:34833707
chr7:34833707	rs17170011	C/T	.	2	PASS PASS	AA=C;BN=123;GP=7:34867182;HM2;HM3	.
chr7:34833707	rs17170011	C/T	CEU	PASS	AA=C;AC=18;AN=178;DP=6736;BN=123;GP=7:34867182;HM2;HM3
chr7:34833707	rs17170011	C/T	TSI	PASS	AA=C;AC=4;AN=120;DP=1589;BN=123;GP=7:34867182;HM2;HM3

Masks are described in more detail here. We will consider a few more examples below. As well as meta-information supplied in tags, we can filter on other properties of the variant: for example, to focus on variants with a lot of no-calls in the CEU sample:

pseq proj v-view --mask file=CEU null=80-
chr6:30018537	rs112039894	T/C	.	1	PASS
chr6:30018583	rs113699656	C/A	.	1	PASS
chr6:30018721	rs78073371	G/C	.	1	PASS
chr6:30018780	rs80246530	C/A	.	1	PASS
chr7:156435262	rs6969990	C/G	.	1	PASS
chr8:98857209	rs2449509	G/C	.	1	PASS
chr9:204706	rs540473	G/C	.	1	PASS
chr22:48958933	rs5771206	A/G	.	1	PASS

The null mask element only displays variants with the number of null (missing) genotypes in the range specified null=n,m. For convenience, null=10 implies null=0-10. Thus, the mask above shows only variants with 80 or more missing calls (of the 90 CEU individuals). Looking at some of these variants in more detail, we see that read depth is very low, or 0, for many individuals:

pseq proj v-view --gmeta --mask reg=chr6:30018780 file=CEU
chr6:30018780	rs80246530	C/A	.	1	PASS
NA06984	CEU	./.	[DP=3]
NA06985	CEU	./.	[DP=0]
NA06986	CEU	./.	[DP=4]
NA06989	CEU	./.	[DP=1]
NA06994	CEU	./.	[DP=0]
NA07000	CEU	./.	[DP=0]
NA07037	CEU	./.	[DP=0]
NA07048	CEU	./.	[DP=2]
NA07051	CEU	./.	[DP=5]
NA07346	CEU	C/C	[DP=6]
NA07347	CEU	./.	[DP=5]
NA07357	CEU	./.	[DP=0]
NA10847	CEU	./.	[DP=1]
NA10851	CEU	./.	[DP=0]
NA11829	CEU	./.	[DP=0]
NA11830	CEU	./.	[DP=0]
NA11831	CEU	./.	[DP=2]
NA11832	CEU	./.	[DP=1]
NA11840	CEU	./.	[DP=0]
... 

To output only non-null genotypes when using --geno or --gmeta, add the hide-null argument:

pseq proj v-view --mask reg=chr6:30018780 --geno --gmeta --hide-null
chr6:30018780	rs80246530	C/A	.	2	PASS PASS
NA07346   CEU  C/C  [DP=6]
NA12878   CEU  A/A  [DP=16]
NA20543   TSI  C/A  [DP=2]
NA20805   TSI  C/C  [DP=6]

Similarly, to output only non-reference calls from a --geno or --gmeta command:

pseq proj v-view --mask reg=chr6:30018780 --gmeta --only-alt
chr6:30018780	rs80246530	C/A	.	2	PASS PASS
NA12878   CEU  A/A  [DP=16]
NA20543   TSI  C/A  [DP=2]

(Note: to show only minor alleles, based on sample frequency rather than reference sequence, add the option --only-minor -- although this is equivlant to the rv-view command)

In this case, we appear to see some non-reference variants called on a relatively small number of reads (e.g. DP=2 in this last example). To scan for similar occurrences, of variants being called from very few reads, one could use a command such as:

pseq proj rv-view --gmeta --mask geno=DP:eq:1 monomorphic.ex
chr4:1378413	rs78309237	T/C	.	2	PASS PASS
NA12400	CEU	C/T	[DP=1]
chr5:178291556	rs34499286	A/G	.	2	PASS PASS
NA20506	TSI	A/G	[DP=1]
chr6:30019298	rs115073453	G/T	.	2	PASS PASS
NA20513	TSI	G/T	[DP=1]
chr7:5993422	rs116094787	G/A	.	1	PASS
NA20538	TSI	A/G	[DP=1]
chr17:41429726	rs114553892	A/G	.	2	PASS PASS
NA11881	CEU	A/G	[DP=1]
chr19:21398559	rs10414834	C/G	.	2	PASS PASS
NA12234	CEU	C/G	[DP=1]
NA12750	CEU	C/G	[DP=1]
NA20807	TSI	C/G	[DP=1]
chr19:21399063	rs115610555	C/T	.	2	PASS PASS
NA11893	CEU	C/T	[DP=1]
chr20:61667328	rs310633	A/G	.	2	PASS PASS
NA12414	CEU	A/G	[DP=1]

The above mask only considers sites that are variant (monomorphic.ex) after looking only at genotypes called on the basis of a single read (gene=DP:eq:1). (Follow this link for a description of the syntax of the geno mask element.) It is worth noting that much of the 1000 Genomes data was generated using haplotype information to effectively impute missing data across sites with little coverage to make more accurate calls. If such calls were not made using an LD-aware genotyper (i.e. considering the genotypes at nearby, better covered variant sites), one might want to investigate the accuracy of these particular calls further.

To focus on variants that are seen in both samples, CEU and TSI, you could either you the obs.file.req to specify a list of file-tags for which you required to see that position observed (whether or not a non-reference, alternate allele is at that position in a given file):

pseq proj v-view --mask obs.file.req=CEU,TSI

Equivalently, we could have used obs.nfile=2 to acieve similar results in this dataset (to list variants observed in at least 2 samples).

To count the sites for which a non-reference allele was observed only in TSI, we could use (showing only the first line of output):

pseq proj v-stats --mask alt.file=TSI alt.file.ex=CEU
NVAR    963

Similarly, the number of sites only seen in CEU:

pseq proj v-stats --mask alt.file=CEU alt.file.ex=TSI
NVAR    1167

and those for which a non-reference allele is seen in both:

pseq proj v-stats --mask alt.file.req=CEU,TSI
NVAR    2313

Note that these three numbers sum to 4443, which is less than the 4449 unique variants reported by vardb-summary above. This implies there must be some monomorphic or completely null calls included in these files:

pseq proj v-view --mask alt.file.ex=CEU,TSI
chr5:140870754:rs114669158   .   G/A   1   PASS
chr6:30019035:rs111457285    .   A/C   1   PASS
chr6:30019094:rs115176787    .   G/C   2   PASS PASS
chr11:408735:rs12792208      .   G/A   1   PASS
chr17:44939003:rs116077510   .   C/T   1   PASS
chr19:2199303:rs115008403    .   G/A   1   PASS

(Note that the total number of actual monomorphic variants is higher, due to a number of sites that are monomorphic for the alternate allele in this sample (i.e. representing an error, or a variant, in the reference sequence):

pseq proj v-stats --mask monomorphic
NVAR    44 

Outputting data in different formats

The v-matrix command is designed to produce rectangular-format files of counts of non-reference alleles, e.g. for subsequent analysis in a program such as R. In this particular example, we restrict the file to only two individuals, for a particular gene:

pseq proj v-matrix --mask gene=NLRP11 indiv=NA20808,NA20774
VAR	                        ALT     REF     NA20774	NA20808
chr19:60988831:rs11671248	G	A	0	0
chr19:60988976:rs116820715	T	G	0	0
chr19:60999365:rs114591455	G	A	NA	NA
chr19:60999395:rs2616940	A	G	1	1
chr19:61012012:rs16986626	C	T	1	0
chr19:61012383:rs114531512	A	G	0	0
chr19:61012475:rs12461110	G	A	0	0
chr19:61012625:rs111287552	T	C	NA	NA
chr19:61013107:rs12977654	G	A	0	0
chr19:61013226:rs299163		C	A	2	2
chr19:61013329:rs7249635	T	C	1	0
chr19:61021146:rs8113630	G	C	0	1
chr19:61021206:rs114117210	T	C	0	0

To output a table of variant meta-information, use the meta-matrix command; in this example, we restrict the output to a subset of the variants, samples and tags:

pseq proj meta-matrix --show AN AA DP BN --mask reg=chr22 file=TSI
CHR POS       ID            AA   BN   AN   DP
22  16042793  rs7289170     G    116  128  7403
22  16049306  rs2231495     T    98   126  4036
22  18338811  rs35219372    c    126  128  3852
22  18338829  rs5993890     g    114  126  2238
22  18340512  rs115344498   A    132  130  5179
22  18343258  rs116249498   G    132  132  5849
22  18347480  rs74544696    C    131  130  2068
22  18348971  rs2073748     G    96   70   362
22  18349043  rs80068543    C    131  108  904
...

Writing to other formats (including PLINK transposed PED files, BEAGLE format genotype likelihood files, VCF and the internal VARDB format) is described here.

Summary statistics for samples, individuals and genes

The three main functions to produce summaries for the whole sample, each individual and each gene/set are v-stats, i-stats and g-stats respectively.

Summary statistics for all variants

The v-stats command provides a range of basic information on: a) the number of variants and non-reference allele frequency distribution, b) FILTER and HWE failure rates, c) transition/transversion ratio, d) read-depth and quality score distributions, e) novelty with respect to dbSNP and/or 1000 Genomes data, f) breakdown by functional class (coding/silent) and singleton status, g) summaries of other variant-level and individual-level tags. We saw an example of the basic v-stats command on the CEU VCF above. Here we apply it to the combined CEU/TSI project, additionally requesting information about the proportion of variants that are in HapMap2 or HapMap3 (i.e. the flag is interpreted as a 0/1 variable):

pseq proj v-stats --stats mean=HM2,HM3
NVAR      4449
RATE      0.707462
MAC       21.6458
MAF       0.0818045
SING      1692
MONO      44
TITV      3.52134
TITV_S    3.75281
DP        NA
QUAL      NA
PASS      1
PASS_S    1
MEAN|HM2  0.289503
MEAN|HM3  0.229715

Note that the count option returns similar information in this case (as HM2 and HM3 are simple binary flags):

pseq proj v-stats --stats count=HM2,HM3
..
HM2|set   1288
HM3|set   1022

(i.e. as 1288 / 4449 is 0.289503). The options extend to other types of meta-information, e.g.:

pseq proj v-stats --stats count=AA
...
AA|-  22
AA|.  102
AA|A  569
AA|C  1350
AA|G  1509
AA|N  36
AA|T  583
AA|a  37
AA|c  97
AA|g  99
AA|t  45

and prespecified ranges can be added:

pseq proj v-stats --mask file=CEU --stats count=DP:0-100,DP:200-1000,DP:1000-5000,DP:5000-
DP|0:100       8
DP|200:1000    88
DP|1000:5000   918
DP|5000:*      2472

i.e. 918 variants in CEU has a DP between 1000 and 5000. Note we performed this analysis only on the CEU sample -- because DP is a sample-specific variant attribute, that is not automatically merged to create a single per-variant value, it is only accessible in this context when looking at a single file.

The ref and loc options can be used to display the proportion of sites that either in a specific REFDB (observed as a variant) or LOCDB (at least one region that spans that variant).

pseq proj v-stats --stats ref=dbSNP130 loc=refseq
REF|dbSNP130   0.570915
LOC|refseq     0.996179

Naturally, this assumes that a REFDB and LOCDB are attached and contain the relevant groups.

One can use the usual --mask options: for example, here we investigate the relationship between allele frequency and which build of dbSNP (indicated by the BN tag) that variant first appeared in: (here also illustrating how the mac option can be used to get coutns of variants in particular minor allele count bins):

pseq proj v-stats --stats mac=1,2-4,5-10 --mask include="BN < 120 "
NVAR      1287
MAF       0.195623
MAC|1     62
MAC|2:4   78
MAC|5:10  108
pseq proj v-stats --stats mac=1,2-4,5-10 --mask include="BN >= 120 "
NVAR      3162
MAF       0.035478
MAC|1	  1630
MAC|2:4	  647
MAC|5:10  328

We see many more singletons and the mean MAF is much lower for the variants not added until later dbSNP builds, which is as we'd expect (particularly as 1000 Genomes data contributed to these later builds).

The v-stats command is described more fully here.

Summary statistics per individual

The i-stats command calculates per-individual metrics such as Ti/Tv, dbSNP percentage, the number of singletons, etc.

pseq proj i-stats
ID	NALT	NMIN	NHET	NVAR	RATE	SING	TITV	PASS	PASS_S	QUAL	DP
NA06984	719	538	480	3162	0.90627	1043	3.7193	538	13	NA	7410.61
NA06985	655	492	420	3144	0.90111	1038	3.43243	492	11	NA	7424.63
NA06986	773	607	503	3437	0.98509	1072	3.56391	607	22	NA	6983.5
NA06989	699	506	469	3130	0.89710	1048	3.2521	506	11	NA	7663.52
NA06994	591	438	377	3002	0.86041	1025	3.5625	438	4	NA	7672.17
NA07000	802	591	517	3388	0.97105	1067	3.51145	591	11	NA	6810.01
NA07037	800	607	512	3374	0.96703	1072	3.63359	607	5	NA	6874.12
NA07048	817	650	607	3373	0.96675	1073	3.27632	650	15	NA	7135.38
NA07051	825	624	507	3451	0.98910	1079	3.07843	624	13	NA	6664.27
NA07346	811	597	512	3419	0.97993	1073	3.62791	597	4	NA	6887.14
NA07347	826	599	505	3417	0.97936	1070	3.53788	599	16	NA	6674.78
...

Here, NALT, NMIN and NHET are the number of non-missing genotypes that contain an alternate allele, a minor allele, or are heterozygous. NVAR is the total number of non-missing variants, RATE is the genotyping rate for that individual. The following quantities (similarly defined as for v-stats) refer only to variants for which that particular individual carried a minor allele. The QUAL fields are all NA as this field is not populated in the 1000 Genomes VCFs.

Adding the gmean option (to calculate the mean of a genotypic, per-individual, tag) adds two further columns:

pseq proj i-stats --stats gmean=DP
G|DP	NRG|DP	       
43.8538	50.0807
13.867	15.5359
137.214	130.093
47.8811	54.4649
12.1152	14.5042
33.1619	31.3329
32.6406	30.0087
62.4786	61.9021
145.679	132.879
107.561	101.276
...

Here, G|DP is the mean of the per-individual DP tag for all non-missing genotypes called for that individual; NRG|DP is similar, but considers only genotypes for which that individual carries a non-reference allele.

For illustration: we can use this command to investigate some basic properties of the callset, for example, the relationship between read depth and sensitivity to detect variants (as indexed by the number of singletons) in the CEU sample. First, we can run:

pseq proj i-stats --mask file=CEU mac=1-1 --stats gmean=DP > istats.txt

and then, in R for example:

d <- read.table( "istats.txt" , header = T ) plot( d$PASS_S / d$NVAR , d$NRG.DP )

There is a substantial correlation between these two measures (Pearson's correlation coefficient is almost 0.5, with a p-value of 3e-6) after removing the single outlier with very high coverage. This suggests, as expected, that singleton rate in a sample may be quite dependent on the mean coverage for that individual (at least within certain ranges of coverage). In this way, the i-stats command can be useful for generating datasets that can be related to phenotypic outcomes, to determine how comparable, for example, cases and controls are in technical and other ways.

The i-stats command is described more fully here.

Summary statistics per gene

Finally, similar summary metrics can be calculated on a per-gene basis with the g-stats command (which requires a grouping variable to be given, e.g. via a loc.group:

pseq proj g-stats --mask loc.group=refseq

NAME POS KB BP EXON DENS NVAR RATE SING TITV PASS PASS_S QUAL DP NM_000021 chr14:72684481..72755747 71.267 1401 10 1401 1 1 0 NA 1 0 NA 6290 NM_000043 chr10:90740614..90764184 23.571 1005 9 335 3 0.918 0 NA 3 0 NA 6591.3 NM_000055 chr3:166973867..167031515 57.649 1806 3 602 3 0.955 1 2 3 1 NA 7517 NM_000077 chr9:21958231..21964826 6.596 468 3 468 1 0.8 0 NA 1 0 NA 1130 NM_000112 chr5:149337409..149341566 4.158 2217 2 246.3 9 0.996 5 2 9 5 NA 10391.9 NM_000113 chr9:131616075..131626185 10.111 996 5 498 2 0.994 0 1 2 0 NA 9829.5 NM_000119 chr15:41276795..41300315 23.521 2163 13 721 3 0.977 3 NA 3 3 NA 8926.6 NM_000122 chr2:127731645..127768127 36.483 2346 15 1173 2 1 2 NA 2 2 NA 7471.5

As for v-stats and i-stats, additional options can be added, as described here.

Lookups and filtering the data

In this section we consider a number of ways in which external references and annotations can be intersected with a project's sequence data.

Functional annotation of variants

To annotate the coding status of variants given a LOCDB containing GTFs and a SEQDB, use the counts (or g-counts) command along with --options annotate:

pseq proj counts --phenotype phe1 --annotate refseq --mask reg=chr22
VAR	        REF/ALT	MINOR	CNTA	CNTU	TOTA	TOTU	FUNC	        GENE
chr22:16042444	C/G	G	1	0	178	0	missense	NM_017424,NM_177405
chr22:16042793	A/G	G	61	44	178	128	silent		NM_017424,NM_177405
chr22:16049306	T/C	C	43	43	148	126	missense	NM_017424,NM_177405
chr22:17729354	G/A	A	1	0	180	0	silent		NM_003325
chr22:18338811	C/T	T	3	1	144	128	silent		NM_001670
chr22:18338829	G/A	A	5	4	148	126	silent		NM_001670
chr22:18340512	A/G	G	0	1	0	130	missense	NM_001670

Filter on variants not in dbSNP(130)

To extract variants that are novel, in the sense of not being present in a particular build of dbSNP, use the following mask, i.e. assuming a group called dbSNP130 (case sensitive) exists in the REFDB:

pseq proj v-view --mask ref.ex=dbSNP130
chr1:1105366	rs111751804	T/C	.	2	PASS PASS
chr1:1105411	rs114390380	G/A	.	2	PASS PASS
chr1:1110240	rs116321663	T/A	.	2	PASS PASS
chr1:3538715	rs114376964	T/C	.	1	PASS
chr1:3541597	rs116230480	C/T	.	1	PASS
chr1:3541599	rs115982402	C/T	.	1	PASS
chr1:3542416	rs113614277	C/T	.	1	PASS
chr1:3545211	rs114697698	G/A	.	1	PASS
...

To consider only known dbSNP variants, one would replace ref.ex with ref.

Append HGMD annotations

Note: The HGMD reference dataset mentioned below is not part of the tutorial download and so the commands are shown for illustration only.

The Human Gene Mutation Database represents (in the creator's words) an attempt to collate known (published) gene lesions responsible for human inherited disease. It is only available to registered users and so cannot be included as a resource database on this site. However, for registered users it is easy to download the database, and upload it, as a flat-file into a PLINK/Seq REFDB. See instructions on how to build this database here.

Assuming, then, a separate database named (for example) hgmd.db exists in the hg18 folder, the following outputs in verbose form all HGMD SNPs in the CEU/TSI pilot 3, along with selected tags from the HGMD database: (database not included with tutorial)

pseq proj v-view --verbose --refdb hg18/hgmd.db --mask ref=hgmd
chr1:151920381:rs28730726       .       G/C     1       PASS
        AA = G
        BN = 125
        GP = 1:153653757
        hgmd flag set
        hgmd_ID = rs28730726
        hgmd_DESC = Essential hypertension, association with
        hgmd_GENE = NPR1
        hgmd_TYPE = mis/non
        hgmd_VALUE = DP

chr1:195678032:rs62636285       .       G/A     1       PASS
        AA = G
        BN = 129
        GP = 1:197411409
        hgmd flag set
        hgmd_ID = rs62636285
        hgmd_DESC = Leber congenital amaurosis
        hgmd_GENE = CRB1
        hgmd_TYPE = mis/non
        hgmd_VALUE = DM

...

We can look at the general properties of variants in the CEU and TSI samples that are in (or not in) HGMD, by using v-stats combined with a ref mask option:

pseq proj v-stats --mask ref=hgmd --refdb hg18/hgmd.db
NVAR    117
RATE    0.796515
MAC     35.5385
MAF     0.126484
SING    21
MONO    0
TITV    2.44118
TITV_S  3.2

In other words, 117 of the variants in HGMD are observed in the CEU/TSI sample; 21 are singletons but most are quite common, as the mean MAF is above 10%.

pseq proj v-stats --mask ref.ex=hgmd --refdb hg18/hgmd.db
NVAR     4332
RATE     0.705057
MAC      21.2705
MAF      0.0805977
SING     1671
MONO     44
TITV     3.56
TITV_S   3.76068

Note how in these examples we used --refdb to swap in a different REFDB (i.e. that contains a group called hgmd in this case).

Intersect a list of variants (e.g. from another study)

The lookup command is designed to take a plain text file of sites or regions of interest and return information from the project's databases. For example, if we had the following list of sites/regions (note the required format):

cat data/lookup.txt
chr22:99 
chr22:20328280 
chr22:20648538
chr22:29187373 
chr22:43000000..44000000 

then specifying this file with the --file option (and optionally setting a phenotype to obtain case/control allele counts):

pseq proj lookup --file data/lookup.txt --ref dbSNP130 --loc refseq --phenotype phe1
##nvar,1,Integer,"Number of overlapping variants"
##var,1,String,"Variant ID"
##case,1,Integer,"Case minor allele counts"
##con,1,Integer,"Control minor allele counts"
##seqdb_ref,1,String,"SEQDB reference sequence"
chr22:99	seqdb_ref	N
chr22:99	var	NA
chr22:99	case	NA
chr22:99	con	NA
chr22:99	nvar	0
chr22:20328280	seqdb_ref	G
chr22:20328280	nvar	1
chr22:20328280	var	chr22:20328280
chr22:20328280	case	4
chr22:20328280	con	0
chr22:20328448	seqdb_ref	G
chr22:20328448	nvar	1
chr22:20328448	var	chr22:20328448
chr22:20328448	case	0
chr22:20328448	con	1
...
chr22:43000000..44000000	seqdb_ref	.
chr22:43000000..44000000	nvar	6
chr22:43000000..44000000	var_1	chr22:43657667
chr22:43000000..44000000	case_1	3
chr22:43000000..44000000	con_1	1
chr22:43000000..44000000	var_2	chr22:43670607
chr22:43000000..44000000	case_2	1
chr22:43000000..44000000	con_2	1
chr22:43000000..44000000	var_3	chr22:43690908

The output indicates the variants in the CEU/TSI that overlap with each site/region in lookup.txt, along with the ID (VAR), and in this case, counts of non-reference alleles for "cases" and "controls" (i.e. TSI and CEU).

If the options ref and loc are added, then additional information is given about the entires in REFDB and LOCDB that match with variants in the lookup list. For example:

pseq proj lookup --file data/lookup.txt --ref dbSNP130 --loc refseq --phenotype phe1

This will add lines with ref_ and loc_ codes, interleaved with the original variant rows: e.g.

...
chr22:29190830	seqdb_ref	C
chr22:29190830	nvar		1
chr22:29190830	var		chr22:29190830:rs2269961
chr22:29190830	cnt		59
chr22:29190830	loc_refseq	chr22:29186011..29197945:NM_174975
chr22:29190830	ref_dbSNP130	.:22..29190830:1:29190830:.
...

Examining genotype-phenotype relationships

In ths section we consider a number of functions that look at allele and genotype frequency distributions and their relation to phenotype.

Variant frequencies: allelic, genotypic and HWE

To list variants that deviate markedly from expected Hardy-Weinberg genotype frequencies, use the hwe mask option combined with the v-freq command. This command generates a number of fields, so we use the gcol utility to extract only these two fields:

pseq proj v-freq --mask hwe=0:1e-7 | gcol VAR HWE
VAR                        HWE
chr10:46507377	9.16912e-08
chr10:46507507	3.31099e-08
chr16:54420192	4.60983e-14
chr16:54420384	5.61696e-08

The argument for hwe is in the form m:n, i.e. to only include variants with a HWE p-value within this range.

Here, we illustrate how to attach meta-information to an existing variant database, by creating a flag to indicate, in this example, HWE-failing variants. As described here, we must first create a suitably formatted file, with headers describing the fields, followed by tab-separated variant, tag key, tag value fields: e.g. the file hw.txt:

##hwd,1,Flag,"Fails HW test at p<1e-7"
chr10:46507377  hwd   1
chr10:46507507  hwd   1
chr16:54420192  hwd   1
chr16:54420384  hwd   1

These tags are imported into the VARDB with the attach-meta command:

pseq proj attach-meta --file hw.txt --id CEU TSI --group hwd

The --id command indicates that the tag should be applied to variants (if they exist) in both CEU and TSI samples in this example: that is, tags are applied to specific files in the VARDB.

To subsequently attach this meta-tag requires the meta.attach mask option:

pseq proj v-view --mask meta.attach=hwd reg=chr10:46507377 --vmeta --samples
chr10:46507377	rs1048156	G/A	.	2	PASS PASS	AA=g;BN=86;GP=10:47087371;HM2	.
chr10:46507377	rs1048156	G/A	TSI	PASS	AA=g;AC=41;AN=128;DP=8341;BN=86;GP=10:47087371;HM2;hwd
chr10:46507377	rs1048156	G/A	CEU	PASS	AA=g;AC=44;AN=176;DP=20753;BN=86;GP=10:47087371;HM2;hwd

Otherwise, these tags are just like other tags that can be used in include expressions, contain integers, floats or strings, etc.

As an alternative to adding a flag to certain variants, one could also take the following approach: if the file hw2.txt contains the text

chr10:46507377
chr10:46507507
chr16:54420192
chr16:54420384

then these sites can be included with reg or reg.req (or excluded with reg.ex) with the @ file inclusion operator, as follows:

pseq proj v-freq --mask reg=@hw2.txt | gcol VAR HWE MAF HET
VAR                         HWE          MAF       HET
chr10:46507377	9.16912e-08	0.279605	0.559211
chr10:46507507	3.31099e-08	0.289116	0.578231
chr16:54420192	4.60983e-14	0.391304	0.782609
chr16:54420384	5.61696e-08	0.3125	0.625

Finally, the g-counts command can print human-readable genotype counts, which can help interpret the deviations from HWE:

pseq proj g-counts --mask reg=@hw2.txt
VAR             ALLELES   GENOTYPES
chr10:46507377  G/A       ./.=4 A/G=85 G/G=67
chr10:46507507  C/T       ./.=9 C/C=62 C/T=85
chr16:54420192  G/A       ./.=41 A/G=90 G/G=25
chr16:54420384  C/A       ./.=36 A/C=75 C/C=45

These instances of HW deviation represent cases where the alternate allele is very common, but never seen in homozygous form and likely represents a problem with alignment and variant calling. This holds for both CEU and TSI samples:

pseq proj g-counts --mask reg=@hw2.txt file=CEU
VAR             ALLELES  GENOTYPES
chr10:46507377  G/A      ./.=2 A/G=44 G/G=44
chr10:46507507  C/T      ./.=3 C/C=31 C/T=56
chr16:54420192  G/A      ./.=33 A/G=42 G/G=15
chr16:54420384  C/A      ./.=17 A/C=43 C/C=30
pseq proj g-counts --mask reg=@hw2.txt file=TSI
VAR             ALLELES  GENOTYPES
chr10:46507377  G/A      ./.=2 A/G=41 G/G=23
chr10:46507507  C/T      ./.=6 C/C=31 C/T=29
chr16:54420192  G/A      ./.=8 A/G=48 G/G=10
chr16:54420384  C/A      ./.=19 A/C=32 C/C=15

Multilocus allelic distributions

The v-dist command can be used to assess the broad comparability in population (and technical) terms between two groups (specified by a dichotomous, case/control phenotype):

pseq proj v-dist --phenotype phe1
--------------------------- All cases & controls ( 90 cases, 66 controls ) ---------------------------

Singletons : chi-sq (1df) = 24.64 p = 6.919e-07
  A / U	   Exp	    Obs	  Ratio
  0 / 1	   713.3    814	  1.141
  1 / 0	   972.7    872	  0.8965

Doubletons : chi-sq (2df) = 45.46 p = 1.345e-10
  A / U	   Exp	    Obs	  Ratio
  0 / 2	   66.59    98	  1.472
  1 / 1	   181.6    118	  0.6498
  2 / 0	   123.8    156	  1.26

Tripletons : chi-sq (3df) = 23.58 p = 3.062e-05
  A / U	   Exp	    Obs	  Ratio
  0 / 3	   14.99    25	  1.667
  1 / 2	   61.34    46	  0.7499
  2 / 1	   83.65    69	  0.8249
  3 / 0	   38.02    58	  1.525

This command considers only sites with exactly 1, 2, 3 or (not shown) 4 minor alleles in the whole sample, and asks (within each frequency category) whether or not the minor alleles appear to be randomly distributed across the two groups, or whether we see (for example) more 2/0 and 0/2 doubletons than we'd expect.

Remembering that in this example case and control actually represent CEU and TSI group memership, we see this analysis indicates (as expected) that there are numerous frequency differences (if we assume that absence of a variant in a particular file implies high confidence that all calls are homozygous reference). In other words, looking only at doubletons (variants for which we observed exactly two copies of an alternate allele), these occur more often within either CEU or TSI, compared to between, than expected by chance. In a typical disease association study, one would hope not to observe signficant deviations from the v-dist test. Note that this approach ignores linkage disequilibrium between sites.

Genotypic similarity between all pairs of individuals

To calculate for each pair of individuals the number of genotypes for which they both carry a minor allele, use the ibs-matrix command. Here we restrict attention to lower frequency variants (only 2 to 5 copies observed) and add some options to get a list (rather than a matrix) as output:

pseq proj ibs-matrix --mask mac=2-5 --long-format --two-counts
 
NA06984  NA06985  0  600
NA06984  NA06986  0  629
NA06984  NA06989  0  616
NA06984  NA06994  0  591
NA06984  NA07000  0  626
NA06984  NA07037  0  630
NA06984  NA07048  2  630
NA06984  NA07051  0  631
NA06984  NA07346  1  627
NA06984  NA07347  1  625
...

The third column is the number of genotypes where both members of the pair carry a minor allele; the fourth column is the total number of non-missing genotypes for the pair (after the mask has been applied). If we plot the ratio of these two columns, we observe at least one outlier pair:


In terms of absolute number of genotypes shared, these two pairs have markedly higher values than most pairs (over 80% of all pairs share 0, the next highest pair shares only 5):

NA12878  NA12891  19  667
NA12878  NA12892  10  669

Finding variants that are unique to particular groups or samples

To find out which doubletons are unique to the first pair from the example above, for example, we can use the unique command:

pseq proj unique --indiv NA12878 NA12891
2	0	chr2:27331735	AA=G;BN=121;GP=2:27478231;HM2;HM3
2	0	chr3:128873047	AA=C;BN=130;GP=3:127390357
2	0	chr5:74711006	AA=T;BN=129;GP=5:74675250
2	0	chr9:130515404	AA=G;BN=130;GP=9:131475583
2	0	chr10:104889187	AA=C;BN=130;GP=10:104899197
2	0	chr14:20625321	AA=G;BN=129;GP=14:21555481
2	0	chr15:59998858	AA=G;BN=130;GP=15:62211566

The first two columns indicate that it was observed for 2 out of 2 specified individuals, and 0 of N-2 remaining individuals. Options can be added to modify the behaviour of unique (i.e. to require at least M of N in one group, but no more than S of T in the remainings samples.

Another convenient way to filter variants in the context of a case control study is to use the mask options case.control, case.uniq and control.uniq:

pseq proj counts --phenotype phe1 --mask case.control=8-10,0-1
VAR             ALLELES  ALTA  ALTU  TOTA  TOTU
chr4:178518748  G/A      8     0     180   128
chr5:76164809   T/C      8     0     174   122
chr5:121515868  A/G      8     0     180   0
chr6:135580796  A/G      8     0     180   0
chr7:96585128   C/T      9     0     178   0
chr10:101635524 A/G      8     1     180   130
chr12:132097393 A/G      9     0     180   0
chr13:23141204  A/G      10    0     176   0
chr14:62244858  T/C      10    1     180   124
chr15:23477291  G/C      9     0     158   0
chr15:76876129  T/C      9     1     156   124
chr16:54420213  T/C      10    0     180   0
chr19:45778149  C/G      8     1     176   132

Note that case.uniq implies case.control=1-,0.

Single locus association

Single locus association for dichotomous case/control traits can be calculated with the v-assoc command:

pseq proj v-assoc --phenotype phe1 | gcol MINA MINU OBSA OBSU VAR OR P

Considering only variants significant at p<1e-4:

MINA   MINU   OBSA   OBSU   VAR              OR        P
78     28     86     63     chr6:7156817     2.90426   4.75622e-05
30     2      84     50     chr14:22619220   10.6522   5.25491e-05
102    39     86     57     chr15:23476187   2.8022    3.80189e-05

By default, these tests are based on Fisher's exact test. They can be based on either a fixed number, or an adaptive number, of permutations of the phenotype by adding --perm 10000 or --perm -1, respectively.

See the glm command for single locus tests framed in terms of regression (that allow for covariates and quantitative outcomes).

Gene-based assocaition tests for rare variants

For illustration of syntax only, here we perform a series of gene-based rare-variant tests, using a very simple test: comparing the burden of non-reference genotypes in cases compared to controls within a gene, evaluating the significance of the difference by permutation.

pseq proj assoc --phenotype phe1 --mask loc.group=refseq mac=1-10
LOCUS         POS                       ALIAS               NVAR TEST     P           I       DESC
NM_001055     chr16:28524914..28527421  CCDS32420.1,SULT1A1 4    BURDEN   0.00317749  0.0052  11/0
NM_001164814  chr14:22617256..22634277  ACIN1               8    BURDEN   0.00359251  0.0010  8/3
NM_014977     chr14:22617256..22634277  CCDS9587.1,ACIN1    8    BURDEN   0.0042879   0.0012  8/3
NM_020186     chr7:96585128..96585128   CCDS5648.1,ACN9     1    BURDEN   0.0034055   0.0058  9/0
NM_024837     chr15:47939741..48153645  CCDS32238.1,ATP8B4  10   BURDEN   0.000176587 5e-05   22/2
NM_177529     chr16:28524914..28527421  CCDS32420.1,SULT1A1 4    BURDEN   0.00373433  0.0058  11/0
NM_177530     chr16:28524914..28527421  CCDS32420.1,SULT1A1 4    BURDEN   0.00306614  0.0054  11/0
NM_177534     chr16:28524914..28527421  CCDS32420.1,SULT1A1 4    BURDEN   0.00231443  0.0059  11/0
NM_177536     chr16:28524914..28527421  CCDS10637.1,SULT1A1 4    BURDEN   0.00306614  0.0054  11/0

We can see what is driving this signal, e.g. for SULT1A1, with the mrv-view command:

pseq proj mrv-view --phenotype phe1 --mask gene=SULT1A1 mac=1-10
NAME=[]    N(V)=4    N(I)=156
V1  chr16:28524914:rs28374453
V2  chr16:28525008:rs41278160
V3  chr16:28525053:rs3176926
V4  chr16:28527421:rs1126446

I        PHE   V1   V2   V3   V4
NA07048  CASE  ./.  C/C  ./.  A/G
NA07347  CASE  A/A  C/C  C/C  A/G
NA11843  CASE  A/A  C/C  C/G  A/A
NA12144  CASE  A/A  C/T  ./.  ./.
NA12249  CASE  A/A  C/C  C/G  A/A
NA12272  CASE  A/A  C/T  G/G  ./.
NA12275  CASE  A/A  C/C  C/G  ./.
NA12286  CASE  A/G  C/C  C/C  A/A
NA12891  CASE  A/A  C/C  C/G  A/G

The BURDEN test is based on the number of genotypes carrying a rare allele (i.e. assuming a dominant model); above we see 11 genotypes (12 alleles in 9 people) that carry a rare variant and all are cases. Note that the BURDEN test is one-sided, testing for a higher rate in cases: in our example where cases and controls are CEU and TSI this clearly isn't what you'd want -- we only show this example to introduce usage of some of the commands. As detailed here there are a growing number of alternative gene-based tests designed to handle a broader array of conditions.

However, in this particular case, the data are all missing for the "controls" and there is a lot of missing data for "cases" too:

pseq proj g-counts --phenotype phe1 --mask gene=SULT1A1 mac=1-10
VAR             ALLELES  GENOTYPES
chr16:28524914  A/G      ./.=30:66 A/A=59:0 A/G=1:0
chr16:28525008  C/T      ./.=27:66 C/C=61:0 C/T=2:0
chr16:28525053  C/G      ./.=45:66 C/C=40:0 C/G=4:0 G/G=1:0
chr16:28527421  A/G      ./.=41:66 A/A=46:0 A/G=3:0

All TSI individuals ("controls") are missing for these variants; unless it is true that the absence of a call means that one is very confident that all genotypes are really homozygous reference, the asymmetric structure of missing data could bias this test. Fuller consideration of such issues is beyond the scope of this introductory tutorial however.

Using R to work with this project

This section assumes you have R installed on your machine and have also downloaded the PLINK/Seq library R interface. At the R prompt, type:

library(Rplinkseq)
PLINK/Seq genetics library for R | v0.08 | 15-Mar-2012

Attaching the project

To attach the PLINK/Seq project we've generated from the two VCFs:

pseq.project( "proj" )

Because this is a relatively small project, we can load entire files into memory in R:

k <- var.fetch( mask = "file=TSI" , limit = 4000 ) length(k)
[1] 3281

The object k is a list of list objects; we can extract core fields such as chromosome with the x.core() function:

table( x.core( k , "CHR" ) )
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
303 220 153 153 148 240 135 125 107 160 190 187  55 128 163 133 170 116 194 112 
 21  22 
 23  66 

We can access the full list for a single variant: in this example, for chr10:, data for which was shown above in the Hardy-Weinberg equilibrium section of this tutorial. The output from the PSEQ g-counts for this variant (for TSI) was:

VAR             ALLELES  GENOTYPES
chr10:46507377  G/A      ./.=2 A/G=41 G/G=23

This same variant is number 1456 in the list k:

str( k[[1456]] )
List of 8
 $ CHR : int 10
 $ BP1 : int 46507377
 $ BP2 : int 46507377
 $ ID  : chr "rs1048156"
 $ NS  : int 1
 $ META:List of 4
  ..$ HM2: chr(0) 
  ..$ AA : chr "g"
  ..$ GP : chr "10:47087371"
  ..$ BN : int 86
 $ CON :List of 7
  ..$ FSET  : int 0
  ..$ REF   : chr "G"
  ..$ ALT   : chr "A"
  ..$ QUAL  : num NA
  ..$ FILTER: chr ""
  ..$ META  :List of 7
  .. ..$ HM2: chr(0) 
  .. ..$ AA : chr "g"
  .. ..$ GP : chr "10:47087371"
  .. ..$ AC : int 41
  .. ..$ AN : int 128
  .. ..$ DP : int 8341
  .. ..$ BN : int 86
  ..$ GENO  :List of 2
  .. ..$ GT: int [1:66] 0 1 0 1 0 1 0 1 1 1 ...
  .. ..$ DP: int [1:66] 48 21 40 197 417 46 15 49 38 32 ...
 $ S2  : NULL
table( x.consensus.genotype( k[1456] ) , useNA = "ifany" )
   0    1 <NA> 
  23   41    2 

This indicates 41 heterozygote carriers of one copy of the alternate A allele, 23 G/G reference homozygotes and 2 individuals with no call.

 


To be continued: in particular, examples of visualizing meta-information
and defining and applying arbitrary functions.