Tutorial: working with 1000 Genomes Pilot 3 VCFs
By way of introducing some of the features and approaches of PLINK/SEQ, this page provides a tutorial that uses PSEQ and the R interface to PLINK/SEQ to work with next-generation sequence data from the 1000 Genomes Project. We will use two datasets of SNP genotype calls from the pilot 3 (exon study). As described in the manuscript, these two samples comprise data from a very small proportion of the complete 1000 Genomes pilot study: 90 CEU and 66 TSI individuals for over 8000 exons at high coverage.
If you haven't already, first take a look at this page, that gives an overview of PLINK/SEQ architecture, before embarking on this tutorial.
Downloading the data for the tutorial
We assume you have a working copy of PLINK/SEQ already installed. The data for this tutorial are in two archives you need to download:
- pseq-tut1.tar.gz [ 1.1M ] : VCFs and a few auxiliary data files
- resources-hg18-tut1.tar.gz [ 1.4G ] : a (relatively large) bundle of resource databases (RefSeq genes, dbSNP variants, hg18 sequence)
The genotype data are in (compressed) VCF format, which is the primary input format used by PLINK/Seq. They should be identical to those available on the 1000 Genomes FTP site: namely, CEU.exon.2010_09.genotypes.vcf.gz and TSI.exon.2010_09.genotypes.vcf.gz.
Extract the contents from these archives:
tar -xzvf pseq-tut1.tar.gz
tar -xzvf resources-hg18-tut1.tar.gz
This should create two folders in your current folder: data and hg18, with the following contents:
data: total 1.1M 625K CEU.exon.2010_09.genotypes.vcf.gz 72 exlist.txt 273 lookup.txt 92 meta.meta 1.6K pop.phe 454K TSI.exon.2010_09.genotypes.vcf.gz hg18: total 2.1G 216M locdb 1.1G refdb 831M seqdb
Overview
The following topics are covered in this tutorial:
- Quick looks at the VCFs: Direct inspection of the VCFs; Using PSEQ to view variants and genotypes; Outputting (filtered) VCFs; Summary statistics on VCFs
- Setting up a PLINK/Seq project: Project specification; Importing VCFs; Attaching phenotypic information; Reference databases
- Viewing and filtering the data in different ways: Basic views and filters; Individual genotype data; Gene-centric queries; Filtering on meta-information; Outputting variant and genotype data in different formats
- Summary statistics: Whole-sample variant statistics; Per-individual summary statistics; Per-gene summary statistics
- Annotating variants : Functional annotation of coding status; Novelty (presence in dbSNP); Human Gene Mutation Database; Look-ups of variant lists
- Examining genotype-phenotype relationships: Frequencies and HWE; Global distributions of rare alleles; Finding variants unique to particular groups; Single locus association; Gene-based rare-variant association tests;
- Using R to work with this project: Attaching the project; Visualising variant meta-information; Defining and applying arbitrary functions
Quick looks at the VCFs
You can view the compressed VCFs with a Unix command such as zless:
zless data/CEU.exon.2010_09.genotypes.vcf.gz
For example: after the headers, we see information for a variant with ID rs111751804 on chromosome 1 at 1105366 (hg18 co-ordinates): (lines wrapped in output below for clarity)
#CHROM POS ID REF ALT QUAL FILTER (line wrapped) 1 1105366 rs111751804 T C . PASS (line wrapped) INFO FORMAT NA06984 NA06985 NA06986 ... AA=T;AC=4;AN=114;DP=3251;GP=1:1115503;BN=132 GT:DP ./.:0 ./.:0 0/0:107 ...
This line shows that the reference allele at this position is T and alternate allele C. There is no associated QUAL score for this variant, and it PASSed filters (note: all variants in these two particular datasets passed QC). We then see a string of tags (meta-information); the fields are described in the header of the VCF:
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele, ..."> ##INFO=<ID=AC,Number=1,Type=Integer,Description="Number of alternate alleles ..."> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total allele count ..."> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth"> ...
Looking at the genotype information for this variant (with format GT:DP, i.e. genotype and read-depth), we see the first two individuals have a missing genotype due to no coverage: ./.:0. The third individual is homozygous for the reference allele, based on 107 reads: 0/0:107.
Using PSEQ to view variants and genotypes in a single VCF
As described here, one can use PSEQ either with a single VCF, or with a PLINK/Seq project. To view the variants in a single VCF, use the v-view command (the second argument given to PSEQ), with the VCF filename (rather than a project name) as the first argument: pseq data/CEU.exon.2010_09.genotypes.vcf.gz v-view
chr1:1105366 rs111751804 T/C . 1 PASS chr1:1105411 rs114390380 G/A . 1 PASS chr1:1108138 rs61733845 C/T . 1 PASS chr1:1110240 rs116321663 T/A . 1 PASS chr1:1110294 rs1320571 G/A . 1 PASS ...
Looking at the first row, we see the variant with ID rs111751804 at position chr1:1105366. The third column lists the reference allele (T) and alternate allele(s) (in this case, C). The fourth column indicates that this row represents a consensus variant and can be ignored for now. The fifth column indicates the number of samples in which this variant was observed (necessarily 1 in this case). The last column displays the FILTER field of the VCF. In the cleaned VCFs in this tutorial, all variants PASS.
We can also output the associated meta-information with the --vmeta argument; here we will restrict output to only the first two variants in the file, by adding the mask option limit=2:
pseq data/CEU.exon.2010_09.genotypes.vcf.gz v-view --vmeta --mask limit=2
chr1:1105366 rs111751804 T/C . 1 PASS . AA=T;AC=4;AN=114;DP=3251;GP=1:1115503;BN=132 chr1:1105411 rs114390380 G/A . 1 PASS . AA=G;AC=1;AN=106;DP=2676;GP=1:1115548;BN=132
We can change the format in which variant meta-information is output, for example, by using the meta-matrix command, designed to generate rectangular/tabular files:
pseq data/CEU.exon.2010_09.genotypes.vcf.gz meta-matrix
--mask limit=2
CHR POS ID AA AC AN DP HM2 HM3 GP BN NR AR OR MP PASS 1 1105366 rs111751804 T 4 114 3251 0 0 1:1115503 132 0 NA NA 0 1 1 1105411 rs114390380 G 1 106 2676 0 0 1:1115548 132 0 NA NA 0 1
Note that when using the meta-matrix command, all tags in the file/project are listed for every variant (e.g. the OR does not appear for either of these two variants, and so is listed NA). The presence or absence of flags such as HM2 is indicated by 0 and 1.
We can view individual-level genotype data (and meta-information)
in a more human-readable format:
pseq data/CEU.exon.2010_09.genotypes.vcf.gz v-view
--geno --gmeta --mask limit=2
chr1:1105366 rs111751804 T/C . 1 PASS NA06984 . ./. [DP=0] NA06985 . ./. [DP=0] NA06986 . T/T [DP=107] NA06989 . ./. [DP=0] NA06994 . ./. [DP=0] NA07000 . T/T [DP=25] NA07037 . T/T [DP=30] NA07048 . T/T [DP=31] NA07051 . T/T [DP=57] NA07346 . T/T [DP=69] NA07347 . T/T [DP=53] NA07357 . T/T [DP=225] ...
Other ways to view the variant and genotype data are descibed below and in the main PSEQ documentation.
Outputting filtered VCFs
It is possible to create new VCFs with the write-vcf command. By itself, this would simply reproduce the original input file. It can be combined with other options to limit the output and create simplified VCFs (removing unwanted tags, for example). Here we request only HM2 and HM3 tags to be shown; we also request genotypes only for two individuals (NA11920 and NA12761), limited to chromosome 7:
pseq data/CEU.exon.2010_09.genotypes.vcf.gz write-vcf
--show HM2 HM3
--mask indiv=NA11920,NA12761 reg=chr7
##fileformat=VCFv4.0 ##source=pseq ##INFO=<ID=HM2,Number=0,Type=Flag,Description="HapMap2 membership"> ##INFO=<ID=HM3,Number=0,Type=Flag,Description="HapMap3 membership"> ##FILTER=<ID=PASS,Description="Passed variant FILTERs"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA11920 NA12761 chr7 2257048 rs1799832 C T . PASS HM3 GT 0/0 0/0 chr7 5993056 rs1805324 C T . PASS HM2;HM3 GT 0/0 0/0 chr7 5993133 rs1805318 T A . PASS HM2;HM3 GT 0/0 0/0 chr7 5993234 rs63750668 C A . PASS . GT 0/0 0/0 chr7 5993301 rs2228006 T C . PASS HM2;HM3 GT ./. ./. chr7 5993468 rs1805323 G T . PASS . GT 0/0 ./. chr7 5993514 rs1805321 G A . PASS HM2 GT ./. ./. chr7 6001737 rs116788608 T C . PASS . GT 0/0 0/0 chr7 6003506 rs1805319 G C . PASS HM2 GT 1/1 1/1 chr7 16782525 rs2290837 T C . PASS HM2;HM3 GT 0/1 0/1 ...
Note: the full mask syntax is described here -- but note that not all mask options are available in single VCF mode (as opposed to working with a project, see below).
Summary statistics on VCFs
We will explore the details of the summary statistic commands (v-stats, i-stats and g-stats) later, but here will illustrate two simple commands that can be performed on a single VCF. The v-stats command yields some basic summary information on things such as call rate, allele frequency and transition/transversion ratio:
pseq data/CEU.exon.2010_09.genotypes.vcf.gz v-stats
NVAR 3489 RATE 0.918095 MAC 15.7263 MAF 0.101213 SING 1083 MONO 49 TITV 3.46735 TITV_S 3.53138 DP 6992.37 QUAL 0 PASS 1 FILTER|PASS 1 PASS_S 1
Next we calculate per-individual summary statistics on the whole VCF, with the i-stats command, but output only a subset using the gcol utility:
pseq data/CEU.exon.2010_09.genotypes.vcf.gz i-stats
| gcol ID NVAR SING TITV
ID NVAR SING TITV NA06984 3162 13 3.7193 NA06985 3144 11 3.43243 NA06986 3437 22 3.56391 NA06989 3130 11 3.2521 NA06994 3002 4 3.5625 NA07000 3388 11 3.51145 NA07037 3374 5 3.63359 NA07048 3373 15 3.27632 NA07051 3451 13 3.07843 NA07346 3419 4 3.62791 NA07347 3417 16 3.53788 ...
This gives the number of non-missing calls per individual (NOBS), the number of singleton non-reference alleles they carry (SING) and the individual's Ti/Tv based only on variants for which they carry a non-reference allele. (The gcol program is a simple utility to extract columns from a rectangular whitespace-delimited text file based on names in the header; it is available for download from here).
A more detailed description of the various summary statistic commands available in PSEQ is given here.
Setting up a project
Generally, it is preferable to build a project rather than working with single VCFs as we have done above. Setting up a project allows data from multiple sources to be combined easily; it provides a wider range of filtering and grouping options via masks; it allows data to be accessed via different PLINK/Seq front-ends (such as the R interface, or the exome-browser); it will process files more quickly than parsing the VCFs for every analysis (as the data are stored in a compressed, indexed binary format inside the VARDB). Therefore, in most non-trivial scenarios, it is advisable to create a project first. This simple process is described below.
Creating a new project and linking resource databases
pseq proj new-project
--metameta data/meta.meta
--resources hg18
This creates a text file called proj in the current directory that henceforth will represent the project. This file is simply a list of the other files that are considered part of the project:
/full/path/to/current/proj_out/ OUTPUT /full/path/to/current/hg18/ RESOURCES /full/path/to/current/data/meta.meta METAMETA /full/path/to/current/proj_out/vardb VARDB /full/path/to/current/proj_out/inddb INDDB /full/path/to/current/hg18/locdb LOCDB /full/path/to/current/hg18/refdb REFDB /full/path/to/current/hg18/seqdb SEQDB
--metameta and --resources are configuration options; both are optional. --metameta points to a file with information about the meta-information tags that will be used in this project. Specifically, data/meta.meta contains information to indicate which variant tags in the VCFs should be considered as unvarying population-wide attributes of the variant (denoted as STATIC) versus sample-specific properties. This information is used when merging information from multiple sources.
The resources archive downloaded as part of this tutorial contains hg18 versions of the core PLINK/Seq reference databases: LOCDB (containing RefSeq gene transcripts), REFDB (containing dbSNPv130) and SEQDB (containing the hg18 sequence itself). By adding the --resources argument to the above command, and pointing to the hg18 folder, it will automatically attach these three databases.
Importing VCF data into a project
To load VCF files into the project, use the command:
pseq proj load-vcf --vcf data/*.vcf.gz
loading : CEU.exon.2010_09.genotypes.vcf.gz ( 90 individuals ) loading : TSI.exon.2010_09.genotypes.vcf.gz ( 66 individuals ) CEU.exon.2010_09.genotypes.vcf.gz : inserted 3489 variants TSI.exon.2010_09.genotypes.vcf.gz : inserted 3281 variants
Each file has a number (reflecting the order in which it was loaded
into the VARDB) and also a tag. The tag can be changed to
provide a convenient identifier for each file in a project, using the
tag-file command:
pseq proj tag-file --id 1 --name CEU
pseq proj tag-file --id 2 --name TSI
pseq proj var-summary
---Variant DB summary--- 4449 unique variants File tag : CEU (3489 variants, 90 individuals) File tag : TSI (3281 variants, 66 individuals)var-summary displays the number of individuals and variants in a VARDB. These values correspond to those in the 1000 Genomes manuscript, so it appears the data have been successfully imported.
We can now perform all the commands as above, but referring to the project instead of a single VCF file:
pseq proj v-view
chr1:1105366 rs111751804 T/C . 2 PASS PASS chr1:1105411 rs114390380 G/A . 2 PASS PASS chr1:1108138 rs61733845 C/T . 2 PASS PASS chr1:1110240 rs116321663 T/A . 2 PASS PASS chr1:1110294 rs1320571 G/A . 2 PASS PASS chr1:3537996 rs2760321 T/C . 2 PASS PASS chr1:3538692 rs2760320 G/C . 2 PASS PASS chr1:3538715 rs114376964 T/C . 1 PASS chr1:3541597 rs116230480 C/T . 1 PASS chr1:3541599 rs115982402 C/T . 1 PASS ...
The fifth column indicates that the variant was seen in two files in some cases. The final column lists the gene annotation of this variant, as well as the FILTER values for each sample. You can obtain sample-specific meta-information by adding the --samples flag along with --vmeta:
pseq proj v-view --samples --vmeta
chr1:1105366 rs111751804 T/C . 2 PASS PASS AA=T;BN=132;GP=1:1115503 . chr1:1105366 rs111751804 T/C CEU PASS AA=T;AC=4;AN=114;DP=3251;BN=132;GP=1:1115503 chr1:1105366 rs111751804 T/C TSI PASS AA=T;AC=3;AN=130;DP=4820;BN=132;GP=1:1115503 ...
Note that --vmeta prints three lines for the same variant. The first row displays the consensus sample, the population/whole-project representation of the variant. Only tags specified as STATIC in data/meta.meta are displayed in the consensus representation; other meta information tags are assumed to be sample specific. The next two rows show sample specific information; the fourth indicates the sample via the file tags that were set earlier.
Loading individual/phenotypic information
An arbitrary number of phenotypes can be imported into a project. Phenotypes can be integer or floating-point quantitative traits; additionally, string values can be loaded (as factors). The format for phenotype files is described here:
head data/pop.phe
##phe1,Integer,-9,"1/2 = TSI/CEU" #ID phe1 NA20504 1 NA20506 1 NA20508 1 NA20509 1 NA20510 1 NA20511 1 NA20512 1 NA20513 1
In this example, we have a single integer phenotype named phe1. As the header suggests, this phenotype actually indicates population membership in our example dataset, with values of 1 and 2 representing TSI and CEU respectively. Later in this tutorial we will perform dummy case/control analyses using this phenotype (i.e. contrasting TSI and CEU samples). As is standard in PED files, a value of 2 indicates a case; a value of 1 indicates a control; a value of -9 means missing.
To import phenotype information, use the load-pheno command:
pseq proj load-pheno --file data/pop.phe
Individual phenotypic data can subsequently be viewed with the i-view command:
pseq proj i-view
#phe1 (Integer) "1/2 = TSI/CEU" #PHE . #STRATA . #ID FID IID MISS SEX PAT MAT META NA06984 . . 0 0 0 0 phe1=2 NA06985 . . 0 0 0 0 phe1=2 NA06986 . . 0 0 0 0 phe1=2 NA06989 . . 0 0 0 0 phe1=2 NA06994 . . 0 0 0 0 phe1=2 NA07000 . . 0 0 0 0 phe1=2 ...
Multiple phenotypes can be attached to individuals, and will be shown in the META field. This database also tracks family-based ID schemes (FID and IID being family and individual ID); a flag to indicate missing data (MISS), sex (SEX), paternal and maternal IDs (PAT and MAT).
The PHE and STRATA fields in the header will indicate if a particular phenotype or stratifying variable has been set or created (via the --pheno and --strata command lines, see the PSEQ documentation for details).
The project's reference databases
The reference databases in this tutorial represent a cut-down (and already out-of-date) selection of some of the types of data that might routinely be of use when analysing genetic variation datasets: information on genes (RefSeq transcripts), variation (dbSNP) and sequence (hg18). More comprehensive reference datasets are (or soon will be) available from the primary resources/download page of this site. See the main PSEQ documentation for information on how to create your own reference databases.
Here we confirm the contents of the reference databases via a series of summary commands: first the LOCDB:
pseq proj loc-summary
---Locus DB summary--- Group : refseq (30072 entries) n/a
Similarly, we can request a description of the REFDB (that contains 18M variants from the v130 release of dbSNP):
pseq proj ref-summary
---Reference DB summary--- Group : dbSNP130 (18833531 entries) : dbSNP130 (default name)
and SEQDB:
pseq proj seq-summary
---Sequence DB summary--- chr1:1..247249719 MB=247 chr2:1..242951149 MB=242 chr3:1..199501827 MB=199 chr4:1..191273063 MB=191 chr5:1..180857866 MB=180 chr6:1..170899992 MB=170 chr7:1..158821424 MB=158 chr8:1..146274826 MB=146 chr9:1..140273252 MB=140 chr10:1..135374737 MB=135 chr11:1..134452384 MB=134 chr12:1..132349534 MB=132 chr13:1..114142980 MB=114 chr14:1..106368585 MB=106 chr15:1..100338915 MB=100 chr16:1..88827254 MB=88 chr17:1..78774742 MB=78 chr18:1..76117153 MB=76 chr19:1..63811651 MB=63 chr20:1..62435964 MB=62 chr21:1..46944323 MB=46 chr22:1..49691432 MB=49 chrX:1..154913754 MB=154 chrY:1..57772954 MB=57 chrM:1..16571 MB=0 SEQDB meta-information: BUILD = hg18 SEQDB meta-information: DESC = from-UCSC-6-aug-2010 SEQDB meta-information: IUPAC = 0 SEQDB meta-information: NAME = hg18 SEQDB meta-information: REPEATMODE = lower
We now have a fully populated PLINK/SEQ project and we can begin analysis.
Viewing the data, with filters (masks)
Basic views and filters
As described above, the v-view command can be used for viewing variant and genotype data, which can be modified by other options such as --samples, --vmeta, --geno and --gmeta, as described here. To obtain human-readable versions of the meta-information, add the --verbose argument:
pseq proj v-view --vmeta --samples --verbose --mask limit=1
chr1:1105366 rs111751804 T/C . 2 PASS PASS AA = T BN = 132 GP = 1:1115503 chr1:1105366 rs111751804 T/C CEU PASS AA = T AC = 4 AN = 114 DP = 3251 BN = 132 GP = 1:1115503 chr1:1105366 rs111751804 T/C TSI PASS AA = T AC = 3 AN = 130 DP = 4820 BN = 132 GP = 1:1115503
Individual-level genotype data
To view phenotype information along-side genotype calls:
pseq proj v-view --geno --phenotype phe1 --mask limit=1
chr1:1105366 rs111751804 T/C . 2 PASS PASS NA06984 CEU CASE ./. NA06985 CEU CASE ./. NA06986 CEU CASE T/T NA06989 CEU CASE ./. NA06994 CEU CASE ./. NA07000 CEU CASE T/T NA07037 CEU CASE T/T ... NA20504 TSI CONTROL T/T NA20506 TSI CONTROL T/T NA20508 TSI CONTROL T/T NA20510 TSI CONTROL T/T NA20511 TSI CONTROL T/T ...
To also add genotype-specific meta-information, add the --gmeta flag. The specific fields that are shown or hidden can be modified with the --show and --hide commands.
There are a number of variations on the basic v-view command that are designed to look at individual level genotype data (the examples below use --mask to select two particular sites for illustration.) The rv-view only lists the rarer genotypes per site:
pseq proj rv-view --mask reg=chr13:27794979,chr13:27795044
chr13:27794979:rs56314249 . C/T 2 PASS PASS NA07051 CEU C/T NA12717 CEU C/T NA12874 CEU C/T NA20506 TSI C/T NA20538 TSI C/T NA20773 TSI C/T NA20828 TSI C/T chr13:27795044:rs41291686 . C/T 1 PASS NA12155 CEU C/T
This command can be combined with --gmeta and --phenotype too. The mv-view lists multiple variants together (output is similar to the g-view command, but where the variants are not necessarily existing within a predefined group):
pseq proj mv-view --mask reg=chr13:27794979,chr13:27795044
NAME=[] N(V)=2 N(I)=156 V1 chr13:27794979 V2 chr13:27795044 I V1 V2 NA06984 C/C C/C NA06985 C/C C/C NA06986 C/C C/C NA06989 C/C C/C NA06994 C/C C/C NA07000 C/C C/C NA07037 C/C C/C NA07048 C/C C/C NA07051 C/T C/C NA07346 C/C C/C ...
Finally, mrv-view only lists the genotypes that contain a minor allele in at least one of the set of multiple variants, here combined with the --phenotype command:
pseq proj mrv-view --phenotype phe1 --mask reg=chr13:27794979,chr13:27795044
NAME=[] N(V)=2 N(I)=156 V1 chr13:27794979 V2 chr13:27795044 I PHE V1 V2 NA07051 CASE C/T C/C NA12155 CASE C/C C/T NA12717 CASE C/T C/C NA12874 CASE C/T C/C NA20506 CONTROL C/T ./. NA20538 CONTROL C/T ./. NA20773 CONTROL C/T ./. NA20828 CONTROL C/T ./.
Gene-centric views of genotype data
With the appropriate information in an attached LOCDB, we can filter (or group) variants based on their location in specific genes. The command g-view provides a summary of variants by gene. Assuming a group called refseq has been imported into the LOCDB (it has already, in the case of the LOCDB downloaded with this tutorial):
pseq proj g-view --mask loc.group=refseq
NAME=[NM_000021] N(V)=1 N(I)=156 V1 chr14:72742931 NAME=[NM_000043] N(V)=4 N(I)=156 V1 chr10:90752918 V2 chr10:90757462 V3 chr10:90758660 V4 chr10:90761809 NAME=[NM_000055] N(V)=7 N(I)=156 V1 chr3:166973974 V2 chr3:166973982 V3 chr3:167030263 V4 chr3:167030667 V5 chr3:167030704 V6 chr3:167030881 V7 chr3:167031223 ...
The loc.group mask element indicates that we will iterate over sets or groups of variants, rather than single variants. Here, the sets are defined by the members of the refseq group in the LOCDB, i.e. corresponding to genes. The loc.group option is heavily used in gene-based association tests. Note that the order of iteration is defined by the alpha-numeric value of the gene/set name, rather than genomic location.
LOCDBs can optionally contain tables of gene-name aliases: for
example, containing the RefSeq transcript IDs, the CCDS IDs as well as
the gene symbol. One convenience function is the gene mask
element, that matches genes (one or more RefSeq transcripts) based on
the gene symbol in (by default) a group called refseq.
pseq proj g-view --geno --mask gene=GRIK3
NAME=[NM_000831] N(V)=3 N(I)=156 V1 chr1:37098064 V2 chr1:37098185 V3 chr1:37110546 I V1 V2 V3 NA06984 C/C C/C G/G NA06985 A/C C/C G/G NA06986 A/A C/C G/G NA06989 A/C C/C G/G NA06994 A/C C/C G/G NA07000 A/C C/C G/A NA07037 A/A C/C G/G NA07048 A/A C/C G/G NA07051 A/A C/C G/G NA07346 A/A C/C G/G NA07347 A/A C/C G/G NA07357 A/A C/C G/G NA10847 ./. ./. G/G
The above command is identical to the following, explicitly setting the loc.group and that only a subset of sets from this group should be considered, via loc.subset:
pseq proj g-view --geno --mask loc.group=refseq loc.subset=refseq,NM_000831
To determine which individuals carry rare non-reference variants in a given gene, one can use something like the following: in this example, the mask specifies the filter on variants with between 1 and 5 copies of the minor allele (mac):
pseq proj g-view --phenotype phe1 --mask gene=GRIK3 mac=1-5 --rarelist
NAME=[NM_000831] N(V)=2 N(I)=156 I NM_000831 NA07000 CASE 1 chr1:37110546;A/G I NM_000831 NA12234 CASE 1 chr1:37110546;A/G I NM_000831 NA12342 CASE 1 chr1:37110546;A/G I NM_000831 NA12842 CASE 1 chr1:37110546;A/G I NM_000831 NA12890 CASE 1 chr1:37098185;C/T I NM_000831 NA20538 CONTROL 1 chr1:37098185;C/T I NM_000831 NA20808 CONTROL 1 chr1:37098185;C/T V NM_000831 chr1:37098185 C 3 NA12890;C/T,NA20538;C/T,NA20808;C/T V NM_000831 chr1:37110546 G 4 NA07000;A/G,NA12234;A/G,NA12342;A/G,NA12842;A/G
The rows starting I indicate an individual who carries a rare allele; the fifth column is the number of variants they carry; the sixth column lists the variant(s) and the genotype(s) for that individual. The rows starting V convey similar information but for a variant, listing the (number of) individuals carrying at least one minor allele; the reference allele is also listed in each line of variant output.
Using masks to filter variants based on meta-information and other properties
Here we illustrate some more complex filters that can be applied to any dataset. For example, this command uses the include mask element to evaluate an arbitrary expression as true or false (that is a function of the meta-information for the variants): in this example, we require the dbSNP build number in which the variant first appeared (BN tag) to be 123 and for read-depth to be less then 2000. Three variants on chromosome 7 meet these criteria:
pseq proj v-view --vmeta --samples --show BN DP
--mask reg=chr7 include="BN==123 && DP<2000"
chr7:34833707 rs17170011 C/T . 1 PASS BN=123 . chr7:34833707 rs17170011 C/T TSI PASS DP=1589;BN=123 chr7:120757061 rs17143291 C/A . 1 PASS BN=123 . chr7:120757061 rs17143291 C/A TSI PASS DP=624;BN=123 chr7:120759251 rs17143296 G/A . 1 PASS BN=123 . chr7:120759251 rs17143296 G/A TSI PASS DP=502;BN=123
The include element works at the level of sample-variants, not the consensus variant. In the above example, rs17170011 was seen in both the CEU and TSI samples, but DP in CEU was greater than 2000. As such, CEU was filtered out, but TSI was kept in. We can confirm this with:
pseq proj v-view --samples --vmeta --mask reg=chr7:34833707
chr7:34833707 rs17170011 C/T . 2 PASS PASS AA=C;BN=123;GP=7:34867182;HM2;HM3 . chr7:34833707 rs17170011 C/T CEU PASS AA=C;AC=18;AN=178;DP=6736;BN=123;GP=7:34867182;HM2;HM3 chr7:34833707 rs17170011 C/T TSI PASS AA=C;AC=4;AN=120;DP=1589;BN=123;GP=7:34867182;HM2;HM3
Masks are described in more detail here. We will consider a few more examples below. As well as meta-information supplied in tags, we can filter on other properties of the variant: for example, to focus on variants with a lot of no-calls in the CEU sample:
pseq proj v-view --mask file=CEU null=80-
chr6:30018537 rs112039894 T/C . 1 PASS chr6:30018583 rs113699656 C/A . 1 PASS chr6:30018721 rs78073371 G/C . 1 PASS chr6:30018780 rs80246530 C/A . 1 PASS chr7:156435262 rs6969990 C/G . 1 PASS chr8:98857209 rs2449509 G/C . 1 PASS chr9:204706 rs540473 G/C . 1 PASS chr22:48958933 rs5771206 A/G . 1 PASS
The null mask element only displays variants with the number of null (missing) genotypes in the range specified null=n,m. For convenience, null=10 implies null=0-10. Thus, the mask above shows only variants with 80 or more missing calls (of the 90 CEU individuals). Looking at some of these variants in more detail, we see that read depth is very low, or 0, for many individuals:
pseq proj v-view --gmeta --mask reg=chr6:30018780 file=CEU
chr6:30018780 rs80246530 C/A . 1 PASS NA06984 CEU ./. [DP=3] NA06985 CEU ./. [DP=0] NA06986 CEU ./. [DP=4] NA06989 CEU ./. [DP=1] NA06994 CEU ./. [DP=0] NA07000 CEU ./. [DP=0] NA07037 CEU ./. [DP=0] NA07048 CEU ./. [DP=2] NA07051 CEU ./. [DP=5] NA07346 CEU C/C [DP=6] NA07347 CEU ./. [DP=5] NA07357 CEU ./. [DP=0] NA10847 CEU ./. [DP=1] NA10851 CEU ./. [DP=0] NA11829 CEU ./. [DP=0] NA11830 CEU ./. [DP=0] NA11831 CEU ./. [DP=2] NA11832 CEU ./. [DP=1] NA11840 CEU ./. [DP=0] ...
To output only non-null genotypes when using --geno or --gmeta, add the hide-null argument:
pseq proj v-view --mask reg=chr6:30018780 --geno --gmeta --hide-null
chr6:30018780 rs80246530 C/A . 2 PASS PASS NA07346 CEU C/C [DP=6] NA12878 CEU A/A [DP=16] NA20543 TSI C/A [DP=2] NA20805 TSI C/C [DP=6]
Similarly, to output only non-reference calls from a --geno or --gmeta command:
pseq proj v-view --mask reg=chr6:30018780 --gmeta --only-alt
chr6:30018780 rs80246530 C/A . 2 PASS PASS NA12878 CEU A/A [DP=16] NA20543 TSI C/A [DP=2]
(Note: to show only minor alleles, based on sample frequency rather than reference sequence, add the option --only-minor -- although this is equivlant to the rv-view command)
In this case, we appear to see some non-reference variants called on a relatively small number of reads (e.g. DP=2 in this last example). To scan for similar occurrences, of variants being called from very few reads, one could use a command such as:
pseq proj rv-view --gmeta --mask geno=DP:eq:1 monomorphic.ex
chr4:1378413 rs78309237 T/C . 2 PASS PASS NA12400 CEU C/T [DP=1] chr5:178291556 rs34499286 A/G . 2 PASS PASS NA20506 TSI A/G [DP=1] chr6:30019298 rs115073453 G/T . 2 PASS PASS NA20513 TSI G/T [DP=1] chr7:5993422 rs116094787 G/A . 1 PASS NA20538 TSI A/G [DP=1] chr17:41429726 rs114553892 A/G . 2 PASS PASS NA11881 CEU A/G [DP=1] chr19:21398559 rs10414834 C/G . 2 PASS PASS NA12234 CEU C/G [DP=1] NA12750 CEU C/G [DP=1] NA20807 TSI C/G [DP=1] chr19:21399063 rs115610555 C/T . 2 PASS PASS NA11893 CEU C/T [DP=1] chr20:61667328 rs310633 A/G . 2 PASS PASS NA12414 CEU A/G [DP=1]
The above mask only considers sites that are variant (monomorphic.ex) after looking only at genotypes called on the basis of a single read (gene=DP:eq:1). (Follow this link for a description of the syntax of the geno mask element.) It is worth noting that much of the 1000 Genomes data was generated using haplotype information to effectively impute missing data across sites with little coverage to make more accurate calls. If such calls were not made using an LD-aware genotyper (i.e. considering the genotypes at nearby, better covered variant sites), one might want to investigate the accuracy of these particular calls further.
To focus on variants that are seen in both samples, CEU and TSI, you could either you the obs.file.req to specify a list of file-tags for which you required to see that position observed (whether or not a non-reference, alternate allele is at that position in a given file):
pseq proj v-view --mask obs.file.req=CEU,TSI
Equivalently, we could have used obs.nfile=2 to acieve similar results in this dataset (to list variants observed in at least 2 samples).
To count the sites for which a non-reference allele was observed only in TSI, we could use (showing only the first line of output):
pseq proj v-stats --mask alt.file=TSI alt.file.ex=CEU
NVAR 963
Similarly, the number of sites only seen in CEU:
pseq proj v-stats --mask alt.file=CEU alt.file.ex=TSI
NVAR 1167
and those for which a non-reference allele is seen in both:
pseq proj v-stats --mask alt.file.req=CEU,TSI
NVAR 2313
Note that these three numbers sum to 4443, which is less than the 4449 unique variants reported by vardb-summary above. This implies there must be some monomorphic or completely null calls included in these files:
pseq proj v-view --mask alt.file.ex=CEU,TSI
chr5:140870754:rs114669158 . G/A 1 PASS chr6:30019035:rs111457285 . A/C 1 PASS chr6:30019094:rs115176787 . G/C 2 PASS PASS chr11:408735:rs12792208 . G/A 1 PASS chr17:44939003:rs116077510 . C/T 1 PASS chr19:2199303:rs115008403 . G/A 1 PASS
(Note that the total number of actual monomorphic variants is higher, due to a number of sites that are monomorphic for the alternate allele in this sample (i.e. representing an error, or a variant, in the reference sequence):
pseq proj v-stats --mask monomorphic
NVAR 44
Outputting data in different formats
The v-matrix command is designed to produce rectangular-format files of counts of non-reference alleles, e.g. for subsequent analysis in a program such as R. In this particular example, we restrict the file to only two individuals, for a particular gene:
pseq proj v-matrix --mask gene=NLRP11 indiv=NA20808,NA20774
VAR ALT REF NA20774 NA20808 chr19:60988831:rs11671248 G A 0 0 chr19:60988976:rs116820715 T G 0 0 chr19:60999365:rs114591455 G A NA NA chr19:60999395:rs2616940 A G 1 1 chr19:61012012:rs16986626 C T 1 0 chr19:61012383:rs114531512 A G 0 0 chr19:61012475:rs12461110 G A 0 0 chr19:61012625:rs111287552 T C NA NA chr19:61013107:rs12977654 G A 0 0 chr19:61013226:rs299163 C A 2 2 chr19:61013329:rs7249635 T C 1 0 chr19:61021146:rs8113630 G C 0 1 chr19:61021206:rs114117210 T C 0 0
To output a table of variant meta-information, use the meta-matrix command; in this example, we restrict the output to a subset of the variants, samples and tags:
pseq proj meta-matrix --show AN AA DP BN --mask reg=chr22 file=TSI
CHR POS ID AA BN AN DP 22 16042793 rs7289170 G 116 128 7403 22 16049306 rs2231495 T 98 126 4036 22 18338811 rs35219372 c 126 128 3852 22 18338829 rs5993890 g 114 126 2238 22 18340512 rs115344498 A 132 130 5179 22 18343258 rs116249498 G 132 132 5849 22 18347480 rs74544696 C 131 130 2068 22 18348971 rs2073748 G 96 70 362 22 18349043 rs80068543 C 131 108 904 ...
Writing to other formats (including PLINK transposed PED files, BEAGLE format genotype likelihood files, VCF and the internal VARDB format) is described here.
Summary statistics for samples, individuals and genes
The three main functions to produce summaries for the whole sample, each individual and each gene/set are v-stats, i-stats and g-stats respectively.Summary statistics for all variants
The v-stats command provides a range of basic information on: a) the number of variants and non-reference allele frequency distribution, b) FILTER and HWE failure rates, c) transition/transversion ratio, d) read-depth and quality score distributions, e) novelty with respect to dbSNP and/or 1000 Genomes data, f) breakdown by functional class (coding/silent) and singleton status, g) summaries of other variant-level and individual-level tags. We saw an example of the basic v-stats command on the CEU VCF above. Here we apply it to the combined CEU/TSI project, additionally requesting information about the proportion of variants that are in HapMap2 or HapMap3 (i.e. the flag is interpreted as a 0/1 variable):
pseq proj v-stats --stats mean=HM2,HM3
NVAR 4449 RATE 0.707462 MAC 21.6458 MAF 0.0818045 SING 1692 MONO 44 TITV 3.52134 TITV_S 3.75281 DP NA QUAL NA PASS 1 PASS_S 1 MEAN|HM2 0.289503 MEAN|HM3 0.229715
Note that the count option returns similar information in this case (as HM2 and HM3 are simple binary flags):
pseq proj v-stats --stats count=HM2,HM3
.. HM2|set 1288 HM3|set 1022
(i.e. as 1288 / 4449 is 0.289503). The options extend to other types of meta-information, e.g.:
pseq proj v-stats --stats count=AA
... AA|- 22 AA|. 102 AA|A 569 AA|C 1350 AA|G 1509 AA|N 36 AA|T 583 AA|a 37 AA|c 97 AA|g 99 AA|t 45
and prespecified ranges can be added:
pseq proj v-stats --mask file=CEU
--stats count=DP:0-100,DP:200-1000,DP:1000-5000,DP:5000-
DP|0:100 8 DP|200:1000 88 DP|1000:5000 918 DP|5000:* 2472
i.e. 918 variants in CEU has a DP between 1000 and 5000. Note we performed this analysis only on the CEU sample -- because DP is a sample-specific variant attribute, that is not automatically merged to create a single per-variant value, it is only accessible in this context when looking at a single file.
The ref and loc options can be used to display the proportion of sites that either in a specific REFDB (observed as a variant) or LOCDB (at least one region that spans that variant).
pseq proj v-stats --stats ref=dbSNP130 loc=refseq
REF|dbSNP130 0.570915 LOC|refseq 0.996179
Naturally, this assumes that a REFDB and LOCDB are attached and contain the relevant groups.
One can use the usual --mask options: for example, here we investigate the relationship between allele frequency and which build of dbSNP (indicated by the BN tag) that variant first appeared in: (here also illustrating how the mac option can be used to get coutns of variants in particular minor allele count bins):
pseq proj v-stats --stats mac=1,2-4,5-10 --mask include="BN < 120 "
NVAR 1287 MAF 0.195623 MAC|1 62 MAC|2:4 78 MAC|5:10 108
pseq proj v-stats --stats mac=1,2-4,5-10 --mask include="BN >= 120 "
NVAR 3162 MAF 0.035478 MAC|1 1630 MAC|2:4 647 MAC|5:10 328
We see many more singletons and the mean MAF is much lower for the variants not added until later dbSNP builds, which is as we'd expect (particularly as 1000 Genomes data contributed to these later builds).
The v-stats command is described more fully here.
Summary statistics per individual
The i-stats command calculates per-individual metrics such as Ti/Tv, dbSNP percentage, the number of singletons, etc.
pseq proj i-stats
ID NALT NMIN NHET NVAR RATE SING TITV PASS PASS_S QUAL DP NA06984 719 538 480 3162 0.90627 1043 3.7193 538 13 NA 7410.61 NA06985 655 492 420 3144 0.90111 1038 3.43243 492 11 NA 7424.63 NA06986 773 607 503 3437 0.98509 1072 3.56391 607 22 NA 6983.5 NA06989 699 506 469 3130 0.89710 1048 3.2521 506 11 NA 7663.52 NA06994 591 438 377 3002 0.86041 1025 3.5625 438 4 NA 7672.17 NA07000 802 591 517 3388 0.97105 1067 3.51145 591 11 NA 6810.01 NA07037 800 607 512 3374 0.96703 1072 3.63359 607 5 NA 6874.12 NA07048 817 650 607 3373 0.96675 1073 3.27632 650 15 NA 7135.38 NA07051 825 624 507 3451 0.98910 1079 3.07843 624 13 NA 6664.27 NA07346 811 597 512 3419 0.97993 1073 3.62791 597 4 NA 6887.14 NA07347 826 599 505 3417 0.97936 1070 3.53788 599 16 NA 6674.78 ...
Here, NALT, NMIN and NHET are the number of non-missing genotypes that contain an alternate allele, a minor allele, or are heterozygous. NVAR is the total number of non-missing variants, RATE is the genotyping rate for that individual. The following quantities (similarly defined as for v-stats) refer only to variants for which that particular individual carried a minor allele. The QUAL fields are all NA as this field is not populated in the 1000 Genomes VCFs.
Adding the gmean option (to calculate the mean of a genotypic, per-individual, tag) adds two further columns:
pseq proj i-stats --stats gmean=DP
G|DP NRG|DP 43.8538 50.0807 13.867 15.5359 137.214 130.093 47.8811 54.4649 12.1152 14.5042 33.1619 31.3329 32.6406 30.0087 62.4786 61.9021 145.679 132.879 107.561 101.276 ...
Here, G|DP is the mean of the per-individual DP tag for all non-missing genotypes called for that individual; NRG|DP is similar, but considers only genotypes for which that individual carries a non-reference allele.
For illustration: we can use this command to investigate some basic properties of the callset, for example, the relationship between read depth and sensitivity to detect variants (as indexed by the number of singletons) in the CEU sample. First, we can run:
pseq proj i-stats --mask file=CEU mac=1-1 --stats gmean=DP > istats.txt
and then, in R for example:
d <- read.table( "istats.txt" , header = T )
plot( d$PASS_S / d$NVAR , d$NRG.DP )
There is a substantial correlation between these two measures (Pearson's correlation coefficient is almost 0.5, with a p-value of 3e-6) after removing the single outlier with very high coverage. This suggests, as expected, that singleton rate in a sample may be quite dependent on the mean coverage for that individual (at least within certain ranges of coverage). In this way, the i-stats command can be useful for generating datasets that can be related to phenotypic outcomes, to determine how comparable, for example, cases and controls are in technical and other ways.
The i-stats command is described more fully here.
Summary statistics per gene
Finally, similar summary metrics can be calculated on a per-gene basis with the g-stats command (which requires a grouping variable to be given, e.g. via a loc.group:
pseq proj g-stats --mask loc.group=refseq
As for v-stats and i-stats, additional options can be added, as described here.
Lookups and filtering the data
In this section we consider a number of ways in which external references and annotations can be intersected with a project's sequence data.
Functional annotation of variants
To annotate the coding status of variants given a LOCDB containing GTFs and a SEQDB, use the counts (or g-counts) command along with --options annotate:
pseq proj counts --phenotype phe1 --annotate refseq --mask reg=chr22
VAR REF/ALT MINOR CNTA CNTU TOTA TOTU FUNC GENE chr22:16042444 C/G G 1 0 178 0 missense NM_017424,NM_177405 chr22:16042793 A/G G 61 44 178 128 silent NM_017424,NM_177405 chr22:16049306 T/C C 43 43 148 126 missense NM_017424,NM_177405 chr22:17729354 G/A A 1 0 180 0 silent NM_003325 chr22:18338811 C/T T 3 1 144 128 silent NM_001670 chr22:18338829 G/A A 5 4 148 126 silent NM_001670 chr22:18340512 A/G G 0 1 0 130 missense NM_001670
Filter on variants not in dbSNP(130)
To extract variants that are novel, in the sense of not being present in a particular build of dbSNP, use the following mask, i.e. assuming a group called dbSNP130 (case sensitive) exists in the REFDB:
pseq proj v-view --mask ref.ex=dbSNP130
chr1:1105366 rs111751804 T/C . 2 PASS PASS chr1:1105411 rs114390380 G/A . 2 PASS PASS chr1:1110240 rs116321663 T/A . 2 PASS PASS chr1:3538715 rs114376964 T/C . 1 PASS chr1:3541597 rs116230480 C/T . 1 PASS chr1:3541599 rs115982402 C/T . 1 PASS chr1:3542416 rs113614277 C/T . 1 PASS chr1:3545211 rs114697698 G/A . 1 PASS ...
To consider only known dbSNP variants, one would replace ref.ex with ref.
Append HGMD annotations
Note: The HGMD reference dataset mentioned below is not part of the tutorial download and so the commands are shown for illustration only.
The Human Gene Mutation Database represents (in the creator's words) an attempt to collate known (published) gene lesions responsible for human inherited disease. It is only available to registered users and so cannot be included as a resource database on this site. However, for registered users it is easy to download the database, and upload it, as a flat-file into a PLINK/Seq REFDB. See instructions on how to build this database here.
Assuming, then, a separate database named (for example) hgmd.db exists in the hg18 folder, the following outputs in verbose form all HGMD SNPs in the CEU/TSI pilot 3, along with selected tags from the HGMD database: (database not included with tutorial)
pseq proj v-view --verbose
--refdb hg18/hgmd.db
--mask ref=hgmd
chr1:151920381:rs28730726 . G/C 1 PASS AA = G BN = 125 GP = 1:153653757 hgmd flag set hgmd_ID = rs28730726 hgmd_DESC = Essential hypertension, association with hgmd_GENE = NPR1 hgmd_TYPE = mis/non hgmd_VALUE = DP chr1:195678032:rs62636285 . G/A 1 PASS AA = G BN = 129 GP = 1:197411409 hgmd flag set hgmd_ID = rs62636285 hgmd_DESC = Leber congenital amaurosis hgmd_GENE = CRB1 hgmd_TYPE = mis/non hgmd_VALUE = DM ...
We can look at the general properties of variants in the CEU and TSI samples that are in (or not in) HGMD, by using v-stats combined with a ref mask option:
pseq proj v-stats --mask ref=hgmd --refdb hg18/hgmd.db
NVAR 117 RATE 0.796515 MAC 35.5385 MAF 0.126484 SING 21 MONO 0 TITV 2.44118 TITV_S 3.2
In other words, 117 of the variants in HGMD are observed in the CEU/TSI sample; 21 are singletons but most are quite common, as the mean MAF is above 10%.
pseq proj v-stats --mask ref.ex=hgmd --refdb hg18/hgmd.db
NVAR 4332 RATE 0.705057 MAC 21.2705 MAF 0.0805977 SING 1671 MONO 44 TITV 3.56 TITV_S 3.76068
Note how in these examples we used --refdb to swap in a different REFDB (i.e. that contains a group called hgmd in this case).
Intersect a list of variants (e.g. from another study)
The lookup command is designed to take a plain text file of sites or regions of interest and return information from the project's databases. For example, if we had the following list of sites/regions (note the required format):
cat data/lookup.txt
chr22:99 chr22:20328280 chr22:20648538 chr22:29187373 chr22:43000000..44000000
then specifying this file with the --file option (and optionally setting a phenotype to obtain case/control allele counts):
pseq proj lookup --file data/lookup.txt --ref dbSNP130 --loc refseq --phenotype phe1
##nvar,1,Integer,"Number of overlapping variants" ##var,1,String,"Variant ID" ##case,1,Integer,"Case minor allele counts" ##con,1,Integer,"Control minor allele counts" ##seqdb_ref,1,String,"SEQDB reference sequence" chr22:99 seqdb_ref N chr22:99 var NA chr22:99 case NA chr22:99 con NA chr22:99 nvar 0 chr22:20328280 seqdb_ref G chr22:20328280 nvar 1 chr22:20328280 var chr22:20328280 chr22:20328280 case 4 chr22:20328280 con 0 chr22:20328448 seqdb_ref G chr22:20328448 nvar 1 chr22:20328448 var chr22:20328448 chr22:20328448 case 0 chr22:20328448 con 1 ... chr22:43000000..44000000 seqdb_ref . chr22:43000000..44000000 nvar 6 chr22:43000000..44000000 var_1 chr22:43657667 chr22:43000000..44000000 case_1 3 chr22:43000000..44000000 con_1 1 chr22:43000000..44000000 var_2 chr22:43670607 chr22:43000000..44000000 case_2 1 chr22:43000000..44000000 con_2 1 chr22:43000000..44000000 var_3 chr22:43690908
The output indicates the variants in the CEU/TSI that overlap with each site/region in lookup.txt, along with the ID (VAR), and in this case, counts of non-reference alleles for "cases" and "controls" (i.e. TSI and CEU).
If the options ref and loc are added, then additional information is given about the entires in REFDB and LOCDB that match with variants in the lookup list. For example:
pseq proj lookup --file data/lookup.txt --ref dbSNP130 --loc refseq --phenotype phe1
This will add lines with ref_ and loc_ codes, interleaved with the original variant rows: e.g.
... chr22:29190830 seqdb_ref C chr22:29190830 nvar 1 chr22:29190830 var chr22:29190830:rs2269961 chr22:29190830 cnt 59 chr22:29190830 loc_refseq chr22:29186011..29197945:NM_174975 chr22:29190830 ref_dbSNP130 .:22..29190830:1:29190830:. ...
Examining genotype-phenotype relationships
In ths section we consider a number of functions that look at allele and genotype frequency distributions and their relation to phenotype.
Variant frequencies: allelic, genotypic and HWE
To list variants that deviate markedly from expected Hardy-Weinberg genotype frequencies, use the hwe mask option combined with the v-freq command. This command generates a number of fields, so we use the gcol utility to extract only these two fields:
pseq proj v-freq --mask hwe=0:1e-7 | gcol VAR HWE
VAR HWE chr10:46507377 9.16912e-08 chr10:46507507 3.31099e-08 chr16:54420192 4.60983e-14 chr16:54420384 5.61696e-08
The argument for hwe is in the form m:n, i.e. to only include variants with a HWE p-value within this range.
Here, we illustrate how to attach meta-information to an existing variant database, by creating a flag to indicate, in this example, HWE-failing variants. As described here, we must first create a suitably formatted file, with headers describing the fields, followed by tab-separated variant, tag key, tag value fields: e.g. the file hw.txt:
##hwd,1,Flag,"Fails HW test at p<1e-7" chr10:46507377 hwd 1 chr10:46507507 hwd 1 chr16:54420192 hwd 1 chr16:54420384 hwd 1
These tags are imported into the VARDB with the attach-meta command:
pseq proj attach-meta --file hw.txt --id CEU TSI --group hwd
The --id command indicates that the tag should be applied to variants (if they exist) in both CEU and TSI samples in this example: that is, tags are applied to specific files in the VARDB.
To subsequently attach this meta-tag requires the meta.attach mask option:
pseq proj v-view --mask meta.attach=hwd reg=chr10:46507377 --vmeta --samples
chr10:46507377 rs1048156 G/A . 2 PASS PASS AA=g;BN=86;GP=10:47087371;HM2 . chr10:46507377 rs1048156 G/A TSI PASS AA=g;AC=41;AN=128;DP=8341;BN=86;GP=10:47087371;HM2;hwd chr10:46507377 rs1048156 G/A CEU PASS AA=g;AC=44;AN=176;DP=20753;BN=86;GP=10:47087371;HM2;hwd
Otherwise, these tags are just like other tags that can be used in include expressions, contain integers, floats or strings, etc.
As an alternative to adding a flag to certain variants, one could also take the following approach: if the file hw2.txt contains the text
chr10:46507377 chr10:46507507 chr16:54420192 chr16:54420384
then these sites can be included with reg or reg.req (or excluded with reg.ex) with the @ file inclusion operator, as follows:
pseq proj v-freq --mask reg=@hw2.txt | gcol VAR HWE MAF HET
VAR HWE MAF HET chr10:46507377 9.16912e-08 0.279605 0.559211 chr10:46507507 3.31099e-08 0.289116 0.578231 chr16:54420192 4.60983e-14 0.391304 0.782609 chr16:54420384 5.61696e-08 0.3125 0.625
Finally, the g-counts command can print human-readable genotype counts, which can help interpret the deviations from HWE:
pseq proj g-counts --mask reg=@hw2.txt
VAR ALLELES GENOTYPES chr10:46507377 G/A ./.=4 A/G=85 G/G=67 chr10:46507507 C/T ./.=9 C/C=62 C/T=85 chr16:54420192 G/A ./.=41 A/G=90 G/G=25 chr16:54420384 C/A ./.=36 A/C=75 C/C=45
These instances of HW deviation represent cases where the alternate allele is very common, but never seen in homozygous form and likely represents a problem with alignment and variant calling. This holds for both CEU and TSI samples:
pseq proj g-counts --mask reg=@hw2.txt file=CEU
VAR ALLELES GENOTYPES chr10:46507377 G/A ./.=2 A/G=44 G/G=44 chr10:46507507 C/T ./.=3 C/C=31 C/T=56 chr16:54420192 G/A ./.=33 A/G=42 G/G=15 chr16:54420384 C/A ./.=17 A/C=43 C/C=30
pseq proj g-counts --mask reg=@hw2.txt file=TSI
VAR ALLELES GENOTYPES chr10:46507377 G/A ./.=2 A/G=41 G/G=23 chr10:46507507 C/T ./.=6 C/C=31 C/T=29 chr16:54420192 G/A ./.=8 A/G=48 G/G=10 chr16:54420384 C/A ./.=19 A/C=32 C/C=15
Multilocus allelic distributions
The v-dist command can be used to assess the broad comparability in population (and technical) terms between two groups (specified by a dichotomous, case/control phenotype):
pseq proj v-dist --phenotype phe1
--------------------------- All cases & controls ( 90 cases, 66 controls ) --------------------------- Singletons : chi-sq (1df) = 24.64 p = 6.919e-07 A / U Exp Obs Ratio 0 / 1 713.3 814 1.141 1 / 0 972.7 872 0.8965 Doubletons : chi-sq (2df) = 45.46 p = 1.345e-10 A / U Exp Obs Ratio 0 / 2 66.59 98 1.472 1 / 1 181.6 118 0.6498 2 / 0 123.8 156 1.26 Tripletons : chi-sq (3df) = 23.58 p = 3.062e-05 A / U Exp Obs Ratio 0 / 3 14.99 25 1.667 1 / 2 61.34 46 0.7499 2 / 1 83.65 69 0.8249 3 / 0 38.02 58 1.525
This command considers only sites with exactly 1, 2, 3 or (not shown) 4 minor alleles in the whole sample, and asks (within each frequency category) whether or not the minor alleles appear to be randomly distributed across the two groups, or whether we see (for example) more 2/0 and 0/2 doubletons than we'd expect.
Remembering that in this example case and control actually represent CEU and TSI group memership, we see this analysis indicates (as expected) that there are numerous frequency differences (if we assume that absence of a variant in a particular file implies high confidence that all calls are homozygous reference). In other words, looking only at doubletons (variants for which we observed exactly two copies of an alternate allele), these occur more often within either CEU or TSI, compared to between, than expected by chance. In a typical disease association study, one would hope not to observe signficant deviations from the v-dist test. Note that this approach ignores linkage disequilibrium between sites.
Genotypic similarity between all pairs of individuals
To calculate for each pair of individuals the number of genotypes for which they both carry a minor allele, use the ibs-matrix command. Here we restrict attention to lower frequency variants (only 2 to 5 copies observed) and add some options to get a list (rather than a matrix) as output:
pseq proj ibs-matrix --mask mac=2-5 --long-format --two-counts
NA06984 NA06985 0 600 NA06984 NA06986 0 629 NA06984 NA06989 0 616 NA06984 NA06994 0 591 NA06984 NA07000 0 626 NA06984 NA07037 0 630 NA06984 NA07048 2 630 NA06984 NA07051 0 631 NA06984 NA07346 1 627 NA06984 NA07347 1 625 ...
The third column is the number of genotypes where both members of the pair carry a minor allele; the fourth column is the total number of non-missing genotypes for the pair (after the mask has been applied). If we plot the ratio of these two columns, we observe at least one outlier pair:
In terms of absolute number of genotypes shared, these two pairs have markedly higher values than most pairs (over 80% of all pairs share 0, the next highest pair shares only 5):
NA12878 NA12891 19 667 NA12878 NA12892 10 669
Finding variants that are unique to particular groups or samples
To find out which doubletons are unique to the first pair from the example above, for example, we can use the unique command:
pseq proj unique --indiv NA12878 NA12891
2 0 chr2:27331735 AA=G;BN=121;GP=2:27478231;HM2;HM3 2 0 chr3:128873047 AA=C;BN=130;GP=3:127390357 2 0 chr5:74711006 AA=T;BN=129;GP=5:74675250 2 0 chr9:130515404 AA=G;BN=130;GP=9:131475583 2 0 chr10:104889187 AA=C;BN=130;GP=10:104899197 2 0 chr14:20625321 AA=G;BN=129;GP=14:21555481 2 0 chr15:59998858 AA=G;BN=130;GP=15:62211566
The first two columns indicate that it was observed for 2 out of 2 specified individuals, and 0 of N-2 remaining individuals. Options can be added to modify the behaviour of unique (i.e. to require at least M of N in one group, but no more than S of T in the remainings samples.
Another convenient way to filter variants in the context of a case control study is to use the mask options case.control, case.uniq and control.uniq:
pseq proj counts --phenotype phe1 --mask case.control=8-10,0-1
VAR ALLELES ALTA ALTU TOTA TOTU chr4:178518748 G/A 8 0 180 128 chr5:76164809 T/C 8 0 174 122 chr5:121515868 A/G 8 0 180 0 chr6:135580796 A/G 8 0 180 0 chr7:96585128 C/T 9 0 178 0 chr10:101635524 A/G 8 1 180 130 chr12:132097393 A/G 9 0 180 0 chr13:23141204 A/G 10 0 176 0 chr14:62244858 T/C 10 1 180 124 chr15:23477291 G/C 9 0 158 0 chr15:76876129 T/C 9 1 156 124 chr16:54420213 T/C 10 0 180 0 chr19:45778149 C/G 8 1 176 132
Note that case.uniq implies case.control=1-,0.
Single locus association
Single locus association for dichotomous case/control traits can be calculated with the v-assoc command:
pseq proj v-assoc --phenotype phe1 | gcol MINA MINU OBSA OBSU VAR OR P
Considering only variants significant at p<1e-4:
MINA MINU OBSA OBSU VAR OR P 78 28 86 63 chr6:7156817 2.90426 4.75622e-05 30 2 84 50 chr14:22619220 10.6522 5.25491e-05 102 39 86 57 chr15:23476187 2.8022 3.80189e-05
By default, these tests are based on Fisher's exact test. They can be based on either a fixed number, or an adaptive number, of permutations of the phenotype by adding --perm 10000 or --perm -1, respectively.
See the glm command for single locus tests framed in terms of regression (that allow for covariates and quantitative outcomes).
Gene-based assocaition tests for rare variants
For illustration of syntax only, here we perform a series of gene-based rare-variant tests, using a very simple test: comparing the burden of non-reference genotypes in cases compared to controls within a gene, evaluating the significance of the difference by permutation.
pseq proj assoc --phenotype phe1 --mask loc.group=refseq mac=1-10
LOCUS POS ALIAS NVAR TEST P I DESC NM_001055 chr16:28524914..28527421 CCDS32420.1,SULT1A1 4 BURDEN 0.00317749 0.0052 11/0 NM_001164814 chr14:22617256..22634277 ACIN1 8 BURDEN 0.00359251 0.0010 8/3 NM_014977 chr14:22617256..22634277 CCDS9587.1,ACIN1 8 BURDEN 0.0042879 0.0012 8/3 NM_020186 chr7:96585128..96585128 CCDS5648.1,ACN9 1 BURDEN 0.0034055 0.0058 9/0 NM_024837 chr15:47939741..48153645 CCDS32238.1,ATP8B4 10 BURDEN 0.000176587 5e-05 22/2 NM_177529 chr16:28524914..28527421 CCDS32420.1,SULT1A1 4 BURDEN 0.00373433 0.0058 11/0 NM_177530 chr16:28524914..28527421 CCDS32420.1,SULT1A1 4 BURDEN 0.00306614 0.0054 11/0 NM_177534 chr16:28524914..28527421 CCDS32420.1,SULT1A1 4 BURDEN 0.00231443 0.0059 11/0 NM_177536 chr16:28524914..28527421 CCDS10637.1,SULT1A1 4 BURDEN 0.00306614 0.0054 11/0
We can see what is driving this signal, e.g. for SULT1A1, with the mrv-view command:
pseq proj mrv-view --phenotype phe1 --mask gene=SULT1A1 mac=1-10
NAME=[] N(V)=4 N(I)=156 V1 chr16:28524914:rs28374453 V2 chr16:28525008:rs41278160 V3 chr16:28525053:rs3176926 V4 chr16:28527421:rs1126446 I PHE V1 V2 V3 V4 NA07048 CASE ./. C/C ./. A/G NA07347 CASE A/A C/C C/C A/G NA11843 CASE A/A C/C C/G A/A NA12144 CASE A/A C/T ./. ./. NA12249 CASE A/A C/C C/G A/A NA12272 CASE A/A C/T G/G ./. NA12275 CASE A/A C/C C/G ./. NA12286 CASE A/G C/C C/C A/A NA12891 CASE A/A C/C C/G A/G
The BURDEN test is based on the number of genotypes carrying a rare allele (i.e. assuming a dominant model); above we see 11 genotypes (12 alleles in 9 people) that carry a rare variant and all are cases. Note that the BURDEN test is one-sided, testing for a higher rate in cases: in our example where cases and controls are CEU and TSI this clearly isn't what you'd want -- we only show this example to introduce usage of some of the commands. As detailed here there are a growing number of alternative gene-based tests designed to handle a broader array of conditions.
However, in this particular case, the data are all missing for the "controls" and there is a lot of missing data for "cases" too:
pseq proj g-counts --phenotype phe1 --mask gene=SULT1A1 mac=1-10
VAR ALLELES GENOTYPES chr16:28524914 A/G ./.=30:66 A/A=59:0 A/G=1:0 chr16:28525008 C/T ./.=27:66 C/C=61:0 C/T=2:0 chr16:28525053 C/G ./.=45:66 C/C=40:0 C/G=4:0 G/G=1:0 chr16:28527421 A/G ./.=41:66 A/A=46:0 A/G=3:0
All TSI individuals ("controls") are missing for these variants; unless it is true that the absence of a call means that one is very confident that all genotypes are really homozygous reference, the asymmetric structure of missing data could bias this test. Fuller consideration of such issues is beyond the scope of this introductory tutorial however.
Using R to work with this project
This section assumes you have R installed on your machine and have also downloaded the PLINK/Seq library R interface. At the R prompt, type:
library(Rplinkseq)
PLINK/Seq genetics library for R | v0.08 | 15-Mar-2012
Attaching the project
To attach the PLINK/Seq project we've generated from the two VCFs:
pseq.project( "proj" )
Because this is a relatively small project, we can load entire files into memory in R:
k <- var.fetch( mask = "file=TSI" , limit = 4000 )
length(k)
[1] 3281
The object k is a list of list objects; we can extract core fields such as chromosome with the x.core() function:
table( x.core( k , "CHR" ) )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 303 220 153 153 148 240 135 125 107 160 190 187 55 128 163 133 170 116 194 112 21 22 23 66
We can access the full list for a single variant: in this example, for chr10:, data for which was shown above in the Hardy-Weinberg equilibrium section of this tutorial. The output from the PSEQ g-counts for this variant (for TSI) was:
VAR ALLELES GENOTYPES chr10:46507377 G/A ./.=2 A/G=41 G/G=23
This same variant is number 1456 in the list k:
str( k[[1456]] )
List of 8 $ CHR : int 10 $ BP1 : int 46507377 $ BP2 : int 46507377 $ ID : chr "rs1048156" $ NS : int 1 $ META:List of 4 ..$ HM2: chr(0) ..$ AA : chr "g" ..$ GP : chr "10:47087371" ..$ BN : int 86 $ CON :List of 7 ..$ FSET : int 0 ..$ REF : chr "G" ..$ ALT : chr "A" ..$ QUAL : num NA ..$ FILTER: chr "" ..$ META :List of 7 .. ..$ HM2: chr(0) .. ..$ AA : chr "g" .. ..$ GP : chr "10:47087371" .. ..$ AC : int 41 .. ..$ AN : int 128 .. ..$ DP : int 8341 .. ..$ BN : int 86 ..$ GENO :List of 2 .. ..$ GT: int [1:66] 0 1 0 1 0 1 0 1 1 1 ... .. ..$ DP: int [1:66] 48 21 40 197 417 46 15 49 38 32 ... $ S2 : NULL
table( x.consensus.genotype( k[1456] ) , useNA = "ifany" )
0 1 <NA> 23 41 2
This indicates 41 heterozygote carriers of one copy of the alternate A allele, 23 G/G reference homozygotes and 2 individuals with no call.
To be continued: in particular, examples of visualizing meta-information
and defining and applying arbitrary functions.