Viewing data
This page describes some of the more common commands for viewing data in a PLINK/SEQ project. Additional commands can be found with pseq help views.
- View variant and genotype information: v-view is a flexible command for viewing variants in a project
- Alternate forms of v-view: filtering for (multiple) rare alleles: rv-view, mv-view, and mrv-view
- Ordering by gene/group: using g-view to view genotypes by gene, transcript, or an arbitrary grouping
- Phenotype information: per-individual listing with the i-view command
- Summary counts: allelic and genotypic summary counts with counts and g-counts
- Lookups: querying information on a list of variants with lookup
View variant and genotype information
To output a per-variant level view of a project:
pseq /path/to/my/project v-view
which will output (extracting only a subset of lines):
... chr5:178291463 rs61744382 C/T . 2 PASS PASS chr5:178291556 rs34499286 A/G . 2 PASS PASS chr5:178291742 rs961030 A/G . 2 PASS PASS chr5:178291777 rs116376577 C/A . 1 PASS chr5:178291823 rs77356525 C/A . 1 PASS ...
The format is tab-delimited, with the following (headerless) fields:
chromosome : base-position variant-ID (or '.' if novel) ref/alt alleles a sample/file identifer (or '.' if consensus variant) # of samples variant observed in filter values for those N samples (space-separated)
For a simple tab-delimited list of sites, reference and alternate alleles, add the --simple flag:
pseq /path/to/my/project v-view --simple
chr1:1105366 T C chr1:1105411 G A chr1:1108138 C T chr1:1110240 T A chr1:1110294 G A
By default, the v-view displays the consensus sample variant. As described more fully here, the consensus sample variant represents an amalgamation of as much as possible of specific samples that are present. In the simple scenario of only a single sample then this single sample-variant and the consensus sample variant are identical. When multiple samples exist, the following option can list information about specific samples as separate rows:
pseq /path/to/my/project v-view --samples
This command also lists the sample variants -- i.e. the second field indicates the sample or tag, where . is the consensus:
chr1:1105366 rs111751804 T/C . 2 PASS PASS chr1:1105366 rs111751804 T/C CEU PASS chr1:1105366 rs111751804 T/C TSI PASS chr1:1105411 rs114390380 G/A . 2 PASS PASS chr1:1105411 rs114390380 G/A CEU PASS chr1:1105411 rs114390380 G/A TSI PASS
Whether or not a given variant meta-field should be understood to apply to a specific sample (e.g. read-depth) or the variant more generally (e.g. dbSNP membership, or gene it resides within) can be specified in the METAMETA file (i.e. when making a new project, or append this to the project specification file.
/path/to/my/meta.meta METAMETA
For example, one meta-meta file contains the following:
DB STATIC TYPE STATIC GENE STATIC TRANSCRIPT STATIC STRAND STATIC GENOMECHANGE STATIC cDNACHANGE STATIC CODONCHANGE STATIC PROTEINCHANGE STATIC
as these are all properties that hold at a generalised level of the variant, not a particular combination of sample and variant. Tags with a STATIC identifier here will be automatically assumed to be common for all samples.
Note: currently, we do not enforce checks to ensure that non-missing values are indeed concordant across multiple files, i.e. if one files indicates that a SNP is in dbSNP, but a second file indicates it is not).
To dump all variant meta-information, use the --vmeta command, e.g.
pseq /path/to/my/proj v-view --samples --vmeta --show GENE DP
Note how the DP meta-variant is only attached to specific sample variants, whereas the GENE attribute is copied to the consensus variant level (spaces added between variants for clarity):
chr1:867694 rs6672356 T/C . 2 PASS PASS GENE=SAMD11 chr1:867694 rs6672356 T/C W1 PASS DP=97;GENE=SAMD11 chr1:867694 rs6672356 T/C W2 PASS DP=62;GENE=SAMD11 chr1:871490 rs2272757 G/A,T . 3 PASS PASS PASS GENE=NOC2L chr1:871490 rs2272757 G/A W1 PASS DP=136;GENE=NOC2L chr1:871490 rs2272757 G/A W3 PASS DP=232;GENE=NOC2L chr1:871490 rs2272757 G/T W4 PASS DP=188;GENE=NOC2L chr1:871781 rs35471880 G/A . 1 PASS GENE=NOC2L chr1:871781 rs35471880 G/A W3 PASS DP=226;GENE=NOC2L
Note how, for the second variant chr1:871490 the sample labeled W4 has a different alternate allele (T, not A) but this is reconciled in the consensus variant automatically (to become G/A,T).
Adding the --verbose option prints the same information, but in a more human-readable format: for the first variant (the variant/consensus sample-variant, then two sample-specific sample-variants):
chr1:867694:rs6672356 0 0:T/C:. 2 PASS PASS DB = 1 TYPE = Missense GENE = SAMD11 TRANSCRIPT = NM_152486 STRAND = + GENOMECHANGE = g.chr1 cDNACHANGE = c.1027T>C CODONCHANGE = c.(1027-1029)TGG>CGG PROTEINCHANGE = p.W343R chr1:867694:rs6672356 1 1:T/C:PASS AC = 120 AF = 1 AN = 120 DB = 1 DP = 97 Dels = 0 HRun = 2 MQ = 208.34 MQ0 = 0 QD = 23.45 SB = -1395.09 LSET = NM_152486 LGRP = 2 TYPE = Missense GENE = SAMD11 TRANSCRIPT = NM_152486 STRAND = + GENOMECHANGE = g.chr1 cDNACHANGE = c.1027T>C CODONCHANGE = c.(1027-1029)TGG>CGG PROTEINCHANGE = p.W343R chr1:867694:rs6672356 2 2:T/C:PASS AC = 120 AF = 1 AN = 120 DB = 1 DP = 97 Dels = 0 HRun = 2 MQ = 208.34 MQ0 = 0 QD = 23.45 SB = -1395.09 LSET = NM_152486 LGRP = 2 TYPE = Missense GENE = SAMD11 TRANSCRIPT = NM_152486 STRAND = + GENOMECHANGE = g.chr1 cDNACHANGE = c.1027T>C CODONCHANGE = c.(1027-1029)TGG>CGG PROTEINCHANGE = p.W343R
Other options for the v-view command: to list genotypes for each variant (one row per individual ater each variant line):
--geno
To list genotypes and meta-information:
--gmeta
To list only non-null genotypes:
--hide-null
To list only genotypes with a minor allele:
--only-minor
To list only genotypes with an alternate (non-reference) allele:
--only-alt
Alternate forms of the v-view command
Three special cases of the v-view command are:
pseq proj rv-view
which is just a shortcut for:
pseq proj v-view --geno --only-minor
For example, to list which samples carry the alternate allele for each singleton variant, one would run: (spaces between rows added for clarity in output)
pseq proj v-view --geno --only-minor --mask mac=1-1
chr1:3538715 rs114376964 T/C . 1 PASS NA20513 2 C/T chr1:3541597 rs116230480 C/T . 1 PASS NA06986 1 C/T chr1:3541599 rs115982402 C/T . 1 PASS NA20520 2 C/T chr1:3542416 rs113614277 C/T . 1 PASS NA20792 2 C/T ...
The number after each individual (i.e. 2 after individual ID NA20513 indicates which file(s) that person belongs to in the context of a project. The flags --phenotype phenotype-name and --gmeta will produce additional output on each per-individual line from the v-view family of commands.
To view the genotypes of multiple variants together:
pseq proj mv-view --mask reg=chr1:1105366,chr1:1105411
NAME=[] N(V)=2 N(I)=156 V1 chr1:1105366 V2 chr1:1105411 I V1 V2 NA06984 ./. ./. NA06985 ./. ./. NA06986 T/T G/G NA06989 ./. ./. NA06994 ./. ./. NA07000 T/T G/G NA07037 T/T G/G NA07048 T/T G/G NA07051 T/T G/A NA07346 T/T G/G NA07347 T/T G/G NA07357 T/T G/G NA10847 ./. ./. NA10851 T/T G/G NA11829 T/T G/G NA11830 ./. ./. NA11831 T/T G/G ...
Note: it is important to only feed a relatively small number of variants into a mv-view command, e.g. with a mask. (The command is designed to produce human-readable output in any case.)
For a similar view, but only listing individuals who carry at least one genotype containing a minor allele:
pseq proj mrv-view --mask reg=chr1:1105366,chr1:1105411
V1 chr1:1105366 V2 chr1:1105411 I V1 V2 NA07051 T/T G/A NA11918 T/C G/G NA12383 T/C G/G NA12413 T/C G/G NA12749 T/C G/G NA20515 T/T G/A NA20589 T/C G/G NA20786 T/C G/G NA20812 T/C G/G NA20813 T/T G/A
View genotypic information ordered by gene/group
To list variants grouped by gene, transcript (or any arbitrary grouping), use the g-view command:
pseq proj g-view --mask loc.group=refseq
Here we see, for example, output for the first two genes:
NAME=[NM_000015] N(V)=8 N(I)=132 V1 chr8:18302075 V2 chr8:18302134 V3 chr8:18302274 V4 chr8:18302372 V5 chr8:18302383 V6 chr8:18302500 V7 chr8:18302596 V8 chr8:18302650 NAME=[NM_000016] N(V)=6 N(I)=132 V1 chr1:75966700 V2 chr1:75971373 V3 chr1:75984124 V4 chr1:75987740 V5 chr1:75999434 V6 chr1:75999610
The header for each gene indicates the name of the gene/transcript, and (given any masks) the number of variants in that group, N(V), and individuals, N(I). Note that the output is sorted alphanumerically by gene/set name, not by genomic position.
The --vmeta and --verbose options work as they do for the v-view command, as well as --geno, e.g.:
pseq proj g-view --geno --mask loc.group=refseq loc.subset=refseq,NM_000022
NAME=[NM_000022] N(V)=6 N(I)=132 V1 chr20:42686312 V2 chr20:42686329 V3 chr20:42687712 V4 chr20:42688634 V5 chr20:42698280 V6 chr20:42698341 I V1 V2 V3 V4 V5 V6 00028296 C/C T/T C/C T/T A/C T/T 00028320 C/C T/T C/C T/T A/A T/T 00028328 C/C T/C C/C T/T A/A T/T 00028380 C/C T/C C/C T/C A/A T/T 00028397 C/C T/T C/C T/T A/A T/T 00028454 C/C T/T C/C T/T A/A T/T 00045279 C/C T/T C/C T/T A/A T/T 00045301 C/C T/C C/T T/T A/A T/T 00153518 C/C T/T C/C T/T A/A T/T 00153542 C/C T/T C/C T/T A/A T/T ...
Alternatively, to list individuals as columns instead of rows:
--geno --transpose
V 00028296 00028320 00028328 ... chr20:42686312 C/C C/C C/C ... chr20:42686329 T/T T/T T/C ... chr20:42687712 C/C C/C C/C ... chr20:42688634 T/T T/T T/T ... chr20:42698280 A/C A/A A/A ... chr20:42698341 T/T T/T T/T ...
View individuals' phenotype information
To list the individuals in the INDDB, use the command:
pseq proj i-view
which writes to standard output a list of the phenotypes currently in the INDDB, and the individuals (along with their phenotypic values): for example
#grp (String) "Experimental batch (A..D)" #dis (Integer) "Primary disease phenotype (2=case,1=control,0=missing)" #qta (Float) "Endophenotype-A" #qtb (Float) "Endophenotype-B" #PHE . #STRATA . #ID FID IID MISS SEX PAT MAT META 00028296 . . 0 1 0 0 grp=A;dis=2 00028320 . . 0 1 0 0 grp=A;dis=1 00028328 . . 0 1 0 0 grp=A;dis=2;qta=22.2;qtb=11.2 00028380 . . 0 1 0 0 grp=C 00028397 . . 0 1 0 0 grp=C;dis=2;qtb=8.23 ...
This implies that four phenotypes (grp, dis, qta and qtb) have been loaded into INDDB (see load-pheno). If a phenotype had been set with --pheno or --make-phenotype (see below), it would be represented in the #PHE line; similarly, if a stratifying factor had been specifed with --strata, it would be represented in the #STRATA line.
Individual level masks can be used to restrict output to specific sets of individuals from the INDDB:
--mask indiv=00028320,00028380
The mask option indiv.ex instead excludes the listed people. To append a flag for case/control status for a particular dichotomous phenotype, use the command
--phenotype dis
00028296 . . 0 0 0 0 grp=A;dis=2 CASE 00028320 . . 0 0 0 0 grp=A;dis=1 CONTROL 00028328 . . 0 0 0 0 grp=A;dis=2;qta=22.2;qtb=11.2 CASE 00028380 . . 0 0 0 0 grp=C UNKNOWN 00028397 . . 0 0 0 0 grp=C;dis=2;qtb=8.23 CASE ...
The --make-phenotype command can also be used to generate on-the-fly phenotypes (for i-view, or any command that expects a phenotype), e.g. (A or B) versus (C or D) for the grp factor:
--make-phenotype grp=A,B:C,D
00028296 . . 0 0 0 0 grp=A;dis=2 CASE 00028320 . . 0 0 0 0 grp=A;dis=1 CASE 00028328 . . 0 0 0 0 grp=A;dis=2;qta=22.2;qtb=11.2 CASE 00028380 . . 0 0 0 0 grp=C CONTROL 00028397 . . 0 0 0 0 grp=C;dis=2;qtb=8.23 CONTROL ...
To list the IDs (and samples) of people in the VARDB (e.g. individuals specified in a VCF but not necessarily with phenotype information imported into the INDDB), add the from-vardb flag:
pseq proj i-view --from-vardb
1 1 NA06984 CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz 2 2 NA06985 CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz 3 3 NA06986 CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz 4 4 NA06989 CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz 5 5 NA06994 CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz ... 89 89 NA12891 CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz 90 90 NA12892 CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz 91 1 NA20504 TSI /tmp/TSI.exon.2010_09.genotypes.vcf.gz 92 2 NA20506 TSI /tmp/TSI.exon.2010_09.genotypes.vcf.gz ...
Allelic and genotypic summary counts per variant
The counts command provides a concise summary of the
allele counts for a variant:
pseq proj counts
VAR REF/ALT MINOR CNT TOT chr1:1105366 T/C C 7 244 chr1:1105411 G/A A 3 236 chr1:1108138 C/T T 9 250 chr1:1110240 T/A A 2 310
The counts command can be combined with two other flags to add output: the --phenotype flag, to stratify counts by case/control staatus; also, the argument --annotate will add the coding status for each variant (if a LOCDB and SEQDB are attached, with an appropriate GTF/locus group):
pseq proj counts --annotate refseq --phenotype dis
VAR REF/ALT MINOR CNTA CNTU TOTA TOTU FUNC GENE chr1:69428 T/G G 77 63 940 918 missense NM_001005484 chr1:69511 A/G A 101 110 802 826 missense NM_001005484 chr1:861349 C/T T 1 1 1038 1038 missense NM_152486 chr1:865628 G/A A 3 0 1042 1042 missense NM_152486 chr1:865738 A/G G 8 5 1040 1040 intronic NM_152486 chr1:866430 A/G G 1 0 1044 1044 missense NM_152486 chr1:866487 C/T T 1 1 1046 1044 intronic NM_152486 chr1:871159 G/A A 4 2 1040 1040 missense NM_152486 chr1:871213 C/G G 2 1 1042 1040 missense NM_152486
g-counts provides a similar view of genotypes rather than alleles:
pseq proj g-counts
VAR REF/ALT GENOTYPES chr2:10498418 G/A ./.=4 A/A=2 A/G=8 G/G=142 chr2:10501295 C/T ./.=66 C/C=89 C/T=1 chr2:10501316 G/A ./.=66 A/G=1 G/G=89 chr2:10502090 G/A ./.=66 A/G=1 G/G=89
The counts command can also be used to output summary-level information count information in VCF format, as follows:
pseq proj counts --output-vcf --name mysample
This command (that can be combined with any mask) produces a sites-only VCF with genotype counts encoded in the INFO field, e.g.:
Note: in order to correctly preserve sample/file-level meta-information, you should only run this version of counts on a single file at a time (i.e. use a --mask file=file-tag).
Lookup information on a list of variants
To lookup various sources of information from the project for a list of variants, use the lookup command. This is designed to supply a list of variants of interest in a file, e.g. pos.txt below, in the tab-delimited 3-column format: position, reference allele, alternate allele. Note that pos.txt may contain headers (lines starting with '#') or contain rows with additional columns (e.g., 5 columns, with 2 additional columns of custom meta-information that will be ignored in this context). Depending on the other arguments given, this will lookup and output the following type of information (described in reference to the example below):
- a list of any gene transcripts spanning that position (from --loc refseq, assuming a group refseq exists in the LOCDB, see here)
- an alternate list of any gene names and symbols (from --alias symbol name, assuming these two aliases have been loaded in the LOCDB, see here)
- a breakdown of how many cases and how many control copies of the alternate allele were observed in an attached VARDB (from --phenotype dis, assuming a phenotype named dis has been loaded into the INDDB, see here)
- a functional annotation (silent, missense, nonsense, etc) (from --annotate refseq, assuming as above that a refseq group exists in the LOCDB, and that a SEQDB is attached also)
- an indication of the variant's presence or absence in the reference-variant database group dbsnp (from --ref dbsnp, assuming that group has been loaded into an attached REFDB, see here) A slight variation on this would be to use the --ref_allelic dbsnp option, which would only print dbSNP entries that have the same alternate allele as the one being considered here.
pseq proj1 lookup --file pos.txt
--loc refseq
--alias symbol name
--phenotype dis
--annotate refseq
--ref dbsnp
This shows example output for a single variant. The output is always 3 tab-delimited columns and contains a set of header fields, such that this file is correctly formatted to be loaded into a project as auxiliary meta-information via the attach-meta command.
... chr15:52416801 func missense,missense chr15:52416801 transcript NM_006578,NM_016194 chr15:52416801 genomic g.919T>C,g.1045T>C chr15:52416801 codon c.919ATC>GTC,c.1045ATC>GTC chr15:52416801 protein p.307I>V,p.349I>V chr15:52416801 worst missense chr15:52416801 class missense,NON=0,MIS=2,SYN=0,SPL=0,INT=0,IGR=0 chr15:52416801 seqdb_ref T chr15:52416801 nvar 1 chr15:52416801 var chr15:52416801 chr15:52416801 case 1 chr15:52416801 con 0 chr15:52416801 loc_refseq chr15:52414954..52472076:NM_006578 chr15:52416801 loc_symbol GNB5 chr15:52416801 loc_refseq chr15:52414954..52476873:NM_016194 chr15:52416801 loc_name guanine nucleotide binding protein, beta 5 chr15:52416801 loc_symbol GNB5 chr15:52416801 ref_dbsnp . ...