Viewing data

This page describes some of the more common commands for viewing data in a PLINK/SEQ project. Additional commands can be found with pseq help views.

View variant and genotype information

To output a per-variant level view of a project:

pseq /path/to/my/project v-view

which will output (extracting only a subset of lines):

  ...
  chr5:178291463   rs61744382   C/T  .  2  PASS PASS
  chr5:178291556   rs34499286   A/G  .  2  PASS PASS
  chr5:178291742   rs961030     A/G  .  2  PASS PASS
  chr5:178291777   rs116376577  C/A  .  1  PASS
  chr5:178291823   rs77356525   C/A  .  1  PASS
  ...

The format is tab-delimited, with the following (headerless) fields:

   chromosome : base-position
   variant-ID (or '.' if novel)
   ref/alt alleles 
   a sample/file identifer (or '.' if consensus variant)
   # of samples variant observed in
   filter values for those N samples (space-separated)

For a simple tab-delimited list of sites, reference and alternate alleles, add the --simple flag:

pseq /path/to/my/project v-view --simple
chr1:1105366  T   C
chr1:1105411  G   A
chr1:1108138  C   T
chr1:1110240  T   A
chr1:1110294  G   A

By default, the v-view displays the consensus sample variant. As described more fully here, the consensus sample variant represents an amalgamation of as much as possible of specific samples that are present. In the simple scenario of only a single sample then this single sample-variant and the consensus sample variant are identical. When multiple samples exist, the following option can list information about specific samples as separate rows:

pseq /path/to/my/project v-view --samples

This command also lists the sample variants -- i.e. the second field indicates the sample or tag, where . is the consensus:

 chr1:1105366  rs111751804  T/C  .   2    PASS PASS
 chr1:1105366  rs111751804  T/C  CEU PASS
 chr1:1105366  rs111751804  T/C  TSI PASS
 chr1:1105411  rs114390380  G/A  .   2    PASS PASS
 chr1:1105411  rs114390380  G/A  CEU PASS
 chr1:1105411  rs114390380  G/A  TSI PASS

Whether or not a given variant meta-field should be understood to apply to a specific sample (e.g. read-depth) or the variant more generally (e.g. dbSNP membership, or gene it resides within) can be specified in the METAMETA file (i.e. when making a new project, or append this to the project specification file.

    /path/to/my/meta.meta     METAMETA

For example, one meta-meta file contains the following:

   DB              STATIC
   TYPE            STATIC
   GENE            STATIC
   TRANSCRIPT      STATIC
   STRAND          STATIC
   GENOMECHANGE    STATIC
   cDNACHANGE      STATIC
   CODONCHANGE     STATIC
   PROTEINCHANGE   STATIC

as these are all properties that hold at a generalised level of the variant, not a particular combination of sample and variant. Tags with a STATIC identifier here will be automatically assumed to be common for all samples.

Note: currently, we do not enforce checks to ensure that non-missing values are indeed concordant across multiple files, i.e. if one files indicates that a SNP is in dbSNP, but a second file indicates it is not).

To dump all variant meta-information, use the --vmeta command, e.g.

pseq /path/to/my/proj v-view --samples --vmeta --show GENE DP

Note how the DP meta-variant is only attached to specific sample variants, whereas the GENE attribute is copied to the consensus variant level (spaces added between variants for clarity):

   chr1:867694  rs6672356    T/C    .   2     PASS PASS   GENE=SAMD11
   chr1:867694  rs6672356    T/C    W1  PASS  DP=97;GENE=SAMD11
   chr1:867694  rs6672356    T/C    W2  PASS  DP=62;GENE=SAMD11

   chr1:871490  rs2272757    G/A,T  .   3     PASS PASS PASS  GENE=NOC2L
   chr1:871490  rs2272757    G/A    W1  PASS  DP=136;GENE=NOC2L
   chr1:871490  rs2272757    G/A    W3  PASS  DP=232;GENE=NOC2L
   chr1:871490  rs2272757    G/T    W4  PASS  DP=188;GENE=NOC2L

   chr1:871781  rs35471880   G/A    .   1     PASS   GENE=NOC2L
   chr1:871781  rs35471880   G/A    W3  PASS  DP=226;GENE=NOC2L

Note how, for the second variant chr1:871490 the sample labeled W4 has a different alternate allele (T, not A) but this is reconciled in the consensus variant automatically (to become G/A,T).

Adding the --verbose option prints the same information, but in a more human-readable format: for the first variant (the variant/consensus sample-variant, then two sample-specific sample-variants):

   chr1:867694:rs6672356   0       0:T/C:. 2       PASS PASS
        DB = 1
        TYPE = Missense
        GENE = SAMD11
        TRANSCRIPT = NM_152486
        STRAND = +
        GENOMECHANGE = g.chr1
        cDNACHANGE = c.1027T>C
        CODONCHANGE = c.(1027-1029)TGG>CGG
        PROTEINCHANGE = p.W343R
   
   chr1:867694:rs6672356   1       1:T/C:PASS
        AC = 120
        AF = 1
        AN = 120
        DB = 1
        DP = 97
        Dels = 0
        HRun = 2
        MQ = 208.34
        MQ0 = 0
        QD = 23.45
        SB = -1395.09
        LSET = NM_152486
        LGRP = 2
        TYPE = Missense
        GENE = SAMD11
        TRANSCRIPT = NM_152486
        STRAND = +
        GENOMECHANGE = g.chr1
        cDNACHANGE = c.1027T>C
        CODONCHANGE = c.(1027-1029)TGG>CGG
        PROTEINCHANGE = p.W343R
   
   chr1:867694:rs6672356   2       2:T/C:PASS
        AC = 120
        AF = 1
        AN = 120
        DB = 1
        DP = 97
        Dels = 0
        HRun = 2
        MQ = 208.34
        MQ0 = 0
        QD = 23.45
        SB = -1395.09
        LSET = NM_152486
        LGRP = 2
        TYPE = Missense
        GENE = SAMD11
        TRANSCRIPT = NM_152486
        STRAND = +
        GENOMECHANGE = g.chr1
        cDNACHANGE = c.1027T>C
        CODONCHANGE = c.(1027-1029)TGG>CGG
        PROTEINCHANGE = p.W343R

Other options for the v-view command: to list genotypes for each variant (one row per individual ater each variant line):

--geno

To list genotypes and meta-information:

--gmeta

To list only non-null genotypes:

--hide-null

To list only genotypes with a minor allele:

--only-minor

To list only genotypes with an alternate (non-reference) allele:

--only-alt

Alternate forms of the v-view command

Three special cases of the v-view command are:

pseq proj rv-view

which is just a shortcut for:

pseq proj v-view --geno --only-minor

For example, to list which samples carry the alternate allele for each singleton variant, one would run: (spaces between rows added for clarity in output)

pseq proj v-view --geno --only-minor --mask mac=1-1
chr1:3538715  rs114376964  T/C  .  1  PASS
NA20513  2   C/T

chr1:3541597  rs116230480  C/T  .  1  PASS
NA06986  1   C/T

chr1:3541599  rs115982402  C/T  .  1  PASS
NA20520  2   C/T

chr1:3542416  rs113614277  C/T  .  1  PASS
NA20792  2   C/T

...

The number after each individual (i.e. 2 after individual ID NA20513 indicates which file(s) that person belongs to in the context of a project. The flags --phenotype phenotype-name and --gmeta will produce additional output on each per-individual line from the v-view family of commands.

To view the genotypes of multiple variants together:

pseq proj mv-view --mask reg=chr1:1105366,chr1:1105411
NAME=[]  N(V)=2   N(I)=156
V1  chr1:1105366
V2  chr1:1105411

I        V1   V2
NA06984  ./.  ./.
NA06985  ./.  ./.
NA06986  T/T  G/G
NA06989  ./.  ./.
NA06994  ./.  ./.
NA07000  T/T  G/G
NA07037  T/T  G/G
NA07048  T/T  G/G
NA07051  T/T  G/A
NA07346  T/T  G/G
NA07347  T/T  G/G
NA07357  T/T  G/G
NA10847  ./.  ./.
NA10851  T/T  G/G
NA11829  T/T  G/G
NA11830  ./.  ./.
NA11831  T/T  G/G
...

Note: it is important to only feed a relatively small number of variants into a mv-view command, e.g. with a mask. (The command is designed to produce human-readable output in any case.)

For a similar view, but only listing individuals who carry at least one genotype containing a minor allele:

pseq proj mrv-view --mask reg=chr1:1105366,chr1:1105411
V1  chr1:1105366
V2  chr1:1105411

I        V1   V2 
NA07051  T/T  G/A
NA11918  T/C  G/G
NA12383  T/C  G/G
NA12413  T/C  G/G
NA12749  T/C  G/G
NA20515  T/T  G/A
NA20589  T/C  G/G
NA20786  T/C  G/G
NA20812  T/C  G/G
NA20813  T/T  G/A

View genotypic information ordered by gene/group

To list variants grouped by gene, transcript (or any arbitrary grouping), use the g-view command:

pseq proj g-view --mask loc.group=refseq

Here we see, for example, output for the first two genes:

   NAME=[NM_000015]       N(V)=8  N(I)=132
   V1      chr8:18302075
   V2      chr8:18302134
   V3      chr8:18302274
   V4      chr8:18302372
   V5      chr8:18302383
   V6      chr8:18302500
   V7      chr8:18302596
   V8      chr8:18302650
   
   NAME=[NM_000016]       N(V)=6  N(I)=132
   V1      chr1:75966700
   V2      chr1:75971373
   V3      chr1:75984124
   V4      chr1:75987740
   V5      chr1:75999434
   V6      chr1:75999610

The header for each gene indicates the name of the gene/transcript, and (given any masks) the number of variants in that group, N(V), and individuals, N(I). Note that the output is sorted alphanumerically by gene/set name, not by genomic position.

The --vmeta and --verbose options work as they do for the v-view command, as well as --geno, e.g.:

pseq proj g-view --geno --mask loc.group=refseq loc.subset=refseq,NM_000022
   NAME=[NM_000022]        N(V)=6  N(I)=132
   V1      chr20:42686312
   V2      chr20:42686329
   V3      chr20:42687712
   V4      chr20:42688634
   V5      chr20:42698280
   V6      chr20:42698341
   
   I               V1      V2      V3      V4      V5      V6
   00028296        C/C     T/T     C/C     T/T     A/C     T/T
   00028320        C/C     T/T     C/C     T/T     A/A     T/T
   00028328        C/C     T/C     C/C     T/T     A/A     T/T
   00028380        C/C     T/C     C/C     T/C     A/A     T/T
   00028397        C/C     T/T     C/C     T/T     A/A     T/T
   00028454        C/C     T/T     C/C     T/T     A/A     T/T
   00045279        C/C     T/T     C/C     T/T     A/A     T/T
   00045301        C/C     T/C     C/T     T/T     A/A     T/T
   00153518        C/C     T/T     C/C     T/T     A/A     T/T
   00153542        C/C     T/T     C/C     T/T     A/A     T/T
   ...

Alternatively, to list individuals as columns instead of rows:

--geno --transpose
   V                    00028296    00028320   00028328   ...
   chr20:42686312       C/C         C/C        C/C        ...
   chr20:42686329       T/T         T/T        T/C        ...
   chr20:42687712       C/C         C/C        C/C        ...
   chr20:42688634       T/T         T/T        T/T        ...
   chr20:42698280       A/C         A/A        A/A        ...
   chr20:42698341       T/T         T/T        T/T        ...

View individuals' phenotype information

To list the individuals in the INDDB, use the command:

pseq proj i-view

which writes to standard output a list of the phenotypes currently in the INDDB, and the individuals (along with their phenotypic values): for example

 #grp (String) "Experimental batch (A..D)"
 #dis (Integer) "Primary disease phenotype (2=case,1=control,0=missing)"
 #qta (Float) "Endophenotype-A"
 #qtb (Float) "Endophenotype-B"
 #PHE	 .
 #STRATA .
 #ID       FID IID MISS SEX PAT MAT META
 00028296  .   .   0    1   0   0   grp=A;dis=2
 00028320  .   .   0    1   0   0   grp=A;dis=1
 00028328  .   .   0    1   0   0   grp=A;dis=2;qta=22.2;qtb=11.2
 00028380  .   .   0    1   0   0   grp=C
 00028397  .   .   0    1   0   0   grp=C;dis=2;qtb=8.23
 ...

This implies that four phenotypes (grp, dis, qta and qtb) have been loaded into INDDB (see load-pheno). If a phenotype had been set with --pheno or --make-phenotype (see below), it would be represented in the #PHE line; similarly, if a stratifying factor had been specifed with --strata, it would be represented in the #STRATA line.

Individual level masks can be used to restrict output to specific sets of individuals from the INDDB:

--mask indiv=00028320,00028380

The mask option indiv.ex instead excludes the listed people. To append a flag for case/control status for a particular dichotomous phenotype, use the command

--phenotype dis
 00028296  .   .  0  0  0  0   grp=A;dis=2                     CASE
 00028320  .   .  0  0  0  0   grp=A;dis=1                     CONTROL
 00028328  .   .  0  0  0  0   grp=A;dis=2;qta=22.2;qtb=11.2   CASE
 00028380  .   .  0  0  0  0   grp=C                           UNKNOWN
 00028397  .   .  0  0  0  0   grp=C;dis=2;qtb=8.23            CASE
 ...

The --make-phenotype command can also be used to generate on-the-fly phenotypes (for i-view, or any command that expects a phenotype), e.g. (A or B) versus (C or D) for the grp factor:

--make-phenotype grp=A,B:C,D
 00028296   .  .  0  0  0  0   grp=A;dis=2                     CASE
 00028320   .  .  0  0  0  0   grp=A;dis=1                     CASE
 00028328   .  .  0  0  0  0   grp=A;dis=2;qta=22.2;qtb=11.2   CASE
 00028380   .  .  0  0  0  0   grp=C                           CONTROL
 00028397   .  .  0  0  0  0   grp=C;dis=2;qtb=8.23            CONTROL
 ...

To list the IDs (and samples) of people in the VARDB (e.g. individuals specified in a VCF but not necessarily with phenotype information imported into the INDDB), add the from-vardb flag:

pseq proj i-view --from-vardb
1  1  NA06984  CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz
2  2  NA06985  CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz
3  3  NA06986  CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz
4  4  NA06989  CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz
5  5  NA06994  CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz
...
89 89 NA12891  CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz
90 90 NA12892  CEU /tmp/CEU.exon.2010_09.genotypes.vcf.gz
91 1  NA20504  TSI /tmp/TSI.exon.2010_09.genotypes.vcf.gz
92 2  NA20506  TSI /tmp/TSI.exon.2010_09.genotypes.vcf.gz
...

Allelic and genotypic summary counts per variant

The counts command provides a concise summary of the allele counts for a variant: pseq proj counts

VAR             REF/ALT MINOR   CNT     TOT
chr1:1105366    T/C     C       7       244
chr1:1105411    G/A     A       3       236
chr1:1108138    C/T     T       9       250
chr1:1110240    T/A     A       2       310

The counts command can be combined with two other flags to add output: the --phenotype flag, to stratify counts by case/control staatus; also, the argument --annotate will add the coding status for each variant (if a LOCDB and SEQDB are attached, with an appropriate GTF/locus group):

pseq proj counts --annotate refseq --phenotype dis
VAR         REF/ALT MINOR   CNTA   CNTU    TOTA   TOTU    FUNC      GENE                                
chr1:69428  T/G     G       77     63      940     918    missense  NM_001005484                        
chr1:69511  A/G     A       101    110     802     826    missense  NM_001005484                        
chr1:861349 C/T     T       1      1       1038   1038    missense  NM_152486                           
chr1:865628 G/A     A       3      0       1042   1042    missense  NM_152486                           
chr1:865738 A/G     G       8      5       1040   1040    intronic  NM_152486                           
chr1:866430 A/G     G       1      0       1044   1044    missense  NM_152486                           
chr1:866487 C/T     T       1      1       1046   1044    intronic  NM_152486                           
chr1:871159 G/A     A       4      2       1040   1040    missense  NM_152486                           
chr1:871213 C/G     G       2      1       1042   1040    missense  NM_152486                           

g-counts provides a similar view of genotypes rather than alleles:

pseq proj g-counts
VAR              REF/ALT    GENOTYPES
chr2:10498418    G/A        ./.=4 A/A=2 A/G=8 G/G=142
chr2:10501295    C/T        ./.=66 C/C=89 C/T=1
chr2:10501316    G/A        ./.=66 A/G=1 G/G=89
chr2:10502090    G/A        ./.=66 A/G=1 G/G=89

The counts command can also be used to output summary-level information count information in VCF format, as follows:

pseq proj counts --output-vcf --name mysample

This command (that can be combined with any mask) produces a sites-only VCF with genotype counts encoded in the INFO field, e.g.:

##fileformat=VCFv4.0 ##source=pseq ##_PROJ=mysample ##_N=1,90 ##INFO=<ID=SAMPLE,Number=1,Type=String,Description="Sample tag"> ##INFO=<ID=GENO,Number=.,Type=String,Description="Genotype counts"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele"> ##INFO=<ID=AC,Number=-1,Type=Integer,Description="Allele count for each alternate allele"> ... #CHROM POS ID REF ALT QUAL FILTER INFO 1 1105366 rs111751804 T C . PASS SAMPLE=1;GENO=33(./.),4(C/T),53(T/T);AA=T;AC=4;AN=114;DP=3251;BN=132;GP=1:1115503 1 1105411 rs114390380 G A . PASS SAMPLE=1;GENO=37(./.),1(A/G),52(G/G);AA=G;AC=1;AN=106;DP=2676;BN=132;GP=1:1115548 1 1108138 rs61733845 C T . PASS SAMPLE=1;GENO=29(./.),58(C/C),3(C/T);AA=c;AC=3;AN=122;DP=3477;BN=129;GP=1:1118275 ...

Note: in order to correctly preserve sample/file-level meta-information, you should only run this version of counts on a single file at a time (i.e. use a --mask file=file-tag).

Lookup information on a list of variants

To lookup various sources of information from the project for a list of variants, use the lookup command. This is designed to supply a list of variants of interest in a file, e.g. pos.txt below, in the tab-delimited 3-column format: position, reference allele, alternate allele. Note that pos.txt may contain headers (lines starting with '#') or contain rows with additional columns (e.g., 5 columns, with 2 additional columns of custom meta-information that will be ignored in this context). Depending on the other arguments given, this will lookup and output the following type of information (described in reference to the example below):

  • a list of any gene transcripts spanning that position (from --loc refseq, assuming a group refseq exists in the LOCDB, see here)
  • an alternate list of any gene names and symbols (from --alias symbol name, assuming these two aliases have been loaded in the LOCDB, see here)
  • a breakdown of how many cases and how many control copies of the alternate allele were observed in an attached VARDB (from --phenotype dis, assuming a phenotype named dis has been loaded into the INDDB, see here)
  • a functional annotation (silent, missense, nonsense, etc) (from --annotate refseq, assuming as above that a refseq group exists in the LOCDB, and that a SEQDB is attached also)
  • an indication of the variant's presence or absence in the reference-variant database group dbsnp (from --ref dbsnp, assuming that group has been loaded into an attached REFDB, see here) A slight variation on this would be to use the --ref_allelic dbsnp option, which would only print dbSNP entries that have the same alternate allele as the one being considered here.
pseq proj1 lookup --file pos.txt
--loc refseq
--alias symbol name
--phenotype dis
--annotate refseq
--ref dbsnp

This shows example output for a single variant. The output is always 3 tab-delimited columns and contains a set of header fields, such that this file is correctly formatted to be loaded into a project as auxiliary meta-information via the attach-meta command.

...
chr15:52416801  func          missense,missense
chr15:52416801  transcript    NM_006578,NM_016194
chr15:52416801  genomic       g.919T>C,g.1045T>C
chr15:52416801  codon         c.919ATC>GTC,c.1045ATC>GTC
chr15:52416801  protein       p.307I>V,p.349I>V
chr15:52416801  worst         missense
chr15:52416801  class         missense,NON=0,MIS=2,SYN=0,SPL=0,INT=0,IGR=0
chr15:52416801  seqdb_ref     T
chr15:52416801  nvar          1
chr15:52416801  var           chr15:52416801
chr15:52416801  case          1
chr15:52416801  con           0
chr15:52416801  loc_refseq    chr15:52414954..52472076:NM_006578
chr15:52416801  loc_symbol    GNB5
chr15:52416801  loc_refseq    chr15:52414954..52476873:NM_016194
chr15:52416801  loc_name      guanine nucleotide binding protein, beta 5
chr15:52416801  loc_symbol    GNB5
chr15:52416801  ref_dbsnp     . 
...