PSEQ summary statistics

Summary statistics about a project can be calculated with the following four commands:

  • v-stats: Summary statistics over all variants
  • g-stats: Summary statistics by gene/group
  • i-stats: Summary statistics by individual
  • group-comparison: Enumeration of non-reference alleles by group

Summary statistics, aggregating over all variants

To obtain summary statistics across all variants, or a subset of variants, individuals or genotypes as specified by a mask:

pseq /path/to/my/project v-stats

As an example of the output, here are the results for chromosome 22 for an exome-sequencing study (prior to full QC) with data on 132 individuals:

pseq exome-study v-stats --mask reg=chr22
NVAR                    2532             total number of variants
RATE                    0.978533         average call rate
MAC                     22.0829          mean minor allele count
MAF                     0.0858024        mean minor allele frequency
SING                    789              number of singletons (MAC==1)
MONO                    15               number of monomoprhic sites (MAC==0)
TITV                    2.92558          transition/transversion ratio
TITV_S                  4.09032          as above, for singletons
DP                      12673.3          mean variant read depth (given DP tag)
QUAL                    49162            mean QUAL score from VCF
PASS                    0.853081         proportion of variants that PASS all FILTERs 
FILTER|HomopolymerRun   0.0114534        proportion of variants with HomopolymerRun FILTER 
FILTER|Mask             0.00947867       ..
FILTER|NearIndel        0.00947867       ..
FILTER|PASS             0.853081         .. 
FILTER|QualByDepth      0.115719         ..
FILTER|SnpCluster       0.0150079        ..
FILTER|StrandBias       0.108215         ..
PASS_S                  0.931559         proportion of singletons that PASS FILTERs

Restricting the summary to only PASSing sites, for example:

pseq exome-study v-stats --mask reg=chr22 any.filter.ex
NVAR                   2160
RATE                   0.979577
MAC                    21.7306
MAF                    0.0843031
SING                   735
MONO                   15
TITV                   3.96552
TITV_S                 4.25
DP                     12795.4
QUAL                   54899
PASS                   1
FILTER|HomopolymerRun  0
FILTER|Mask            0
FILTER|NearIndel       0
FILTER|PASS            1
FILTER|QualByDepth     0
FILTER|SnpCluster      0
FILTER|StrandBias      0
PASS_S                 1

There are a number of options that can add additional summaries statistics via the stats keyword, as illustrated here:

--stats mac=2,3-5 qual=0-10000,10000- gmean=GQ gcount=DP:0-10 mean=QD count=QD:0-5,QD:5-10,QD:10-

These additional options generate the following extra lines of output:

MAC|2           216           number of doubletons, mac=2
MAC|3:5         236           number of variants with 3-5 minor alleles, mac=3-5
QUAL|0:10000    1345          number of variants with QUAL less than 10,000, qual=0-10000
QUAL|10000:*    815           number of variants with QUAL above 10,000, qual=10000-
MEAN|QD         16.7215       mean QD tag (quality-depth ratio), mean=QD 
QD|0:5          0             number of variants with QD less than 5, count=QD:0-5
QD|5:10         120           number with QD 5-10, count=QD:5-10
QD|10:*         2040          number with QD above 10, count=QD:10-
G|GQ            85.6505       mean individual GQ tag for all genotypes, gmean=GQ
NRG|GQ          89.4076       mean individual GQ tag for all non-reference genotypes, gmean=GQ
G|DP|0:10       0.087082      proportion of all genotypes within individual DP 0-10, gcount=DP:0-10 
NRG|DP|0:10     0.103311      proportion of all non-reference genotypes, individual DP 0-10, gcount=DP:0-10

Other options are available not shown above; see pseq help v-stats.

Summary statistics by gene/group

To produce a similar table of summary statistics per gene (with a tabular output format, one row per gene/group), use the command:

pseq /path/to/my/project g-stats --mask loc.group=CCDS

This command assumes that a grouping factor (such as loc.group) has been specified in the mask. It produces a table of summary statistics per gene/set in the grouping factor:

NAME POS KB BP N_EXON DENS NVAR RATE SING TITV PASS PASS_S QUAL DP NM_000015 chr8:18257514..18258383 0.87 870 1 108.75 8 0.992481 1 7 8 1 119389 14835.9 NM_000016 chr1:76190473..76228445 37.973 1263 12 252.6 5 0.993985 3 1.5 5 3 59059.8 28888 NM_000017 chr12:121163689..121177248 13.56 1236 10 206 6 0.992481 1 NA 6 1 44106.2 5580.5 NM_000018 chr17:7123304..7128413 5.11 1965 20 982.5 2 0.992481 1 NA 2 1 1093.52 4590 NM_000019 chr11:107992334..108018114 25.781 1281 12 427 3 0.934837 1 0 2 0 135615 41273.3 NM_000020 chr12:52306259..52314674 8.416 1509 9 301.8 5 0.992481 3 1.5 4 3 1328.8 9369.6 NM_000021 chr14:73614728..73685994 71.267 1401 10 350.25 4 0.994361 1 1 1 0 16631.8 17019.5 NM_000022 chr20:43248478..43280248 31.771 1089 12 155.571 7 0.992481 2 6 6 2 41031.7 10140.3 NM_000023 chr17:48243402..48252795 9.394 1161 9 290.25 4 0.992481 1 1 3 1 11181.2 22467 ...

The same --stats described for the v-stats command can also be applied here, to add additional fields to the output. This command can also be combined with the various --mask options. With small numbers of variants in some sets/genes, certain fields might not be interpretable (e.g. Ti/Tv, or POS if the set spans multiple chromosomes (e.g. from a locset.group mask).

Summary statistics by individual

To obtain a matrix of summary statistics for each individual in a project, use the command:

pseq /path/to/my/project i-stats

along with any mask settings. Also, the same --stats described for the v-stats command can also be applied here. The output is one row per individual.

ID NALT NMIN NHET NVAR RATE SING TITV PASS PASS_S QUAL DP 00028296 484 376 337 2496 0.98578 785 1.870 271 4 111170 11734.5 00028320 491 395 348 2505 0.98933 786 2.657 315 8 117262 12106.7 00028328 439 346 293 2493 0.98459 784 2.887 306 7 155044 13328.5 00028380 506 408 352 2479 0.97906 781 1.914 318 6 124900 12408.6 00028397 529 434 371 2485 0.98143 783 1.993 318 8 119099 12830.2 00028454 480 419 349 2502 0.98815 785 2.352 338 7 149798 12398.6 00045279 460 372 311 2488 0.98262 784 2.412 309 5 153917 13935.1 00045301 519 400 343 2483 0.98064 783 1.898 288 7 106857 13271 00045303 514 440 384 2494 0.98499 785 2.308 361 3 143618 12981.4 00045413 493 386 348 2490 0.98341 784 1.946 312 5 131900 13062.1 00060622 463 353 306 2471 0.97590 781 2.017 262 5 112954 12711.7 00069374 442 337 288 2496 0.98578 786 2.547 286 7 137964 13396.2 00071204 428 316 274 2483 0.98064 785 3.388 286 7 150463 13079

The default fields are:

   ID           Individual ID
   NALT		Number of non-reference genotypes
   NMIN		Number of genotypes with a minor allele
   NHET		Number of heterozygous genotypes for individual
   NVAR		Total number of called variants for individual
   RATE		Genotyping rate for individual
   SING		Number of singletons individual has
   TITV		Mean Ti/Tv for variants for which individual has a nonreference genotype
   PASS		Number of variants PASS'ing for which individual has a nonreference genotype
   PASS_S	Number of singletons PASS'ing for which individual has a (singleton) nonreference genotype
   QUAL		Mean QUAL for variants for which individual has a nonreference genotype
   DP		Mean variant DP for variants for which individual has a nonreference genotype

If other --stats are added, extra columns will be appended to the above.

Group comparisons (enumeration of non-reference alleles by group)

This command calculates the number of variants for which a non-reference allele is observed at least once in a group of individuals, for all possible groupings. This test can be useful as a simple screen to indicate, for example, if certain batches of samples are systematically differing. For a factor-level phenotype (i.e. labeled as a String) the command:

pseq proj group-comparison --phenotype my.group

tabulates the number of individuals in each group, followed by the number of variants in each combination of groups. For example:

   A  25
   B  25
   C  25
   D  25
   NA  0

This indicates that there are a total of 100 individuals, assigned to groups A to D in equal proportion (NA means missing). The subsequent output indicates the number of variants (given the mask) unique to that group or combination of groups:

 182         A       
 132	     B       
	     ..	   
 73	     A,B     
 76	     A,C     
	     ...	   
 57	     A,B,C   
	     ...	   
 89	     A,B,C,D 

That is, 182 variants seen only in A-group individuals, 73 seen on only A or B, 89 seen in all four groups, etc. This command can be combined with the --make-phenotype command.