PSEQ summary statistics
Summary statistics about a project can be calculated with the following four commands:
- v-stats: Summary statistics over all variants
- g-stats: Summary statistics by gene/group
- i-stats: Summary statistics by individual
- group-comparison: Enumeration of non-reference alleles by group
Summary statistics, aggregating over all variants
To obtain summary statistics across all variants, or a subset of variants, individuals or genotypes as specified by a mask:
pseq /path/to/my/project v-stats
As an example of the output, here are the results for chromosome 22 for an exome-sequencing study (prior to full QC) with data on 132 individuals:
pseq exome-study v-stats --mask reg=chr22
NVAR 2532 total number of variants RATE 0.978533 average call rate MAC 22.0829 mean minor allele count MAF 0.0858024 mean minor allele frequency SING 789 number of singletons (MAC==1) MONO 15 number of monomoprhic sites (MAC==0) TITV 2.92558 transition/transversion ratio TITV_S 4.09032 as above, for singletons DP 12673.3 mean variant read depth (given DP tag) QUAL 49162 mean QUAL score from VCF PASS 0.853081 proportion of variants that PASS all FILTERs FILTER|HomopolymerRun 0.0114534 proportion of variants with HomopolymerRun FILTER FILTER|Mask 0.00947867 .. FILTER|NearIndel 0.00947867 .. FILTER|PASS 0.853081 .. FILTER|QualByDepth 0.115719 .. FILTER|SnpCluster 0.0150079 .. FILTER|StrandBias 0.108215 .. PASS_S 0.931559 proportion of singletons that PASS FILTERs
Restricting the summary to only PASSing sites, for example:
pseq exome-study v-stats --mask reg=chr22 any.filter.ex
NVAR 2160 RATE 0.979577 MAC 21.7306 MAF 0.0843031 SING 735 MONO 15 TITV 3.96552 TITV_S 4.25 DP 12795.4 QUAL 54899 PASS 1 FILTER|HomopolymerRun 0 FILTER|Mask 0 FILTER|NearIndel 0 FILTER|PASS 1 FILTER|QualByDepth 0 FILTER|SnpCluster 0 FILTER|StrandBias 0 PASS_S 1
There are a number of options that can add additional summaries statistics via the stats keyword, as illustrated here:
--stats mac=2,3-5
qual=0-10000,10000-
gmean=GQ
gcount=DP:0-10
mean=QD
count=QD:0-5,QD:5-10,QD:10-
These additional options generate the following extra lines of output:
MAC|2 216 number of doubletons, mac=2 MAC|3:5 236 number of variants with 3-5 minor alleles, mac=3-5 QUAL|0:10000 1345 number of variants with QUAL less than 10,000, qual=0-10000 QUAL|10000:* 815 number of variants with QUAL above 10,000, qual=10000- MEAN|QD 16.7215 mean QD tag (quality-depth ratio), mean=QD QD|0:5 0 number of variants with QD less than 5, count=QD:0-5 QD|5:10 120 number with QD 5-10, count=QD:5-10 QD|10:* 2040 number with QD above 10, count=QD:10- G|GQ 85.6505 mean individual GQ tag for all genotypes, gmean=GQ NRG|GQ 89.4076 mean individual GQ tag for all non-reference genotypes, gmean=GQ G|DP|0:10 0.087082 proportion of all genotypes within individual DP 0-10, gcount=DP:0-10 NRG|DP|0:10 0.103311 proportion of all non-reference genotypes, individual DP 0-10, gcount=DP:0-10
Other options are available not shown above; see pseq help v-stats.
Summary statistics by gene/group
To produce a similar table of summary statistics per gene (with a tabular output format, one row per gene/group), use the command:
pseq /path/to/my/project g-stats --mask loc.group=CCDS
This command assumes that a grouping factor (such as loc.group) has been specified in the mask. It produces a table of summary statistics per gene/set in the grouping factor:
The same --stats described for the v-stats command can also be applied here, to add additional fields to the output. This command can also be combined with the various --mask options. With small numbers of variants in some sets/genes, certain fields might not be interpretable (e.g. Ti/Tv, or POS if the set spans multiple chromosomes (e.g. from a locset.group mask).
Summary statistics by individual
To obtain a matrix of summary statistics for each individual in a project, use the command:
pseq /path/to/my/project i-stats
along with any mask settings. Also, the same --stats described for the v-stats command can also be applied here. The output is one row per individual.
The default fields are:
ID Individual ID NALT Number of non-reference genotypes NMIN Number of genotypes with a minor allele NHET Number of heterozygous genotypes for individual NVAR Total number of called variants for individual RATE Genotyping rate for individual SING Number of singletons individual has TITV Mean Ti/Tv for variants for which individual has a nonreference genotype PASS Number of variants PASS'ing for which individual has a nonreference genotype PASS_S Number of singletons PASS'ing for which individual has a (singleton) nonreference genotype QUAL Mean QUAL for variants for which individual has a nonreference genotype DP Mean variant DP for variants for which individual has a nonreference genotype
If other --stats are added, extra columns will be appended to the above.
Group comparisons (enumeration of non-reference alleles by group)
This command calculates the number of variants for which a non-reference allele is observed at least once in a group of individuals, for all possible groupings. This test can be useful as a simple screen to indicate, for example, if certain batches of samples are systematically differing. For a factor-level phenotype (i.e. labeled as a String) the command:
pseq proj group-comparison --phenotype my.group
tabulates the number of individuals in each group, followed by the number of variants in each combination of groups. For example:
A 25 B 25 C 25 D 25 NA 0
This indicates that there are a total of 100 individuals, assigned to groups A to D in equal proportion (NA means missing). The subsequent output indicates the number of variants (given the mask) unique to that group or combination of groups:
182 A 132 B .. 73 A,B 76 A,C ... 57 A,B,C ... 89 A,B,C,D
That is, 182 variants seen only in A-group individuals, 73 seen on only A or B, 89 seen in all four groups, etc. This command can be combined with the --make-phenotype command.