PSEQ data output

This page describes some of the different formats that PSEQ can export to. In contrast to the various views of genetic data and meta-data that are available, these options are designed to produce machine-readable, rather than more human-readable, output. Run pseq help output for any other current options.

  • VCF: writing to Variant Call Format with write-vcf
  • VARDB: making a new VARDB based on filtered, consensus calls with write-vardb
  • PLINK: create PLINK fileset with write-ped
  • v-matrix: output genotype data as a simple rectangular matrix of alternate allele counts
  • g-matrix: similar to v-matrix, but with data grouped by gene/group
  • meta-matrix: write variant meta-information as a simple rectangular matrix
  • v-meta-matrix: similar to meta-matrix, but for genotype meta-information
  • BEAGLE: genotype likelihoods with write-lik

Output to a VCF file

To write (to stdout) project data in VCF format, use the command:

pseq /path/to/my/project write-vcf

This can be combined with a mask, for example to write a VCF that only lists singleton variants in the target region, excluding variants that fail a filter called GTAKStandard, and only showing the tags DP and AN, for example:

pseq /path/to/my/project write-vcf --mask loc=targets filter.ex=GATKStandard mac=1-1 --show DP AN > my.new.vcf

To output a file in BGZF-compressed format (such that can be indexed by the index-vcf command), add the flag:

--format BGZF --file filename

Thus, to directly convert a plain-text VCF to a BGZF-compressed one:

pseq my.vcf write-vcf --format BGZF --file my.vcf.gz

which will create a new BGZF compressed file my.vcf.gz.

Output to a new variant database (VARDB)

To create a new VARDB, use the command:

pseq proj1 write-vardb --new-project proj2 --new-vardb proj_out/vardb2

This creates a new project, with the project specification file at proj2, and the new VARDB is named vardb2 (but in the same folder as the original VARDB, in this case). This can be combined with a mask command to restrict the amount of information (individuals, variants, samples, meta-fields, etc) that is present in the new project. Typically, PLINK/Seq assumes the model of filtering on-the-fly rather than recreating new projects/files to represent different stages of filtering/QC, but there can be many instances where it is useful to create a new project. In particular, only consensus genotype and variant information is stored in the new VARDB, and multiple files are merged into a single file in the new project. For example, considering the project created in the tutorial:

pseq proj var-summary
---Variant DB summary---

4449 unique variants
File tag : CEU (3489 variants, 90 individuals)
File tag : TSI (3281 variants, 66 individuals)

If we create a new project, in this example only for chromosome 7:

pseq proj write-vardb --new-project proj2 --new-vardb vardb2 --mask reg=chr7 pseq proj2 var-summary
---Variant DB summary---

186 unique variants
File tag : 1 (186 variants, 156 individuals)

That is, the 186 variants on chromosome 7 from CEU and TSI samples are now combined in a single sample in a new VARDB, that can be accessed with commands such as:

pseq proj2 v-stats

Note: sample-specific variant meta-information is not imported into the new VARDB when more than a single sample is selected from the original project.

Output to a PED-formatted file

To create a file that can be read by PLINK, use the command:

pseq /path/to/my/project write-ped --name mydata

will create two files

   mydata.tped
   mydata.tfam

that can be read by PLINK with the command

plink --tfile mydata

In order to generate a valid PLINK fileset, multi-allelic markers are set to a missing genotype value; phase and ploidy information is lost also. This command can be combined with a mask via the --mask option. To use the family and individual IDs in the TFAM file (i.e. FID and IID instead of the standard PLINK/SEQ ID and a missing value), add the following:

--family-id

Output genotype data as a rectangular matrix

To write a file that contains the number of non-reference alleles for each genotype in a rectangular format (that can be easily loaded into R, for example), use the command:

pseq /path/to/my/project v-matrix

The format is a header row:

   VAR   REF   ALT   ID1   ID2...

followed by one line per variant. Null genotypes are represented with the NA code. This command can be combined with a mask via the --mask option.

Output genotype data aggregated by gene/group as a rectangular matrix

To create a rectangular text file that scores each individual for each gene, or set of variants:

pseq /path/to/my/project g-matrix --mask loc.group=CCDS

Some grouping option must be specified in the mask (i.e. loc.group). The format is a header row is:

   GENE NV ID1 ID2...

followed by one line per gene. Null genes are represented with the NA code. The GENE field is the gene-name; the NV field is the number of variants (post any filters) in this gene. Each individual is scored for the number of minor alleles that individual carries for that gene/group. If --collapse is added, then scores are reduced to 0s and 1s, indicating the presence of at least one minor allele. If --hide-invariant is included, then genes that are invariant (e.g. either nobody carries a one minor allele) are not reported. As for most commands, any mask options can be included too.

Output variant meta-information as a rectangular matrix

To create a matrix of per-variant meta-information, use the command:

pseq /path/to/my/project meta-matrix

This will generate a variant-by-meta-field matrix, e.g.

   CHR     POS     ID         AB      DP     CODING
   1       10001   rs0001     NA      6652   Missense
   1       20001   .          0.56    1124   Silent
   ...

The behavior of this command can be modified with the options

   --show AB DP    only show fields AB and DP
   --hide AB DP    show all but fields AB and DP

If the project contains more than one file, then the variant meta-information specific to each file is shown with the prefix F1_, F2_, etc. Any mask can be applied, as for most commands.

Output genotype meta-information as a rectangular matrix

For a single, genotypic meta-field (e.g. per-individual read-depth, encoded DP), the command

pseq /path/to/my/project v-meta-matrix --name DP

This will dump a variant-by-individual matrix of read-depth information. Missing values are encoded as NA; the file is tab-delimited e.g. for individuals with IDs P1 to P4:

   MARKER          P1  P2  P3  P4
   chr1:1000       22  65  10  NA
   chr1:1010       8   45  15  2
   chr2:999        16  30  8   NA

Any mask can be applied, as for most commands.

Output genotype likelihoods in BEAGLE format

To write a file of genotype likelihoods (that is correctly formatted for BEAGLE), use the command:

pseq /path/to/my/project write-lik

which writes to standard output. This command assumes that each genotype has a meta-information field named GL (genotype likelihood), which is a vector of 3 floating-point values, or PL tag, which is the same but Phred-scaled integers. The genotype likelihoods are scaled to sum to 1.0. The format of the header row is:

   VAR REF ALT ID1 ID1 ID1 ID2 ID2 ID2...

where ID1 and ID2 are the IDs for the first and second persons in the dataset, followed by one line per variant in the above format (i.e. each genotype is represented by 3 values). This command can be combined with a mask via the --mask option.