PSEQ reference database operations

This page describes working with the REFDB and SEQDB, two of PLINK/SEQ's reference databases. The LOCDB is described elsewhere.

  • Loading a REFDB: populate a reference variant databases from VCF or flat-file (ref-load)
  • Loading a SEQDB: creating a new reference sequence database from FASTA (seq-load)

Load a reference variant file

PSEQ can load VCF files directly into a REFDB, as follows:

pseq . ref-load --refdb /path/to/my/refdb --vcf dbsnp.vcf.gz --group dbsnp

If the REFDB does not already exist, a new one will be created. Any individual genotypes are ignored. Otherwise, the meta-information is preserved in the REFDB in the same manner it would be for a typical project.

As when loading VCFs, other options are available: to restrict to a set of sites (given a group in an attached LOCDB):

pseq . ref-load --refdb refdb --vcf dbsnp.vcf.gz --locdb /path/to/locdb --filter refseq

[ Note: The following options for the ref-load have not been checked -- this next section of the documentation may need updating.]

The meta and meta.ex options can be used to restrict the meta-fields that are imported (very useful to remove large amounts of extraneous information), and also check-reference, as described here.

Descriptions and comments can be added with the options:

--description "This is my description of the reference data"

Loading from a flat-file

To load generic, flat-file tables of reference variant information into the REFDB, use the command: for example,

pseq . ref-load --refdb /path/to/refdb --file dbsnp130.gz --group dbSNP130 --format chr=chrom bp1=chromStart bp2=chromEnd id=name skip=valid,locType,weight,refNCBI 0-start

By default, we expect the fields (upper-case text):

   CHR    chromosome (required)
   BP1    base-pair position (required)
   BP2    end base-pair position (optional)
   ID     variant ID (optional)
   VALUE  primary "value" for this variant (optional)
If the file header row names these fields differently, the --format argument can be used to specify this, as in the example below (e.g. the header contains a chrom field, not a CHR field). The skip format option means that the fields valid, locType, etc, are skipped. Otherwise, all other fields are recorded as meta-fields in the REFDB. The option 0-start means that the values read for BP1 are assumed to be 0-based co-ordinates. Similarly, the option 0-stop implies that BP2 is 0-based. The option 0-based implies both are. By default, PLINK/Seq uses 1-based coding of all genomic co-ordinates throughout (i.e. and so these values are transformed upon loading).

Load a FASTA file into a SEQDB

To load a FASTA file into SEQDB, use the seq-load command. This takes a number of options, but the core ones are bolded below:

pseq . seq-load --seqdb /path/to/seqdb --file hg18.fa.gz --name hg18 --description downloaded-from-UCSC-28-june-2010 --build hg18 --repeat-mode N

The file hg18.fa.gz is a compressed FASTA file, in this case with the entire human genome sequence. See here for an example of how to create a SEQDB. The SEQDB location must also be specified (if not implicitly part of a project specification). The other options specify some meta-information that tracks with the SEQDB, in this case specifying the name, description, build and also repeat-mode. Values for repeat-mode are N (meaning repeat-masked sequence is represented by the letter N; lower meaning repeat-masked sequence is lower-case; none meaning their is no repeat-masking.

View sequence for a region

To dump sequence data from SEQDB, use the command:

pseq proj seq-view --region chr2:25000000..25000010
   chr2    25000000        T
   chr2    25000001        A
   chr2    25000002        T
   chr2    25000003        A
   chr2    25000004        C
   chr2    25000005        A
   chr2    25000006        T
   chr2    25000007        T
   chr2    25000008        T
   chr2    25000009        T
   chr2    25000010        G

To obtain a more compact representation (i.e. simple string of bases):

pseq proj seq-view --region chr2:25000000..25000010 --compact
   TATACATTTTG