PLINK/SEQ genetics library

Getting started

Key concepts

PSEQ documentation

Other tools and resources

Tutorials

Technical reference

C/C++ library API

PSEQ reference database operations

This page describes working with the REFDB and SEQDB, two of PLINK/SEQ's reference databases. The LOCDB is described elsewhere.

Loading a REFDB: populate a reference variant databases from VCF or flat-file (ref-load)
Loading a SEQDB: creating a new reference sequence database from FASTA (seq-load)

Load a reference variant file

PSEQ can load VCF files directly into a REFDB, as follows:

pseq . ref-load --refdb /path/to/my/refdb --vcf dbsnp.vcf.gz --group dbsnp

If the REFDB does not already exist, a new one will be created. Any individual genotypes are ignored. Otherwise, the meta-information is preserved in the REFDB in the same manner it would be for a typical project.

As when loading VCFs, other options are available: to restrict to a set of sites (given a group in an attached LOCDB):

 pseq . ref-load --refdb refdb  --vcf dbsnp.vcf.gz
                 --locdb /path/to/locdb  --filter refseq

[ Note: The following options for the ref-load have not been checked -- this next section of the documentation may need updating.]

The meta and meta.ex options can be used to restrict the meta-fields that are imported (very useful to remove large amounts of extraneous information), and also check-reference, as described here.

Descriptions and comments can be added with the options:

--description "This is my description of the reference data"

Loading from a flat-file

To load generic, flat-file tables of reference variant information into the REFDB, use the command: for example,

 pseq . ref-load --refdb /path/to/refdb
                   --file dbsnp130.gz
                   --group dbSNP130
                   --format chr=chrom
                            bp1=chromStart
                            bp2=chromEnd
                            id=name
                            skip=valid,locType,weight,refNCBI
                            0-start

By default, we expect the fields (upper-case text):

   CHR    chromosome (required)
   BP1    base-pair position (required)
   BP2    end base-pair position (optional)
   ID     variant ID (optional)
   VALUE  primary "value" for this variant (optional)

If the file header row names these fields differently, the --format argument can be used to specify this, as in the example below (e.g. the header contains a chrom field, not a CHR field). The skip format option means that the fields valid, locType, etc, are skipped. Otherwise, all other fields are recorded as meta-fields in the REFDB. The option 0-start means that the values read for BP1 are assumed to be 0-based co-ordinates. Similarly, the option 0-stop implies that BP2 is 0-based. The option 0-based implies both are. By default, PLINK/Seq uses 1-based coding of all genomic co-ordinates throughout (i.e. and so these values are transformed upon loading).

Load a FASTA file into a SEQDB

To load a FASTA file into SEQDB, use the seq-load command. This takes a number of options, but the core ones are bolded below:

 pseq . seq-load --seqdb /path/to/seqdb
                   --file hg18.fa.gz                   
                   --name hg18
                   --description downloaded-from-UCSC-28-june-2010
                   --build hg18
                   --repeat-mode N

The file hg18.fa.gz is a compressed FASTA file, in this case with the entire human genome sequence. See here for an example of how to create a SEQDB. The SEQDB location must also be specified (if not implicitly part of a project specification). The other options specify some meta-information that tracks with the SEQDB, in this case specifying the name, description, build and also repeat-mode. Values for repeat-mode are N (meaning repeat-masked sequence is represented by the letter N; lower meaning repeat-masked sequence is lower-case; none meaning their is no repeat-masking.

View sequence for a region

To dump sequence data from SEQDB, use the command:

pseq proj seq-view --region chr2:25000000..25000010

   chr2    25000000        T
   chr2    25000001        A
   chr2    25000002        T
   chr2    25000003        A
   chr2    25000004        C
   chr2    25000005        A
   chr2    25000006        T
   chr2    25000007        T
   chr2    25000008        T
   chr2    25000009        T
   chr2    25000010        G

To obtain a more compact representation (i.e. simple string of bases):

 pseq proj seq-view --region chr2:25000000..25000010  --compact

   TATACATTTTG

PLINK/SEQ

A library for the analysis of genetic variation data