PSEQ reference database operations
This page describes working with the REFDB and SEQDB, two of PLINK/SEQ's reference databases. The LOCDB is described elsewhere.
- Loading a REFDB: populate a reference variant databases from VCF or flat-file (ref-load)
- Loading a SEQDB: creating a new reference sequence database from FASTA (seq-load)
Load a reference variant file
PSEQ can load VCF files directly into a REFDB, as follows:
pseq . ref-load --refdb /path/to/my/refdb --vcf dbsnp.vcf.gz --group dbsnp
If the REFDB does not already exist, a new one will be created. Any individual genotypes are ignored. Otherwise, the meta-information is preserved in the REFDB in the same manner it would be for a typical project.
As when loading VCFs, other options are available: to restrict to a set of sites (given a group in an attached LOCDB):
pseq . ref-load --refdb refdb --vcf dbsnp.vcf.gz
--locdb /path/to/locdb --filter refseq
[ Note: The following options for the ref-load have not been checked -- this next section of the documentation may need updating.]
The meta and meta.ex options can be used to restrict the meta-fields that are imported (very useful to remove large amounts of extraneous information), and also check-reference, as described here.
Descriptions and comments can be added with the options:
--description "This is my description of the reference data"
Loading from a flat-file
To load generic, flat-file tables of reference variant information into the REFDB, use the command: for example,
pseq . ref-load --refdb /path/to/refdb
--file dbsnp130.gz
--group dbSNP130
--format chr=chrom
bp1=chromStart
bp2=chromEnd
id=name
skip=valid,locType,weight,refNCBI
0-start
By default, we expect the fields (upper-case text):
CHR chromosome (required) BP1 base-pair position (required) BP2 end base-pair position (optional) ID variant ID (optional) VALUE primary "value" for this variant (optional)If the file header row names these fields differently, the --format argument can be used to specify this, as in the example below (e.g. the header contains a chrom field, not a CHR field). The skip format option means that the fields valid, locType, etc, are skipped. Otherwise, all other fields are recorded as meta-fields in the REFDB. The option 0-start means that the values read for BP1 are assumed to be 0-based co-ordinates. Similarly, the option 0-stop implies that BP2 is 0-based. The option 0-based implies both are. By default, PLINK/Seq uses 1-based coding of all genomic co-ordinates throughout (i.e. and so these values are transformed upon loading).
Load a FASTA file into a SEQDB
To load a FASTA file into SEQDB, use the seq-load command. This takes a number of options, but the core ones are bolded below:
pseq . seq-load --seqdb /path/to/seqdb
--file hg18.fa.gz
--name hg18
--description downloaded-from-UCSC-28-june-2010
--build hg18
--repeat-mode N
The file hg18.fa.gz is a compressed FASTA file, in this case with the entire human genome sequence. See here for an example of how to create a SEQDB. The SEQDB location must also be specified (if not implicitly part of a project specification). The other options specify some meta-information that tracks with the SEQDB, in this case specifying the name, description, build and also repeat-mode. Values for repeat-mode are N (meaning repeat-masked sequence is represented by the letter N; lower meaning repeat-masked sequence is lower-case; none meaning their is no repeat-masking.
View sequence for a region
To dump sequence data from SEQDB, use the command:
pseq proj seq-view --region chr2:25000000..25000010
chr2 25000000 T chr2 25000001 A chr2 25000002 T chr2 25000003 A chr2 25000004 C chr2 25000005 A chr2 25000006 T chr2 25000007 T chr2 25000008 T chr2 25000009 T chr2 25000010 G
To obtain a more compact representation (i.e. simple string of bases):
pseq proj seq-view --region chr2:25000000..25000010 --compact
TATACATTTTG