PLINK/SEQ genetics library

Getting started

Key concepts

PSEQ documentation

Other tools and resources

Tutorials

Technical reference

C/C++ library API

This page contains the following sections:

Core hg19 resources for the current release
How these were created

Note: Occasionally we may alter the internal structure of a resource DB, as between PSEQ releases 1.08 (the current) and 1.09 (the next one). In general, however, resource databases are independent of the particular version of PLINK/Seq installed -- that is, you will not need to re-download these databases with every version upgrade of PSEQ.

Also note that, you do not need all of these files to get most of the basic functions of PLINK/SEQ to work, but these will be useful files to download if you plan on using PLINK/SEQ seriously. (And, in fact, for certain procedures will be absolutely required.)

Core resource databases (hg19) for current release

Download all resource files here:

Type	Date	Link	Size	Description	Example command-line usage (resource names in database)
LOCDB(*)	30-Apr-2013	`locdb.gz`	251 MB # of transcripts: 41,880; 27,429; 94,378; 165,825	Transcripts (and genes): RefSeq; CCDS; GENCODE; ENSEMBL	"--annotate", "--group", "--loc" should be one of: "refseq", "ccds", "gencode", "ensembl"
REFDB(*)	21-Sep-2012	`refdb.dbsnp.gz`	2.2 GB 52,054,804 variants	dbSNP build 137	"--ref", "--ref-allelic" should be: "dbsnp"
REFDB	21-Sep-2012	`refdb.g1k.gz`	2.0 GB 39,706,715 variants	1000 Genomes sites (release 16-March-2012)	"--ref", "--ref-allelic" should be: "g1k"
SEQDB(*)	17-May-2012	`seqdb.hg19.gz`	831 MB	hg19 sequence	N/A
PROTDB	2-May-2013	`protdb.gz`	32 MB	InterProScan motifs and domains (e.g., Pfam), as run by SIMAP	"--protdb protdb"

We suggest you download at least the three databases marked (*). Be sure to decompress (using gunzip) any .gz compressed database files prior to use. Place all files in a folder called hg19, e.g. if you placed it here:

/genetics/data/pseq/hg19/

then you'd subsequently point to this folder with the --resources argument when creating a new project:

pseq proj new-project --resources /genetics/data/pseq/hg19/

As described in the main documentation, these databases can be named anything, and can be swapped in and out of projects with the --locdb, --refdb and --seqdb commands.

Building the resource databases

You can download the above links to (and unzip any compressed files) and directly use these resource databases in your own projects. This section describes the steps used to build the above databases. Note that this description is approximate, as a number of additional filtering steps are required in practice in order to get the well-curated databases above. These commands may be useful if you wish to create your own custom resource databases.

Gene transcripts (LOCDB)

GTFs from the UCSC table browser were downloaded for hg19, for RefSeq, CCDS and GENCODE transcripts

Each GTF file was loaded into a separate LOCDB, with the following commands (We could also have created a single LOCDB with all three groups.):

pseq . loc-load --file refseq-hg19.gtf.gz --group refseq --locdb locdb pseq . loc-load --file ccds-hg19.gtf.gz --group ccds --locdb locdb.ccds pseq . loc-load --file gencodeBasicV11-hg19.gtf.gz --group gencode --locdb locdb.gencode

In addition, we also created a mapping of RefSeq transcript ID to gene symbol (from the HGNC genenames.org . We populated the alternate name of each transcript to be the gene symbol. We also populated the alias table with the RefSeq-to-symbol mapping.

 pseq . loc-swap-names --locdb locdb 
  --file refseq2symbol.txt --group refseq --alternate-name

pseq . loc-load-alias --locdb locdb --file refseq2symbol.txt

This enables searches of the LOCDB based on gene-name (e.g. in pbrowse, or via the gene mask) and output of gene symbols (e.g. in gene-based tests ALIAS field).

Reference variants (REFDB)

We obtained a sites-only VCF of the current hg19 build of dbSNP (136) from the following URL:

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz

The REFDB was then created with the following command. To keep the file smaller for download, we elected here to create a stripped down VCF, not including any meta-information. This was achieved with the --show flag (there is not field called XXX, and thus no meta-fields are included).

pseq . ref-load --vcf 00-All.vcf.gz --group dbsnp --refdb refdb --show XXX

If the --show argument was excluded, all the meta-fields in the VCF would also be loaded into the REFDB (and appended to any project variants downstream): i.e. such as

pseq 00-All.vcf.gz v-view --vmeta --verbose --mask limit=1

   chr1:10144..10145   rs144773400   TA/T  .  1  .

   RSPOS = 10145
   VP = 050000000005000002000200
   dbSNPBuildID = 134
   SAO = 0
   SSR = 0
   WGT = 1
   VC = DIV
   ASP flag set
   OTHERKG flag set

We also downloaded a sites-only VCF for the 16-March-2012 release of the 1000 Genomes project, and similarly stripped the very large number of variant meta-fields to include only a single allele frequency field (AF):

 wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.wgs.phase1_release_v2.20101123.snps_indels_sv.sites.vcf.gz

 pseq . ref-load --refdb refdb-g1k --group g1k --show AF 
  --vcf ALL.wgs.phase1.projectConsensus.snps.sites.vcf.gz

Human genome sequence (SEQDB)

We downloaded the hg19 sequence in FASTA format:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

and extracted and concatenated the file:

tar -xzvf chromFa.tar.gz
rm -rf hg19.fa.gz
for c in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
do
 echo "done $c"
 cat chr${c}.fa | gzip >> hg19.fa.gz
done
rm -rf chr*.fa

and then loaded this single, compressed FASTA into a SEQDB:

 pseq . seq-load --file hg19.fa.gz --seqdb seqdb 
    --format name=hg19 description=from-UCSC-20-dec-2010 build=hg19 repeat-mode=lower

This SEQDB therefore contains repeat-masked sequence; lower-case bases to represent repeat-masked sequence

pseq . seq-summary --seqdb seqdb

---Sequence DB summary---

  chr1:1..249250621  MB=249
  chr2:1..243199373  MB=243
  chr3:1..198022430  MB=198
  chr4:1..191154276  MB=191
  chr5:1..180915260  MB=180
  chr6:1..171115067  MB=171
  chr7:1..159138663  MB=159
  chr8:1..146364022  MB=146
  chr9:1..141213431  MB=141
  chr10:1..135534747 MB=135
  chr11:1..135006516 MB=135
  chr12:1..133851895 MB=133
  chr13:1..115169878 MB=115
  chr14:1..107349540 MB=107
  chr15:1..102531392 MB=102
  chr16:1..90354753  MB=90
  chr17:1..81195210  MB=81
  chr18:1..78077248  MB=78
  chr19:1..59128983  MB=59
  chr20:1..63025520  MB=63
  chr21:1..48129895  MB=48
  chr22:1..51304566  MB=51
  chrX:1..155270560  MB=155
  chrY:1..59373566   MB=59
  chrM:1..16571      MB=0

SEQDB meta-information: BUILD = hg19
SEQDB meta-information: DESC = from-UCSC-20-dec-2010
SEQDB meta-information: IUPAC = 0
SEQDB meta-information: NAME = hg19
SEQDB meta-information: REPEATMODE = lower

Genotype VCF for exome-subset of 1000 Genomes March 16 2012 phase 1 data release

We downloaded the VCF for c in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X do wget -c ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr${c}.phase1.projectConsensus.genotypes.vcf.gz done

For the purpose of an example dataset for the pbrowse utility (hosted here), we want to make a PLINK/Seq project only extracting sites that are in the exome from this large dataset. To do this, we first must specify the group in a LOCDB that defines the exome. We do this as follows, using a slight variant on the prevous loc-load command. Usually when the loc-load command is given a GTF file, it automatically reads exons as loci into one group (a so-called unmerged group), and then creates a second group in which the exons are grouped together as subregions for each transcript (whole locus/region), and the first group is removed. Adding the --keep-unmerged flag has the effect of retaining the first group in the LOCDB, that will be called refseq-unmerged in this instance. Subsequently, selecting on the refseq group will include all variants within the transcription start and stop sites of a transcript, whereas selected on the refseq-unmerged group we select only variants that fall in coding regions (i.e. not intronic variants). In the context of an example dataset for the exome-centric pbrowse tool, we therefore used the commands:

pseq . loc-load --file refseq-hg19.gtf.gz --group refseq --locdb locdb.refseq --keep-unmerged pseq exome new-project --locdb locdb.refseq pseq exome load-vcf --vcf vcf/ALL.chr*.vcf.gz --filter refseq-unmerged pseq exome var-summary

---Variant DB summary---

314695 unique variants
File tag : 1 (32716 variants, 1094 individuals)
File tag : 2 (12437 variants, 1094 individuals)
File tag : 3 (20004 variants, 1094 individuals)
File tag : 4 (15405 variants, 1094 individuals)
File tag : 5 (5287 variants, 1094 individuals)
File tag : 6 (9849 variants, 1094 individuals)
File tag : 7 (10490 variants, 1094 individuals)
File tag : 8 (13756 variants, 1094 individuals)
File tag : 9 (17598 variants, 1094 individuals)
File tag : 10 (5014 variants, 1094 individuals)
File tag : 11 (22992 variants, 1094 individuals)
File tag : 12 (22286 variants, 1094 individuals)
File tag : 13 (7810 variants, 1094 individuals)
File tag : 14 (3668 variants, 1094 individuals)
File tag : 15 (6887 variants, 1094 individuals)
File tag : 16 (17818 variants, 1094 individuals)
File tag : 17 (12671 variants, 1094 individuals)
File tag : 18 (14189 variants, 1094 individuals)
File tag : 19 (17119 variants, 1094 individuals)
File tag : 20 (14997 variants, 1094 individuals)
File tag : 21 (11129 variants, 1094 individuals)
File tag : 22 (13301 variants, 1094 individuals)
File tag : 23 (7272 variants, 1094 individuals)

This runs relatively quickly compared to loading the entire dataset (as it scans through the whole-genome files, only putting the coding variants into the project database). Otherwise, we would suggest using the index-vcf command to work with very large VCF files, in the first instance.

PLINK/SEQ

A library for the analysis of genetic variation data