This page contains the following sections:
- Core hg19 resources for the current release
- How these were created
Note: Occasionally we may alter the internal structure of a resource DB, as between PSEQ releases 1.08 (the current) and 1.09 (the next one). In general, however, resource databases are independent of the particular version of PLINK/Seq installed -- that is, you will not need to re-download these databases with every version upgrade of PSEQ.
Also note that, you do not need all of these files to get most of the basic functions of PLINK/SEQ to work, but these will be useful files to download if you plan on using PLINK/SEQ seriously. (And, in fact, for certain procedures will be absolutely required.)
Core resource databases (hg19) for current release
Download all resource files here:Type | Date | Link | Size | Description | Example command-line usage (resource names in database) |
---|---|---|---|---|---|
LOCDB(*) | 30-Apr-2013 | locdb.gz | 251 MB # of transcripts: 41,880; 27,429; 94,378; 165,825 |
Transcripts (and genes): RefSeq; CCDS; GENCODE; ENSEMBL | "--annotate", "--group", "--loc" should be one of: "refseq", "ccds", "gencode", "ensembl" |
REFDB(*) | 21-Sep-2012 | refdb.dbsnp.gz | 2.2 GB 52,054,804 variants |
dbSNP build 137 | "--ref", "--ref-allelic" should be: "dbsnp" |
REFDB | 21-Sep-2012 | refdb.g1k.gz | 2.0 GB 39,706,715 variants |
1000 Genomes sites (release 16-March-2012) | "--ref", "--ref-allelic" should be: "g1k" |
SEQDB(*) | 17-May-2012 | seqdb.hg19.gz | 831 MB | hg19 sequence | N/A |
PROTDB | 2-May-2013 | protdb.gz | 32 MB | InterProScan motifs and domains (e.g., Pfam), as run by SIMAP | "--protdb protdb" |
We suggest you download at least the three databases marked (*). Be sure to decompress (using gunzip) any .gz compressed database files prior to use. Place all files in a folder called hg19, e.g. if you placed it here:
/genetics/data/pseq/hg19/
then you'd subsequently point to this folder with the --resources argument when creating a new project:
pseq proj new-project --resources /genetics/data/pseq/hg19/
As described in the main documentation, these databases can be named anything, and can be swapped in and out of projects with the --locdb, --refdb and --seqdb commands.
Building the resource databases
You can download the above links to (and unzip any compressed files) and directly use these resource databases in your own projects. This section describes the steps used to build the above databases. Note that this description is approximate, as a number of additional filtering steps are required in practice in order to get the well-curated databases above. These commands may be useful if you wish to create your own custom resource databases.
Gene transcripts (LOCDB)
GTFs from the UCSC table browser were downloaded for hg19, for RefSeq, CCDS and GENCODE transcripts
Each GTF file was loaded into a separate LOCDB, with the following commands (We could also have created a single LOCDB with all three groups.):
pseq . loc-load --file refseq-hg19.gtf.gz --group refseq --locdb locdb
pseq . loc-load --file ccds-hg19.gtf.gz --group ccds --locdb locdb.ccds
pseq . loc-load --file gencodeBasicV11-hg19.gtf.gz --group gencode --locdb locdb.gencode
In addition, we also created a mapping of RefSeq transcript ID to gene symbol (from the HGNC genenames.org . We populated the alternate name of each transcript to be the gene symbol. We also populated the alias table with the RefSeq-to-symbol mapping.
pseq . loc-swap-names --locdb locdb
--file refseq2symbol.txt --group refseq --alternate-name
pseq . loc-load-alias --locdb locdb --file refseq2symbol.txt
This enables searches of the LOCDB based on gene-name (e.g. in pbrowse, or via the gene mask) and output of gene symbols (e.g. in gene-based tests ALIAS field).
Reference variants (REFDB)
We obtained a sites-only VCF of the current hg19 build of dbSNP (136) from the following URL:
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz
The REFDB was then created with the following command. To keep the file smaller for download, we elected here to create a stripped down VCF, not including any meta-information. This was achieved with the --show flag (there is not field called XXX, and thus no meta-fields are included).
pseq . ref-load --vcf 00-All.vcf.gz --group dbsnp --refdb refdb --show XXX
If the --show argument was excluded, all the meta-fields in the VCF would also be loaded into the REFDB (and appended to any project variants downstream): i.e. such as
pseq 00-All.vcf.gz v-view --vmeta --verbose --mask limit=1
chr1:10144..10145 rs144773400 TA/T . 1 . RSPOS = 10145 VP = 050000000005000002000200 dbSNPBuildID = 134 SAO = 0 SSR = 0 WGT = 1 VC = DIV ASP flag set OTHERKG flag set
We also downloaded a sites-only VCF for the 16-March-2012 release of the 1000 Genomes project, and similarly stripped the very large number of variant meta-fields to include only a single allele frequency field (AF):
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.wgs.phase1_release_v2.20101123.snps_indels_sv.sites.vcf.gz
pseq . ref-load --refdb refdb-g1k --group g1k --show AF
--vcf ALL.wgs.phase1.projectConsensus.snps.sites.vcf.gz
Human genome sequence (SEQDB)
We downloaded the hg19 sequence in FASTA format:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
and extracted and concatenated the file:
tar -xzvf chromFa.tar.gz
rm -rf hg19.fa.gz
for c in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
do
echo "done $c"
cat chr${c}.fa | gzip >> hg19.fa.gz
done
rm -rf chr*.fa
and then loaded this single, compressed FASTA into a SEQDB:
pseq . seq-load --file hg19.fa.gz --seqdb seqdb
--format name=hg19 description=from-UCSC-20-dec-2010 build=hg19 repeat-mode=lower
This SEQDB therefore contains repeat-masked sequence; lower-case bases to represent repeat-masked sequence
pseq . seq-summary --seqdb seqdb
---Sequence DB summary--- chr1:1..249250621 MB=249 chr2:1..243199373 MB=243 chr3:1..198022430 MB=198 chr4:1..191154276 MB=191 chr5:1..180915260 MB=180 chr6:1..171115067 MB=171 chr7:1..159138663 MB=159 chr8:1..146364022 MB=146 chr9:1..141213431 MB=141 chr10:1..135534747 MB=135 chr11:1..135006516 MB=135 chr12:1..133851895 MB=133 chr13:1..115169878 MB=115 chr14:1..107349540 MB=107 chr15:1..102531392 MB=102 chr16:1..90354753 MB=90 chr17:1..81195210 MB=81 chr18:1..78077248 MB=78 chr19:1..59128983 MB=59 chr20:1..63025520 MB=63 chr21:1..48129895 MB=48 chr22:1..51304566 MB=51 chrX:1..155270560 MB=155 chrY:1..59373566 MB=59 chrM:1..16571 MB=0 SEQDB meta-information: BUILD = hg19 SEQDB meta-information: DESC = from-UCSC-20-dec-2010 SEQDB meta-information: IUPAC = 0 SEQDB meta-information: NAME = hg19 SEQDB meta-information: REPEATMODE = lower
Genotype VCF for exome-subset of 1000 Genomes March 16 2012 phase 1 data release
We downloaded the VCF
for c in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
do
wget -c ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr${c}.phase1.projectConsensus.genotypes.vcf.gz
done
For the purpose of an example dataset for the pbrowse utility (hosted here), we want to make a PLINK/Seq project only extracting sites that are in the exome from this large dataset. To do this, we first must specify the group in a LOCDB that defines the exome. We do this as follows, using a slight variant on the prevous loc-load command. Usually when the loc-load command is given a GTF file, it automatically reads exons as loci into one group (a so-called unmerged group), and then creates a second group in which the exons are grouped together as subregions for each transcript (whole locus/region), and the first group is removed. Adding the --keep-unmerged flag has the effect of retaining the first group in the LOCDB, that will be called refseq-unmerged in this instance. Subsequently, selecting on the refseq group will include all variants within the transcription start and stop sites of a transcript, whereas selected on the refseq-unmerged group we select only variants that fall in coding regions (i.e. not intronic variants). In the context of an example dataset for the exome-centric pbrowse tool, we therefore used the commands:
pseq . loc-load --file refseq-hg19.gtf.gz --group refseq --locdb locdb.refseq --keep-unmerged
pseq exome new-project --locdb locdb.refseq
pseq exome load-vcf --vcf vcf/ALL.chr*.vcf.gz --filter refseq-unmerged
pseq exome var-summary
---Variant DB summary--- 314695 unique variants File tag : 1 (32716 variants, 1094 individuals) File tag : 2 (12437 variants, 1094 individuals) File tag : 3 (20004 variants, 1094 individuals) File tag : 4 (15405 variants, 1094 individuals) File tag : 5 (5287 variants, 1094 individuals) File tag : 6 (9849 variants, 1094 individuals) File tag : 7 (10490 variants, 1094 individuals) File tag : 8 (13756 variants, 1094 individuals) File tag : 9 (17598 variants, 1094 individuals) File tag : 10 (5014 variants, 1094 individuals) File tag : 11 (22992 variants, 1094 individuals) File tag : 12 (22286 variants, 1094 individuals) File tag : 13 (7810 variants, 1094 individuals) File tag : 14 (3668 variants, 1094 individuals) File tag : 15 (6887 variants, 1094 individuals) File tag : 16 (17818 variants, 1094 individuals) File tag : 17 (12671 variants, 1094 individuals) File tag : 18 (14189 variants, 1094 individuals) File tag : 19 (17119 variants, 1094 individuals) File tag : 20 (14997 variants, 1094 individuals) File tag : 21 (11129 variants, 1094 individuals) File tag : 22 (13301 variants, 1094 individuals) File tag : 23 (7272 variants, 1094 individuals)
This runs relatively quickly compared to loading the entire dataset (as it scans through the whole-genome files, only putting the coding variants into the project database). Otherwise, we would suggest using the index-vcf command to work with very large VCF files, in the first instance.