PSEQ data input

This page explains how to load various core data types (genotypes and phenotypes) into a PLINK/SEQ project:

Importing VCF files into a project

To load one or more VCF files (that might be compressed) into an existing project (created with the new-project command), use the following: pseq /path/to/project load-vcf

Given a project file has been created (/path/to/project) and contains 1 or more VCF files, this command loads these VCF files into the variant-database. VCF files that have already been imported into the VARDB are skipped (i.e. not reloaded). (To reload a VCF file you must first use the var-delete command.)

Alternatively, VCF files can be directly specified from the command line with the load-vcf command, using the --vcf option (they are also added to the project specification file, to track which files are in the project):

pseq /path/to/project load-vcf --vcf data/myfirst.vcf ../more/data/*vcf.gz

After importing a VCF file with load-vcf, the original file can be moved or deleted as it is no longer used by PLINK/Seq. Note that this is not the case when using index-vcf, as described below. You can always write out the data again as VCF using the write-vcf command. (Of course, for important projects keeping backups and track of your original data is obviously key, and whilst we expect PLINK/Seq to be able to recreate the original VCFs, it would not be prudent to rely exclusively on this fact...).

Performance: The process of loading VCFs into a project database can be relatively (and sometimes painfully) slow. As described below, it is far quicker to keep the original VCF as is, and only index (rather than load) it into the project. This should not impact any of the available operations on the data downstream (loaded and indexed VCFs can be mixed and matched within a single project). However, at least for some datasets, working with indexed VCFs can be marginally slower downstream. Thus, in order to take a quick look at a very large dataset, it makes sense to use index-vcf. For key projects in which you know the core data are unlikely to be changing rapidly, it probably makes sense to load-vcf instead. Specific timings, and a general sense of how PLINK/SEQ scales with very large datasets (i.e. 1000s of whole-genomes) are available here.

Additional options

To make PSEQ check that the REF allele is consistent with the base(s) in a SEQDB, if one is attached:

--check-reference

If a LOCDB is attached with a group called refseq, to load only sites that are in regions specified by that group. This can provide an quick way to extract (for example) an exome's worth of data from a whole-genome dataset:

--filter refseq

To only import certain meta fields (similarly, a meta.ex option means to import all meta fields except those specified):

--include-meta DP,AC

Indexing VCF Files

For large VCF files, load-vcf can be slow, as it must parse each line of the VCF into a structured format. For whole-exome sequencing projects containing over a thousand individuals, load-vcf will likely take on the order of one or two hours. However, for whole-genome datasets on thousands of individuals (i.e. the 1000 Genomes VCF), it can take one or two days or more. To start working with very large files more quickly, using the index-vcf command may be preferable: rather than load all the information from the VCF into the project, it remains untouched in the VCF. Instead, only an index is built in the project database that allows random-access into the VCF (and for it to be merged, filtered, or annotated, etc) with other PSEQ commands and datasets.

Note: only BGZF-compressed VCFs can be indexed in this manner. The 1000 Genomes VCF files are distributed in this format. Plain-text VCFs (or VCFs compressed only with the standard gzip tool) cannot be indexed in this manner). (Note, you can convert a plain-text VCF to a BGZF-compressed VCF with the write-vcf command.)

pseq /path/to/project index-vcf --vcf bigdata/*vcf.gz

Note that when using index-vcf the original VCF(s) cannot be deleted or moved. For a comparison of the trade-offs between load-vcf and index-vcf in terms of upfront and downstream costs in terms of speed (and file-size), see this page.

Load a PLINK binary fileset

To load a binary PLINK fileset of SNP/GWAS data (created by PLINK's --make-bed command):

pseq /path/to/my/project load-plink --file /path/to/mydata --id mydata1

This assumes three PLINK files exist in the /path/to folder:

   mydata.bed
   mydata.bim
   mydata.fam 

By default, the phenotype encoded in column 6 of the FAM file is assumed to be a dichotomous case/control phenotype, and is labelled phe1. To give a different name, add:

--phenotype t2d

By default, the REF and ALT alleles are the major and minor alleles in the PLINK dataset (PLINK does not explicitly track reference allele status with respect to a consensus sequence). If a SEQDB is attached, then the option:

--check-reference

will make PSEQ attempt to swap the alleles to make the REF allele consistent with the SEQDB. VCFs, by definition, should use allele encoding on the positive strand, although this is not always the case with GWAS datasets, so take care that the PLINK file is encoded on the positive strand. PSEQ will give a warning if the alleles are inconsistent with the base in SEQDB (although the site will still be loaded).

Load individual and pedigree information

To register individuals in the individual database (INDDB), along with some basic meta-data, use the command:

pseq /path/to/my/project load-pedigree --file myfile.info

where the file myfile.info is a tab-delimited file, with six fields per row

   ID    Main unique ID
   FID   Optional family ID
   IID   Optional individual ID
   PAT   Optional paternal ID
   MAT   Optional maternal ID
   SEX   Optional sex field (1=male, 2=female)

Use a period (.) symbol to indicate a missing value:

   0001   .   .   .   .   1
   0002   .   .   .   .   2
   0003  F1   3   1   2   2

indicates one male and two females; in this case, the second female has identifiers specified for family IDs. The family IDs are primarily for backwards compatibility with other genetics packages, including PLINK, that require an individual to be uniquely-identified by the combination of a family and an individual ID. In contrast, PLINK/SEQ uses the single unique ID for most operations (i.e. most notably, connecting individuals between the INDDB and VARDB, or specifying individuals in phenotypic datafiles).

Note: Currently, there is no support for family-based analysis in PSEQ. These will be added in future.

Load phenotypic information

To load phenotype information (on one or more phenotypes) into the INDDB, use the command:

pseq /path/to/my/project load-pheno --file myfile.phe

where the file myfile.phe is as follows:

   ##scz,Integer,0,"Primary disease phenotype"
   ##qt1,Float,-9,"Quantitative trait 1"
   ##batch,String,?,"Batch number"
   #ID           scz     qt1    batch
   00340190        1    2.23       B1
   00313746        2    0.23       B2
   00346307        1      -9        ?
   00346271        1    1.92       B1
   ...

Here the first rows starting ## are meta-data rows, which should be in the format (comma-delimited):

   name,type,missing-value,"verbose description"

One can also specify a reduced format, that implies myphe is an Integer, with 0 as the missing data code:

##myphe
#ID   myphe
P001  1
P002  2
P003  0
...

Rows starting # are header rows, which should always start #ID (uppercase) and then contain the phenotype names for the following columns (tab-delimited). Finally, each row starting without # is assumed to be a data-row. This should have the same number of entries as the header row.

If individuals do not already exist in the database, they will be automatically created. Multiple phenotypes can be included in a single file, or loaded into the INDDB database with multiple pseq calls.