PLINK/SEQ genetics library

Getting started

Key concepts

PSEQ documentation

Other tools and resources

Tutorials

Technical reference

C/C++ library API

Project Management

This page describes the key commands for creating and working with projects.

Creating new projects: description of the new-project command and options
Summary information: var-summary, loc-summary, etc.
Changes to projects: adding, removing and changing projects files
File tags: refering to specific files in a project
Misc: miscellaneous activites with variant databases
Version information: code and database versions

Create a new project

To create a new PLINK/Seq project, for example called proj1, use the new-project command:

pseq proj1 new-project

This will do nothing more than create a text file proj1 in the working directory (or overwrite an existing one). A number of additional arguments can be specified to determine some of the project components: for example, VCF files, resource databases or other miscellaneous files. For example:

pseq proj1 new-project --resources /share/data/hg19 --scratch /tmp/myfolder

would set the folder /share/data/hg19/ as the location where PSEQ will expect (or create, if they do not exist) the LOCDB, REFDB and SEQDB. The --scratch flag (that can usually be omitted) will over-ride the system's default for where temporary files should be written. (This can be useful when running in certain compute cluster environments, some heavily-used nodes might have no free local scratch space in the system's default location, which can cause I/O errors to be reported.) Additional flags are:

pseq proj1 new-project --metameta /path/to/metameta.file

This argument, --metameta, points to a file that contains meta-information about a project's meta-information and is described in more detail here.

As mentioned above, the new-project command does nothing more than create a text file that describes the project (this step could be be performed by hand with identical results). Project files can be edited manually, that is sometimes necessary (e.g. if the project folder or databases have been copied to an alternate location).

For example, the command:

pseq proj1 new-project --resources /share/data/hg19 --vcf data/*vcf.gz

will generate the following project file, proj1, in the current folder, assumed here to be /full/path/to (here we also specify all compressed VCFs in the folder data/ to be attached to the project. These will not be loaded until the load-vcf command is run, however):

   /full/path/to/proj1_out/            OUTPUT
   /share/data/hg19                    RESOURCES
   /full/path/to/data/batch1.vcf.gz    VCF
   /full/path/to/data/batch2.vcf.gz    VCF
   /full/path/to/data/batch3.vcf.gz    VCF
   /full/path/to/proj1_out/vardb       VARDB
   /full/path/to/proj1_out/inddb       INDDB
   /share/data/hg19/locdb              LOCDB
   /share/data/hg19/refdb              REFDB
   /share/data/hg19/seqdb              SEQDB

Note that the full paths are added in the project specification file, if they weren't specified on the command line. If you move the project specification file and associated databases, be sure to update these paths. If making the project file by hand, you can specify variables as #VAR=VALUE and subsequently refer to them as ${VAR} in the project file, which can make it easier to move the location of the project subsequently:

  #ROOT=/path/to/working
  #RES=/path/to/working
  ${ROOT}/vcf/vcf1.vcf                 VCF
  ${RES}/mylocdb                       LOCDB
  ...

Summary information for project databases

To obtain a summary of an entire project:

pseq /path/to/project summary

This command lists some fairly verbose but self-explanatory information about the core project databases (primarily VARDB, INDDB, LOCDB, REFDB and SEQDB) as well as some information about the known meta-information fields and the files specified in the project specification file. Individual databases can be summarised with the command var-summary, loc-summary, ref-summary, seq-summary, meta-summary and file-summary.

Making changes to projects

If a file has been loaded into a project, simply removing it from the project specification file will not remove it from the variant database. To achieve this, use the command: pseq proj1 var-delete --id myfile

where myfile is the file-tag, or the integer file number, for that file.

pseq /path/to/project delete-var --id CEU TSI

Here, CEU and TSI refer to file-tags. Alternatively, file-numbers can be used to specify which files to delete (i.e. the FILE_N numbers shown by the vardb-summary command.

To remove an entire VARDB, it is easiest to simple delete the file from the filesystem. The VARDB will be recreated when needed (e.g. upon next loading a VCF file).

To clear all attached meta-information previously attached using the attach-meta command from a VARDB, use the command:

pseq /path/to/my/project delete-meta --group myannot

where the group is the same as was given when performing attach-meta.

Attaching file-tags (names to refer to files in a project)

By default, samples loaded into a project are given sequential numbers 1, 2, 3, ... as identifiers. It is also possible to attach more meaningful labels, that can be used subsequently:

pseq /path/to/project tag-file --id 1 --name CEU

This means that in masks, one can use:

--mask file=CEU

rather than

--mask file=1

Similarly, this applies to --id command lines that expect a project file to be referenced.

Misc variant database operations

If a large amount of data has been removed from a VARDB (e.g. with delete-var), it can sometimes be beneficial to run:

pseq /path/to/project vacuum

to reduce the size of the VARDB. Under most circumstances, this should not be necessary to perform explicitly.

Software version information

To obtain the version information for PLINK/Seq, use the command (here, with no project specified and so a period character is used in place of the project name):

pseq . version

which yields output similar to the following:

  PSEQ             0.08(10-Mar-2012)
  PLINKSeq         0.08(10-Mar-2012)
  SQLITE3_HEADER   3.7.9
  SQLITE3_LIBRARY  3.7.9
  ZLIB             1.2.6

PLINK/SEQ

A library for the analysis of genetic variation data