Project Management
This page describes the key commands for creating and working with projects.
- Creating new projects: description of the new-project command and options
- Summary information: var-summary, loc-summary, etc.
- Changes to projects: adding, removing and changing projects files
- File tags: refering to specific files in a project
- Misc: miscellaneous activites with variant databases
- Version information: code and database versions
Create a new project
To create a new PLINK/Seq project, for example called proj1, use the new-project command:
pseq proj1 new-project
This will do nothing more than create a text file proj1 in the working directory (or overwrite an existing one). A number of additional arguments can be specified to determine some of the project components: for example, VCF files, resource databases or other miscellaneous files. For example:
pseq proj1 new-project --resources /share/data/hg19 --scratch /tmp/myfolder
would set the folder /share/data/hg19/ as the location where PSEQ will expect (or create, if they do not exist) the LOCDB, REFDB and SEQDB. The --scratch flag (that can usually be omitted) will over-ride the system's default for where temporary files should be written. (This can be useful when running in certain compute cluster environments, some heavily-used nodes might have no free local scratch space in the system's default location, which can cause I/O errors to be reported.) Additional flags are:
pseq proj1 new-project --metameta /path/to/metameta.file
This argument, --metameta, points to a file that contains meta-information about a project's meta-information and is described in more detail here.
As mentioned above, the new-project command does nothing more than create a text file that describes the project (this step could be be performed by hand with identical results). Project files can be edited manually, that is sometimes necessary (e.g. if the project folder or databases have been copied to an alternate location).
For example, the command:
pseq proj1 new-project --resources /share/data/hg19 --vcf data/*vcf.gz
will generate the following project file, proj1, in the current folder, assumed here to be /full/path/to (here we also specify all compressed VCFs in the folder data/ to be attached to the project. These will not be loaded until the load-vcf command is run, however):
/full/path/to/proj1_out/ OUTPUT /share/data/hg19 RESOURCES /full/path/to/data/batch1.vcf.gz VCF /full/path/to/data/batch2.vcf.gz VCF /full/path/to/data/batch3.vcf.gz VCF /full/path/to/proj1_out/vardb VARDB /full/path/to/proj1_out/inddb INDDB /share/data/hg19/locdb LOCDB /share/data/hg19/refdb REFDB /share/data/hg19/seqdb SEQDB
Note that the full paths are added in the project specification file, if they weren't specified on the command line. If you move the project specification file and associated databases, be sure to update these paths. If making the project file by hand, you can specify variables as #VAR=VALUE and subsequently refer to them as ${VAR} in the project file, which can make it easier to move the location of the project subsequently:
#ROOT=/path/to/working #RES=/path/to/working ${ROOT}/vcf/vcf1.vcf VCF ${RES}/mylocdb LOCDB ...
Summary information for project databases
To obtain a summary of an entire project:
pseq /path/to/project summary
This command lists some fairly verbose but self-explanatory information about the core project databases (primarily VARDB, INDDB, LOCDB, REFDB and SEQDB) as well as some information about the known meta-information fields and the files specified in the project specification file. Individual databases can be summarised with the command var-summary, loc-summary, ref-summary, seq-summary, meta-summary and file-summary.
Making changes to projects
If a file has been loaded into a project, simply removing it from
the project specification file will not remove it from the variant database. To achieve this, use
the command:
pseq proj1 var-delete --id myfile
where myfile is the file-tag, or the integer file number, for that file.
pseq /path/to/project delete-var --id CEU TSI
Here, CEU and TSI refer to file-tags. Alternatively, file-numbers can be used to specify which files to delete (i.e. the FILE_N numbers shown by the vardb-summary command.
To remove an entire VARDB, it is easiest to simple delete the file from the filesystem. The VARDB will be recreated when needed (e.g. upon next loading a VCF file).
To clear all attached meta-information previously attached using the attach-meta command from a VARDB, use the command:
pseq /path/to/my/project delete-meta --group myannot
where the group is the same as was given when performing attach-meta.
Attaching file-tags (names to refer to files in a project)
By default, samples loaded into a project are given sequential numbers 1, 2, 3, ... as identifiers. It is also possible to attach more meaningful labels, that can be used subsequently:
pseq /path/to/project tag-file --id 1 --name CEU
This means that in masks, one can use:
--mask file=CEU
rather than
--mask file=1
Similarly, this applies to --id command lines that expect a project file to be referenced.
Misc variant database operations
If a large amount of data has been removed from a VARDB (e.g. with delete-var), it can sometimes be beneficial to run:
pseq /path/to/project vacuum
to reduce the size of the VARDB. Under most circumstances, this should not be necessary to perform explicitly.
Software version information
To obtain the version information for PLINK/Seq, use the command (here, with no project specified and so a period character is used in place of the project name):
pseq . version
which yields output similar to the following:
PSEQ 0.08(10-Mar-2012) PLINKSeq 0.08(10-Mar-2012) SQLITE3_HEADER 3.7.9 SQLITE3_LIBRARY 3.7.9 ZLIB 1.2.6