PLINK/SEQ Overview

Some of the central design features of PLINK/SEQ include:

  • Flexible, extensible data representation: there is support for different types of variation data (multiallelic, phased, probabilistic genotypes as well as indels and structural variants) and extensible, typed meta-information for variants and genotypes.
  • Efficient random-access: sets of variants can be extracted based on genomic co-ordinates or other user-defined criteria, with on-the-fly filtering and annotation with a variety of criteria and datasources.
  • Large datasets: handles datasets much larger than can fit into memory (for example, whole-genome sequencing projects with tens of millions of variants and potentially genotype data on hundreds, or thousands, of individuals).
  • Key references datasets: packaged with key genetic reference databases of gene transcripts, sequence and variation projects, including dbSNP and data from the 1000 Genomes Project, that can be appended to one's own dataset or used to filter it.
  • Decouples data-handling and analysis: a strong separation between data-handling and statistical genetic methods, such that others can use the library to support their own analytic methods development.
  • Focus on called variation: this library does not support direct access to read-level data, such as BAM files. Rather, the focus is on downstream analysis of called variants.

Structure

The PLINK/Seq package consists of a number of inter-related software tools and associated databases, as illustrated by this cartoon:


At the core of PLINK/SEQ is a C++ Library. This library is responsible for fundamentals such as collating genotype and phenotype data from various sources; and intersecting, filtering and annotating these data with other relevant types of data. The PLINK/SEQ library can be used by a number of user interfaces:

  • PSEQ Command line tool: pseq provides easy access to some of the most common functions of the library (e.g. loading and querying data) and also implements a number of useful statistical procedures (e.g. to summarise datasets, perform phenotype-genotype association tests).
  • R package for statistical computing: Use R as an interface to the dynamically-linked C/C++ extension library. This provides convenient access to the powerful statistical and visualisation tools available in R.
  • Web-browser: an exome-centric table-browser provides a simple, interactive tool for searching and reporting on a project's variant, genotypic and phenotypic data and meta-data.
  • C/C++ API: alternatively, one can use the C/C++ library API directly, to build analysis packages or other tools.

Data Sources

Human genetic variation data from large-scale sequencing studies is now most likely represented in the VCF file format. PLINK/Seq supports this, along with a few other formats (e.g. PLINK binary PED files). PLINK/SEQ can either operate on single VCF file (that may be compressed), in single-file mode, although it is often beneficial to work in project mode, which allows a combination of multiple VCF files to be intersected with other individual phenotype data, reference data sets, etc. PLINK/SEQ projects are described in more details here.