PLINK/SEQ Projects
A PLINK/Seq project comprises a collection of files and databases, the core of which are currently:
-
A single project specification (plain-text) file, that lists the other files in the project.
-
Project-specific data, typically containing the user's own data. The two primary databases contain variant and genotype information (labeled VARDB) and individual phenotype information (labeled INDDB). The primary source of genetic variation data is assumed to come from VCF (Variant Call Format) files, although other file types are supported.
-
Reference datasets, typically containing publicly-available data. The three primary databases contain information on loci, reference variants, and sequence information (respectively labeled LOCDB, REFDB and SEQDB). Pre-built reference databases with commonly-used datasets are available from the resources page. Reference databases can be shared across multiple projects and users, although they are also designed to be easily augmented with new information (via standard file-formats such as GTF, VCF or FASTA) allowing users to generate project-specific reference databases. Examples of the types of genomic references datasets used in PLINK/Seq projects include:
-
In LOCDB, gene transcripts (e.g. RefSeq, CCDS), targeted regions in sequencing studies (e.g. whole-exome studies) and also sets of genes (e.g. based on GO categories or KEGG pathways).
-
In REFDB, lists of known variants from dbSNP and the 1000 Genomes project (along with associated meta-information, on population frequncy or coding status), or disease-specific databases such as the Human Gene Mutation Database (HGMD).
-
In SEQDB, the human reference sequences (typically hg18 or hg19).
-