Variants and Samples

This page gives an overview, with some simple examples, of how genetic variation data are represented within PLINK/Seq. If you want to use the R or C++ interfaces, this will be particularly important to understand.

Variants

A core concept is of a variant (represented in the C++ library by the Variant class). A variant contains the following information:

  • A genomic co-ordinate (chromosome, and base-position; optionally, a second base-position)
  • An optional ID, e.g. a RefSeq ID such as rs12345.
  • Population-level, abstracted meta-information (e.g. dbSNP membership, protein-coding status)
  • A consensus representation of the sample-specific meta-information (e.g. read-depth) and genotype information. This is called the consensus sample-variant (see below).
  • Optionally, a set of sample-variants, representing sample-specific data (see below).

Variants are created on-the-fly, as collections of sample-variants with the same (or overlapping) genomic co-ordinate(s). The variant (and specifically, the consensus sample-variant it contains) will be the primary unit for most subsequent analyses.

Sample Variants

A sample-variant (implemented by the SampleVariant C++ class) describes the properties of a variant, along with individual genotype data, on a specific sample. One line of a VCF file will correspond to a single sample-variant. The VARDB (main variation database) stores data at the level of sample-variants. (Variants, described above, are collections of 1 or more sample-variants and created on-the-fly.)

A sample-variant contains the following information:

  • 1 or more allele codes (reference, and alternate alleles)
  • Sample-specific variant meta-information, quality scores, etc
  • Genotype calls (and meta-information) for all individuals in this sample
i.e. these are all things that can differ between particular datasets for the same variant.

Consensus Sample Variant

A variant will therefore consist of one or more sample-variants. A special, additional ''consensus'' sample-variant is created for each variant: in most cases, this will be the main focus of subsequent analysis. The consensus sample-variant attempts to aggregate (and reconcile, if needed) variant and genotypic information across multiple samples for the same variant.

For a specific example: consider a project with three samples and a total of 11 unique individuals (labeled P1 to P11).

   Sample 1:  P1, P2, P3, P4, P5
   Sample 2:  P6, P7, P8, P9, P10
   Sample 3:  P1, P2, P8, P9, P11

If an analysis were based on the whole sample, the following consensus representation would be constructed:

   Consensus     [From sample(s)]
   P1            1,3
   P2            1,3
   P3            1
   P4            1
   P5            1 
   P6            2
   P7            2
   P8            2,3
   P9            2,3
   P10           2
   P11           3  

In other words, some individuals are represented more than once in the whole dataset. For any one specific variant, the calls might be as follows:

 
                 [From sample(s)]        Consensus     
   
   P1            A/A   ./.          -->     A/A
   P2            A/C   A/A          -->     ./.
   P3            C/C                -->     C/C
   P4            C/C                -->     C/C
   P5            C/C                -->     C/C 
   P6            C/C                -->     C/C
   P7            A/C                -->     A/C
   P8            ./. ./.            -->     ./.
   P9            A/C A/C            -->     A/C
   P10           C/C                -->     C/C
   P11           ./.                -->     ./.
That is, discordant non-missing genotype calls are set to missing in the consensus (the original sample-specific calls will still be available to the user of the library however, i.e. in the returned R list object, or C++ class).

Genotypic meta-information is not combined in the same way:

  • For variant-level meta-data, an attribute is represented in the Variant object if it is indicated as a static, or population-level attribute, external to the samples at hand, e.g. dbSNP membership. This is typically done via the METAMETA file (specified by --metameta when creating a new project in PSEQ.
  • Otherwise, all variant meta-information is represented in the original sample-variant. Similarly, genotypic meta-information is only propagated to the consensus object if there is only a single genotype observed. For example, below DP represents per-individual read-depth; in this case, there is no associated DP field for P3, for P4, only a single call is seen for this variant, and so the genotype meta-information for this individual is automatically set in the consensus slot.
                 [From sample(s)]        Consensus     
          
   P1            DP=20 DP=0         -->     
   P2            DP=11 DP=34        -->     
   P3                               -->     
   P4                               -->     DP=8
   ...
Depending on the context, one could imagine various rules that determine how meta-information across samples (at the variant and genotypic level) are combined (e.g. the sum, average, max/min, or more complex user-defined rules). This will be addressed in future releases.

Special cases

Although some complex projects will have multiple genotype calls from multiple data sources on the same individual, a more typical scenario will one in which each individual has only a single set of genotype calls. In PLINK/Seq jargon, this is called a flat alignment (i.e. aligning individuals from different samples to the consensus set). If we analysed the dataset above, with the PSEQ option:

--mask file=1,2

Here, the resulting data would be flat in this sense:

   Consensus     [From sample(s)]
   P1            1
   P2            1
   P3            1
   P4            1
   P5            1 
   P6            2
   P7            2
   P8            2
   P9            2
   P10           2

Here there are still multiple samples (1 and 2) but each individual only features in one and only one sample (of 1 or 2). In this instance, the will be no issue of inconsistent genotype calls or genotype-level meta-information needing to be resolved across samples.

In the most simple case, all data comes from a single sample, in which case nothing needs to be reconciled in order to form the consensus. Here, the data (sample/variant meta-information, genotypes and genotype meta-information) are read straight into the consensus sample variant and no further action is necessary.

Different observed alleles

In any case where more than one sample features (whether the alignment is flat or not), a further check is necessary to resolve what might be different allele encoding across samples. For example, a particular variant might be represented as an A/C variant in sample 1, but A/G in sample 2. The consensus sample-variant will represent the union of these, and individuals' genotypes will be recoded appropriately, i.e.

   Sample 1   A/C        P1 has genotype A/C (which is represented 0/1 in the VCF)
   Sample 2   A/G        P6 has genotype G/G (which is represented 1/1 in the VCF)
   Consensus  A/C,G      Internally, P1 is represented as 0/1, P6 as 2/2

If the reference allele differs between files, the reference for the consensus will be taken from the first sample (given any mask).

Whether or not such differences represent the true state of nature (different alternate alleles seen in different samples) versus a simple annotation error in compiling the original data is of course not determined by PLINK/Seq. (A list of variants with different reference and alternate alleles across files can be obtained from the clusters PSEQ command.)

Variant Groups

A variant group is a collection of Variants, for example representing all variants in a gene. (This is implemented by the VariantGroup C++ class.) For all commands that are based on sets of genes (e.g. assoc and g-view), analysis iterates over all variant groups, as specified by the mask (e.g. the loc.group component).

Misc notes

Some caveats / points under development:

  • By default, multiple sample-variants are treated as a single variant based on genomic-position, if both start and stop match exactly. Other options exist to merge partially over-lapping variants (to be described) although it is not always possible to unambiguously reconcile genotypes (without phase information).
  • Variant-groups are typically assumed to consist of variants on the same chromosome. This is not an intrinsic constraint, but a few convenience functions will not work (specifically, those that return the genomic span of variants in a group or display the genomic position).