This page describes some basic file formats, convenience functions and analysis options for common copy number polymorphism (CNP) data. Support for rare copy number variant (CNV) data is described here.

Common copy number variation is represented for specific SNP genotypes, for example, allowing A, AAB or AABB calls (being copy number 1,3 and 4 respectively) as well as the canonical AA, AB and BB genotypes. These formats are specified via the "generic variant" (--gfile) option.

Here we assume that some other software package such as the Birdsuite package has previously been used to make calls for either specific copy-number variable genotypes or to identify particular genomic regions in individuals that are deletions or duplications, based on the raw data. That is, PLINK only offers functions for downstream analysis of CNV data, not for identifying CNVs in the first place, i.e. similar to the distinction between SNP genotype calling versus the subsequent analysis of those calls.

Format for common CNVs (generic variant format)

For common CNVs, that might also have meaningful allelic/SNP variation, it can be desirable to represent and analyse these not as segments. The rest of the page considers non-segmental specification of CNVs: that is, copy-number variable specific genotype calls, such as A or AAB.

Such data are represented with the generic variant file format, and read into PLINK with the command:

plink --gfile mydata

where three files are assumed to exist

     mydata.fam     (describes individuals, as usual)
     mydata.map     (describes variants, as usual)
     mydaya.gvar    (new file format)

The .gvar file is in long-format: always with 7 fields, one row per genotype (note that the reference to the first and second parents above does not imply that paternal or maternal origin should be known or is used)

  FID       Family ID
  IID       Individual ID  (i.e. person should appear in .fam file)
  NAME      Variant name (should appear in .map file)
  ALLELE1   Code for allele from first parent
  DOSAGE1   Copy number for first allele
  ALLELE2   Code for allele from second parent
  DOSAGE2   Copy number for second allele

Some example of using this format to represent different genotypes are shown here:

     1 1  var1  A 1  C 1    -> normal het
     1 1  var2  A 2  C 1    -> AAC genotype
     1 1  var3  0 1  0 1    -> missing individual
     1 1  var4  0 0  0 0    -> homozygous deletion

     1 1  var5  4 1  7 1    -> e.g. 4/7 genotype
     2 1  var5  4 1  8 1    -> e.g. 4/8 genotype

     1 1  var6  A 0.95 C 1.05  -> expected allele dosage (e.g. from imputation)

As currently implemented, all the codings below would be equivalent, i.e. specifying an AA homozygote:

     1 1  var7  A 1 A 1
     2 1  var7  A 0 A 2
     3 1  var7  A 2 A 0
     4 1  var7  X 0 A 2
     5 1  var7  0 0 A 2

That is, for a missing (null) genotype, ALLELE1 and ALLELE2 should both be set to 0, and by convention, DOSAGE1 and DOSAGE2 should be 1 (indicating a 0 0 genotype). But if a DOSAGE value is 0, then the value of the corresponding ALLELE column does not matter. Thus, genotypes can have DOSAGE >= 1 for one allele, and DOSAGE for the other allele: A 0 B 3 means 3 copies of allele B and no copies of A; X 0 B 3 means the same thing because the X is ignored when DOSAGE=0.

When loading this kind of file, PLINK will parse allelic and copy number variation; currently by default it looks for integer dosage calls in this part of the process. There are currently no functions implemented yet for fractional counts, but the datatype exists.

Alleles and CNVs are then appropriately counted. PLINK assesses and records for each variant whether there is allelic and/or copy number variation, and this influences downstream analysis. Currently variation is defined as at least one individual varying, but in the future thresholds will be added (e.g. to treat a site of a CNV only if, say, 1% of all individuals have a non-canonical copy number).

The basic summary output is also in "long format": in the future this will be expanded and reformated, e.g. to include specific allelic/CNV frequncies or counts; stratification by phenotype, etc. This summary file is called

     plink.gvar.summary

and always contains three columns, as illustrated here

            NAME        FIELD        VALUE
            var1          CHR            1
            var1           BP            1
            var1          CNV          yes
            var1      ALLELIC          yes
            var1       GCOUNT         1000
            var1            B       0.6031
            var1            A       0.3969
            var1          [2]         0.56
            var1          [3]        0.378
            var1          [4]        0.062
            var1          B/B        30:38
            var1         BB/B        66:60
            var1        BB/BB        42:20
            var1          A/B      142:101
            var1         A/BB       161:91
            var1          A/A       162:87

The CN counts are always in [x] to distingush from allele codes, if they are also numeric. e.g. in this example, 37.5% of sample have the deletion for example. There can be more than 2 CN states for a given variant.

If the trait is binary, then the counts for copy-number specific genotypes (e.g. A/BB) will be given separately for cases and controls, separated by a colon.

Association models for combined SNP and common CNV data

PLINK has implemented the following regression models (logistic or linear) currently applicable to biallelic SNPs residing within CNPs:

  Y ~ b0 + b1.(A+B) + b2.(A-B)

When an association test is performed, extra lines will be appended to the plink.gvar.summary file

            var1       B(SNP)     -0.05955
            var1       P(SNP)      0.09085
            var1       B(CNP)      0.09314
            var1       P(CNP)       0.3809
            var1   B(CNP|SNP)       0.5638
            var1   P(CNP|SNP)    0.0006768
            var1   B(SNP|CNP)      -0.2042
            var1   P(SNP|CNP)    0.0002242
            var1   P(SNP&CNP)    0.0007413

Covariates can be added with --covar as with --linear or --logistic. The coefficients and p-values for the SNP and CNP will reflect this, although the specific coefficients and p-values for the covariates themselves are not shown in the output.

This section is not finished -- more details will be added online presently.

This document last modified Wednesday, 25-Jan-2017 11:39:26 EST OA