2. Basic information
3. Download and general notes
Common copy number polymorphism (CNP) dataThis page describes some basic file formats, convenience functions and analysis options for common copy number polymorphism (CNP) data. Support for rare copy number variant (CNV) data is described here. Common copy number variation is represented for specific SNP genotypes, for example, allowing A, AAB or AABB calls (being copy number 1,3 and 4 respectively) as well as the canonical AA, AB and BB genotypes. These formats are specified via the "generic variant" (--gfile) option. Here we assume that some other software package such as the Birdsuite package has previously been used to make calls for either specific copy-number variable genotypes or to identify particular genomic regions in individuals that are deletions or duplications, based on the raw data. That is, PLINK only offers functions for downstream analysis of CNV data, not for identifying CNVs in the first place, i.e. similar to the distinction between SNP genotype calling versus the subsequent analysis of those calls.
Format for common CNVs (generic variant format)For common CNVs, that might also have meaningful allelic/SNP variation, it can be desirable to represent and analyse these not as segments. The rest of the page considers non-segmental specification of CNVs: that is, copy-number variable specific genotype calls, such as A or AAB. Such data are represented with the generic variant file format, and read into PLINK with the command:
plink --gfile mydatawhere three files are assumed to exist
mydata.fam (describes individuals, as usual) mydata.map (describes variants, as usual) mydaya.gvar (new file format)The .gvar file is in long-format: always with 7 fields, one row per genotype (note that the reference to the first and second parents above does not imply that paternal or maternal origin should be known or is used)
FID Family ID IID Individual ID (i.e. person should appear in .fam file) NAME Variant name (should appear in .map file) ALLELE1 Code for allele from first parent DOSAGE1 Copy number for first allele ALLELE2 Code for allele from second parent DOSAGE2 Copy number for second alleleSome example of using this format to represent different genotypes are shown here:
1 1 var1 A 1 C 1 -> normal het 1 1 var2 A 2 C 1 -> AAC genotype 1 1 var3 0 1 0 1 -> missing individual 1 1 var4 0 0 0 0 -> homozygous deletion 1 1 var5 4 1 7 1 -> e.g. 4/7 genotype 2 1 var5 4 1 8 1 -> e.g. 4/8 genotype 1 1 var6 A 0.95 C 1.05 -> expected allele dosage (e.g. from imputation)As currently implemented, all the codings below would be equivalent, i.e. specifying an AA homozygote:
1 1 var7 A 1 A 1 2 1 var7 A 0 A 2 3 1 var7 A 2 A 0 4 1 var7 X 0 A 2 5 1 var7 0 0 A 2That is, for a missing (null) genotype, ALLELE1 and ALLELE2 should both be set to 0, and by convention, DOSAGE1 and DOSAGE2 should be 1 (indicating a 0 0 genotype). But if a DOSAGE value is 0, then the value of the corresponding ALLELE column does not matter. Thus, genotypes can have DOSAGE >= 1 for one allele, and DOSAGE for the other allele: A 0 B 3 means 3 copies of allele B and no copies of A; X 0 B 3 means the same thing because the X is ignored when DOSAGE=0. When loading this kind of file, PLINK will parse allelic and copy number variation; currently by default it looks for integer dosage calls in this part of the process. There are currently no functions implemented yet for fractional counts, but the datatype exists. Alleles and CNVs are then appropriately counted. PLINK assesses and records for each variant whether there is allelic and/or copy number variation, and this influences downstream analysis. Currently variation is defined as at least one individual varying, but in the future thresholds will be added (e.g. to treat a site of a CNV only if, say, 1% of all individuals have a non-canonical copy number). The basic summary output is also in "long format": in the future this will be expanded and reformated, e.g. to include specific allelic/CNV frequncies or counts; stratification by phenotype, etc. This summary file is called
plink.gvar.summaryand always contains three columns, as illustrated here
NAME FIELD VALUE var1 CHR 1 var1 BP 1 var1 CNV yes var1 ALLELIC yes var1 GCOUNT 1000 var1 B 0.6031 var1 A 0.3969 var1  0.56 var1  0.378 var1  0.062 var1 B/B 30:38 var1 BB/B 66:60 var1 BB/BB 42:20 var1 A/B 142:101 var1 A/BB 161:91 var1 A/A 162:87The CN counts are always in [x] to distingush from allele codes, if they are also numeric. e.g. in this example, 37.5% of sample have the deletion for example. There can be more than 2 CN states for a given variant. If the trait is binary, then the counts for copy-number specific genotypes (e.g. A/BB) will be given separately for cases and controls, separated by a colon.
Association models for combined SNP and common CNV dataPLINK has implemented the following regression models (logistic or linear) currently applicable to biallelic SNPs residing within CNPs:
Y ~ b0 + b1.(A+B) + b2.(A-B)When an association test is performed, extra lines will be appended to the plink.gvar.summary file
var1 B(SNP) -0.05955 var1 P(SNP) 0.09085 var1 B(CNP) 0.09314 var1 P(CNP) 0.3809 var1 B(CNP|SNP) 0.5638 var1 P(CNP|SNP) 0.0006768 var1 B(SNP|CNP) -0.2042 var1 P(SNP|CNP) 0.0002242 var1 P(SNP&CNP) 0.0007413Covariates can be added with --covar as with --linear or --logistic. The coefficients and p-values for the SNP and CNP will reflect this, although the specific coefficients and p-values for the covariates themselves are not shown in the output. This section is not finished -- more details will be added online presently.