1. Introduction
2. Basic information
3. Download and general notes
4. Command reference table
5. Basic usage/data formats
6. Data management
7. Summary stats
8. Inclusion thresholds
9. Population stratification
10. IBS/IBD estimation
11. Association
12. Family-based association
13. Permutation procedures
14. LD calculations
15. Multimarker tests
16. Conditional haplotype tests
17. Proxy association
18. Imputation (beta)
19. Dosage data
20. Meta-analysis
21. Annotation
22. LD-based results clumping
23. Gene-based report
24. Epistasis
25. Rare CNVs
26. Common CNPs
27. R-plugins
28. Annotation web-lookup
29. Simulation tools
30. Profile scoring
31. ID helper
32. Resources
33. Flow-chart
34. Miscellaneous
35. FAQ & Hints
36. gPLINK
|
|
Analysis of dosage data
This page describes features to analyse "dosage" SNP datasets, for
example, from imputation packages BEAGLE
or MACH. The
--dosage command will take data in a variety of formats (but
best suited to BEAGLE-style output, with one SNP per line) potentially
compressed and distributed across multiple files, and perform
association tests between the phenotype and the dosage data (expected
allele counts) as well as outputing merged, filtered or hard-called
datasets.
Basic usage
The basic usage is
plink --dosage myfile.dat --fam mydata.fam
which will create a file
plink.assoc.dosage
which contains the fields
CHR Chromosome code, if map file specified
SNP SNP code
BP Base-pair position, if map file specified
A1 Allele 1 code
A2 Allele 2 code
FRQ Frequency of A1, from dosage data
INFO R-squared quality metric / information content
OR Odds ratio for association (or BETA for quantitative traits)
SE Standard error of effect estimate
P p-value for association tests
If a MAP file is also specified
plink --dosage myfile.dat --fam mydata.fam --map mymap.map
then a) extra CHR and BP fields will be reported in the output, b)
only SNPs that are present in the MAP file will be analysed and
reported.
The basic format of a dosage file specifies that each row of the file
corresponds to a SNP (i.e. similar to a transposed PED file, rather
than one individual per row). There are three default columns that
should appear before the dosage data:
SNP A1 A2 Dosage data ...
For example
SNP A1 A2 F1 I1 F2 I2 F3 I3
rs0001 A C 0.98 0.02 1.00 0.00 0.00 0.01
rs0002 G A 0.00 1.00 0.00 0.00 0.99 0.01
In this case, we have data for two SNPs on three individuals. Here,
each genotype is represented by two numbers (alternative
representations can be specified below). For example, the two numbers
for the first SNP represent the probability of an A/A, then
an A/C genotype. The probability of a G/G is
naturally 1 minus the sum of these.
Individuals in the dosage data but not the FAM file are ignored
(unless the noheader option is specified, see
below). Individuals in the FAM file but not the doseage file are
removed from the dataset.
Association tests are performed within a linear or logistic regression
framework. As such, many standard options such as --covar or
--within can be specified. See the main page on association
for more details. Not all options are available however: for example,
permutation is not possible with dosage data files.
The INFO metric is calculated based on the entire file, based
on the ratio of empirical and expected variance in dosage. Values
closer to 1 indicate better expected quality of imputation. Values can
be above 1: note that values much greater than 1 can indicate strong
departure from HWE.
Optionally, if extra fields exist they can be skipped, via the
skip0, skip1 and skip2 options (see below):
{skip0} SNP {skip1} A1 A2 {skip2} Dosage data ...
By default, we expect a header row for each dosage file, that has the
same header fields for the leading columns, and then lists the
FID and IID codes for the individuals in that file.
If there is no header (noheader option), then PLINK assumes
the order and number of individuals in the each dosage file should
correspond to the FAM file (after any exclusions, e.g. from
--remove, etc) specifed.
As described below, dosage data can be represented in a number of
ways. Dosage data can be spread across multiple files: if the
list option is specified, e.g.
plink --dosage myfile.lst list --fam mydata.fam
where myfile.lst is a list of file names (full paths can be
specified if the dosage files are in different directories), e.g.
chr1.dose
chr2.dose
chr3.dose
...
Options
The options available are as follows:
list Indicates that the file following --dosage is a list of dosage files
(as opposed to being a dosage file itself).
sepheader Indicates that the ID lists are in separate files (requires 'list')
noheader Indicates that there are no headers available
skip0=N Number of fields to skip before SNP
skip1=N Number of fields to skip between SNP and A1
skip2=N Number of fields to skip between A2 and genotype data
dose1 Dosage data is 0..1, not 0..2 scale
format=N Dosage, two probabilities or three (N=1,2,3)
Z All input (dosage) files and output files compressed
Zin All input files are compressed
Zout Output file will be compressed
occur Helper function: count number of occurrences
Most of these options modify the expected format of the input
files. Examples are given in the section below.
Examples of different input format options
Based on the example data file shown above, here are some examples
different of how the data could be differently formatted. That is,
these are all equivalent and will give the same results. The purpose
of these options is to reduce the likely number of steps required in
preparing the data file(s) for analysis. The major fixed
specification is that the data are essentially in SNP-by-individual
(one row is one SNP) format in all cases.
Split by SNP, single dosage
Here each file contains all individuals, has a header file and
contains single dosages of the A1 allele.
a1.dose
SNP A1 A2 F1 I1 F2 I2 F3 I3
rs0001 A C 0.02 0.00 1.99
a2.dose
SNP A1 A2 F1 I1 F2 I2 F3 I3
rs0002 G A 1.00 2.00 0.01
The command would be
plink --fam d.fam --dosage a.txt list format=1
where a.txt is a text file, with 2 fields, SNP batch and dosage file name
1 a1.dose
2 a2.dose
in which the numeric codes indicate different batches of
SNPs. Obviously, in real examples a given file would likely contain a
very large number of SNPs (e.g. all SNPs for a given chromosome).
Split by individuals, with some leading nuissance fields
b1.dose
SNP A1 A2 F R2 F1 I1 F2 I2
rs0001 A C 0.02 0.98 0.98 0.02 1.00 0.00
rs0002 G A 0.5 0.23 0.00 1.00 0.00 0.00
b2.dose
SNP A1 A2 F R2 F3 I3
rs0001 A C 0.02 0.8 0.00 0.01
rs0002 G A 0.5 0.55 0.99 0.01
The command to read these data is then
plink --fam d.fam --dosage b.txt list skip2=2
where b.txt is a text file, with 1 field (file name), as
there is only a single batch of SNPs (i.e. all dosage files contain
the same set of SNPs, in the same order).
b1.dose
b2.dose
The skip2 option means that PLINK knows to ignore the fields
F and R2 fields.
Split by SNP and individual, without headers, different
individual order and compressed
In this third example, the same dataset is spread across four
files. Note how the order of which individuals are in which file, and
the order within the file, changes between different batches of
SNPs. As long as such changes are accurately represented in the
headers (whether these are in the dosage file itself, or in separate
header files, as in this example), this is allowed.
c1.dose.gz
rs0001 A C 0.98 0.02 1.00 0.00
c2.dose.gz
rs0001 A C 0.00 0.01
c3.dose.gz
rs0002 G A 0.00 1.00
c4.dose.gz
rs0002 G A 0.99 0.01 0.00 0.00
with the accompanying list of IDs in the auxiliary files
c1.lst
F1 I1
F2 I2
c2.lst
F3 I3
c3.lst
F1 I1
c4.lst
F3 I3
F2 I2
The command to read these data is then
plink --fam d.fam --dosage c.txt list sepheader Zin --write-dosage
where c.txt is a text file, with 3 fields (SNP batch, file name, separate header)
1 c1.dose.gz c1.lst
1 c2.dose.gz c2.lst
2 c3.dose.gz c3.lst
2 c4.dose.gz c4.lst
Note that in this example, the individuals are differently distibuted
between files in the first versus the second batch of SNPs. It is also
not necessary that all individuals are specified -- they will be set
to have a missing datapoint in that case.
The main constraint is that between files within a particular
genomic-batch, the length and SNP order must be exactly the same.
|
|