This page describes features to analyse "dosage" SNP datasets, for example, from imputation packages BEAGLE or MACH. The --dosage command will take data in a variety of formats (but best suited to BEAGLE-style output, with one SNP per line) potentially compressed and distributed across multiple files, and perform association tests between the phenotype and the dosage data (expected allele counts) as well as outputing merged, filtered or hard-called datasets.

Basic usage

The basic usage is

plink --dosage myfile.dat --fam mydata.fam

which will create a file

     plink.assoc.dosage

which contains the fields

     CHR   Chromosome code, if map file specified
     SNP   SNP code
      BP   Base-pair position, if map file specified
      A1   Allele 1 code
      A2   Allele 2 code
     FRQ   Frequency of A1, from dosage data
    INFO   R-squared quality metric / information content
      OR   Odds ratio for association (or BETA for quantitative traits)
      SE   Standard error of effect estimate
       P   p-value for association tests

If a MAP file is also specified

plink --dosage myfile.dat --fam mydata.fam --map mymap.map

then a) extra CHR and BP fields will be reported in the output, b) only SNPs that are present in the MAP file will be analysed and reported. The basic format of a dosage file specifies that each row of the file corresponds to a SNP (i.e. similar to a transposed PED file, rather than one individual per row). There are three default columns that should appear before the dosage data:

     SNP  A1  A2  Dosage data ...

For example

        SNP  A1  A2   F1 I1       F2 I2        F3 I3
     rs0001   A   C   0.98 0.02   1.00 0.00    0.00 0.01 
     rs0002   G   A   0.00 1.00   0.00 0.00    0.99 0.01

In this case, we have data for two SNPs on three individuals. Here, each genotype is represented by two numbers (alternative representations can be specified below). For example, the two numbers for the first SNP represent the probability of an A/A, then an A/C genotype. The probability of a G/G is naturally 1 minus the sum of these.

Individuals in the dosage data but not the FAM file are ignored (unless the noheader option is specified, see below). Individuals in the FAM file but not the doseage file are removed from the dataset.

Association tests are performed within a linear or logistic regression framework. As such, many standard options such as --covar or --within can be specified. See the main page on association for more details. Not all options are available however: for example, permutation is not possible with dosage data files.

The INFO metric is calculated based on the entire file, based on the ratio of empirical and expected variance in dosage. Values closer to 1 indicate better expected quality of imputation. Values can be above 1: note that values much greater than 1 can indicate strong departure from HWE.

Optionally, if extra fields exist they can be skipped, via the skip0, skip1 and skip2 options (see below):

     {skip0}  SNP  {skip1}  A1  A2  {skip2}  Dosage data  ...

By default, we expect a header row for each dosage file, that has the same header fields for the leading columns, and then lists the FID and IID codes for the individuals in that file. If there is no header (noheader option), then PLINK assumes the order and number of individuals in the each dosage file should correspond to the FAM file (after any exclusions, e.g. from --remove, etc) specifed.

As described below, dosage data can be represented in a number of ways. Dosage data can be spread across multiple files: if the list option is specified, e.g.

plink --dosage myfile.lst list --fam mydata.fam

where myfile.lst is a list of file names (full paths can be specified if the dosage files are in different directories), e.g.

     chr1.dose
     chr2.dose
     chr3.dose
     ...

Options

The options available are as follows:

     list          Indicates that the file following --dosage is a list of dosage files
                   (as opposed to being a dosage file itself).
     sepheader     Indicates that the ID lists are in separate files (requires 'list')
     noheader      Indicates that there are no headers available

     skip0=N       Number of fields to skip before SNP
     skip1=N       Number of fields to skip between SNP and A1
     skip2=N       Number of fields to skip between A2 and genotype data

     dose1         Dosage data is 0..1, not 0..2 scale
     format=N      Dosage, two probabilities or three (N=1,2,3)

     Z             All input (dosage) files and output files compressed
     Zin           All input files are compressed
     Zout          Output file will be compressed 

     occur         Helper function: count number of occurrences

Most of these options modify the expected format of the input files. Examples are given in the section below.

Examples of different input format options

Based on the example data file shown above, here are some examples different of how the data could be differently formatted. That is, these are all equivalent and will give the same results. The purpose of these options is to reduce the likely number of steps required in preparing the data file(s) for analysis. The major fixed specification is that the data are essentially in SNP-by-individual (one row is one SNP) format in all cases.

Split by SNP, single dosage

Here each file contains all individuals, has a header file and contains single dosages of the A1 allele.

a1.dose

        SNP  A1  A2   F1 I1 F2 I2 F3 I3
     rs0001   A   C   0.02 0.00 1.99

a2.dose

        SNP  A1  A2   F1 I1 F2 I2 F3 I3
     rs0002   G   A   1.00 2.00 0.01

The command would be

plink --fam d.fam --dosage a.txt list format=1

where a.txt is a text file, with 2 fields, SNP batch and dosage file name

     1 a1.dose
     2 a2.dose

in which the numeric codes indicate different batches of SNPs. Obviously, in real examples a given file would likely contain a very large number of SNPs (e.g. all SNPs for a given chromosome).

Split by individuals, with some leading nuissance fields

b1.dose

        SNP  A1  A2   F     R2    F1 I1       F2 I2    
     rs0001   A   C   0.02  0.98  0.98 0.02   1.00 0.00
     rs0002   G   A   0.5   0.23  0.00 1.00   0.00 0.00

b2.dose

        SNP  A1  A2  F     R2    F3 I3
     rs0001   A   C  0.02  0.8   0.00 0.01 
     rs0002   G   A  0.5   0.55  0.99 0.01

The command to read these data is then

plink --fam d.fam --dosage b.txt list skip2=2

where b.txt is a text file, with 1 field (file name), as there is only a single batch of SNPs (i.e. all dosage files contain the same set of SNPs, in the same order).

     b1.dose
     b2.dose

The skip2 option means that PLINK knows to ignore the fields F and R2 fields.

Split by SNP and individual, without headers, different individual order and compressed

In this third example, the same dataset is spread across four files. Note how the order of which individuals are in which file, and the order within the file, changes between different batches of SNPs. As long as such changes are accurately represented in the headers (whether these are in the dosage file itself, or in separate header files, as in this example), this is allowed.

c1.dose.gz

     rs0001   A   C   0.98 0.02   1.00 0.00

c2.dose.gz

     rs0001   A   C   0.00 0.01

c3.dose.gz

     rs0002   G   A   0.00 1.00

c4.dose.gz

     rs0002   G   A   0.99 0.01   0.00 0.00

with the accompanying list of IDs in the auxiliary files

c1.lst

     F1 I1
     F2 I2

c2.lst

     F3 I3

c3.lst

     F1 I1

c4.lst

     F3 I3
     F2 I2

The command to read these data is then

plink --fam d.fam --dosage c.txt list sepheader Zin --write-dosage

where c.txt is a text file, with 3 fields (SNP batch, file name, separate header)

     1 c1.dose.gz c1.lst
     1 c2.dose.gz c2.lst
     2 c3.dose.gz c3.lst
     2 c4.dose.gz c4.lst

Note that in this example, the individuals are differently distibuted between files in the first versus the second batch of SNPs. It is also not necessary that all individuals are specified -- they will be set to have a missing datapoint in that case. The main constraint is that between files within a particular genomic-batch, the length and SNP order must be exactly the same.

This document last modified Wednesday, 25-Jan-2017 11:39:26 EST