Working with dosage genotype data

This page describes some new features in the (to be released) 1.09 version of PSEQ.

These features are desigend to facilitate working with soft-called genotype data, as result from most genotype imputation procedures. For most of this document, we use the term dosage generically, to refer to both posterior genotype probability vectors as well as expected allele counts.

Main input files

The primary consideration for input into a PSEQ project is that the data are ordered with rows corresponding to markers (as in a VCF file), not individuals (as in a PED file, for example). If this is not the case, you will need to transpose those files first.

Beyond the core restriction of rows corresponding to markers, not individuals, there is some flexibility in terms of the exact format of the data.

  • Dosage data can be in a single file, or spread across multiple files (by marker and/or individual)
  • A header row may be present in a dosage file, containing the individual IDs for that file. Alternatively, the individual IDs can be specified in a separate file if no header is present.
  • Variant IDs, map positions and/or (biallelic) allele codes can be specified in the first five columns of a dosage file. Alternatively, this information can be present in a separate map file.
  • Arbitrary meta-information (e.g. such as imputation quality scores) from a separate file can be appended to the project.
  • Input can be either as single floating-point values representing dosage (on either a 0..1 or 0..2 scale), or as 2 or 3 posterior genotype probabilities (such that the three implied genotype probabilities sum to 1.00).
  • Internally, imputed genotypes can be represented either as dosages or as posterior genotype probabilities, independent of the input format.
  • Data files can be tab- or space-delimited.

To take a simple example: if the following file mydata.dat represents a dosage file:

#P1  P2  P3  P4  P5
rs001  chr1  1000  A  C   0.1  1.2  0.0  1.6  2.0
rs002  chr1  1010  G  T   1.0  0.8  1.1  0.2  1.8
.      chr2  1000  A  AA  0.1  2.0  2.0  0.7  1.2

Here we have a single file containing ID, map position and allele information in the first five columns, followed by a single dosage (the expected count of the second listed allele) for each genotype/individual. The header (which must start with a # character), simply lists the individual IDs for the five people in this file. These data can be loaded into a new (or existing) project as follows. First we make a new project:

pseq proj1 new-project

and then load the dosage data with the load-dosage command:

pseq proj1 load-dosage --file mydata.dat --id S1 --name EC

The --id option is required to give a name to that sample (collection of five individuals). This is the same as the file-tag, i.e that can subsequently be used in a mask statement:

--mask file=S1

Importantly, the load-dosage command allows the file-id to already exist in the the project. In this case, the new variants will be appended to that sample, and the end result will be exactly the same as the input dosage files had first been concatenated into a large, single dosage file. As in the examples below, when a single dataset is arbitrarily split into a large number of chunks for the purpose of imputation, this allows for those separate chunks of the same dataset to be treated as a single, unitary file, as long as each chunk has exactly the same list of individuals.

Note: all dosage files can be uncompressed plain text, or can be gzip-compressed (typically ending with a .gz extension). Compressed files will be automatically decompressed on-the-fly.

The --name option refers to the meta-information tag that should be used to represent the genotype-level dosage data internally. Here EC means expected count (i.e. of the alternate allele).

This dataset can then be viewed: here we use the --gmeta option to obtain genotype-level dosage data:

pseq proj1 v-view --gmeta
chr1:1000	rs001	A/C	.	1	PASS
P1	S1	A/A	[EC=0.1]
P2	S1	./.	[EC=1.2]
P3	S1	A/A	[EC=0]
P4	S1	./.	[EC=1.6]
P5	S1	C/C	[EC=2]
chr1:1010	rs002	G/T	.	1	PASS
P1	S1	G/T	[EC=1]
P2	S1	G/G	[EC=0.08]
P3	S1	G/T	[EC=1.0999]
P4	S1	./.	[EC=0.2]
P5	S1	./.	[EC=1.8]
chr2:1000	.	A/AA	.	1	PASS
P1	S1	A/A	[EC=0.1]
P2	S1	AA/AA	[EC=2]
P3	S1	AA/AA	[EC=2]
P4	S1	./.	[EC=0.7]
P5	S1	./.	[EC=1.2]

Here we see that the dosage information has been appended using the EC tag. Also evident is that hard-calls have been made for variants based on the dosage values, if the value is sufficiently close to 0, 1 or 2 and otherwise left as missing. We describe below how to alter the way in which hard-calls are made.

We can also retrieve the basic matrix of dosages using the v-meta-matrix command:

pseq proj1 v-meta-matrix --name EC
MARKER	P1	P2	P3	P4	P5
chr1:1000	0.1	1.2	0	1.6	2	
chr1:1010	1	0.08	1.0999	0.2	1.8	
chr2:1000	0.1	2	2	0.7	1.2	

Note: Internally, these data are represented in exactly the same way as all other data in a PLINK/SEQ project. Data imported from the load-dosage command can be mixed with data from the load-vcf or load-plink commands. All commands and masks therefore work in the same way for imputed data. That is, the v-freq command will work as usual, based on the hard-calls, rather than the EC genotype-tag. From PLINK/SEQ's perspective, the EC tag is just another arbitrary genotype-level tag. As we describe below, some key commands (notably glm) have options to directly use information on soft-called genotypes, instead of looking at the hard call.

Equivalent VCF encoding

Note that the above example using the load-dosage command would produce identical results by simply loading the following VCF, that uses VCF variant- and genotype-level tags to represent the dosage and meta-information associated with each variant/genotype:

##fileformat=VCFv4.1
##source=pseq
##INFO=<ID=fail,Number=1,Type=Float,Description=".">
##INFO=<ID=pct_geno,Number=1,Type=Float,Description=".">
##INFO=<ID=r2,Number=1,Type=Float,Description=".">
##FILTER=<ID=PASS,Description="Passed variant FILTERs">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	P1	P2	P3	P4	P5
chr1	1000	rs001	A	C	.	PASS	fail=1;pct_geno=0.2;r2=0.88	GT:EC	0/0:0.1	./.:1.2	0/0:0	./.:1.6	1/1:2
chr1	1010	rs002	G	T	.	PASS	fail=1;pct_geno=1;r2=0.99	GT:EC	0/1:1	0/0:0.08	0/1:1.0999	./.:0.2	./.:1.8
chr2	1000	.	A	AA	.	PASS	fail=0;pct_geno=0.2;r2=0.34	GT:EC	0/0:0.1	1/1:2	1/1:2	./.:0.7	./.:1.2

If it is easier for your imputation pipeline to produce this type of VCF as output, there is no need, per se, to use the load-dosage command.

Allele encoding

Of the two alleles listed for each variant, the first is assumed to be the reference allele, the second to be the (single) alternate or non-reference allele. However, it is not unusual for the results of imputation to be ordered arbitrarily, or with respect to the major/minor allele status in terms of sample frequency. To automatically rectify this, if a SEQDB is attached (i.e. by using the --seqdb option either with new-project or load-dosage) then the option:

--check-reference

will modify the behaviour of load-dosage by checking the two listed alleles against the reference sequence (e.g. hg19) and taking action if necessary, as follows.

If the first list allele matches the reference base from SEQDB, then we assume everything is correctly specified and take no further action.

If the second listed allele matches the reference base from SEQDB, this is taken to be the reference (i.e. the two alleles as well as the subsequent dosage data are effectively swapped).

Othewise, if neither the first nor second alleles match the SEQDB reference allele, it will see whether flipping the strand on which both alleles are reported creates an allele that matches the reference: that allele will then be assigned as the reference allele. Obviously this is an imprecise and error-prone procedure, and users are well-advised to ensure that all input data are both strand- and reference-consistent in the first instance. PSEQ will print warnings for any variants at which neither of the strand-unflipped alleles match the SEQDB reference.

If the SEQDB returns a N at that position, the original order of the two alleles is preserved.

If one of the two alleles is a . (period) character, 0 or X, this is treated as unobserved, implying that the variant is monomorphic. Here we assume that the remaining allele is the reference (whatever that may be).

Note: currently, only biallelic variants can be handled by the load-dosage command.

Variant-level meta-information

If you have variant-level meta-information about each variant (such as population frequencies, imputation quality scores (R-squared metrics), or a flag for whether that variant was directly genotyped, etc), it needs to be in a separate, tab-delimited file. Importantly, the order of variants in this file must exactly match the order of the variants in the dosage file (and also, if separately specified, any external map file). If this is not the case (based on comparing variant IDs) an error will result and loading will halt.

Currently, all meta-data associated with variants being imported via load-dosage are assumed to be scalar floating-point values. (The attach-meta command can always be used subsequently to attach any type of meta-information for these variants.) The format is simply a first header row, starting with a # character, followed by tab-delimited labels (e.g. three in this case). Following the header, each row in the meta-information file must correspond to the row/variant in the matching dosage file, as mentioned above. The first column should be a variant ID. If this is not defined, enter a period (.) character. The following tab-delimited fields are the values for the three meta-fields. Any non-numeric value is treated as a missing data-point.

#r2	pct_geno	fail
rs001	0.88	0.2	1
rs002	0.99	1	1
.	0.34	0.2	0

To load these variant-level metadata along with the basic dosage information when performing the load-dosage command, add the option:

--meta-file mydata.meta

These attributes are then available as standard, sample-variant level tags, i.e. such as can be viewed with the --vmeta option on the v-view command:

pseq p3 v-view --vmeta
chr1:1000	rs001	A/C	.	1	PASS	.	fail=1;pct_geno=0.2;r2=0.88
chr1:1010	rs002	G/T	.	1	PASS	.	fail=1;pct_geno=1;r2=0.99
chr2:1000	.	A/AA	.	1	PASS	.	fail=0;pct_geno=0.2;r2=0.34

Using a separate individual-list file

If the header and/or map/allele information is not present in the main dosage file, it can be specified in a separate file. If there is no header in the dosage file, then the option:

--indiv-file mydata.indiv

will instruct PSEQ to read the list of individuals from the file mydata.indiv. This file should contain simply one unique individual ID per line. Naturally, the order and number of individuals in this file must match the order and number of individuals in the corresponding dosage file.

If a header row is in fact present in the dosage file, but you still want to skip it and read IDs from a separate file, add the additional option:

--indiv-file mydata.indiv --format skip-header

It is not possible to skip the header without a corresponding indiv-file being specified (obviously, there would be no way to know which genotypes belong to which individuals otherwise).

Using a separate map file

Instead of having the basic information about the position and alleles of each site in the main dosage file, it is possible to have these in an external file: if the main dosage file were instead

#P1	P2	P3	P4	P5
rs1001	0.1	1.2	0.0	1.6	2.0
rs1002	1.0	0.08	1.0999	0.2	1.8
.	0.1	2.0	2.0	0.7	1.2

and the corresponding map file:

rs1001  chr1  1000	A	C
rs1002  chr1  1010	G	T
.       chr2  1000	A	AA

then use the command:

pseq proj1 load-dosage --file mydata.dat --id S1 --name EC --map-file mydata.map

Alternatively, if the main dosage file contains the two allele codes, but not the chromosome and base-position, this information can be specified separately:

#P1	P2    P3     P4      P5
rs1001	A     C      0.1     1.2	0.0	1.6	2.0
rs1002	G     T      1.0     0.08	1.0999	0.2	1.8
.	A     AA     0.1     2.0	2.0	0.7	1.2

and the corresponding map file:

rs1001  chr1  1000
rs1002  chr1  1010
.       chr2  1000

then use the command:

pseq proj1 load-dosage --file mydata.dat --id S1 --name EC --map-file mydata.map --format position-map

Similarly, if the main dosage file contains only the chromosome and base-position fields, then then allele code information can be specified in a separate map file (ID, first allele, second allele), with the following option:

pseq proj1 load-dosage --file mydata.dat --id S1 --name EC --map-file mydata.map --format allele-map

If the main dosage file does not contain a first column of ID, add the flag:

--format no-ids

In this instance, the a separate map file (that necessarily contains IDs) must be specified. Again, as in all cases, the order of variants in the map file must match exactly the order of variants in the main dosage file.

When the fields are present, they must appear in the following order always: ID, chromosome, base-position, first allele, second allele.

Posterior genotype probabilities and dosages

By default, load-dosage assumes that the soft-called genotypes are represented as the expected number ("dosage") of alterate alleles, on a scale of 0 to 2 (that is, assuming a diploid, biallelic locus). If the input data are dosages on the scale of 0 to 1 (so that a true heterozygote would have a dosage of 0.5), then add:

--format dose1

Alternatively, if the soft-calls are in the form of posterior probabilities, use the option:

--format prob2

if each "genotype" is represented by two numbers: P(A1/A1), P(A1/A2), and so the third genotype, the homozygote for the second (alternate) allele) is P(A2/A2) = 1 - P(A1/A1) - P(A1/A2).

Alternatively, if all three genotype probabilities are specified, add the option:

--format prob3

As above, the implied ordering of genotype probabilities is reference homozygote, heterozygote, alternate homozygote. Note that the values are forced to sum to 1.00 exactly, by dividing each by the sum of the three.

When a soft-called genotype corresponds to two or three values (prob2 or prob3 format) then each value should be separated by the same delimiter used between individuals/genotypes (i.e. a tab usually).

The dose2 keyword can be specified for clarity, but is the default in any case.

By default, no matter which of the above formats genotypes are read into a project, they will be converted to dosages on a 0..2 scale. This corresponds to the command:

--format as-dosage

(it is not necessary to explicitly specify this command, as it is the default). To instead store internally the information as a vector of posterior genotype probabilities, add the option:

--format as-posteriors

If the input is in the format prob2 or prob3, naturally then no conversion needed. If the input data are genotype probabilities stored as dosages, this is calculated as 2 * P(A2/A2) + P(A1/A2). If the input data are dosages, but the above option is invoked to store the results are probabilities, then the following inexact heuristic is applied (assuming dosage D on the scale 0..2):

               D < 1        D >= 1 
 ==================================
 P(A1/A1)      1 - D        0  
 P(A1/A2)      D            2 - D
 P(A2/A2)      0            D - 1 

Naturally, a given dosage is consistent with a range of genotype probabilites (i.e. at an extreme [1,0,1] equals [0,1,0] on a dosage scale) but this should provide a reasonable way to fudge probabilities, if needed, assuming relatively high confidence calls (so that most dosages are in fact close to either 0, 1 or 2).

Using --format as-posteriors, the toy dosage file in the examples above is therefore represented as:

chr1:1000	rs001	A/C	.	1	PASS	.	fail=1;pct_geno=0.2;r2=0.88
P1	F1	A/A	[EC=0.9,0.1,0]
P2	F1	./.	[EC=0,0.8,0.2]
P3	F1	A/A	[EC=1,0,0]
P4	F1	./.	[EC=0,0.4,0.6]
P5	F1	C/C	[EC=0,0,1]
chr1:1010	rs002	G/T	.	1	PASS	.	fail=1;pct_geno=1;r2=0.99
P1	F1	G/T	[EC=0,1,0]
P2	F1	G/G	[EC=0.92,0.08,0]
P3	F1	G/T	[EC=0,0.9001,0.0999]
P4	F1	./.	[EC=0.8,0.2,0]
P5	F1	./.	[EC=0,0.2,0.8]
chr2:1000	.	A/AA	.	1	PASS	.	fail=0;pct_geno=0.2;r2=0.34
P1	F1	A/A	[EC=0.9,0.1,0]
P2	F1	AA/AA	[EC=0,0,1]
P3	F1	AA/AA	[EC=0,0,1]
P4	F1	./.	[EC=0.3,0.7,0]
P5	F1	./.	[EC=0,0.8,0.2]

Dosage data distributed across a large number of files

It is not uncommon to have large imputed datasets distributed across several (potentially hundreds) of smaller files, that may split by overall datasets by samples and/or variants. The load-dosage command has a few options to facilitate loading multiple related files, by using the --file-list option instead of --file.

The --file-list option takes a single argument that is a text file where each row corresponds to a dosage file to be loaded, and has either 4 or 5 tab-delimited fields that correspond to:

  file-tag 
  dosage-file
  map-file
  indiv-file
  [ optionally, meta-file ] 

The file-tag corresponds to S1 in the example above. As a file-tag denotes a single group of individuals, there should be a one-to-one correspondence between the file-tag and the associated indiv-file in the fourth column (otherwise, an error will be displayed). If two dosage files contain data on the same individuals (ordered similarly), they should be given the same file-tag -- this will ensure that all data for those individuals end up in the same sample/file internally. For example: if the file impfiles.txt contains the following:

S1  ../data/imputed-s1-chr1-chunk1.dose.gz  aux/chr1-chunk1.map  aux/s1.indiv  ../data/imputed-s1-chr1-chunk1.rsq
S1  ../data/imputed-s1-chr1-chunk2.dose.gz  aux/chr1-chunk2.map  aux/s1.indiv  ../data/imputed-s1-chr1-chunk2.rsq
S1  ../data/imputed-s1-chr1-chunk3.dose.gz  aux/chr1-chunk3.map  aux/s1.indiv  ../data/imputed-s1-chr1-chunk3.rsq
S2  ../data/imputed-s2-chr1-chunk1.dose.gz  aux/chr1-chunk1.map  aux/s2.indiv  ../data/imputed-s2-chr1-chunk1.rsq
S2  ../data/imputed-s2-chr1-chunk2.dose.gz  aux/chr1-chunk2.map  aux/s2.indiv  ../data/imputed-s2-chr1-chunk2.rsq
S2  ../data/imputed-s2-chr1-chunk3.dose.gz  aux/chr1-chunk3.map  aux/s2.indiv  ../data/imputed-s2-chr1-chunk3.rsq

then the command

pseq proj1 load-dosage --file-list impfiles.txt --name EC

will load the six dosage files into a single project. That project will contain two 'samples' or 'files' (in PSEQ's internal representation), S1 and S2, that can be specified in conjunction with the --mask file=S2 mask option, for example. Three different genomic regions have been given here, corresponding to three chunks (e.g. 5Mb regions) of chromosome 1. In this example, both sample groups (S1 and S2) have the similar set of three files each, although this is not a requirement (i.e. different sample sets can be "chunked" differently and still merged using the above approach). Finally, in this example, we've also included some optional meta-information for each variant, in the *.rsq files. Note that these reference six unique files, each corresponding exactly to the six unique dosage files in the second column.

Other command line options (such as --format prob2 for example) can be specified along with --file-list, although note that this will imply that all listed dosage files have a similar format (i.e. two genotype probabilities, in this instance). If you wish to combine imputed data from different pipelines that perhaps has different formats, then use multiple load-dosage commands to sequentially populate the project with the heterogeneously-formatted data.

Making hard-calls

By default, a hard-call is made if a dosage (0..2 scale) is within 0.1 of either 0, 1 or 2 otherwise a missing genotype value is set. This value can be changed by specifying along with the load-dosage command, e.g. :

--hard-call-threshold 0.05

Hint: Note that after loading, the include mask g-functions can still be used to dynamically recall hard-called genotypes (example of how to do this to be given).

Misc.

By default, we assume all files are tab-delimited. If all files are space-delimited instead, and the following option to load-dosage --format space-delimited

(Note: currently, load-dosage parses on one-or-more whitespace delimiters. This option is therefore now redundant, although future releases may (e.g. if it speeds up parsing of large files) parse on only single delimiters that are either tabs or spaces.

Association analysis of soft-called data

As noted above, once loaded into a project there is no distinction between "imputed" genotype data and any other type of genotype data, that may also happen to have arbitrary variant-level and genotype-level tags. Most commands in PSEQ operate on hard-calls, and this would happen no differently for data that happened to be loaded via the load-dosage command. (In fact, you