SNP scoring routine
PLINK provides a simple means to generate scores
or profiles for individuals based on an allelic scoring
system involving one or more SNPs. One potential use would be
to assign a single quantitative index of genetic load, perhaps to
build multi-SNP prediction models, or just as a quick way to
identify a list of individuals containing one or more of a set of
variants of interest.
Basic usage
The basic command to generate a score is the --score option, e.g.
./plink --bfile mydata --score myprofile.raw
which takes as a parameter the name of a file (here myprofile.raw)
that describes the scoring system. This file has the format of one or more lines,
each with exactly three fields
SNP ID
Reference allele
Score (numeric)
for example
SNPA A 1.95
SNPB C 2.04
SNPC C -0.98
SNPD C -0.24
These scores can be based on whatever you want. One choice might be the log of the odds ratio for significantly
associated SNPs, for example. Then, running the command above would generate a file
plink.profile
with one individual per row and the fields:
FID Family ID
IID Individual ID
PHENO Phenotype for that
CNT Number of non-missing SNPs used for scoring
CNT2 The number of named alleles
SCORE Total score for that individual
The score is simply a sum across SNPs of the number of reference
alleles (0,1 or 2) at that SNP multiplied by the score for that SNP.
For, example,
Variant(1/2) A/T C/G A/C C/G
Freq. of allele 1 0.20 0.43 0.02 0.38
Ind 1 genotype A/A G/G A/C 0/0
# ref alleles 2 0 1 2*0.38 (=expectation)
Score ( 2*1.95 + 0*2.04 + 1*(-0.98) + 2*0.38*(-0.24) ) / 4
= 2.74 / 4 = 0.68
The score 2.74/4 (the average score per non-missing SNP) could then be
used, e.g. as a covariate, or a predictor of disease if it is scored
in a sample that is independent from the one used to generate the
original scoring weights. Obviously, a score profile based on some
effect size measure from a large number of SNPs will necessarily be
highly correlated with the phenotype in the original sample: i.e. this
in no (straightforward) way provides additional statistical evidence
for associations in that sample.
Multiple scores from SNP subsets
To calculate multiple scores from subsets of SNPs in a
single --score file, it is possible to use the
two commands, each followed by a filename, e.g.
--q-score-file snpval.dat
--q-score-range q.ranges
in addition to --score, where snpval.dat is a file
that contains for each SNP a number (e.g. that might be the p-value
from some test)
rs00001 0.234
rs00002 0.046
rs00003 0.887
...
and q.ranges is a file in which each row corresponds to a
different score, containing a label, then a lower and upper bound for
the values as given in the other file, e.g.
S1 0.00 0.01
S2 0.00 0.20
S3 0.10 0.50
would create three score files,
plink.S1.profile
plink.S2.profile
plink.S3.profile
in which the first only uses SNPs that have a value
in snpval.txt between 0.0 and 0.01; the second uses only SNPs
which have a value between 0.00 and 0.20, etc.
Misc. options
By default, if a genotype in the score is missing for a particular
individual, then the expected value is imputed, i.e. based on the
sample allele frequency. To change this behavior, add the flag
--score-no-mean-imputation
which means the above example would be calculated as
Score ( 2*1.95 + 0*2.04 + 1*(-0.98) ) / 3
= 2.92 / 3 = 0.97
|