Illustration of the different CNV frequency filtering commands
To illustrate both the region-based and CNV-based methods of frequency
filtering, consider this example CNV file, with 18 individuals and
18 CNVs, which contains a complex set of partially overlapping events:
FID IID CHR BP1 BP2 TYPE SCORE SITES
1 1 1 10000 20000 1 10 10
2 1 1 10000 20000 1 10 10
3 1 1 9000 21000 1 10 10
4 1 1 10000 32000 1 10 10
5 1 1 20000 31000 1 10 10
6 1 1 5000 50000 1 10 10
7 1 1 40000 51000 1 10 10
8 1 1 44000 48000 1 10 10
9 1 1 42000 46000 1 10 10
10 1 1 41000 49000 1 10 10
11 1 1 39000 48000 1 10 10
12 1 1 38000 52000 1 10 10
13 1 1 80000 85000 1 10 10
14 1 1 90000 99000 1 10 10
15 1 1 91000 99000 1 10 10
16 1 1 89000 98000 1 10 10
17 1 1 90000 99000 1 10 10
18 1 1 90000 99000 1 10 10
The files are available for you to download and play with:
test1.cnv,
test1.cnv.map and
test1.fam.
The command
./plink --cfile test1 --cnv-seglist
gives the following output in plink.cnv.seglist, but with the
rightmost column being the AFF CNV count field
from plink.cnv.summary (i.e. all 18 individuals are coded as
cases; this number represents the number of CNVs spanning that
particular MAP position):
AFF
p1-5000 + 1
p1-9000 + | 2
p1-10000 ++| + | 5
p1-20000 AA|+| | 6
p1-20001 ||| | 4
p1-21000 A|| | 4
p1-21001 || | 3
p1-31000 A| | 3
p1-31001 | | 2
p1-32000 A | 2
p1-32001 | 1
p1-38000 +| 2
p1-39000 + || 3
p1-40000 |+|| 4
p1-41000 +|||| 5
p1-42000 + ||||| 6
p1-44000 |+||||| 7
p1-46000 A|||||| 7
p1-46001 |||||| 6
p1-48000 A|A||| 6
p1-48001 | ||| 4
p1-49000 A ||| 4
p1-49001 ||| 3
p1-50000 ||A 3
p1-50001 || 2
p1-51000 A| 2
p1-51001 | 1
p1-52000 A 1
p1-52001 0
p1-80000 + 1
p1-85000 A 1
p1-85001 0
p1-89000 + 1
p1-90000 |+++ 4
p1-91000 +|||| 5
p1-98000 |A||| 5
p1-98001 | ||| 4
p1-99000 A AAA 4
p1-99001 0
Region-based, or locus-based, frequency filtering (default)
NOTE These commands are intended to illustrate how
the filtering works, rather than provide useful examples of how to
analyse data in practice.
For example, the command
plink --cfile test1 --cnv-seglist --cnv-freq-exclude-above 4 --cnv-overlap 1
will remove CNVs that completely span regions with 5 or more CNVs:
p1-5000 +
p1-9000 + |
p1-10000 | + |
p1-20000 |+| |
p1-20001 ||| |
p1-21000 A|| |
p1-21001 || |
p1-31000 A| |
p1-31001 | |
p1-32000 A |
p1-32001 |
p1-38000 +|
p1-39000 + ||
p1-40000 |+||
p1-41000 +||||
p1-42000 |||||
p1-44000 |||||
p1-46000 |||||
p1-46001 |||||
p1-48000 |A|||
p1-48001 | |||
p1-49000 A |||
p1-49001 |||
p1-50000 ||A
p1-50001 ||
p1-51000 A|
p1-51001 |
p1-52000 A
p1-52001
p1-80000 +
p1-85000 A
p1-85001
p1-89000 +
p1-90000 |+++
p1-91000 +||||
p1-98000 |A|||
p1-98001 | |||
p1-99000 A AAA
p1-99001
The command
plink --cfile test1 --cnv-seglist --cnv-freq-exclude-above 6 --cnv-overlap 0
will remove CNVs that completely even partially overlap regions with 7 or more CNVs: ( this removes
7 CNVs in total)
p1-5000
p1-9000 +
p1-10000 ++| +
p1-20000 AA|+|
p1-20001 |||
p1-21000 A||
p1-21001 ||
p1-31000 A|
p1-31001 |
p1-32000 A
p1-32001
p1-38000
p1-39000
p1-40000
p1-41000
p1-42000
p1-44000
p1-46000
p1-46001
p1-48000
p1-48001
p1-49000
p1-49001
p1-50000
p1-50001
p1-51000
p1-51001
p1-52000
p1-52001
p1-80000 +
p1-85000 A
p1-85001
p1-89000 +
p1-90000 |+++
p1-91000 +||||
p1-98000 |A|||
p1-98001 | |||
p1-99000 A AAA
p1-99001
Alternative frequency filtering approach
The standard approach to frequency filtering considers the frequency
of CNVs at each particular genomic location, defining regions
with a particular number of CNVs spanning it; CNVs are subsequently
filtered based on the extent to which each individual CNV overlaps or
does not overlap with these regions.
An alternative approach (invoked with the --cnv-freq-method2
flag) is to define frequency as being a property of a particular
CNV rather than of a region, which is perhaps more
intuitive. Here we count for each CNV how many other CNVs overlap
it. The overlap definition here is forced to be a union
overlap that isn't allowed to be disruptive
(--cnv-disrupt), in order to ensure symmetry (i.e. if A
overlaps B, then B must overlap A). The frequency filtering is then
based on these counts.
Below are the frequency counts for each CNV, given different values for the overlap
parameter specified in the --cnv-freq-method2 command:
--cnv-freq-method2 0 | --cnv-freq-method2 0.5 | --cnv-freq-method2 1
----------------------|------------------------|----------------------
| |
p1-5000 12 | 1 | 1
p1-9000 6 12 | 3 1 | 1 1
p1-10000 6 6 6 6 12 | 3 3 3 2 1 | 2 2 1 1 1
p1-20000 6 6 6 6 6 12 | 3 3 3 2 2 1 | 2 2 1 1 1 1
p1-20001 6 6 6 12 | 3 2 2 1 | 1 1 1 1
p1-21000 6 6 6 12 | 3 2 2 1 | 1 1 1 1
p1-21001 6 6 12 | 2 2 1 | 1 1 1
p1-31000 6 6 12 | 2 2 1 | 1 1 1
p1-31001 6 12 | 2 1 | 1 1
p1-32000 6 12 | 2 1 | 1 1
p1-32001 12 | 1 | 1
p1-38000 7 12 | 4 1 | 1 1
p1-39000 7 7 12 | 4 4 1 | 1 1 1
p1-40000 7 7 7 12 | 4 4 4 1 | 1 1 1 1
p1-41000 7 7 7 7 12 | 6 4 4 4 1 | 1 1 1 1 1
p1-42000 7 7 7 7 7 12 | 2 6 4 4 4 1 | 1 1 1 1 1 1
p1-44000 7 7 7 7 7 7 12 | 2 2 6 4 4 4 1 | 1 1 1 1 1 1 1
p1-46000 7 7 7 7 7 7 12 | 2 2 6 4 4 4 1 | 1 1 1 1 1 1 1
p1-46001 7 7 7 7 7 12 | 2 6 4 4 4 1 | 1 1 1 1 1 1
p1-48000 7 7 7 7 7 12 | 2 6 4 4 4 1 | 1 1 1 1 1 1
p1-48001 7 7 7 12 | 6 4 4 1 | 1 1 1 1
p1-49000 7 7 7 12 | 6 4 4 1 | 1 1 1 1
p1-49001 7 7 12 | 4 4 1 | 1 1 1
p1-50000 7 7 12 | 4 4 1 | 1 1 1
p1-50001 7 7 | 4 4 | 1 1
p1-51000 7 7 | 4 4 | 1 1
p1-51001 7 | 4 | 1
p1-52000 7 | 4 | 1
p1-52001 | |
p1-80000 1 | 1 | 1
p1-85000 1 | 1 | 1
p1-85001 | |
p1-89000 5 | 5 | 1
p1-90000 5 5 5 5 | 5 5 5 5 | 1 3 3 3
p1-91000 5 5 5 5 5 | 5 5 5 5 5 | 1 1 3 3 3
p1-98000 5 5 5 5 5 | 5 5 5 5 5 | 1 1 3 3 3
p1-98001 5 5 5 5 | 5 5 5 5 | 1 3 3 3
p1-99000 5 5 5 5 | 5 5 5 5 | 1 3 3 3
p1-99001 | |
----------------------|------------------------|----------------------
| |
Any additional commands such as --cnv-freq-exclude-above 5
would work in a straightforward manner based on these counts. For example
plink --cfile test1
--cnv-freq-method2 0
--cnv-freq-include-exact 5
--cnv-write
--cnv-write-freq
will include just the group of segments starting after position 89000
Filtering segments based on frequencies
Will remove 13 CNVs based on frequency (after other filters)
18 mapped to a person, of which 18 passed filters
5 of 18 mapped as valid segments
e.g. as shown in file
plink.cnv
which contains
FID IID CHR BP1 BP2 TYPE SCORE SITES FREQ
14 1 1 90000 99000 1 10 10 5
15 1 1 91000 99000 1 10 10 5
16 1 1 89000 98000 1 10 10 5
17 1 1 90000 99000 1 10 10 5
18 1 1 90000 99000 1 10 10 5
In summary
For a complex set of partially overlapping CNVs, any attempt to
collapse CNVs into discrete groups or counts will inevitably be
somewhat artificial. Nonetheless, the commands presented here provide
a range of options, to either strictly or loosely filter as
desired. This made-up example dataset is particularly complex -- in
most real cases, these frequency filters will yield sensible results.
To select CNVs below some overall frequency (e.g. 1%, which if there
are 1000 individuals would mean 10 events) the option
--cnv-freq-exclude-above 10
--cnv-overlap 0.5
would be a good default.
To select strictly defined singleton CNVs (those seen only once in a dataset), use
--cnv-freq-exclude-above 1
|