PLINK: Whole genome data analysis toolset plink...
Last original PLINK release is v1.07 (10-Oct-2009); PLINK 1.9 is now available for beta-testing

Whole genome association analysis toolset

Introduction | Basics | Download | Reference | Formats | Data management | Summary stats | Filters | Stratification | IBS/IBD | Association | Family-based | Permutation | LD calcualtions | Haplotypes | Conditional tests | Proxy association | Imputation | Dosage data | Meta-analysis | Result annotation | Clumping | Gene Report | Epistasis | Rare CNVs | Common CNPs | R-plugins | SNP annotation | Simulation | Profiles | ID helper | Resources | Flow chart | Misc. | FAQ | gPLINK

1. Introduction

2. Basic information

3. Download and general notes

4. Command reference table

5. Basic usage/data formats 6. Data management

7. Summary stats 8. Inclusion thresholds 9. Population stratification 10. IBS/IBD estimation 11. Association 12. Family-based association 13. Permutation procedures 14. LD calculations 15. Multimarker tests 16. Conditional haplotype tests 17. Proxy association 18. Imputation (beta) 19. Dosage data 20. Meta-analysis 21. Annotation 22. LD-based results clumping 23. Gene-based report 24. Epistasis 25. Rare CNVs 26. Common CNPs 27. R-plugins 28. Annotation web-lookup 29. Simulation tools 30. Profile scoring 31. ID helper 32. Resources 33. Flow-chart 34. Miscellaneous 35. FAQ & Hints

36. gPLINK
 

BED file format

This page describes the format of binary PED (BED) files. Consider the following example PED file, test.ped:
     1 1 0 0 1  0    G G    2 2    C C
     1 2 0 0 1  0    A A    0 0    A C
     1 3 1 2 1  2    0 0    1 2    A C
     2 1 0 0 1  0    A A    2 2    0 0
     2 2 0 0 1  2    A A    2 2    0 0
     2 3 1 2 1  2    A A    2 2    A A
and corresponding MAP file test.map
     1 snp1 0 1
     1 snp2 0 2
     1 snp3 0 3
We create a binary fileset with the following command:
plink --file test --make-bed --out test

which produces output:

     @----------------------------------------------------------@
     |         PLINK!       |    v0.99l     |   27/Jul/2006     |
     |----------------------------------------------------------|
     |  (C) 2006 Shaun Purcell, GNU General Public License, v2  |
     |----------------------------------------------------------|
     |       http://pngu.mgh.harvard.edu/purcell/plink/         |
     @----------------------------------------------------------@

     Web-based version check ( --noweb to skip )
     Connecting to web...  OK, v0.99l is current

     *** Pre-Release Testing Version ***

     Writing this text to log file [ test.log ]
     Analysis started: Sat Jul 29 17:22:59 2006

     Options in effect:
             --file test
             --make-bed
             --out test

     3 (of 3) markers to be included from [ test.map ]
     6 individuals read from [ test.ped ]
     3 individuals with nonmissing phenotypes
     Assuming a binary trait (1=unaff, 2=aff, 0=miss)
     Missing phenotype value is also -9
     Before frequency and genotyping pruning, there are 3 SNPs
     Applying filters (SNP-major mode)
     4 founders and 2 non-founders found
     0 SNPs failed missingness test ( GENO > 1 )
     0 SNPs failed frequency test ( MAF < 0 )
     After frequency and genotyping pruning, there are 3 SNPs
     Writing pedigree information to [ test.fam ]
     Writing map (extended format) information to [ test.bim ]
     Writing genotype bitfile to [ test.bed ]
     Using (default) SNP-major mode
     Analysis finished: Sat Jul 29 17:37:57 2006
and generates files
     test.bed
     test.bim
     test.fam
The file test.bim is the extended map file, which also includes the names of the alleles: (chromosome, SNP, cM, base-position, allele 1, allele 2):
     1       snp1    0       1       G       A
     1       snp2    0       2       1       2
     1       snp3    0       3       A       C
The file test.fam is simply the first six columns of test.ped
     1 1 0 0 1 0
     1 2 0 0 1 0
     1 3 1 2 1 2
     2 1 0 0 1 0
     2 2 0 0 1 2
     2 3 1 2 1 2
We can inspect the BED file with the Unix xxd command, to view a binary file:
xxd -b test.bed

which generates:
     0000000: 01101100 00011011 00000001 11011100 00001111 11100111  l.....
     0000006: 00001111 01101011 00000001                             .k.
The actual binary data are the nine blocks of 8 bits (a byte) in the center: the first 3 bytes have a special meaning. The first two are fixed, a 'magic number' that enables PLINK to confirm that a BED file is really a BED file. That is, BED files should always start 01101100 00011011. The third byte indicates whether the BED file is in SNP-major or individual-major mode: a value of 00000001 indicates SNP-major (i.e. list all individuals for first SNP, all individuals for second SNP, etc) whereas a value of 00000000 indicates individual-major (i.e. list all SNPs for the first individual, list all SNPs for the second individual, etc). By default, all BED files are SNP-major mode (as is the example below).

Here we have extracted and annotated the relevant part of the xxd output:
     |-magic number--| |-mode-| |--genotype data---------| 

     01101100 00011011 00000001 11011100 00001111 11100111

     |--genotype data-cont'd--| 

     00001111 01101011 00000001 

For the genotype data, each byte encodes up to four genotypes (2 bits per genoytpe). The coding is
     00  Homozygote "1"/"1"
     01  Heterozygote
     11  Homozygote "2"/"2"
     10  Missing genotype
The only slightly confusing wrinkle is that each byte is effectively read backwards. That is, if we label each of the 8 position as A to H, we would label backwards:
     01101100
     HGFEDCBA
and so the first four genotypes are read as follows:
     01101100
     HGFEDCBA

           AB   00  -- homozygote (first)
         CD     11  -- other homozygote (second)
       EF       01  -- heterozygote (third)
     GH         10  -- missing genotype (fourth)
Finally, when we reach the end of a SNP (or if in individual-mode, the end of an individual) we skip to the start of a new byte (i.e. skip any remaining bits in that byte).

It is important to remember that the files test.bim and test.fam will already have been read in, so PLINK knows how many SNPs and individuals to expect.

So, considering the full test.bed file: here we consider the six bytes that contain all the genotype data. We consider them one at a time, showing how the 4 genotypes are extracted from each byte to make up the entire dataset. Some positions are called null meaning that all the genotypes for that SNP have been read in, so we advance to the start of a new byte for the next SNP (when in SNP-major mode):

                Genotype    Person    SNP
     11011100 

           00   G/G         1 1       snp1
         11     A/A         1 2       snp1
       10       0/0         1 3       snp1
     11         A/A         2 1       snp1


     00001111 

           11   A/A         2 2       snp1
         11     A/A         2 3       snp1
       00       (null)
     00         (null)


     11100111
           
           11   2/2         1 1       snp2
         10     0/0         1 2       snp2
       01       1/2         1 3       snp2
     11         2/2         2 1       snp2


     00001111 
  
           11   2/2         2 2       snp2
         11     2/2         2 3       snp2
       00       (null) 
     00         (null)


     01101011

           11   C/C         1 1       snp3
         01     A/C         1 2       snp3
       01       A/C         1 3       snp3
     10         0/0         2 1       snp3


     00000001

           10   0/0         2 2       snp3
         00     A/A         2 3       snp3
       00       (null)
     00         (null)

In summary, the following define the BED file format
  • First two bytes 01101100 00011011 for PLINK v1.00 BED file
  • Third byte is 00000001 (SNP-major) or 00000000 (individual-major)
  • Genotype data, either in SNP-major or individual-major order
  • New "row" always starts a new byte
  • Each byte encodes up to 4 genotypes
  • 10 indicates missing genotype, otherwise 0 and 1 point to allele 1 or allele 2 in the BIM file, respectively
  • Bits in each byte read in reverse order
Any changes to this format will be accompanied by a different, unique magic number and will be backwards compatabile in PLINK

Old versions Earlier versions: v0.99 BED files do not contain the 2-byte magic number; BED files prior to 0.99 are always in individual-major mode and contain neither the magic-numbers nor the SNP-major/individual-major identifier. PLINK will indicate if these earlier, legacy files are found.

 
This document last modified Wednesday, 25-Jan-2017 11:39:26 EST