Utilities

PLINK/SEQ provides a number of auxillary command line utilities that can be useful in working with variant data:

  • gcol: view selected rows in a tab-delimited file
  • tab2vcf: create a VCF file from a simple tab-delimited list of variants
  • behead: make wide, large tab-delimited text files more human readable

gcol: get columns text utility

gcol is a very simple utility for extracting certain rows from a tab-delimited text file, based on names in a header (the first row of the input). It is sometimes easier to use than a more general tool such as awk. gcol can be summarized as follows:

  • All input is from the STDIN.
  • Fields to extract are listed as arguments to gcol
  • Fields not found in the input are printed as literals in that position in the output

For illustration: the input file is my.dat:

A  B  C
1  2  3
4  5  6

The following commands demonstrate use of gcol:

gcol A C < my.dat
A  C
1  3
4  6
gcol C B B < my.dat
C  B  B
3  2  2
6  5  5
cat my.dat | gcol A X B
A  X  B
1  X  2
4  X  5
gcol A ";" B "some text" C < my.dat
A  ;  B  some text  C
1  ;  2  some text  3
4  ;  5  some text  6

tab2vcf

This utility is designed to take a regular, tab-delimited test file, and convert to a sites-only VCF. For, example, based on the the text file t.txt:

cat myfile.txt
CHROM   POS     ALT     REF     STAT    P       GENE
1       222     A       G       1.12    0.09    ABC1
1       333     G       T       -0.23   0.23    XYZ1
2       444     AT      A       NA      NA      .
tab2vcf < t.txt
##fileformat=VCFv4.1
##source=tab2vcf
##INFO=<ID=STAT,Number=1,Type=Float,Description="n/a">
##INFO=<ID=P,Number=1,Type=Float,Description="n/a">
##INFO=<ID=GENE,Number=1,Type=Float,Description="n/a">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       222     .       G       A       .       .       STAT=1.12;P=0.09;GENE=ABC1
1       333     .       T       G       .       .       STAT=-0.23;P=0.23;GENE=XYZ1
2       444     .       A       AT      .       .       STAT=NA;P=NA;GENE=.

By defaut, tab2vcf looks for the following keywords in the header to structure the VCF:

        ID
        REF
        ALT
        QUAL
        FILTER

The following fields in the input are recognized and used to specify the CHROM and POS fields in the VCF:

   VAR/LOC                chr1:12345

   CHR/CHROM              chr1
   BP/POS/BP1/POS1        12345

It can also accept a multi-base region (via CHR or LOC, or with a field BP2 or POS2, in which case the VCF POS field will be in the form 12345..12347, for example.

Any leading # character in the header is ignored.

A field can be skipped by using the SKIP keyword as below. Further, header fields that define the meta-data can be entered into the VCF by adding quoted terms in the form: "name:length:type:description":

tab2vcf SKIP=STAT "P:1:Float:Asymptotic p-value" "GENE:.:String:Official gene symbol" < t.txt
##fileformat=VCFv4.1
##source=tab2vcf
##INFO=<ID=P,Number=1,Type=Float,Description="Asymptotic p-value">
##INFO=<ID=GENE,Number=.,Type=String,Description="Official gene symbol">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       222     .       G       A       .       .       P=0.09;GENE=ABC1
1       333     .       T       G       .       .       P=0.23;GENE=XYZ1
2       444     .       A       AT      .       .       P=NA;GENE=.

If these are not specified, all meta-data is assumed to be a single, floating-point value (as in the first example).

behead

behead is a simple text processing utility to make wide tab-delimited files of regular format (same number of fields on each line) easier to read, by lining up the header against each subsequent row, and printing each field on a separate line. Based on the text file t.txt shown in the tab2vcf example above:

behead < t.txt
1       1       CHROM   1
1       2       POS     222
1       3       ALT     A
1       4       REF     G
1       5       STAT    1.12
1       6       P       0.09
1       7       GENE    ABC1

2       1       CHROM   1
2       2       POS     333
2       3       ALT     G
2       4       REF     T
2       5       STAT    -0.23
2       6       P       0.23
2       7       GENE    XYZ1

3       1       CHROM   2
3       2       POS     444
3       3       ALT     AT
3       4       REF     A
3       5       STAT    NA
3       6       P       NA
3       7       GENE    .

The first two numbers are just the row and column numbers. One could image using this in conjunction with other commands, e.g. to view top associated SNPs using awk, although you have to make sure that the header row is included in the output:

awk ' NR == 1 || $6 < 1e-3 ' t.txt | behead

This can be particularly useful, for example, when viewing the output from the PSEQ v-assoc command.