Utilities
PLINK/SEQ provides a number of auxillary command line utilities that can be useful in working with variant data:
- gcol: view selected rows in a tab-delimited file
- tab2vcf: create a VCF file from a simple tab-delimited list of variants
- behead: make wide, large tab-delimited text files more human readable
gcol: get columns text utility
gcol is a very simple utility for extracting certain rows from a tab-delimited text file, based on names in a header (the first row of the input). It is sometimes easier to use than a more general tool such as awk. gcol can be summarized as follows:- All input is from the STDIN.
- Fields to extract are listed as arguments to gcol
- Fields not found in the input are printed as literals in that position in the output
For illustration: the input file is my.dat:
A B C 1 2 3 4 5 6
The following commands demonstrate use of gcol:
gcol A C < my.dat
A C 1 3 4 6
gcol C B B < my.dat
C B B 3 2 2 6 5 5
cat my.dat | gcol A X B
A X B 1 X 2 4 X 5
gcol A ";" B "some text" C < my.dat
A ; B some text C 1 ; 2 some text 3 4 ; 5 some text 6
tab2vcf
This utility is designed to take a regular, tab-delimited test file, and convert to a sites-only VCF. For, example, based on the the text file t.txt:
cat myfile.txt
CHROM POS ALT REF STAT P GENE 1 222 A G 1.12 0.09 ABC1 1 333 G T -0.23 0.23 XYZ1 2 444 AT A NA NA .
tab2vcf < t.txt
##fileformat=VCFv4.1 ##source=tab2vcf ##INFO=<ID=STAT,Number=1,Type=Float,Description="n/a"> ##INFO=<ID=P,Number=1,Type=Float,Description="n/a"> ##INFO=<ID=GENE,Number=1,Type=Float,Description="n/a"> #CHROM POS ID REF ALT QUAL FILTER INFO 1 222 . G A . . STAT=1.12;P=0.09;GENE=ABC1 1 333 . T G . . STAT=-0.23;P=0.23;GENE=XYZ1 2 444 . A AT . . STAT=NA;P=NA;GENE=.
By defaut, tab2vcf looks for the following keywords in the header to structure the VCF:
ID REF ALT QUAL FILTER
The following fields in the input are recognized and used to specify the CHROM and POS fields in the VCF:
VAR/LOC chr1:12345 CHR/CHROM chr1 BP/POS/BP1/POS1 12345
It can also accept a multi-base region (via CHR or LOC, or with a field BP2 or POS2, in which case the VCF POS field will be in the form 12345..12347, for example.
Any leading # character in the header is ignored.
A field can be skipped by using the SKIP keyword as below. Further, header fields that define the meta-data can be entered into the VCF by adding quoted terms in the form: "name:length:type:description":
tab2vcf SKIP=STAT
"P:1:Float:Asymptotic p-value"
"GENE:.:String:Official gene symbol" < t.txt
##fileformat=VCFv4.1 ##source=tab2vcf ##INFO=<ID=P,Number=1,Type=Float,Description="Asymptotic p-value"> ##INFO=<ID=GENE,Number=.,Type=String,Description="Official gene symbol"> #CHROM POS ID REF ALT QUAL FILTER INFO 1 222 . G A . . P=0.09;GENE=ABC1 1 333 . T G . . P=0.23;GENE=XYZ1 2 444 . A AT . . P=NA;GENE=.
If these are not specified, all meta-data is assumed to be a single, floating-point value (as in the first example).
behead
behead is a simple text processing utility to make wide tab-delimited files of regular format (same number of fields on each line) easier to read, by lining up the header against each subsequent row, and printing each field on a separate line. Based on the text file t.txt shown in the tab2vcf example above:
behead < t.txt
1 1 CHROM 1 1 2 POS 222 1 3 ALT A 1 4 REF G 1 5 STAT 1.12 1 6 P 0.09 1 7 GENE ABC1 2 1 CHROM 1 2 2 POS 333 2 3 ALT G 2 4 REF T 2 5 STAT -0.23 2 6 P 0.23 2 7 GENE XYZ1 3 1 CHROM 2 3 2 POS 444 3 3 ALT AT 3 4 REF A 3 5 STAT NA 3 6 P NA 3 7 GENE .
The first two numbers are just the row and column numbers. One could image using this in conjunction with other commands, e.g. to view top associated SNPs using awk, although you have to make sure that the header row is included in the output:
awk ' NR == 1 || $6 < 1e-3 ' t.txt | behead
This can be particularly useful, for example, when viewing the output from the PSEQ v-assoc command.