Locus (and segment) database operations
The LOCDB stores reference genomic regions and groups of regions. It is most often used to represent genes, transcripts and exons, but can be extended to represent any regions of interest.
A SEGDB is similar to a LOCDB, except that it also tracks genomic intervals as the property of specific individuals. Such a database could be used to track, for example, information on large structural variants (deletions and duplications).
- Loading loci: using loc-load to load GTF or region files into a LOCDB
- Loading segments: using seg-load to load a segment list into a SEGDB
- Locus aliases and alternate names: loc-load-alias and loc-delete-alias commands
- Viewing loci and segments: using the loc-view and seg-view command
- Loading locus sets: the locset-load command to specify, for example, a pathway or set of genes
- Additional information about loci: the loc-stats to annotate loci with information from the reference and sequence databases
- Intersecting LOCDB groups: using loc-intersect to intersect a group of loci based on genomic coordinates
- Misc. LOCDB commands: various other locus operations (loc-merge, loc-delete, loc-translate).
Load a GTF or region file
To load a set of loci into the LOCDB, use the command
pseq /path/to/my/project load-loc --file myfile.gtf
--group myloci
This assumes a GTF2.2 format file.
This creates a group called myloci in the LOCDB in which each entry represents a single exon. By default, the transcript_id field of the GTF is used to assign a name to each region. To use the gene_id field instead, add the option:
pseq /path/to/my/project load-loc --file myfile.gtf
--group myloci
--options gene_id
If a locus file ends with extension is .reg ( or .reg.gz) instead of .gtf, instead of .gtf (or .gtf.gz) this implies a format other than GTF. Specifically, this assumes a rectangular, tab-delimited format:
#CHR POS1 POS2 ID chr1 20229 20366 target_1 chr1 58952 59873 target_2 chr1 357520 358462 target_3 chr1 610957 611899 target_4 ...
The first row is expected to contain a header (starting with a # symbol), that contains the keywords CHR (or CHROM), POS1 (or BP1), and POS2 (or BP2) and ID. Alternatively, POS and ID can be specified for data in the following format:
#POS ID chr1:20229..20366 target_1 chr1:58952..59873 target_2 ...
If any additional fields are present, they will be appended in the database as a meta-attribute. The default type will be as a String, unless an option is given as in the example below, which indicates three additional columns in the .reg file (i.e. CLASS, SZ and GC_PCT should feature in the header row):
pseq /path/to/my/project load-loc --file myfile.reg
--group myloci2
--options CLASS=String SZ=Integer GC_PCT=Float
Loading segment files into a SEGDB
By convention, segment files should end with a .seg file extension and have the format:
Header row starting with '#' character Required fields CHR, BP1, BP2 (or CHROM, POS1, POS2) Individual ID (ID) Optional segment-specific ID (SEGID) Meta-information field (META)
The seg-load command will then create a SEGDB if it does not already exist, and load the list of segments into a group. For example, given the segment list:
cat example.seg
#CHROM POS1 POS2 ID META chr1 1002 1003 P001 SEG=1;Q=99 chr1 1001 1001 P002 SEG=2;Q=95 chr1 1001 1002 P001 SEG=3;Q=20 chr1 1004 1006 P005 SEG=4;Q=80
pseq proj seg-load --file example.seg --group myseg --segdb segdb
Currently, the meta-information in such a file is stored in the SEGDB along with the segments, but there are not automated ways to use or access this information. Rather, this is design with future functionality in mind.
Note: currently, a SEGDB entry is not automatically appended to the project file. This is simply an omission in the current 1.08 release and will be fixed. In the mean time, either always specify the SEGDB with the --segdb command line option, or manually add this entry to the project-specification file: two tab-delimited entries:
/full/path/to/this/segdb SEGDB
Load an alias table
To load a gene alias table, use the command
pseq . loc-load-alias --locdb /path/to/locdb --file aliases.txt
The format of a gene-alias table should be tab-delimited; the first row should start with a # symbol and contain the names of the IDs; a n/a or '.' character indicates this is unknown/undefined: e.g.
#symbol refseq ccds WASH7P NR_024540 n/a FAM138F NR_026820 n/a FAM138A NR_026818 n/a FAM138C NR_026822 n/a OR4F5 NM_001005484 CCDS30547.1 PLEKHN1 NM_001160184 n/a PLEKHN1 NM_032129 CCDS4.1 ...
Subsequently, this information can be used to convert between ID schemes, etc, and give aliases in the output of various commands.
To remove all aliases, use the command
pseq . loc-delete-alias --locdb /path/to/locdb
To swap in an alternate locus name:
pseq /path/to/project loc-swap-names
--alternate-name
--file refseq2symbol.txt
--group refseq
The file refseq2symbol.txt in this example is assumed to contain two tab-delimited entries per row: the original name, and the new, to-be-swapped-in alternate name. A header line starting the '#' will be ignored; any locus ID in the first field that are not found in the specified group will also be ignored.
The alternate names will now be display by the loc-view command, for example:
pseq /path/to/project loc-view --group refseq
yields
NM_001164554 chr1:229829237..229904401 DISC1 DISC1 NM_001164552 chr1:229829237..229923260 DISC1 DISC1 NM_001164550 chr1:229829237..229924951 DISC1 DISC1 NM_001164553 chr1:229829237..229924951 DISC1 DISC1 NM_001012958 chr1:229829237..229924967 DISC1 CCDS31056.1,DISC1 ...where the third column is the alternate gene name. Note that using the loc-swap-name without the --alternate-name flag will attempt to change the primary ID of the locus instead.
In addition, the gene component of the mask is designed to take gene symbols and look up the consistent transcripts in (by default) the refseq locus group. That is,
--mask gene=DISC1is equivalent to
--mask loc.subset=refseq,NM_001164554,NM_001164552,NM_001164550,NM_001164553,NM_001012958,...
View locus information
To dump the loci from a group in the locus database (LOCDB), use the command
pseq /path/to/my/project loc-view --group refseq
This simply lists the names of the loci, and the genomic co-ordinate. The third column lists the alternate name. Also, all available aliases are also displayed in the fourth column (in this example, gene symbols and CCDS IDs have been loaded in the gene-name alias table in LOCDB).
NM_004038_dup2 chr1:104094615..104102800 NM_004038 . NM_018137 chr1:107400861..107401985 NM_018137 CCDS41360.1,PRMT6 NM_001113226 chr1:107492739..107824982 NM_001113226 NTNG1 NM_001113228 chr1:107492739..107824982 NM_001113228 NTNG1 NM_014917 chr1:107492739..107824982 NM_014917 CCDS30785.1,NTNG1 NM_001079874 chr1:107917479..108032576 NM_001079874 VAV3 NM_006113 chr1:107917479..108309014 NM_006113 CCDS785.1,VAV3 ...
One can view groups in a segment database with the seg-view command: given the example segment file used above, this would give:
pseq proj seg-view --segdb segdb --group myseg
P001 2 chr1:1001..1002 chr1:1002..1003 P002 1 chr1:1001..1001 P003 0 P004 0 P005 1 chr1:1004..1006
which lists each individual (P001, P002, etc), the number of segments they have in that group, and a list of those segments.
Load a locus-set, for example, to specify a set of genes
To add a locus-set to the LOCDB (i.e. representing a pathway, or a set of genes), use the following:
pseq proj load-locset --group refseq
--name pathway1
--file myfile.txt
The file myfile.txt is in the format locus-ID, set-ID (tab-delimited, two fields per line):
ACP1 ADHERENS_JUNCTION ACTB ADHERENS_JUNCTION ACTC1 ADHERENS_JUNCTION ACTG1 ADHERENS_JUNCTION ACTN1 ADHERENS_JUNCTION ... ACACB ADIPOCYTOKINE_SIGNALING_PATHWAY ACSL1 ADIPOCYTOKINE_SIGNALING_PATHWAY ACSL3 ADIPOCYTOKINE_SIGNALING_PATHWAY ...
In the above example, we specify that the locus IDs in the first column should be found in a previously-loaded locus group called refseq. This will create a locus-set group called pathway1, that is based on refseq genes and contains multiple sets (i.e. two of which are shown above). This locus-set is subsequently referred to by both the locus-group and the locus-set group name, i.e. in a mask:
--mask locset.group=refseq,pathway1
View locus information with additional annotation from the reference and sequence databases
The command loc-stats lists loci in a LOCDB along with some descriptive statistics from the intersection of attached REFDB and SEQDB (here we assume these databases are already specified as part of the proj project):
pseq proj loc-stats --group refseq
The output is similar to the output from loc-view, except some additional quantities are presented. Specifically, the output contains the fields:
LOCUS name of locus ALIAS any aliases, in key=value pairs NSUB number of subregions POS genomic co-ordinate L length (in base-pairs), as sum of subregions L_ALL length of entire region GC % of GC bases, for all subregions GC_ALL % of GC bases, for entire region N % of repeat-masked bases, for all subregions N_ALL % of repeat-masked bases, for entire region
If a locus represents a gene, and the subregions represented exons, then the difference between GC and GC_ALL, for example, is that the latter includes intronic bases. One line of output is shown here (line transposed for clarity):
NM_194458 ccds=CCDS16.1,symbol=UBE2J2 5 chr1:1180449..1188604 621 8156 0.52818 0.575298 0 0.280284
If the --ref option is specified:
--ref g1k
and a group exists in the REFDB named g1k (for example, corresponding to variants observed in the 1000 Genomes project), then two extra columns are appended to the output, containing the following information:
NREF Number of reference variants in subregions NREF_ALL Number of reference variants in entire regionIf the option
--show-subregions
is added, then additional lines are shown for each subregion separately. An additional column is inserted in the output, which indicates the number of the subregion (this is 0 for the main, spanning locus). For example: (some of output removed for clarity) -- here the five exons are shown; the last line is the entire region (introns and exons):
NM_194458 symbol=UBE2J2 NA 1 chr1:1180449..1180730 282 NA 0.60 NA 0 NA NM_194458 symbol=UBE2J2 NA 2 chr1:1181288..1181368 81 NA 0.35 NA 0 NA NM_194458 symbol=UBE2J2 NA 3 chr1:1182235..1182373 139 NA 0.60 NA 0 NA NM_194458 symbol=UBE2J2 NA 4 chr1:1182451..1182553 103 NA 0.41 NA 0 NA NM_194458 symbol=UBE2J2 NA 5 chr1:1188589..1188604 16 NA 0.44 NA 0 NA NM_194458 symbol=UBE2J2 5 0 chr1:1180449..1188604 621 8156 0.53 0.57 0 0.28
If the option
--no-subregions
is added, then only the entire-region columns are displayed (i.e. only GC_ALL, not GC).
Intersect a group of loci based on genomic co-ordinates
Given a group of loci in LOCDB, to obtain a listing of which loci overlap a second set (contained in a text file), use the command:
pseq proj loc-intersect --group refseq --file regions.txt
where regions.txt is in the following format, e.g.:
chr1:10000..99000 chr22:21500000..22000000 chr8:999..1000 chr10:61558165
yields the output:
chr1:10000..99000 1 OR4F5 chr22:21500000..22000000 4 BCR|GNAZ|RAB36|RTDR1 chr8:999..1000 0 . chr10:61558165..61558165 1 ANK3
indicating chromosome, base-position of target region (start and stop) and then the number and a list of any loci (genes in this case) that intersect that target region.
Misc. LOCDB operations
Here we describe less commonly used commands to merge or delete locus groups; also, to translate a gene transcript into an amino-acid sequence, given an attached SEDB.
Merge subregions in a locus group
A GTF file contains records that can be grouped according to the gene_id or transcript_id. For example, the exons in a single gene. For a gene-based test, one will often want to iterate over all groups of variants in a gene (from all exons), rather than single exons. To create a new group in the LOCDB, use the command
pseq /path/to/my/project loc-merge --group myloci myloci2
This command expects two group names: the first (myloci here) must already exist in the LOCDB. The second is the name that is given to the newly-formed group. The individual regions in the first group are stored as ''subregions'' in the second. For example, if the original group contained six records:
myloci chr1:100..200 gene1 chr1:300..400 gene1 chr1:800..1000 gene1 chr1:2000..2100 gene2 chr1:2200..2250 gene2 chr1:3000..3200 gene2
The new group will contain two records, with subregion information that will be automatically extracted and can be used as appropriate in downstream analysis:
myloci2 chr1:100..1000 gene1 [ 100..200; 300..400; 800..1000 ] chr1:2000..3200 gene2 [ 2000..2100; 2200..2250; 3000..3200 ]
Note: by default, the loc-merge command is automatically invoked when loading a GTF. Use the --keep-unmerged flag along with loc-load to retain the original unmerged group (where each locus entry corresponds to one exon, not one transcript, and will have a -unmerged suffix on the group name).
Delete a locus group
To remove a group from a LOCDB, use the command:
pseq proj loc-delete --group groupname
Translate a locus to amino-acid sequence
The command
pseq proj loc-translate --group groupname
generates the following, with one locus per line (locus name, genomic position, number of exons and number of amino-acids, followed by the amino-acid sequence formatted into groups of 10)
NM_001042478 chr1:4615346..4734416 5 411 MWIQQLLGLS SMSIRWPGRP LGSHAWILIA MFQLAVDLPA CEALGPGPEF WLLPRSPPRP PRLWSFRSGQ PARVPAPVWS PRPPRVERIH GQMQMPRARR AHRPRDQAAA LVPKAGLAKP PAAAKSSPSL ASSSSSSSSA VAGGAPEQQA LLRRGKRHLQ GDGLSSFDSR ...
Adding the option
--verbose
adds the following lines of output for each gene.
exon 1 1 4615346 4615374 0 exon 2 1 4671820 4672619 28 exon 3 1 4729773 4729860 827 exon 4 1 4732200 4732445 914 exon 5 1 4734347 4734416 1159 chr1 1/5 fwd 4615347..4615348 1..3 1 ATG M chr1 1/5 fwd 4615350..4615351 4..6 2 TGG W chr1 1/5 fwd 4615353..4615354 7..9 3 ATT I chr1 1/5 fwd 4615356..4615357 10..12 4 CAA Q chr1 1/5 fwd 4615359..4615360 13..15 5 CAG Q chr1 1/5 fwd 4615362..4615363 16..18 6 CTT L chr1 1/5 fwd 4615365..4615366 19..21 7 TTA L chr1 1/5 fwd 4615368..4615369 22..24 8 GGA G chr1 1/5 fwd 4615371..4615372 25..27 9 CTC L chr1 2&1/5 fwd 4615374..4671820 28..30 10 AGC S chr1 2/5 fwd 4671822..4671823 31..33 11 TCC S chr1 2/5 fwd 4671825..4671826 34..36 12 ATG M chr1 2/5 fwd 4671828..4671829 37..39 13 TCC S chr1 2/5 fwd 4671831..4671832 40..42 14 ATC I chr1 2/5 fwd 4671834..4671835 43..45 15 CGC R chr1 2/5 fwd 4671837..4671838 46..48 16 TGG W chr1 2/5 fwd 4671840..4671841 49..51 17 CCG P
Note: currently no option to filter / select genes -- all loci/genes in a group will be output.