Locus (and segment) database operations

The LOCDB stores reference genomic regions and groups of regions. It is most often used to represent genes, transcripts and exons, but can be extended to represent any regions of interest.

A SEGDB is similar to a LOCDB, except that it also tracks genomic intervals as the property of specific individuals. Such a database could be used to track, for example, information on large structural variants (deletions and duplications).

Load a GTF or region file

To load a set of loci into the LOCDB, use the command

pseq /path/to/my/project load-loc --file myfile.gtf --group myloci

This assumes a GTF2.2 format file.

This creates a group called myloci in the LOCDB in which each entry represents a single exon. By default, the transcript_id field of the GTF is used to assign a name to each region. To use the gene_id field instead, add the option:

pseq /path/to/my/project load-loc --file myfile.gtf --group myloci --options gene_id

If a locus file ends with extension is .reg ( or .reg.gz) instead of .gtf, instead of .gtf (or .gtf.gz) this implies a format other than GTF. Specifically, this assumes a rectangular, tab-delimited format:

 #CHR    POS1    POS2    ID
 chr1    20229   20366   target_1
 chr1    58952   59873   target_2
 chr1    357520  358462  target_3
 chr1    610957  611899  target_4
 ...

The first row is expected to contain a header (starting with a # symbol), that contains the keywords CHR (or CHROM), POS1 (or BP1), and POS2 (or BP2) and ID. Alternatively, POS and ID can be specified for data in the following format:

 #POS                ID
 chr1:20229..20366   target_1
 chr1:58952..59873   target_2
 ...

If any additional fields are present, they will be appended in the database as a meta-attribute. The default type will be as a String, unless an option is given as in the example below, which indicates three additional columns in the .reg file (i.e. CLASS, SZ and GC_PCT should feature in the header row):

pseq /path/to/my/project load-loc --file myfile.reg --group myloci2 --options CLASS=String SZ=Integer GC_PCT=Float

Loading segment files into a SEGDB

By convention, segment files should end with a .seg file extension and have the format:

  Header row starting with '#' character
  Required fields CHR, BP1, BP2 (or CHROM, POS1, POS2) 
  Individual ID (ID)
  Optional segment-specific ID (SEGID)
  Meta-information field (META)

The seg-load command will then create a SEGDB if it does not already exist, and load the list of segments into a group. For example, given the segment list:

cat example.seg
  #CHROM POS1 POS2 ID   META 
  chr1   1002 1003 P001 SEG=1;Q=99
  chr1   1001 1001 P002 SEG=2;Q=95
  chr1   1001 1002 P001 SEG=3;Q=20
  chr1   1004 1006 P005 SEG=4;Q=80
pseq proj seg-load --file example.seg --group myseg --segdb segdb

Currently, the meta-information in such a file is stored in the SEGDB along with the segments, but there are not automated ways to use or access this information. Rather, this is design with future functionality in mind.

Note: currently, a SEGDB entry is not automatically appended to the project file. This is simply an omission in the current 1.08 release and will be fixed. In the mean time, either always specify the SEGDB with the --segdb command line option, or manually add this entry to the project-specification file: two tab-delimited entries:

/full/path/to/this/segdb     SEGDB

Load an alias table

To load a gene alias table, use the command

pseq . loc-load-alias --locdb /path/to/locdb --file aliases.txt

The format of a gene-alias table should be tab-delimited; the first row should start with a # symbol and contain the names of the IDs; a n/a or '.' character indicates this is unknown/undefined: e.g.

   #symbol refseq          ccds
   WASH7P  NR_024540       n/a
   FAM138F NR_026820       n/a
   FAM138A NR_026818       n/a
   FAM138C NR_026822       n/a
   OR4F5   NM_001005484    CCDS30547.1
   PLEKHN1 NM_001160184    n/a
   PLEKHN1 NM_032129       CCDS4.1
   ...

Subsequently, this information can be used to convert between ID schemes, etc, and give aliases in the output of various commands.

To remove all aliases, use the command

pseq . loc-delete-alias --locdb /path/to/locdb

To swap in an alternate locus name:

pseq /path/to/project loc-swap-names --alternate-name --file refseq2symbol.txt --group refseq

The file refseq2symbol.txt in this example is assumed to contain two tab-delimited entries per row: the original name, and the new, to-be-swapped-in alternate name. A header line starting the '#' will be ignored; any locus ID in the first field that are not found in the specified group will also be ignored.

The alternate names will now be display by the loc-view command, for example:

pseq /path/to/project loc-view --group refseq

yields

 NM_001164554    chr1:229829237..229904401       DISC1   DISC1
 NM_001164552    chr1:229829237..229923260       DISC1   DISC1
 NM_001164550    chr1:229829237..229924951       DISC1   DISC1
 NM_001164553    chr1:229829237..229924951       DISC1   DISC1
 NM_001012958    chr1:229829237..229924967       DISC1   CCDS31056.1,DISC1
 ...
where the third column is the alternate gene name. Note that using the loc-swap-name without the --alternate-name flag will attempt to change the primary ID of the locus instead.

In addition, the gene component of the mask is designed to take gene symbols and look up the consistent transcripts in (by default) the refseq locus group. That is,

  --mask gene=DISC1
is equivalent to
  --mask loc.subset=refseq,NM_001164554,NM_001164552,NM_001164550,NM_001164553,NM_001012958,...

View locus information

To dump the loci from a group in the locus database (LOCDB), use the command

pseq /path/to/my/project loc-view --group refseq

This simply lists the names of the loci, and the genomic co-ordinate. The third column lists the alternate name. Also, all available aliases are also displayed in the fourth column (in this example, gene symbols and CCDS IDs have been loaded in the gene-name alias table in LOCDB).

   NM_004038_dup2  chr1:104094615..104102800       NM_004038       .
   NM_018137       chr1:107400861..107401985       NM_018137       CCDS41360.1,PRMT6
   NM_001113226    chr1:107492739..107824982       NM_001113226    NTNG1
   NM_001113228    chr1:107492739..107824982       NM_001113228    NTNG1
   NM_014917       chr1:107492739..107824982       NM_014917       CCDS30785.1,NTNG1
   NM_001079874    chr1:107917479..108032576       NM_001079874    VAV3
   NM_006113       chr1:107917479..108309014       NM_006113       CCDS785.1,VAV3
   ...

One can view groups in a segment database with the seg-view command: given the example segment file used above, this would give:

pseq proj seg-view --segdb segdb --group myseg
P001  2 chr1:1001..1002 chr1:1002..1003
P002  1 chr1:1001..1001
P003  0
P004  0
P005  1 chr1:1004..1006

which lists each individual (P001, P002, etc), the number of segments they have in that group, and a list of those segments.

Load a locus-set, for example, to specify a set of genes

To add a locus-set to the LOCDB (i.e. representing a pathway, or a set of genes), use the following:

pseq proj load-locset --group refseq --name pathway1 --file myfile.txt

The file myfile.txt is in the format locus-ID, set-ID (tab-delimited, two fields per line):

   ACP1    ADHERENS_JUNCTION
   ACTB    ADHERENS_JUNCTION
   ACTC1   ADHERENS_JUNCTION
   ACTG1   ADHERENS_JUNCTION
   ACTN1   ADHERENS_JUNCTION
   ...
   ACACB   ADIPOCYTOKINE_SIGNALING_PATHWAY
   ACSL1   ADIPOCYTOKINE_SIGNALING_PATHWAY
   ACSL3   ADIPOCYTOKINE_SIGNALING_PATHWAY
   ...

In the above example, we specify that the locus IDs in the first column should be found in a previously-loaded locus group called refseq. This will create a locus-set group called pathway1, that is based on refseq genes and contains multiple sets (i.e. two of which are shown above). This locus-set is subsequently referred to by both the locus-group and the locus-set group name, i.e. in a mask:

--mask locset.group=refseq,pathway1

View locus information with additional annotation from the reference and sequence databases

The command loc-stats lists loci in a LOCDB along with some descriptive statistics from the intersection of attached REFDB and SEQDB (here we assume these databases are already specified as part of the proj project):

pseq proj loc-stats --group refseq

The output is similar to the output from loc-view, except some additional quantities are presented. Specifically, the output contains the fields:

   LOCUS    name of locus
   ALIAS    any aliases, in key=value pairs
   NSUB     number of subregions
   POS      genomic co-ordinate
   L        length (in base-pairs), as sum of subregions
   L_ALL    length of entire region 
   GC       % of GC bases, for all subregions 
   GC_ALL   % of GC bases, for entire region  
   N        % of repeat-masked bases, for all subregions
   N_ALL    % of repeat-masked bases, for entire region

If a locus represents a gene, and the subregions represented exons, then the difference between GC and GC_ALL, for example, is that the latter includes intronic bases. One line of output is shown here (line transposed for clarity):

   NM_194458
   ccds=CCDS16.1,symbol=UBE2J2     
   5       
   chr1:1180449..1188604   
   621     
   8156    
   0.52818 
   0.575298        
   0       
   0.280284

If the --ref option is specified:

--ref g1k

and a group exists in the REFDB named g1k (for example, corresponding to variants observed in the 1000 Genomes project), then two extra columns are appended to the output, containing the following information:

   NREF        Number of reference variants in subregions
   NREF_ALL    Number of reference variants in entire region
If the option --show-subregions

is added, then additional lines are shown for each subregion separately. An additional column is inserted in the output, which indicates the number of the subregion (this is 0 for the main, spanning locus). For example: (some of output removed for clarity) -- here the five exons are shown; the last line is the entire region (introns and exons):

  NM_194458 symbol=UBE2J2 NA 1 chr1:1180449..1180730 282 NA   0.60 NA   0 NA
  NM_194458 symbol=UBE2J2 NA 2 chr1:1181288..1181368 81  NA   0.35 NA   0 NA
  NM_194458 symbol=UBE2J2 NA 3 chr1:1182235..1182373 139 NA   0.60 NA   0 NA
  NM_194458 symbol=UBE2J2 NA 4 chr1:1182451..1182553 103 NA   0.41 NA   0 NA
  NM_194458 symbol=UBE2J2 NA 5 chr1:1188589..1188604 16  NA   0.44 NA   0 NA
  NM_194458 symbol=UBE2J2 5  0 chr1:1180449..1188604 621 8156 0.53 0.57 0 0.28

If the option

--no-subregions

is added, then only the entire-region columns are displayed (i.e. only GC_ALL, not GC).

Intersect a group of loci based on genomic co-ordinates

Given a group of loci in LOCDB, to obtain a listing of which loci overlap a second set (contained in a text file), use the command:

pseq proj loc-intersect --group refseq --file regions.txt

where regions.txt is in the following format, e.g.:

   chr1:10000..99000
   chr22:21500000..22000000
   chr8:999..1000
   chr10:61558165

yields the output:

   chr1:10000..99000          1   OR4F5
   chr22:21500000..22000000   4   BCR|GNAZ|RAB36|RTDR1
   chr8:999..1000             0   .
   chr10:61558165..61558165   1   ANK3

indicating chromosome, base-position of target region (start and stop) and then the number and a list of any loci (genes in this case) that intersect that target region.

Misc. LOCDB operations

Here we describe less commonly used commands to merge or delete locus groups; also, to translate a gene transcript into an amino-acid sequence, given an attached SEDB.

Merge subregions in a locus group

A GTF file contains records that can be grouped according to the gene_id or transcript_id. For example, the exons in a single gene. For a gene-based test, one will often want to iterate over all groups of variants in a gene (from all exons), rather than single exons. To create a new group in the LOCDB, use the command

pseq /path/to/my/project loc-merge --group myloci myloci2

This command expects two group names: the first (myloci here) must already exist in the LOCDB. The second is the name that is given to the newly-formed group. The individual regions in the first group are stored as ''subregions'' in the second. For example, if the original group contained six records:

   myloci 
      chr1:100..200     gene1
      chr1:300..400     gene1
      chr1:800..1000    gene1
      chr1:2000..2100   gene2
      chr1:2200..2250   gene2
      chr1:3000..3200   gene2

The new group will contain two records, with subregion information that will be automatically extracted and can be used as appropriate in downstream analysis:

   myloci2 
      chr1:100..1000    gene1    [ 100..200; 300..400; 800..1000 ]
      chr1:2000..3200   gene2    [ 2000..2100; 2200..2250; 3000..3200 ]

Note: by default, the loc-merge command is automatically invoked when loading a GTF. Use the --keep-unmerged flag along with loc-load to retain the original unmerged group (where each locus entry corresponds to one exon, not one transcript, and will have a -unmerged suffix on the group name).

Delete a locus group

To remove a group from a LOCDB, use the command:

pseq proj loc-delete --group groupname

Translate a locus to amino-acid sequence

The command

pseq proj loc-translate --group groupname

generates the following, with one locus per line (locus name, genomic position, number of exons and number of amino-acids, followed by the amino-acid sequence formatted into groups of 10)

 NM_001042478    chr1:4615346..4734416   5   411     MWIQQLLGLS SMSIRWPGRP 
         LGSHAWILIA MFQLAVDLPA CEALGPGPEF WLLPRSPPRP PRLWSFRSGQ PARVPAPVWS 
         PRPPRVERIH GQMQMPRARR AHRPRDQAAA LVPKAGLAKP PAAAKSSPSL ASSSSSSSSA 
         VAGGAPEQQA LLRRGKRHLQ GDGLSSFDSR ...

Adding the option

--verbose

adds the following lines of output for each gene.

 exon 1  1       4615346 4615374 0
 exon 2  1       4671820 4672619 28
 exon 3  1       4729773 4729860 827
 exon 4  1       4732200 4732445 914
 exon 5  1       4734347 4734416 1159
 chr1    1/5     fwd     4615347..4615348        1..3    1       ATG     M
 chr1    1/5     fwd     4615350..4615351        4..6    2       TGG     W
 chr1    1/5     fwd     4615353..4615354        7..9    3       ATT     I
 chr1    1/5     fwd     4615356..4615357        10..12  4       CAA     Q
 chr1    1/5     fwd     4615359..4615360        13..15  5       CAG     Q
 chr1    1/5     fwd     4615362..4615363        16..18  6       CTT     L
 chr1    1/5     fwd     4615365..4615366        19..21  7       TTA     L
 chr1    1/5     fwd     4615368..4615369        22..24  8       GGA     G
 chr1    1/5     fwd     4615371..4615372        25..27  9       CTC     L
 chr1    2&1/5   fwd     4615374..4671820        28..30  10      AGC     S
 chr1    2/5     fwd     4671822..4671823        31..33  11      TCC     S
 chr1    2/5     fwd     4671825..4671826        34..36  12      ATG     M
 chr1    2/5     fwd     4671828..4671829        37..39  13      TCC     S
 chr1    2/5     fwd     4671831..4671832        40..42  14      ATC     I
 chr1    2/5     fwd     4671834..4671835        43..45  15      CGC     R
 chr1    2/5     fwd     4671837..4671838        46..48  16      TGG     W
 chr1    2/5     fwd     4671840..4671841        49..51  17      CCG     P

Note: currently no option to filter / select genes -- all loci/genes in a group will be output.