Genebook

Purcell Lab | http://atgu.mgh.harvard.edu/genebook/ | version 0.01 (15-Aug-2018)
Overview | Download | Create a genebook | Types of entities | View a genebook | Examples

Overview

genebook is designed to create simple web-based databases that provide results from genetic studies. A genebook is not intended to be a comprehensive representation of all information in a given study. Rather, think of a genebook as an interactive version of a supplementary table in a typical journal article.

The primary focus is gene-centric and is most applicable to exome sequencing studies although other study types could conceiveably fit, e.g. CNV, GWAS and whole-genome sequencing studies.

A typical genebook contains precalculated summary statistics and/or results from analyses at the level of variants, genes or sets of genes. Other types of entity can also be represented (e.g. publications, gene networks, individual phenotypes, etc).

No one component is necessary: one genebook might contain only summary variant-level data (e.g. case/control counts and variant annotations) whereas another might contain only lists of individual genotypes, linked to phenotypic data on each individual. Some example genebooks are illustrated below.

Genebook runs in two modes: command line (currently primarily for data input) and as a web-client (primarily for viewing data).


Download

Obtain the latest C/C++ source code from this link:
 Version 0.01 (15-Aug-2018) : download/genebook-0.01.tar.gz

Extract and compile the contents of the archive:

tar -xzvf genebook-0.01.tar.gz
cd genebook-0.01
make

It is convenient to make a link genebook.cgi to the newly-created genebook binary.

ln -s genebook genebook.cgi

Note: on some systems (e.g. Mac) you may need to remove the -std=gnu++0x part of the CXXFLAGS line in the Makefile if it does not compile first time.

Running locally

genebook is designed to run through a web-server as a CGI program. Alternatively, it can be run locally using Mongoose (a simple and useful http server) or similar. If using Mongoose:

  1. Download mongoose from https://code.google.com/p/mongoose/
  2. Compile and place mongoose in your system path
  3. 'cd' to the folder containing the genebook (i.e. the .genebook.db file, see below on how to create this), run mongoose to host that folder as the HTML document root
  4. In the web-browser, go to http://localhost:8080/genebook.cgi?

Hosting a genebook on a webserver

Assuming two machines: L, the local machine that contains the summary data to be compiled into a genebook, and S a webserver (e.g. a public-facing machine running Apache or similar). Genebook is installed on both. An architecture-independent file (.genebook.db) is created on L using genebook as a command-line tool; it is then copied to S, where it can be served to others via genebook running in CGI mode.

(Of course, L and S might in practice be the same machine; as we assume that most data-analytic work is not carried out on a web-server, this distinction is made for clarity.)

  1. On local machine L, compile the appropriate text (*.txt) files (see below)
  2. Create the genebook DB (.genebook.db) with the add command (see below).
    genebook joe add < myfiles.txt
    genebook joe add-intro < myintro.txt
    
  3. On the webserver, make a folder to contain the genebook, for instance (depending on server configuration):
    • mkdir /var/www/gb1 to specify the URL http://myserver.org/gb1
  4. Copy the .genebook.db file to the gb1 folder on the server
  5. Copy the binary genebook.cgi and auxilliary folders (css, img and js, all bundled in genebook-*.tar.gz) to the gb1 folder
  6. Create a web-link to the genebook as follow:
    • Either href="gb1/genebook.cgi?user=guest&cmd=overview" to land on an overview page,
    • or href="gb1/genebook.cgi?" to point to a user login page

Creating a Genebook

A genebook is created primarily by the add command, that takes from standard input a text-based description of the data to be represented. This creates and populates a single-file database (.genebook.db). Subsequently, if invoked as a CGI program, these data can then be accessed via a web-browser.

Basics

The basic add command

genebook joe add < myfile.txt

will load the data in myfile.txt, creating a new genebook if one doesn't already exist, and add the user joe. If the genebook already exists, this information will be added (or will modify existing entries). Genebook expects the information in myfile.txt to comprise one or more blocks that define a certain class of information. The exact formats for these are described below and examples are given here.

To illustrate with a study type:

@study{study1} My first study
%pub{author2014}
%pheno{scz}

This specifies a study with the ID study1 and descriptive text My first study. In general, new entities are specified in this way:

@type{identifier} Single-line textual description
%linked-type{identity of linked entity}

Here, the @ operator specifies a new study called study1. The % operator links two entities together: in this case, the publication author2014 and phenotype scz are both linked to this study. Links can be generic between any types of entities. Links can also be added on-the-fly via the web-interface, as described below.

At this point, the publication and phenotype do not yet exist in the database; the IDs will be added to the database by this command (and linked to study1). More information on those entities can be attached at a later point, for example, by the following:

@pheno{scz} Schizophrenia case/control disease status (coded 2/1)

Attributes

Most entities can have arbitrary attributes (key/value pairs) attached, using the following %{key} value syntax:

@study{study1}
%{Sample size} 2500
%{Design} Exome sequencing study
%{Contact} person@institution.edu
%{If no explicit 'value', then this is free text}
%{that will be attached to the study and displayed}
%{in the order in which is it given here}

Tabular data

Simple entities such as a study or phenotype definition will be specified as a single item and typically have only a line or description and some links or attributes. For more complex types, such as genes, gene sets or generic analysis results, multiple rows of tabular data will usually follow the initial type declaration. For example:

@gene{refseq} RefSeq hg19 genes
#GENE   CHR   BP1   BP2   CONSERVED  ALTNAME
ABC1    1     1000  2000  Y          G:0001
DEF2    2     2000  3000  N          G:0011
XYZ1    3     3000  4000  Y          G:0112

Here, the second row is a header row; this always starts with the # character and the columns should be tab-delimited. Each row will be entered as a distinct gene entity. The exact format expected will vary for the different types, as described in the types section below.

Users and anonymous/read-only/read-write genebooks

One type of entity is that of a user. Each genebook has a list of known users. When running on the command-line, the second argument is always a user-name (this user will be created if it does not exist).

genebook joe add < mydata.txt

When accessing a genebook via the web-interface, a user-name is always specified. This can be the reserved word guest, which means that the genebook is not being accessed by somebody with an explicit user-name. Guests cannot write to a genebook (described below).

Genebooks can be denoted as read-only (with respect to the web-interface) with the following command:

echo "true" | ./genebook joe set-locked 

or as a read/write database (the default):

echo "false" | ./genebook joe set-locked 

If a database is read-only (or if the user is guest) this means that notes, links and themes cannot be added via the web-interface.

Security

Although genebooks can be used with named user accounts, this is designed solely to track the provenance of notes (i.e. links and comments) that are added to a database. That is, this is in in no way intended to provide secure access to a genebook: there is no password system. User names are transmitted as plain-text (as part of the URL).

If you wish to have a private genebook, either do not place it in a world-wide-web accessible location (i.e. and access via mongoose, as above), or use the standard folder-level authentication that your web-server supports (i.e. so that all users are required to enter a user/password combination to get to the genebook in the first place).

We imagine that read/write databases will work best within relatively small groups of collaborators working on a common project. Completely open genebooks are best set to be read-only. (Note that there are currently no limits on the size or number of notes that a user could in theory attach, meaning that malicious users could cause problems if given write-access. In future versions we will consider imposing constraints on what can be added to a database via the web interface.)

All CGI-mediated access to a genebook is tracked and written to a log file (.genebook.log). For every access, a line similar to the following will be written to this file, so that usage can be tracked (the client's IP address) and recording the specific request -- in this case, a search on the gene CDH13:

Wed Feb 26 16:38:05 2014   IP=76.234.120.51   user=guest   cmd=verb-gene   arg1=CDH13

Types

Genebook currently recognizes the following types of entities:

Core components More detailed phenotype/geneotype information Collections of genes Genomic mappings Higher-level organization and annotation Display options

All are optional in a given genebook. The format for each type is given below. Some general points to note:


Genes

@gene{refseq} RefSeq hg19 genes
#GENE CHR    BP1      BP2       NAME                                        LOCUS     ALTNAME
SCO2  chr22  50961996 50964034  SCO2 cytochrome c oxidase assembly protein  22q13.33  9997
TYMP  chr22  50964181 50968514  thymidine phosphorylase                     22q13     1890

The header fileds GENE, CHR, BP1, BP2, NAME and ALTNAME are reserved keywords with special meanings. Other fields, e.g. LOCUS here, will be added as a generic attribute of that gene (i.e. displatyed only on the page for that gene).

Genes can be indirectly referenced (i.e. as part of a geneset). However, only genes listed as part of a @gene group will be displayed in the list of known genes.


Studies

@study{study1} My First Study
%pheno{dis1} 
%pheno{qt1}
%pheno{qt2}
%study{other_study}
%pub{author2014}

The primary @study tag takes only an ID and a descriptive field. In the example above, there are also links to other entities -- including in this example a second study, i.e. using the %study{} linking tag.


Analyses

@analysis{study1}{main-dis1} Primary analysis of the <b><em>dis1</em></b> phenotype
%col{blue}
#GENE           STAT    P       A/U
%gene{SCO2}     12      0.04    2/7
%gene{TYMP}     10      0.76    6/5
%gene{KLHDC7B}  5       0.43    3/6
%gene{CHKB}     31      N/A     0/0
...

Analysis tables are perhaps the core component of a Genebook. These are simple tables, based on input from tab-delimited text files. Every analysis must belong to precisely one study, as specified by the double-valued @analysis{}{} tag above -- in this case, the analysis main-dis1 belongs to study study1. An analysis is uniquely referenced by the combination of study and analysis IDs.

The optional %col{} directive paints a blue stripe along the edge of the table, for that analysis in the summary page of all analyses, and if any rows from the analysis table are shown in a gene or geneset's page. This is just a simple visual device that can be used to help group results by, for example, the type of mutation or study design. (See the schizophrenia genebook for an application.)

In an analysis table, none of the header fields have special meanings. However, one can indicate specific entities in the data rows of an analysis table, using the % linking directives. For example, in the above example, the %gene{} tags indicate that these entries are for genes. This means that when the table is displayed, only the gene-name is shown, hyper-linked to the summary page for that gene.

Further more, when the summary page for a given gene is displayed, rows that mention that gene will be displayed. In this way, all relevant information across multiple studies/analyses can be shown on a single page for a gene. A similar behavior holds for genesets also.

For large tables (e.g. if giving statistics for all genes or variants), one might want to store the information such that it can be cross-referenced and the appropriate rows shown on a gene summary page, but not allow the user to view the entire table in one page (viewing a very large table will cause the page to load very slowly). In this case, use the the alternate form of the @analysis tag:

@analysis-only-rows{study}{analysis1} Results that can only be viewed one gene/geneset at a time

There is a second, complementary form that has the opposite effect: data from this table will not be shown in a gene/set summary page.

@analysis-only-full{study}{analysis2} Results that do not appear in gene/geneset summaries

This latter form can be useful if there are multiple, partially redundant tables (e.g. results for both disruptive and all nonsynonymous variants) and you do not want the same variant to appear multiple times in the gene summary. In this instance one might make the 'disruptive mutation list' table as only-full as these mutations will be displayed in the superset of nonsynonymous mutations, for example.


Individuals

@indiv{study1}
#ID	PHE	qt1	qt2	Sex
ID0001	CASE	1.22	22.3	Male
ID0002	CONTROL	0.892	18.9	Female

An @indiv{} tag specifies a group of individuals, that follow after the tag, one individual per line. A single study must always be referenced, and all individuals will primarily belong to a single study.

The second row must be a header row, always starting with # and tab-delimited. The tags ID and PHE are reserved keywords: the latter is the "primary" phenotype for that individual, as listed on the summary page of all individuals.

An individual is linked to by the double-valued tag (study/ID):

%indiv{study1}{ID0001}

Phenotypes

@phenotype{dis1} Disease phenotype
%study{study1}

A phenotype is only an identifier and a textual description (with optional links). By linking different studies and analyses to a given phenotype as appropriate, this provides a convenient way to select out all relevant analyses (i.e. as these will all be linked together from the phenotype page). See the schizophrenia genebook for an application.


Variants

@variants{study1}
#VAR            Counts  GENE            FUNC                  AA change   Annotation       REF ALT  RefSeq transcript
chr22:51021197  1/1/2   CHKB,CHKB-CPT1B missense,npcRNA (A,A) p.5A>V      missense,npcRNA  G   A,A  NM_005198,NR_027928 
chr22:51133382  1/0/0   SHANK3          missense (T)          p.404R>W    missense         C   T    NM_033517           
...

Here VAR, GENE, FUNC, REF and ALT are reserved key words, used to specify the basic properties of each variant as above. All other fields are stored as generic attributes of that variant (i.e. can be viewed on the summary page for that individual variant).

Although this example breaks the convention, typically one would want to store only generic (not study-specific) information about a particular variant here. Things such as case/control counts, that will be specific to a given study/analysis, are probably better represented as an analysis table (that can be linked back to variants listed here, e.g. such as in the example file anal.secondary-dis1.txt.


Genotypes

@genotypes{study1}
#VAR             INDIV  GENO META                                ANNOT     GENE
chr22:50962329  ID0001  C/T  [AD=10,8;DP=18;GQ=99;PL=216,0,283]  missense  SCO2
chr22:50962329  ID0002  T/T  [AD=1,22;DP=23;GQ=99;PL=234,112,0]  missense  SCO2
chr22:51183474  ID0001  C/A  [AD=1,4;DP=5;GQ=19;PL=116,0,19]     missense  ACR

The header names VAR, INDIV, GENO, META and GENE are all reserved words, used to define the genotype. The META can contain any type of meta-information about the call. The additonal field ANNOT is not a reserved keyword: this field is appended to the meta-information value in the output, i.e. as ANNOT = missense for example.

Genotypes are listed on an individual, variant and gene summary pages.

It is not recommended to dump all genotypes from a large study into a genebook, as a very large amount of information is likely to adversely impact the performance of the database. In addition, that extent of information would likely not be very useful in any case, and tools other than genebook may be better suited. The expected use case here is for small number of "interesting" or flagged mutations -- e.g. rare gene-disruptive mutations or de novo mutations. (Note: this has not been extensively tested yet, but we've found that genebooks with around 100,000 rare-variant genotypes work fine. A database could likely handle many more, although note that representing common variants in large samples in this manner will be awkward, as a very large number of genotypes will be printed for every gene, etc).


Publications

@pub{author2014} Author A.N. et al. (2014) Manuscript describing the findings of an analysis of a study. Journal of Study Findings.
%study{study1}
%pmid{123456789}
%gene{AAA1}

Publications are simply entities with a name, descriptive text and links. The special %pmid{} directive can be used to specify a PubMed ID. In the publication summary table, this will automatically create a link to the publications abstract at http://www.ncbi.nlm.nih.gov/pubmed.

Note that the reference (which can be in any format) must all be on the same line as the @pub{} tag.


Genesets

@geneset{group1} Generic genesets 
%subset{group1}{S::001} First generic geneset
%subset{group1}{S::002} Second generic geneset
AAA1  S::001
SCO2  S::001
TYMP  S::002
ACR   S::002
XYZ   S::002

A geneset is a higher-level grouping, that contains one or more subsets; each subset contains one or more gene entries. For example, a geneset might be GO/Gene Ontology; individual subsets correspond to particular genes mapped to individual GO terms. In the above example we define two subsets (S::001 and S:002) with two and three genes respectively.

As mentioned above, genes can can be listed when defining a geneset even if they do not appear in a @gene block. Such genes can be searched, and linked to by other entities, but they will not feature in the primary summary page that lists all "known" genes. Also, they will not have any additional information (i.e. genomic location or full gene name) of course.

A geneset is then linked to from other entities by the double-valued (group/subgroup) tag, e.g.:

%geneset{group1}{S::002}

Networks

@net{InWeb} Subset of InWeb (v3, thresholded at > 0.154 quality score)
PLD1     CPT1B   1
NCK1     SHANK3  1
C3orf34  RABL2B  1
HNRNPR   ARSA    1
...

A network is a collection of gene-gene pairs (tab-delimited); although a third column contains an arbitrary weight (1 in the above examples), it is currently not used. When visiting a gene summary page, all "interactors" (genes connected in the network) will be listed for that gene.


bigBed

When viewing a locus, it is possible to display an embedded UCSC browser window containing a user-defined bigBed track (e.g. containing positions of variants detected in a given study, etc).

A particular locus can be viewed by entering a value in the left search box (e.g. chr1:12000000..13000000) or by clicking on the locus information for a particular gene from that gene's summary page.

All genes and variants within that locus will be listed on the locus page.

Making and hosting a bigBed file

The following steps are a brief guide to making a bigBed file:

First, download UCSC utilities, either as source or compiled binaries, for bedToBigBed and fetchChromSizes.

To make a bigBed file from a BED file (see here for definitions of BED files):

  1. Remove all headers (in this case, the first line), ensure the file is genomically sorted;
    awk ' NR>1 ' unsorted.bed | sort -k1,1 -k2,2n > sorted.bed
    
  2. Obtain the chromosome sizes from UCSC
    fetchChromSizes hg19 > chrom.sizes
    
  3. Use the bedToBigBed utility to make a bigBed file.

    bedToBigBed sorted.bed chrom.sizes variants.bb
    

Then copy the bigBed file to a web-accessible location (in this example, http://atgu.mgh.harvard.edu/genebook/example/variants.bb)

You can check the contents of the bigBed file directly via the UCSC browser: e.g. enter the following text in the Add Custom Track component:

track type=bigBed name="My Big Bed1" description="my data" bigDataUrl=http://atgu.mgh.harvard.edu/genebook/example/variants.bb

To attach the bigBed file to a Genebook, create a text file (e.g. bigbed.txt) as follows:

@bigbed{study1_variants} http://atgu.mgh.harvard.edu/genebook/example/variants.bb
My Study's Variants

and insert into the Genebook using the standard add command:

genebook guest add < bigbed.txt

Currently, you can only have a single bigBed attached to a genebook.


Analysis groups

@group{dis} Disease-based analyses
%analysis{study1}{main-dis1}
%analysis{study1}{secondary-dis1}

@group{primary} Primary analyses
%analysis{study1}{main-dis1}
%analysis{study1}{main-qt1}
%pub{author2014}

An analysis group is simply a set of related analyses. If there are a large number of analyses in a genebook, it is useful to group them into logically (and possibly overlapping) sets. This is done by specify a group name and then linking othe analyses with the %analysis{} linking directive. Other links can also be added, as above.

As illustrated in the example groups.txt file, a group can have one level of sub-groups also, for a hierarchical group structure.


Themes

@theme{theme1} My first theme
%gene{ACR}
%geneset{group1}{S::002}
%indiv{study1}{ID0001}

A theme is a little like an analysis group except it is more generic: any types of collection of entities can be mapped to a theme.

Themes can also be added via the web-interface on the fly, by writing a note containing a #mytheme phrase, to define a theme called mytheme. Subsequently, other entities can be linked to that theme, as described below.


Notes

Notes can only be added via the online interface when viewing an existing Genebook (see below).


Users

For a writable genebook, users can be added and defined via the default log-in page: i.e. linked to by:

 <a href="genebook.cgi?">

Otherwise, users can be added with the following syntax; note the two reserved tags used to specify full name and e-mail address:

@user{joe}
%fullname{Joe Bloggs}
%email{jb@email.address.com}

Front page (introductory text)

My First Genebook

<p>This example Genebook shows some of the main features of a <em>Genebook</em>.</p>

This is the text that is displayed on the opening front-page of a genebook. Ideally it will describe the contents of the genebook. The first line is the title of the genebook. All subsequent lines are displayed on the front page, rendered as HTML (and so can contain links, etc).

Rather than the generic add command, this is added with the special command:

genebook joe add-intro < intro.txt

Any new text will overwrite the old information, if the above command is performed twice on different files.


Navigation bar menu

Users can add their own links on the top of any genebook, using the add-navbar command, This takes a file with two tab-delimited columns per row, indicating the title/name to be displayed for the link, followed by the URL. For example, if the file navbar.txt contains the following:

Purcell Lab	http://research.mssm.edu/statgen/
UCSC	http://genome.ucsc.edu/
PubMed	http://www.ncbi.nlm.nih.gov/pubmed

These are added with the command:

genebook joe add-navbar < navbar.txt

The first Genebook item is fixed and always links back to this page.


Viewing a Genebook

(to be added...)


Example Genebooks

I. Schizophrenia Exome Sequencing Study examples

This genebook contains real data from various neuropsychiatric disease exome sequencing studies.


II. A toy example

The files and commands listed here can be used to make the following toy genebook (containing only a handful of genes/variants/individuals) hosted here.