INRICH: interval-based enrichment analysis

Main

Tutorial

Getting Started

INRICH takes a set of independent genomic intervals (typically, but not necessarily, representing highly associated regions from a GWAS) and asks whether there is enrichment in these intervals of particular sets of target features (typically, but not necessarily, representing genes grouped by some functional annotation).

It works by randomly reshuffling the intervals to obtain the null distribution of chance overlap between the test intervals and targets. To do this, INRICH usually requires four plain-text, tab-delimited input files.

Requisite Input Files
1. Test interval file : chromosome, start, stop : -a
2. SNP map file, corresponding to data used to generate (1) : chromosome, position : -m
3. Target set file : gene ID, gene-set ID, gene-set name : -t
4. Reference gene file, giving the genomic co-ordinates of the gene features: chromosome, start, stop, gene ID, free-text description : -g

Typically, input files (1) and (2) will represent your own data. In the context of GWAS, these will likely represent independent regions of association in the first file, where each interval represents the spanning region of associated SNPs in linkage disequilibrium with each other. PLINK's clump or show-tags commands can be used to create such LD-based intervals from a list of SNPs and a dataset (either HapMap, or the dataset used to generate the associations), as described here.

So that the reshuffling procedure can generate reasonably realistic null interval lists, it is preferable to give INRICH the list of SNP positions where associations actually could, in principle, have been observed. In this manner, INRICH is able to control approximately the SNP density within intervals, as well as their size and total number of genes hit. This list will typically correspond to all SNPs that were retained in the final analysis, or perhaps the set of HapMap SNPs, if dealing with summary statistics from imputed data (in which the HapMap was used to generate the LD-intervals mentioned above). If a maker map file is specified, all start and stop positions in the interval file (1) must correspond exactly to a listed map position in this file. In some cases, it might not be possible to generate an appropriate map file -- for example, if the main interval file represents large structural variants that have breakpoints estimated statistically, and so might not correspond to a precise marker position. In this case, you can either a) adjust the interval file so the breakpoints are mapped to the nearest marker, or b) drop the marker map file altogether, instead specifying a "genome range" file with the "-x" option, as described here. By default, INRICH will still match the total number of targets hit under each random reshuffling of the intervals

You can create the target set definition and map files yourself, or you can use the files available here. (Note: when creating your own target set and map files, you can use any gene ID naming scheme you wish (RefSeq, Entrez, gene symbols, etc) as long as the same ID scheme is used between the set and map files (3) and (4). In fact, the target features may not be genes at all (e.g. they could represent prior linkage regions), in which case you can use any arbitrary ID naming scheme. By default, all sets will be tested, although it is often desirable to exclude sets that are either too small, or too large, to increase computational speed and reduce the multiple testing burden. For example, if looking at genes in GO categories, a test of only sets with between 5 and 200 genes is achieved with the the flags "-i 5 -j 200".

By default, INRICH evaluates each target set by considering the number of intervals that contain at least one target feature. To give more weight to intervals that contain multiple targets, one can count targets (instead of intervals) by adding the flag "-2". INRICH will generate nominal empirical significance values and also significance values that correct for all sets tested. The corrected significance values are based on looking at the distribution of the minimum nominal significance value over all sets that we'd expect to obtain by chance -- thus necessitating the two rounds of permutation that you'll see INRICH uses. This also means that if not enough permutations are used in the first round (controlled by the "-r" flag) then the corrected significance may be conservative. In this case, they will be flagged with an asterisk (*). Re-running with more permutations for the first round is necessary in this case to obtain a precise (and usually more significant) empirical p-value that is corrected for multiple testing. Finally, INRICH will also report three global significance values, that indicate whether more targets than expected by chance are a) within one of the user's intervals and b) within at least one set that is nominally significant at either p< 0.05, 0.01 or 0.001. These metrics can act as a rough guide to whether or not there is some degree of correspondence between the test intervals and the target sets. It would not necessarily be a cause for concern, however, if, for example, a single target set was highly significant after correction, but the global tests were not significant. The opposite scenario (significant global tests, but no individually significant single set after correction) can also occur. This could indicate a highly complex genetic architecture, in which a large number of sets are only weakly enriched.

Overall, INRICH will work best when there is a small to moderate number of test intervals (i.e. not thousands of intervals that might cover a very large proportion of the entire genome). Even though overlapping intervals (and intervals that overlap the same gene) will be merged automatically by INRICH, the critical assumption is that all intervals are statistically independent of each other (particularly for the positional clustering test). Therefore, simply supplying a list of all associated SNPs as 1-base "intervals" in the primary input file is *not a good idea*.