Enrichment analysis

ncPRO-seq has a special engine, which we call enrichment analysis, to analyse reads that can not be annotated as known genomic features. Users can select different subsets of reads to perform this analysis by giving different sets of options to SIG_READ_OPTIONS (section 2). And reads that can be aligned to annotation regions given in EXCLUDE_ANN_GFF are excluded from enrichment analysis. Finally, we use the following steps to identify regions significantly enriched with remaining reads after filter steps.

  1. Slide window of fixed size (SIG_WIN_SIZE, e.g. 10,000) along the whole genome at fixed step (SIG_STEP_SIZE, e.g. 5,000)
  2. For each window, summarize read mapping information (number of mapped reads...)
  3. Fit number of mapped reads in all window to selected model to estimate expected read number distribution
  4. Compare real and expected read number distribution to determine P value for each window, thereby identify regions with significant numbers of mapped reads (PVAL_CUTOFF, e.g. 0.001)
  5. Finally generate tables containing read information and additional gene annotation of these regions. Track files in http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format are also created

There are three models that users can choose to do simulation and fit sliding window results:

  1. NB.ML: negative binomial distribution inferred using maximum likelihood method
  2. NB.012: negative binomial distribution inferred using windows with only 0, 1, or 2 aligned reads
  3. Poisson: Poisson distribution inferred using windows with only 0, 1, or 2 aligned reads

For more details about these three models, please check the addNBSignificance function in http://www.bioconductor.org/packages/2.6/bioc/html/girafe.html girafe R package [16].

In this step, three types of results are generated: figures displaying the distribution of sliding windows with different number of reads mapped and model simulation results, table files containing location, read mapping and annotation information of identified regions significantly enriched with reads, and track files in http://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED format.

Chongjian Chen 2012-01-26