Solving the differential peak calling problem in ChIP-seq data

Allhoff, Manuel; Jarke, Matthias; Zenke, Martin; Berlage, Thomas
doi:HT019027213
% IMPORTANT: The following is UTF-8 encoded.  This means that in the presence
% of non-ASCII characters, it will not work with BibTeX 0.99 or older.
% Instead, you should use an up-to-date BibTeX implementation like “bibtex8” or
% “biber”.

@PHDTHESIS{Allhoff:660061,
      author       = {Allhoff, Manuel},
      othercontributors = {Berlage, Thomas and Zenke, Martin and Jarke, Matthias},
      title        = {{S}olving the differential peak calling problem in
                      {C}h{IP}-seq data},
      school       = {RWTH Aachen University},
      type         = {Dissertation},
      address      = {Aachen},
      reportid     = {RWTH-2016-05102},
      pages        = {1 Online-Ressource (xiii, 126 Seiten) : Illustrationen,
                      Diagramme},
      year         = {2016},
      note         = {Veröffentlicht auf dem Publikationsserver der RWTH Aachen
                      University; Dissertation, RWTH Aachen University, 2016},
      abstract     = {Gene expression is the process of selectively reading
                      genetic information and it describes a life-essential
                      mechanism in all known living organisms. Key players in the
                      regulation of gene expression are proteins that interact
                      with DNA. DNA-protein interaction sites are nowadays
                      analyzed in a genome wide manner with chromatin
                      immunoprecipitation followed by sequencing (ChIP-seq). With
                      ChIP-seq it becomes possible to assign a discrete value to
                      each genomic location. The value corresponds to the strength
                      of the protein binding event. Peaks, that is, regions with a
                      signal higher than expected by chance, correspond to the
                      protein-DNA interaction sites. Detecting such peaks is the
                      fundamental computational challenge in the ChIP-seq
                      analysis. As in every complex wet lab protocol, ChIP-seq
                      contains a wide range of potential biases. To reduce the
                      effect of unwanted biases, ChIP-seq experiments are often
                      replicated, which helps to distinguish between biological
                      and random events and to verify the reliability of all
                      experimental steps. Complex ChIP-seq based studies emphasize
                      the demand of methods to compare replicated ChIP-seq signals
                      which are associated with distinct biological conditions.
                      These studies investigate the differential peak calling
                      problem which is subject of current biological and medical
                      research. Solving this problem leads to a deeper
                      understanding of gene expression regulation. Several
                      computational challenges arise when detecting differential
                      peaks (DPs). First, the shape of ChIP-seq peaks depends on
                      the underlying protein of interest. For ChIP-seq data of
                      histone modifications, the DNA-protein interactions occur in
                      mid-size to large domains. Here, domains can span several
                      hundreds of base pairs and may have intricate patterns of
                      gains and losses of ChIP-seq signals within the same domain.
                      In contrast, ChIP-seq from transcription factors mostly
                      happens in small isolated peaks. Second, artefacts, which
                      arise due to the complexity of the ChIP-seq protocol,
                      produce signals with distinct signal-to-noise ratios, even
                      when they are produced in the same lab and follow the same
                      protocols. Furthermore, different sequencing depths between
                      samples aggravate the comparison of their ChIP-seq signal.
                      Hence, a robust normalization method for the ChIP-seq
                      signals is required. Finally, clinical samples, where
                      patients have a distinct genetic background, introduce
                      further variation to the distinct ChIP-seq signals.
                      Moreover, replicated ChIP-seq experiments introduce further
                      complexity which has to be reflected by the use of
                      sophisticated statistical models. Current differential peak
                      calling methods fail to cover all listed challenges. They
                      apply heuristic signal segmentation strategies, such as
                      window-based approaches, to identify DPs. There are only a
                      few attempts to normalize ChIP-seq data. Furthermore, most
                      methods do not support replicates. Hence, there is a clear
                      need for computational methods that address all challenges.
                      In this thesis, we propose ODIN and THOR, algorithms to
                      determine changes of protein-DNA complexes for distinct
                      cellular conditions in ChIP-seq experiments without and with
                      replicates. We apply a statistical model (hidden Markov
                      model) to call DPs and to handle replicates. We also
                      introduce a novel normalization strategy which is based on
                      control regions. These features lead to comprehensive
                      algorithms that accurately call DPs in ChIP-seq signals.
                      Moreover, the evaluation of differential peak calling
                      algorithms is an open problem. The research community lacks
                      both a direct metric to rate the algorithms and data sets
                      with a genome wide map of DNA-protein interaction sites
                      which can serve as gold standards. We propose two
                      alternative approaches for the evaluation. First, we present
                      indirect metrics to quantify DPs by taking advantage of gene
                      expression data and second, we use simulation to customize
                      artificial gold standards.},
      cin          = {080003 / 122620 / 120000},
      ddc          = {004},
      cid          = {$I:(DE-82)080003_20140620$ / $I:(DE-82)122620_20140620$ /
                      $I:(DE-82)120000_20140620$},
      typ          = {PUB:(DE-HGF)11},
      urn          = {urn:nbn:de:hbz:82-rwth-2016-051022},
      url          = {https://publications.rwth-aachen.de/record/660061},
}
h1

h2

h3

h4

h5

h6

RWTH

Kontakt

RWTH Publications

Allgemeines