Published in

International Press, Statistics and Its Interface, 4(8), p. 419-436

DOI: 10.4310/sii.2015.v8.n4.a2

Links

Tools

Export citation

Search in Google Scholar

Statistical issues in binding site identification through CLIP-seq

Journal article published in 2015 by Xiaowei Chen, Dongjun Chung ORCID, Giovanni Stefani, Frank J. Slack ORCID, Hongyu Zhao
This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

With the advent and development of CLIP-seq technologies, a growing number of CLIP-seq experiments are being performed to identify the targets of RNA-binding proteins and understand the regulation mechanism of these proteins. Although broad similarities exist between CLIPseq and ChIP-seq, statistical methods developed to identify binding sites from ChIP-seq data are not directly applicable to CLIP-seq data because of some differences between the two technologies. First, transcript abundance has a large impact on CLIP-seq results, and needs to be accounted for when analyzing CLIP-seq data. Second, mutations near the binding sites from CLIP-seq data offer valuable information that can be incorporated in analysis. Other differences arise from the ability of RNA to form complex secondary structures and from many other technical aspects of the two purification protocols. To date, no systematic studies have been conducted to investigate the general statistical properties of CLIP-seq data, the merits of including RNA-seq as a matching control, and the performance of different binding site identification methods for CLIP-seq data. In this study, we performed a comprehensive evaluation of various statistical issues in using CLIP-seq data to identify RNA-protein binding sites. We demonstrate the value of RNA-seq data in background estimation and peak calling. We show that the large dispersion in CLIP-seq data compared to ChIPseq data is the main reason for the difficulty in peak calling in the former. Using both real and simulated data, we also show the importance of biological/technical replicates and of combining mutation and peak analysis to accurately identify binding sites from CLIP-seq data.