RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty
Feb 15, 2010·
,,,,·
0 min read
Dr. Bo Li
V. ruotti
R. m. stewart
J. a. thomson
C. n. dewey
Abstract
Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically.
Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.
Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at http://deweylab.biostat.wisc.edu/rsem .
Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.
Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at http://deweylab.biostat.wisc.edu/rsem .
Type
Publication
Bioinformatics

Authors
Principal Scientist II
Dr. Bo Li is a Principal Scientist at Genentech, Inc. His research focuses on large-scale single-cell genomics data analysis.
Before joining in Genentech, he was an Assistant Professor of Medicine at Harvard Medical School and the director of Bioinformatics and Computational Biology at Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital.
He received his Ph.D. in computer science from UW-Madison and completed two postdoctoral trainings with Dr. Lior Pachter at UC Berkeley and Dr. Aviv Regev at Broad Institute.
He is best known for developing RSEM, an impactful RNA-seq transcript quantification software. RSEM is cited 22,602 times (Google Scholar) and adopted by several big consortia such as TCGA, ENCODE, GTEx and TOPMed.
Authors
Authors
Authors
Authors