Zero-inflated models for RNA-Seq count data
Motivation: Next Generation Sequencing (NGS) methods for RNA-Seq result mil- lions of short sequences, called reads that provide fundamental information in the elds of genomics, epigenetics and transcriptomes. One of the main objectives of many bio- logical studies is gene expression pro ling between samples. Gene expression pro ling studies involve mapping of short reads to reference genome, if available, summarizing, normalizing, and nally performing downstream analysis such as making a list of dif- ferentially expressed (DE) genes. One of the common assumptions of RNA-Seq data is that, all gene counts follow an overdispersed Poisson or Negative Binomial (NB) distri- bution which is sometimes misleading because some genes may have stable transcription levels with no overdispersion and some of them may have excessive number of zero counts. Thus, a more realistic assumption in RNA-Seq data is to consider four sets of genes: overdispersed with limited number of zeros and excessive number of zeros, and non-overdispersed with limited number of zeros and excessive number of zeros. Our ob- jective is to apply zero in ated models to the data with excessive number of zero counts and to evaluate their performance. Method: Available methods can handle read counts data with limited number of zero counts for both overdispersed and non-overdispersed data. With excessive num- ber of zeros in the data, we adopt a new approach and apply it to the real RNA-Seq data obtained from Gilad et al. to detect DE genes. Our approach is to consider Zero In ated Poisson (ZIP) mixed model for non-overdispersed genes and Zero In ated Negative Binomial (ZINB) mixed model for overdispersed genes. This is an integrated approach because this method can be combined with any other Poisson and NB based methods for detecting DE genes. We also evaluate the performance of the models by conducting a simulation study. Results: Heat maps for DE genes obtained by ZIP and ZINB mixed models demon- strate the notable performance of the models for the real data. Area under receiver operating characteristics curve (AUC) and Receiver operating characteristics (ROC) curve depict that the models perform well for simulated data. However, ZIP performs better in identifying DE genes from both real and simulated data with excessive zeros.