Statistical and Geometric Methods for the Analysis of Sequence Count Data


Sequence counting via high-throughput DNA sequencing has become omnipresent in biomedical research and underlies techniques such as 16s rRNA sequencing for profiling microbial communities and RNA-seq for measuring gene expression. However, the processes by which sequence count data is generated strips information regarding total sequence abundances and retains only information on the relative abundances of sequences. Such relative data, often termed compositional, is known to cause problems in many statistical analyses. Moreover, sequence count data is further complicated by stochastic count variation as well as random experimental errors and biases. In this talk I will introduce statistical and geometric tools for modeling sequence count data based on hierarchical Multinomial Logistic-Normal models. To examine these models I will discuss: (1) their theoretical and empirical justification (2) their connections to currently available methods including negative-binomial models and purely compositional data analysis approaches (3) their connection to standard log-linear models for categorical data analysis. In particular, I will demonstrate and discuss the analysis of longitudinal microbiome time-series data using Bayesian multinomial logistic-normal generalized dynamic linear models.

F45 Jon M. Huntsman Hall