Statistical Methods for the Analysis of Sequence Count Data


Sequence counting via high-throughput DNA sequencing underlies many studies including 16S rRNA sequencing as well as single-cell or bulk RNA sequencing. However, due to the measurement process, sequence count data contains information regarding only the relative abundances of sequences. Commonly such relative data is analyzed using tools from compositional data analysis (CoDA). Yet the CoDA approach fails to account for other features of the data such as count variability and technical variation. In this talk, I will introduce an alternative formulation of sequence count data as count-compositional and will introduce tools in line with this formulation. Based on the compound multinomial logistic-normal distribution I will introduce a class of linear and dynamic linear models for the analysis of sequence count data. While such models are typically difficult to fit, I will introduce the collapse-uncollapse sampler as a means of efficiently inferring these models. Finally, I will discuss my future research plans including the idea of total augmentation as a means of fundamentally overcoming compositional limitations.

IST Research Talk at Penn State
E202 Westgate Building