Robust statistical methods for significance evaluation and applications in cancer driver detection and biomarker discovery

Research output: ResearchPh.D. thesis

In the present thesis I develop, implement and apply statistical methods for detecting genomic elements implicated in cancer development and progression.
This is done in two separate bodies of work.
The first uses the somatic mutation burden to distinguish cancer driver mutations from passenger mutations.
The second uses gene expression and DNA methylation data to find genes differentially expressed and/or methylated between two sets of samples.
Additionally I have developed efficient methods and algorithms for evaluating the significance of observations from a graphical model.
These methods are used to scale the aforementioned driver detection methods to a dataset consisting of more than 2,000 cancer genomes.

The sizes and dimensionalities of genomic data sets, be it a large number of genes or multiple heterogeneous data sources, pose both great statistical opportunities and challenges.
These challenges include model selection and multiple testing problems.
On the other hand, in large datasets we can often exploit hierarchical structures to improve inference: E.g. in differential expression studies with multiple genes, it is natural to define a distribution of the variability of each gene.
This distribution can be learned across the entire set of genes and then be used to improve inference on the level of the individual gene.
A practical way to implement this insight is using empirical Bayes.
This idea is one of the main statistical underpinnings of the present work.

The thesis consist of three main manuscripts as well as two supplementary manuscripts.

In the first manuscript we explore efficient significance evaluation for models defined with factor graphs.
Factor graphs are a class of graphical models encompassing both Bayesian networks and Markov models.
We specifically develop a saddle-point approximation and an importance sampling scheme that are fast to evaluate yet accurate.
We demonstrate the methods on multiple models including the Poisson-binomial model, a high-order Markov chain motif model and phylogenetic trees of sequence evolution
The methods are implemented in a publicly available R-package called dgRaph.

The methods mentioned above are next put to use in detection of potential cancer drivers.
Based on a predictive model of the sample and site specific somatic mutation rate across cancer genomes, Juul, et al., 2017, developed a method to search for cancer drivers.
We improve the specificity of the previous method by modelling and accounting for overdispersion in the somatic mutation rate.
Overdispersion is learned using empirical Bayes and we evaluate the statistical significance of a genomic element using the saddle-point approximation developed previously.

In the final manuscript we integrate methylation and expression data to detect genes that separate two classes.
Regularized estimates and tests has previously been employed in differential expression studies.
We generalize these methods to this multivariable setting and further include techniques from robust statistics.
We show that using these methods improve classification performance particularly on smaller cohorts.
Original languageEnglish
Number of pages203
StateSubmitted - 31 Aug 2017

See relations at Aarhus University Citationformats

ID: 116547789