Population genetics is a branch of applied mathematics. It is a translation of scientific observations into mathematical models and their manipulations in order to produce quantitative predictions about evolution. Combining knowledge from genetics, statistics, and computer science, population geneticists strive to establish working solutions to extract information from massive volumes of biological data. The steep increase in the quantity and quality of genomic data during the past decades provides a unique opportunity but also calls for new and improved algorithms and software to cope with the big data era.
In this PhD dissertation, I present my work on methods and tools developed for two projects, Admixture CoalHMM and Ohana, both of which have been designed to study historical admixture and its influence on population evolution. In Admixture CoalHMM, I make use of full genomic sequences from a few individuals to perform demographic inference. In Ohana, I use site-independent genomic data from many individuals to analyze individual admixture, to infer population trees, and to identify selection signals.
The development of CoalHMM at the Bioinformatics Research Centre at Aarhus University dates back to 2007. CoalHMM is a hidden Markov model constructed on the foundation of coalescence theory with the key approximation that the distribution of local genealogies is Markovian along the sequence alignment. Through parametrized modeling, CoalHMM attempts to recover a full demography including population splits, effective population sizes, gene flow, etc. Since joining the CoalHMM development team in 2014, I have mainly contributed in two directions: 1) improving optimizations through heuristic-based evolutionary algorithms and 2) modeling of historical admixture events.
Ohana, meaning "family" in Hawaiian, is a novel project I started at the Center for Theoretical Evolutionary Genetics at the University of California Berkeley. Ohana provides a set of methods and tools for structure analysis, population tree inference, and selection study that fully takes advantage of structured genomic data. Ohana's admixture module is based on classical structure modeling but uses new optimization subroutines through quadratic programming, which outperform the current state-of-the-art software in both speed and accuracy. Ohana presents a new method for phylogenetic tree inference using Gaussian approximation. With the estimated global ancestry and population relationships, Ohana provides a flexible selection signal detection process that considers any prior knowledge on the covariance structure, e.g population bottleneck or local adaptation.
Statistical modeling and numerical optimization form the foundation for both CoalHMM and Ohana. Optimization modeling has been the main theme throughout my PhD, and it will continue to shape my work for the years to come. The algorithms and software I developed to study historical admixture and population evolution fall into a larger family of machine learning, and their underlying techniques have a wide range of applications that go beyond just bioinformatics and population genetics.
Number of pages
Published - 31 Jul 2016
population genetics, statistical modeling, numerical optimization, software