Designs and Methods for Association Studies and Population Size Inference in Statistical Genetics

Publication: ResearchPh.D. thesis


This dissertation falls in two parts. The rst part discusses statistical modelling
of association studies within the eld of epidemiology. A special focus
is given to genome-wide association studies (GWAS), which are able to investigate
specic associations between positions in the genome and dierent
diseases. In the second part statistical methods for inferring population history
is discussed. Knowledge on e.g. the common ancestor of the human
species, possible bottlenecks back in time, and the expected number of rare
variants in each genome, may be factors in the full picture of any disease
In epidemiology the wording "odds ratio" is used for the estimator of any
case-control study independent of the sampling of the controls. This phrase
is ambiguous without specications of the sampling schemes of the controls.
When controls are sampled among the non-diseased individuals at the end of
follow-up, i.e. the classical case-control study, the estimator is consistently
measuring the odds ratio (OR). If controls are sampled among those at
risk when each case is diagnosed, i.e. the matched case-control study, the
estimator consistently estimates the incidence rate ratio (IRR). The OR is
interpreted as the eect of an exposure on the probability of being diseased
at the end of follow-up, while the interpretation of the IRR is the eect of
an exposure on the probability of becoming diseased.
Through a simulation study, the OR from a classical case-control study
is shown to be an inconsistent estimator of the IRR. The dierence between
the OR and the IRR is re
ected in the p-value of the null hypothesis of
no exposure eect. For multiple testing scenarios, e.g. in a GWAS, these
dierences in estimators imply a change in comparison between the null
hypotheses for dierent sampling schemes of controls.
Population genetics
In population genetics two methods concerning the inference of the population
size back in time are described. Both methods are based on the site
frequency spectrum (SFS), and the fact that the expected SFS only depends
on the time between coalescent events back in time.
The rst method provides a simple goodness of t test by comparing the
observed SFS with the expected SFS under a given model of population size
changes. By the use of Monte Carlo estimation the expected time between
coalescent events can be estimated and the expected SFS can thereby be
evaluated. Using the classical chi-square statistics we are able to infer single
parameter models. Multiple parameter models, e.g. multiple epochs, are
harder to identify.
By introducing the inference of population size back in time as an inverse
problem, the second procedure applies the theory of smoothing splines
to infer the changes in population size. By adding a penalising term to
the goodness-of-t described above, we are able to estimate the integrated
intensity of the coalescent process by a two times continuous dierentiable
piecewise cubic polynomial.
Original languageEnglish
PublisherAarhus Universitet
Number of pages180
StatePublished - 30 Sep 2016

See relations at Aarhus University Citationformats

ID: 100357118