Department of Economics and Business Economics

Designs and Methods for Association Studies and Population Size Inference in Statistical Genetics

Research output: Book/anthology/dissertation/reportPh.D. thesisResearch

Standard

Designs and Methods for Association Studies and Population Size Inference in Statistical Genetics. / Waltoft, Berit Lindum.

Aarhus Universitet, 2016. 180 p.

Research output: Book/anthology/dissertation/reportPh.D. thesisResearch

Harvard

APA

CBE

MLA

Vancouver

Author

Bibtex

@phdthesis{c431e6baea95486fa166e86d46dbe995,
title = "Designs and Methods for Association Studies and Population Size Inference in Statistical Genetics",
abstract = "This dissertation falls in two parts. The rst part discusses statistical modellingof association studies within the eld of epidemiology. A special focusis given to genome-wide association studies (GWAS), which are able to investigatespecic associations between positions in the genome and dierentdiseases. In the second part statistical methods for inferring population historyis discussed. Knowledge on e.g. the common ancestor of the humanspecies, possible bottlenecks back in time, and the expected number of rarevariants in each genome, may be factors in the full picture of any diseaseaetiology.EpidemiologyIn epidemiology the wording {"}odds ratio{"} is used for the estimator of anycase-control study independent of the sampling of the controls. This phraseis ambiguous without specications of the sampling schemes of the controls.When controls are sampled among the non-diseased individuals at the end offollow-up, i.e. the classical case-control study, the estimator is consistentlymeasuring the odds ratio (OR). If controls are sampled among those atrisk when each case is diagnosed, i.e. the matched case-control study, theestimator consistently estimates the incidence rate ratio (IRR). The OR isinterpreted as the eect of an exposure on the probability of being diseasedat the end of follow-up, while the interpretation of the IRR is the eect ofan exposure on the probability of becoming diseased.Through a simulation study, the OR from a classical case-control studyis shown to be an inconsistent estimator of the IRR. The dierence betweenthe OR and the IRR is reected in the p-value of the null hypothesis ofno exposure eect. For multiple testing scenarios, e.g. in a GWAS, thesedierences in estimators imply a change in comparison between the nullhypotheses for dierent sampling schemes of controls.Population geneticsIn population genetics two methods concerning the inference of the populationsize back in time are described. Both methods are based on the siteiiiivfrequency spectrum (SFS), and the fact that the expected SFS only dependson the time between coalescent events back in time.The rst method provides a simple goodness of t test by comparing theobserved SFS with the expected SFS under a given model of population sizechanges. By the use of Monte Carlo estimation the expected time betweencoalescent events can be estimated and the expected SFS can thereby beevaluated. Using the classical chi-square statistics we are able to infer singleparameter models. Multiple parameter models, e.g. multiple epochs, areharder to identify.By introducing the inference of population size back in time as an inverseproblem, the second procedure applies the theory of smoothing splinesto infer the changes in population size. By adding a penalising term tothe goodness-of-t described above, we are able to estimate the integratedintensity of the coalescent process by a two times continuous dierentiablepiecewise cubic polynomial.",
author = "Waltoft, {Berit Lindum}",
year = "2016",
month = "9",
day = "30",
language = "English",
publisher = "Aarhus Universitet",

}

RIS

TY - BOOK

T1 - Designs and Methods for Association Studies and Population Size Inference in Statistical Genetics

AU - Waltoft, Berit Lindum

PY - 2016/9/30

Y1 - 2016/9/30

N2 - This dissertation falls in two parts. The rst part discusses statistical modellingof association studies within the eld of epidemiology. A special focusis given to genome-wide association studies (GWAS), which are able to investigatespecic associations between positions in the genome and dierentdiseases. In the second part statistical methods for inferring population historyis discussed. Knowledge on e.g. the common ancestor of the humanspecies, possible bottlenecks back in time, and the expected number of rarevariants in each genome, may be factors in the full picture of any diseaseaetiology.EpidemiologyIn epidemiology the wording "odds ratio" is used for the estimator of anycase-control study independent of the sampling of the controls. This phraseis ambiguous without specications of the sampling schemes of the controls.When controls are sampled among the non-diseased individuals at the end offollow-up, i.e. the classical case-control study, the estimator is consistentlymeasuring the odds ratio (OR). If controls are sampled among those atrisk when each case is diagnosed, i.e. the matched case-control study, theestimator consistently estimates the incidence rate ratio (IRR). The OR isinterpreted as the eect of an exposure on the probability of being diseasedat the end of follow-up, while the interpretation of the IRR is the eect ofan exposure on the probability of becoming diseased.Through a simulation study, the OR from a classical case-control studyis shown to be an inconsistent estimator of the IRR. The dierence betweenthe OR and the IRR is reected in the p-value of the null hypothesis ofno exposure eect. For multiple testing scenarios, e.g. in a GWAS, thesedierences in estimators imply a change in comparison between the nullhypotheses for dierent sampling schemes of controls.Population geneticsIn population genetics two methods concerning the inference of the populationsize back in time are described. Both methods are based on the siteiiiivfrequency spectrum (SFS), and the fact that the expected SFS only dependson the time between coalescent events back in time.The rst method provides a simple goodness of t test by comparing theobserved SFS with the expected SFS under a given model of population sizechanges. By the use of Monte Carlo estimation the expected time betweencoalescent events can be estimated and the expected SFS can thereby beevaluated. Using the classical chi-square statistics we are able to infer singleparameter models. Multiple parameter models, e.g. multiple epochs, areharder to identify.By introducing the inference of population size back in time as an inverseproblem, the second procedure applies the theory of smoothing splinesto infer the changes in population size. By adding a penalising term tothe goodness-of-t described above, we are able to estimate the integratedintensity of the coalescent process by a two times continuous dierentiablepiecewise cubic polynomial.

AB - This dissertation falls in two parts. The rst part discusses statistical modellingof association studies within the eld of epidemiology. A special focusis given to genome-wide association studies (GWAS), which are able to investigatespecic associations between positions in the genome and dierentdiseases. In the second part statistical methods for inferring population historyis discussed. Knowledge on e.g. the common ancestor of the humanspecies, possible bottlenecks back in time, and the expected number of rarevariants in each genome, may be factors in the full picture of any diseaseaetiology.EpidemiologyIn epidemiology the wording "odds ratio" is used for the estimator of anycase-control study independent of the sampling of the controls. This phraseis ambiguous without specications of the sampling schemes of the controls.When controls are sampled among the non-diseased individuals at the end offollow-up, i.e. the classical case-control study, the estimator is consistentlymeasuring the odds ratio (OR). If controls are sampled among those atrisk when each case is diagnosed, i.e. the matched case-control study, theestimator consistently estimates the incidence rate ratio (IRR). The OR isinterpreted as the eect of an exposure on the probability of being diseasedat the end of follow-up, while the interpretation of the IRR is the eect ofan exposure on the probability of becoming diseased.Through a simulation study, the OR from a classical case-control studyis shown to be an inconsistent estimator of the IRR. The dierence betweenthe OR and the IRR is reected in the p-value of the null hypothesis ofno exposure eect. For multiple testing scenarios, e.g. in a GWAS, thesedierences in estimators imply a change in comparison between the nullhypotheses for dierent sampling schemes of controls.Population geneticsIn population genetics two methods concerning the inference of the populationsize back in time are described. Both methods are based on the siteiiiivfrequency spectrum (SFS), and the fact that the expected SFS only dependson the time between coalescent events back in time.The rst method provides a simple goodness of t test by comparing theobserved SFS with the expected SFS under a given model of population sizechanges. By the use of Monte Carlo estimation the expected time betweencoalescent events can be estimated and the expected SFS can thereby beevaluated. Using the classical chi-square statistics we are able to infer singleparameter models. Multiple parameter models, e.g. multiple epochs, areharder to identify.By introducing the inference of population size back in time as an inverseproblem, the second procedure applies the theory of smoothing splinesto infer the changes in population size. By adding a penalising term tothe goodness-of-t described above, we are able to estimate the integratedintensity of the coalescent process by a two times continuous dierentiablepiecewise cubic polynomial.

M3 - Ph.D. thesis

BT - Designs and Methods for Association Studies and Population Size Inference in Statistical Genetics

PB - Aarhus Universitet

ER -