TY - JOUR
T1 - Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks
AU - Appadurai, Vivek
AU - Bybjerg-Grauholm, Jonas
AU - Krebs, Morten Dybdahl
AU - Rosengren, Anders
AU - Buil, Alfonso
AU - Ingason, Andrés
AU - Mors, Ole
AU - Børglum, Anders D
AU - Hougaard, David M
AU - Nordentoft, Merete
AU - Mortensen, Preben B
AU - Delaneau, Olivier
AU - Werge, Thomas
AU - Schork, Andrew J
N1 - © 2023. The Author(s).
PY - 2023/12
Y1 - 2023/12
N2 - Sample recruitment for research consortia, biobanks, and personal genomics companies span years, necessitating genotyping in batches, using different technologies. As marker content on genotyping arrays varies, integrating such datasets is non-trivial and its impact on haplotype estimation (phasing) and whole genome imputation, necessary steps for complex trait analysis, remains under-evaluated. Using the iPSYCH dataset, comprising 130,438 individuals, genotyped in two stages, on different arrays, we evaluated phasing and imputation performance across multiple phasing methods and data integration protocols. While phasing accuracy varied by choice of method and data integration protocol, imputation accuracy varied mostly between data integration protocols. We demonstrate an attenuation in imputation accuracy within samples of non-European origin, highlighting challenges to studying complex traits in diverse populations. Finally, imputation errors can bias association tests, reduce predictive utility of polygenic scores. Carefully optimized data integration strategies enhance accuracy and replicability of complex trait analyses in complex biobanks.
AB - Sample recruitment for research consortia, biobanks, and personal genomics companies span years, necessitating genotyping in batches, using different technologies. As marker content on genotyping arrays varies, integrating such datasets is non-trivial and its impact on haplotype estimation (phasing) and whole genome imputation, necessary steps for complex trait analysis, remains under-evaluated. Using the iPSYCH dataset, comprising 130,438 individuals, genotyped in two stages, on different arrays, we evaluated phasing and imputation performance across multiple phasing methods and data integration protocols. While phasing accuracy varied by choice of method and data integration protocol, imputation accuracy varied mostly between data integration protocols. We demonstrate an attenuation in imputation accuracy within samples of non-European origin, highlighting challenges to studying complex traits in diverse populations. Finally, imputation errors can bias association tests, reduce predictive utility of polygenic scores. Carefully optimized data integration strategies enhance accuracy and replicability of complex trait analyses in complex biobanks.
KW - Biological Specimen Banks
KW - Genome
KW - Genotype
KW - Haplotypes
KW - Humans
KW - Multifactorial Inheritance
UR - http://www.scopus.com/inward/record.url?scp=85146757556&partnerID=8YFLogxK
U2 - 10.1038/s42003-023-04477-y
DO - 10.1038/s42003-023-04477-y
M3 - Journal article
C2 - 36697501
SN - 2399-3642
VL - 6
JO - Communications Biology
JF - Communications Biology
IS - 1
M1 - 101
ER -