Modelling count data with excessive zeros: The need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data

Publikation: Bidrag til tidsskrift/Konferencebidrag i tidsskrift /Bidrag til avisTidsskriftartikelForskningpeer review

  • Mark S Gilthorpe, Danmark
  • Morten Frydenberg, Danmark
  • Yaping Cheng, Danmark
  • Vibeke Baelum
  • Afdeling for Biostatistik
  • Odontologisk Institut, Tandlægeskolen
Count data may possess an 'excess' of zeros relative to standard distributions. Zero-inflated Poisson (ZiP) or binomial (ZiB) and generic mixture models have been proposed to deal with such data. We consider biomedical count data with an excess number of zeros and seek to address the following: (i) do zero-inflated models need covariates in the distribution part to predict class membership; (ii) what model-fit criteria have clinical relevance to predicted counts; (iii) can very different model parameterizations have near-identical fit; and (iv) how could model selection and hence model interpretation be aided by considering data generation processes? We show that covariates in the distribution part of zero-inflated models are needed to predict class membership. A range of model-fit criteria should be considered, as consensus is rarely achieved, and considering predicted outcomes may be just as valuable as likelihood-based criteria. Zero-inflated and generic mixture models may be indistinguishable according to both likelihood-based model-fit criteria and predicted outcomes, in which case model differentiation, hence, model selection and interpretation, might be guided by the consideration of a priori data generation processes. Zero-inflated models reflect whether or not there are (or have been) risk differences in disease onset and disease progression, while generic mixture models identify sub-types of individuals with similar risks of disease onset and progression. One or both modelling strategies may be used, though a priori knowledge or clinical impression of data generation might help to distinguish between two or more parameterizations that exhibit similar fit and yield near-identical predicted counts.
TidsskriftStatistics in Medicine
Sider (fra-til)3539-53
Antal sider14
StatusUdgivet - 2009

Bibliografisk note

Copyright (c) 2009 John Wiley & Sons, Ltd.

Se relationer på Aarhus Universitet Citationsformater

ID: 18941813