A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide

Research output: Contribution to journal/Conference contribution in journal/Contribution to newspaperJournal articleResearchpeer-review

Standard

A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide. / Chen, Jie; de Hoogh, Kees; Gulliver, John; Hoffmann, Barbara; Hertel, Ole; Ketzel, Matthias; Bauwelinck, Mariska; van Donkelaar, Aaron; Hvidtfeldt, Ulla A.; Katsouyanni, Klea; Janssen, Nicole A.H.; Martin, Randall V.; Samoli, Evangelia; Schwartz, Per E.; Stafoggia, Massimo; Bellander, Tom; Strak, Maciek; Wolf, Kathrin; Vienneau, Danielle; Vermeulen, Roel; Brunekreef, Bert; Hoek, Gerard.

In: Environment International, Vol. 130, 104934, 09.2019.

Research output: Contribution to journal/Conference contribution in journal/Contribution to newspaperJournal articleResearchpeer-review

Harvard

Chen, J, de Hoogh, K, Gulliver, J, Hoffmann, B, Hertel, O, Ketzel, M, Bauwelinck, M, van Donkelaar, A, Hvidtfeldt, UA, Katsouyanni, K, Janssen, NAH, Martin, RV, Samoli, E, Schwartz, PE, Stafoggia, M, Bellander, T, Strak, M, Wolf, K, Vienneau, D, Vermeulen, R, Brunekreef, B & Hoek, G 2019, 'A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide', Environment International, vol. 130, 104934. https://doi.org/10.1016/j.envint.2019.104934

APA

Chen, J., de Hoogh, K., Gulliver, J., Hoffmann, B., Hertel, O., Ketzel, M., Bauwelinck, M., van Donkelaar, A., Hvidtfeldt, U. A., Katsouyanni, K., Janssen, N. A. H., Martin, R. V., Samoli, E., Schwartz, P. E., Stafoggia, M., Bellander, T., Strak, M., Wolf, K., Vienneau, D., ... Hoek, G. (2019). A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide. Environment International, 130, [104934]. https://doi.org/10.1016/j.envint.2019.104934

CBE

Chen J, de Hoogh K, Gulliver J, Hoffmann B, Hertel O, Ketzel M, Bauwelinck M, van Donkelaar A, Hvidtfeldt UA, Katsouyanni K, Janssen NAH, Martin RV, Samoli E, Schwartz PE, Stafoggia M, Bellander T, Strak M, Wolf K, Vienneau D, Vermeulen R, Brunekreef B, Hoek G. 2019. A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide. Environment International. 130:Article 104934. https://doi.org/10.1016/j.envint.2019.104934

MLA

Vancouver

Author

Chen, Jie ; de Hoogh, Kees ; Gulliver, John ; Hoffmann, Barbara ; Hertel, Ole ; Ketzel, Matthias ; Bauwelinck, Mariska ; van Donkelaar, Aaron ; Hvidtfeldt, Ulla A. ; Katsouyanni, Klea ; Janssen, Nicole A.H. ; Martin, Randall V. ; Samoli, Evangelia ; Schwartz, Per E. ; Stafoggia, Massimo ; Bellander, Tom ; Strak, Maciek ; Wolf, Kathrin ; Vienneau, Danielle ; Vermeulen, Roel ; Brunekreef, Bert ; Hoek, Gerard. / A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide. In: Environment International. 2019 ; Vol. 130.

Bibtex

@article{f0232a1efb8242f4b1e58408951a7102,
title = "A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide",
abstract = "Empirical spatial air pollution models have been applied extensively to assess exposure in epidemiological studies with increasingly sophisticated and complex statistical algorithms beyond ordinary linear regression. However, different algorithms have rarely been compared in terms of their predictive ability. This study compared 16 algorithms to predict annual average fine particle (PM2.5) and nitrogen dioxide (NO2) concentrations across Europe. The evaluated algorithms included linear stepwise regression, regularization techniques and machine learning methods. Air pollution models were developed based on the 2010 routine monitoring data from the AIRBASE dataset maintained by the European Environmental Agency (543 sites for PM2.5 and 2399 sites for NO2), using satellite observations, dispersion model estimates and land use variables as predictors. We compared the models by performing five-fold cross-validation (CV) and by external validation (EV) using annual average concentrations measured at 416 (PM2.5) and 1396 sites (NO2) from the ESCAPE study. We further assessed the correlations between predictions by each pair of algorithms at the ESCAPE sites. For PM2.5, the models performed similarly across algorithms with a mean CV R2 of 0.59 and a mean EV R2 of 0.53. Generalized boosted machine, random forest and bagging performed best (CV R2~0.63; EV R2 0.58–0.61), while backward stepwise linear regression, support vector regression and artificial neural network performed less well (CV R2 0.48–0.57; EV R2 0.39–0.46). Most of the PM2.5 model predictions at ESCAPE sites were highly correlated (R2 > 0.85, with the exception of predictions from the artificial neural network). For NO2, the models performed even more similarly across different algorithms, with CV R2s ranging from 0.57 to 0.62, and EV R2s ranging from 0.49 to 0.51. The predicted concentrations from all algorithms at ESCAPE sites were highly correlated (R2 > 0.9). For both pollutants, biases were low for all models except the artificial neural network. Dispersion model estimates and satellite observations were two of the most important predictors for PM2.5 models whilst dispersion model estimates and traffic variables were most important for NO2 models in all algorithms that allow assessment of the importance of variables. Different statistical algorithms performed similarly when modelling spatial variation in annual average air pollution concentrations using a large number of training sites.",
keywords = "Fine particles, Land use regression, Machine learning, Nitrogen dioxide",
author = "Jie Chen and {de Hoogh}, Kees and John Gulliver and Barbara Hoffmann and Ole Hertel and Matthias Ketzel and Mariska Bauwelinck and {van Donkelaar}, Aaron and Hvidtfeldt, {Ulla A.} and Klea Katsouyanni and Janssen, {Nicole A.H.} and Martin, {Randall V.} and Evangelia Samoli and Schwartz, {Per E.} and Massimo Stafoggia and Tom Bellander and Maciek Strak and Kathrin Wolf and Danielle Vienneau and Roel Vermeulen and Bert Brunekreef and Gerard Hoek",
year = "2019",
month = sep,
doi = "10.1016/j.envint.2019.104934",
language = "English",
volume = "130",
journal = "Environment International",
issn = "0160-4120",
publisher = "Pergamon Press",

}

RIS

TY - JOUR

T1 - A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide

AU - Chen, Jie

AU - de Hoogh, Kees

AU - Gulliver, John

AU - Hoffmann, Barbara

AU - Hertel, Ole

AU - Ketzel, Matthias

AU - Bauwelinck, Mariska

AU - van Donkelaar, Aaron

AU - Hvidtfeldt, Ulla A.

AU - Katsouyanni, Klea

AU - Janssen, Nicole A.H.

AU - Martin, Randall V.

AU - Samoli, Evangelia

AU - Schwartz, Per E.

AU - Stafoggia, Massimo

AU - Bellander, Tom

AU - Strak, Maciek

AU - Wolf, Kathrin

AU - Vienneau, Danielle

AU - Vermeulen, Roel

AU - Brunekreef, Bert

AU - Hoek, Gerard

PY - 2019/9

Y1 - 2019/9

N2 - Empirical spatial air pollution models have been applied extensively to assess exposure in epidemiological studies with increasingly sophisticated and complex statistical algorithms beyond ordinary linear regression. However, different algorithms have rarely been compared in terms of their predictive ability. This study compared 16 algorithms to predict annual average fine particle (PM2.5) and nitrogen dioxide (NO2) concentrations across Europe. The evaluated algorithms included linear stepwise regression, regularization techniques and machine learning methods. Air pollution models were developed based on the 2010 routine monitoring data from the AIRBASE dataset maintained by the European Environmental Agency (543 sites for PM2.5 and 2399 sites for NO2), using satellite observations, dispersion model estimates and land use variables as predictors. We compared the models by performing five-fold cross-validation (CV) and by external validation (EV) using annual average concentrations measured at 416 (PM2.5) and 1396 sites (NO2) from the ESCAPE study. We further assessed the correlations between predictions by each pair of algorithms at the ESCAPE sites. For PM2.5, the models performed similarly across algorithms with a mean CV R2 of 0.59 and a mean EV R2 of 0.53. Generalized boosted machine, random forest and bagging performed best (CV R2~0.63; EV R2 0.58–0.61), while backward stepwise linear regression, support vector regression and artificial neural network performed less well (CV R2 0.48–0.57; EV R2 0.39–0.46). Most of the PM2.5 model predictions at ESCAPE sites were highly correlated (R2 > 0.85, with the exception of predictions from the artificial neural network). For NO2, the models performed even more similarly across different algorithms, with CV R2s ranging from 0.57 to 0.62, and EV R2s ranging from 0.49 to 0.51. The predicted concentrations from all algorithms at ESCAPE sites were highly correlated (R2 > 0.9). For both pollutants, biases were low for all models except the artificial neural network. Dispersion model estimates and satellite observations were two of the most important predictors for PM2.5 models whilst dispersion model estimates and traffic variables were most important for NO2 models in all algorithms that allow assessment of the importance of variables. Different statistical algorithms performed similarly when modelling spatial variation in annual average air pollution concentrations using a large number of training sites.

AB - Empirical spatial air pollution models have been applied extensively to assess exposure in epidemiological studies with increasingly sophisticated and complex statistical algorithms beyond ordinary linear regression. However, different algorithms have rarely been compared in terms of their predictive ability. This study compared 16 algorithms to predict annual average fine particle (PM2.5) and nitrogen dioxide (NO2) concentrations across Europe. The evaluated algorithms included linear stepwise regression, regularization techniques and machine learning methods. Air pollution models were developed based on the 2010 routine monitoring data from the AIRBASE dataset maintained by the European Environmental Agency (543 sites for PM2.5 and 2399 sites for NO2), using satellite observations, dispersion model estimates and land use variables as predictors. We compared the models by performing five-fold cross-validation (CV) and by external validation (EV) using annual average concentrations measured at 416 (PM2.5) and 1396 sites (NO2) from the ESCAPE study. We further assessed the correlations between predictions by each pair of algorithms at the ESCAPE sites. For PM2.5, the models performed similarly across algorithms with a mean CV R2 of 0.59 and a mean EV R2 of 0.53. Generalized boosted machine, random forest and bagging performed best (CV R2~0.63; EV R2 0.58–0.61), while backward stepwise linear regression, support vector regression and artificial neural network performed less well (CV R2 0.48–0.57; EV R2 0.39–0.46). Most of the PM2.5 model predictions at ESCAPE sites were highly correlated (R2 > 0.85, with the exception of predictions from the artificial neural network). For NO2, the models performed even more similarly across different algorithms, with CV R2s ranging from 0.57 to 0.62, and EV R2s ranging from 0.49 to 0.51. The predicted concentrations from all algorithms at ESCAPE sites were highly correlated (R2 > 0.9). For both pollutants, biases were low for all models except the artificial neural network. Dispersion model estimates and satellite observations were two of the most important predictors for PM2.5 models whilst dispersion model estimates and traffic variables were most important for NO2 models in all algorithms that allow assessment of the importance of variables. Different statistical algorithms performed similarly when modelling spatial variation in annual average air pollution concentrations using a large number of training sites.

KW - Fine particles

KW - Land use regression

KW - Machine learning

KW - Nitrogen dioxide

UR - http://www.scopus.com/inward/record.url?scp=85067523711&partnerID=8YFLogxK

U2 - 10.1016/j.envint.2019.104934

DO - 10.1016/j.envint.2019.104934

M3 - Journal article

C2 - 31229871

AN - SCOPUS:85067523711

VL - 130

JO - Environment International

JF - Environment International

SN - 0160-4120

M1 - 104934

ER -