TY - JOUR
T1 - Critical evaluation of the effects of a cross-validation strategy and machine learning optimization on the prediction accuracy and transferability of a soybean yield prediction model using UAV-based remote sensing
AU - Habibi, Luthfan Nur
AU - Matsui, Tsutomu
AU - Tanaka, Takashi
PY - 2024/3
Y1 - 2024/3
N2 - Crop yield prediction models are critical tools for evaluating growth performance and informing decisions during farm management. Developing yield prediction models that are robust not only in the ranges of the model spatial domain but also in additional locations using a data-driven approach is challenging. The main objective of this study was to investigate an appropriate cross-validation (CV) strategy for establishing transferable UAV-based yield prediction models across different spatial domains (i.e., meeting extrapolation mapping objectives). In this study, we compared three data splitting procedures for the CV protocols, including random data splitting (random CV), cluster-based spatial splitting (spatial CV), and field-specific hold-out data splitting (leave-one-field-out CV). Model optimization was also examined to determine whether these factors affect the transferability of the yield model, including performing recursive feature elimination (RFE) and comparing the effects of algorithms utilized in the yield prediction model. Three base learner algorithms, namely, random forest, XGBoost, and LASSO regression, were utilized, and a stacked ensemble technique model formed with these base learners was also implemented. The established models were later tested on an independent field as a test dataset to evaluate the model transferability performance. Random CV exhibited poor error tracking performance in predicting yield beyond the model spatial domain, while spatial CV and leave-one-field-out CV approaches provided better expectation on yield predictions outside the model's training spatial domain. Furthermore, simple models as implementing LASSO regression and RFE improved the model capability in extrapolation tasks. The results of this study suggest that spatially-aware CV should be used as the standard method rather than conventional random CV for validating the yield model to ensure a more realistic and reliable yield model in extrapolation objectives.
AB - Crop yield prediction models are critical tools for evaluating growth performance and informing decisions during farm management. Developing yield prediction models that are robust not only in the ranges of the model spatial domain but also in additional locations using a data-driven approach is challenging. The main objective of this study was to investigate an appropriate cross-validation (CV) strategy for establishing transferable UAV-based yield prediction models across different spatial domains (i.e., meeting extrapolation mapping objectives). In this study, we compared three data splitting procedures for the CV protocols, including random data splitting (random CV), cluster-based spatial splitting (spatial CV), and field-specific hold-out data splitting (leave-one-field-out CV). Model optimization was also examined to determine whether these factors affect the transferability of the yield model, including performing recursive feature elimination (RFE) and comparing the effects of algorithms utilized in the yield prediction model. Three base learner algorithms, namely, random forest, XGBoost, and LASSO regression, were utilized, and a stacked ensemble technique model formed with these base learners was also implemented. The established models were later tested on an independent field as a test dataset to evaluate the model transferability performance. Random CV exhibited poor error tracking performance in predicting yield beyond the model spatial domain, while spatial CV and leave-one-field-out CV approaches provided better expectation on yield predictions outside the model's training spatial domain. Furthermore, simple models as implementing LASSO regression and RFE improved the model capability in extrapolation tasks. The results of this study suggest that spatially-aware CV should be used as the standard method rather than conventional random CV for validating the yield model to ensure a more realistic and reliable yield model in extrapolation objectives.
KW - Spatial data
KW - Leave-one-field-out
KW - Extrapolation
KW - Vegetation indices
KW - Spatial clustering
UR - http://www.scopus.com/inward/record.url?scp=85187987750&partnerID=8YFLogxK
U2 - 10.1016/j.jafr.2024.101096
DO - 10.1016/j.jafr.2024.101096
M3 - Journal article
SN - 2666-1543
JO - Journal of Agriculture and Food Research
JF - Journal of Agriculture and Food Research
M1 - 101096
ER -