TY - JOUR
T1 - Can large language models help predict results from a complex behavioural science study?
AU - Lippert, Steffen
AU - Dreber, Anna
AU - Johannesson, Magnus
AU - Tierney, Warren
AU - Cyrus-Lai, Wilson
AU - Uhlmann, Eric Luis
AU - Emotion Expression Collaboration
AU - Elbæk, Christian T.
AU - Tønnesen, Mathilde Hedegaard
AU - Pfeiffer, Thomas
PY - 2024/9/25
Y1 - 2024/9/25
N2 - We tested whether large language models (LLMs) can help predict results from a complex behavioural science experiment. In study 1, we investigated the performance of the widely used LLMs GPT-3.5 and GPT-4 in forecasting the empirical findings of a large-scale experimental study of emotions, gender, and social perceptions. We found that GPT-4, but not GPT-3.5, matched the performance of a cohort of 119 human experts, with correlations of 0.89 (GPT-4), 0.07 (GPT-3.5) and 0.87 (human experts) between aggregated forecasts and realized effect sizes. In study 2, providing participants from a university subject pool the opportunity to query a GPT-4 powered chatbot significantly increased the accuracy of their forecasts. Results indicate promise for artificial intelligence (AI) to help anticipate—at scale and minimal cost—which claims about human behaviour will find empirical support and which ones will not. Our discussion focuses on avenues for human–AI collaboration in science.
AB - We tested whether large language models (LLMs) can help predict results from a complex behavioural science experiment. In study 1, we investigated the performance of the widely used LLMs GPT-3.5 and GPT-4 in forecasting the empirical findings of a large-scale experimental study of emotions, gender, and social perceptions. We found that GPT-4, but not GPT-3.5, matched the performance of a cohort of 119 human experts, with correlations of 0.89 (GPT-4), 0.07 (GPT-3.5) and 0.87 (human experts) between aggregated forecasts and realized effect sizes. In study 2, providing participants from a university subject pool the opportunity to query a GPT-4 powered chatbot significantly increased the accuracy of their forecasts. Results indicate promise for artificial intelligence (AI) to help anticipate—at scale and minimal cost—which claims about human behaviour will find empirical support and which ones will not. Our discussion focuses on avenues for human–AI collaboration in science.
KW - forecasting
KW - large language models
KW - meta-research
UR - http://www.scopus.com/inward/record.url?scp=85205290129&partnerID=8YFLogxK
U2 - 10.1098/rsos.240682
DO - 10.1098/rsos.240682
M3 - Journal article
C2 - 39323554
SN - 2054-5703
VL - 11
JO - Royal Society Open Science
JF - Royal Society Open Science
IS - 9
M1 - 240682
ER -