Leah von der Heyde, Anna-Carolina Haensch, Alexander Wenz
Assessing bias in LLM-generated synthetic datasets: The case of German voter behavior

SocArXiv preprint
7 S.

[For the more comprehensive version of this work, please see https://osf.io/preprints/socarxiv/8je9g ] The rise of large language models (LLMs) like GPT-3 has sparked interest in their potential for creating synthetic datasets, particularly in the realm of privacy research. This study critically evaluates the use of LLMs in generating synthetic public opinion data, pointing out the biases inherent in the data generation process. While LLMs, trained on vast internet datasets, can mimic societal attitudes and behaviors, their application in synthesizing data poses significant privacy and accuracy challenges. We investigate these issues using the case of vote choice prediction in the 2017 German federal elections. Employing GPT-3, we construct synthetic personas based on the German Longitudinal Election Study, prompting the LLM to predict voting behavior. Our analysis compares these LLM-generated predictions with actual survey data, focusing on the implications of using such synthetic data and the biases it may contain. The results demonstrate GPT-3’s propensity to inaccurately predict voter choices, with biases favoring certain political groups and more predictable voter profiles. This outcome raises critical questions about the reliability and ethical use of LLMs in generating synthetic data.