Analyzing Effects of Learning Downstream Tasks on Moral Bias in Large Language Models

Niklas Kiehne, Alexander Ljapunov, Marc Bätje, Wolf-Tilo Balke


Abstract
Pre-training and fine-tuning large language models (LMs) is currently the state-of-the-art methodology for enabling data-scarce downstream tasks. However, the derived models still tend to replicate and perpetuate social biases. To understand this process in more detail, this paper investigates the actual effects of learning downstream tasks on moral bias in LMs. We develop methods to assess the agreement of LMs to explicitly codified norms in both pre-training and fine-tuning stages. Even if a pre-trained foundation model exhibits consistent norms, we find that introducing downstream tasks may indeed lead to unexpected inconsistencies in norm representation. Specifically, we observe two phenomena during fine-tuning across both masked and causal LMs: (1) pre-existing moral bias may be mitigated or amplified even when presented with opposing views and (2) prompt sensitivity may be negatively impacted. We provide empirical evidence of models deteriorating into conflicting states, where contradictory answers can easily be triggered by slight modifications in the input sequence. Our findings thus raise concerns about the general ability of LMs to mitigate moral biases effectively.
Anthology ID:
2024.lrec-main.82
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
904–923
Language:
URL:
https://aclanthology.org/2024.lrec-main.82
DOI:
Bibkey:
Cite (ACL):
Niklas Kiehne, Alexander Ljapunov, Marc Bätje, and Wolf-Tilo Balke. 2024. Analyzing Effects of Learning Downstream Tasks on Moral Bias in Large Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 904–923, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Analyzing Effects of Learning Downstream Tasks on Moral Bias in Large Language Models (Kiehne et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.82.pdf