arXiv:2101.11718 Abstract | arXiv Analytics

arXiv:2101.11718 [cs.CL]Abstract References Reviews Resources

BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta

Published 2021-01-27Version 1

Recent advances in deep learning techniques have enabled machines to generate cohesive open-ended text when prompted with a sequence of words as context. While these models now empower many downstream applications from conversation bots to automatic storytelling, they have been shown to generate texts that exhibit social biases. To systematically study and benchmark social biases in open-ended language generation, we introduce the Bias in Open-Ended Language Generation Dataset (BOLD), a large-scale dataset that consists of 23,679 English text generation prompts for bias benchmarking across five domains: profession, gender, race, religion, and political ideology. We also propose new automated metrics for toxicity, psycholinguistic norms, and text gender polarity to measure social biases in open-ended text generation from multiple angles. An examination of text generated from three popular language models reveals that the majority of these models exhibit a larger social bias than human-written Wikipedia text across all domains. With these results we highlight the need to benchmark biases in open-ended language generation and caution users of language generation models on downstream tasks to be cognizant of these embedded prejudices.

DOI: 10.1145/3442188.3445924

Categories: cs.CL, cs.AI, cs.LG

Keywords: measuring biases, popular language models, open-ended text, measure social biases, benchmark social biases

Tags: journal article

Related articles: Most relevant | Search more

arXiv:2403.20147 [cs.CL] (Published 2024-03-29)

IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context

Nihar Ranjan Sahoo, Pranamya Prashant Kulkarni, Narjis Asad, Arif Ahmad, Tanu Goyal, Aparna Garimella, Pushpak Bhattacharyya

arXiv:2307.16457 [cs.CL] (Published 2023-07-31)

A Benchmark for Understanding Dialogue Safety in Mental Health Support

Huachuan Qiu, Tong Zhao, Anqi Li, Shuai Zhang, Hongliang He, Zhenzhong Lan

arXiv:2205.06262 [cs.CL] (Published 2022-05-12)

FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue

Alon Albalak et al.

arXiv Analytics

arXiv:2101.11718 [cs.CL]Abstract References Reviews Resources

BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation

Links

Toolbox

arXiv:2101.11718 [cs.CL]AbstractReferencesReviewsResources

BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation

Links

Toolbox

arXiv:2101.11718 [cs.CL]Abstract References Reviews Resources