arXiv:2404.06349 Abstract | arXiv Analytics

arXiv:2404.06349 [cs.LG]Abstract References Reviews Resources

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models

Yu Zhou, Xingyu Wu, Beicheng Huang, Jibin Wu, Liang Feng, Kay Chen Tan

Published 2024-04-09Version 1

Causality reveals fundamental principles behind data distributions in real-world scenarios, and the capability of large language models (LLMs) to understand causality directly impacts their efficacy across explaining outputs, adapting to new evidence, and generating counterfactuals. With the proliferation of LLMs, the evaluation of this capacity is increasingly garnering attention. However, the absence of a comprehensive benchmark has rendered existing evaluation studies being straightforward, undiversified, and homogeneous. To address these challenges, this paper proposes a comprehensive benchmark, namely CausalBench, to evaluate the causality understanding capabilities of LLMs. Originating from the causal research community, CausalBench encompasses three causal learning-related tasks, which facilitate a convenient comparison of LLMs' performance with classic causal learning algorithms. Meanwhile, causal networks of varying scales and densities are integrated in CausalBench, to explore the upper limits of LLMs' capabilities across task scenarios of varying difficulty. Notably, background knowledge and structured data are also incorporated into CausalBench to thoroughly unlock the underlying potential of LLMs for long-text comprehension and prior information utilization. Based on CausalBench, this paper evaluates nineteen leading LLMs and unveils insightful conclusions in diverse aspects. Firstly, we present the strengths and weaknesses of LLMs and quantitatively explore the upper limits of their capabilities across various scenarios. Meanwhile, we further discern the adaptability and abilities of LLMs to specific structural networks and complex chain of thought structures. Moreover, this paper quantitatively presents the differences across diverse information sources and uncovers the gap between LLMs' capabilities in causal understanding within textual contexts and numerical domains.

Categories: cs.LG

Keywords: large language models, causal learning capability, comprehensive benchmark, causalbench, upper limits

Related articles: Most relevant | Search more

arXiv:2407.07457 [cs.LG] (Published 2024-07-10)

GLBench: A Comprehensive Benchmark for Graph with Large Language Models

Yuhan Li et al.

arXiv:2306.11222 [cs.LG] (Published 2023-06-20)

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, Tuo Zhao

arXiv:2305.12143 [cs.LG] (Published 2023-05-20)

Learning Horn Envelopes via Queries from Large Language Models

Sophie Blum, Raoul Koudijs, Ana Ozaki, Samia Touileb

arXiv Analytics

arXiv:2404.06349 [cs.LG]Abstract References Reviews Resources

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models

Links

Toolbox

arXiv:2404.06349 [cs.LG]AbstractReferencesReviewsResources

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models

Links

Toolbox

arXiv:2404.06349 [cs.LG]Abstract References Reviews Resources