Talk Title: Goodhart’s Revenge: Reward Hacking in RL-Tuned LLMs, and How We Fight Back
Talk Abstract: Reinforcement learning, such as RLHF and RLVR, has become a dominant post-training paradigm for large language models. Yet both methods are fundamentally proxy optimization: they optimize reward signals that only approximate true user intent. Under strong optimization pressure, this gap can produce reward hacking, behaviors that score highly on the proxy while undermining truthfulness or robustness, including sycophancy, length bias, code gaming, and so on. This talk presents three complementary defenses from the PLUM Lab: (1) SMART mitigates sycophancy by training models on uncertainty-aware adaptive reasoning trajectories with dense progress rewards, distilling high-quality reasoning patterns and behaviors into the policy. (2) IR³ performs post-tuning objective forensics by reconstructing the implicit reward, decomposing it into interpretable feature contributions, and surgically repairing hacking-related components. (3) ARA brings robustness into the RLHF loop through adversarial reward auditing: a Hacker–Auditor game actively surfaces exploits, and auditor-gated rewards make exploitative behaviors unprofitable during training. I conclude with open problems and a roadmap toward reward-hacking-resistant alignment.
Bio: Lifu Huang is an Assistant Professor of Computer Science at UC Davis. He received his Ph.D. in Computer Science from the University of Illinois Urbana-Champaign in 2020 and was an Assistant Professor at Virginia Tech from 2021 to 2024. His research spans natural language processing and multimodal learning, with an emphasis on the fundamentals and applications of large language and multimodal foundation models. His work has been recognized with an NSF CAREER Award (2023), an Outstanding Paper Award (ACL 2023), a Best Paper Award Honorable Mention (SIGIR 2023), and a Best Paper Award (AI4Research Workshop, AAAI 2025).