Online · Pacific Time (PT)

Agentic AI Frontier Seminar

A seminar series on Agentic AI: models, tools, memory, multi-agent systems, online learning, and safety, featuring leading researchers and industry experts.

Join us! Register here to get seminar updates.

Recordings (with speaker consent) will be posted on our YouTube channel.

Incoming Seminar

Online · 2026-03-06 · 09:00–10:00 PT

Talk Title: Goodhart’s Revenge: Reward Hacking in RL-Tuned LLMs, and How We Fight Back

Assistant Professor · Lifu Huang · UC Davis

Talk Abstract: Reinforcement learning, such as RLHF and RLVR, has become a dominant post-training paradigm for large language models. Yet both methods are fundamentally proxy optimization: they optimize reward signals that only approximate true user intent. Under strong optimization pressure, this gap can produce reward hacking, behaviors that score highly on the proxy while undermining truthfulness or robustness, including sycophancy, length bias, code gaming, and so on. This talk presents three complementary defenses from the PLUM Lab: (1) SMART mitigates sycophancy by training models on uncertainty-aware adaptive reasoning trajectories with dense progress rewards, distilling high-quality reasoning patterns and behaviors into the policy. (2) IR³ performs post-tuning objective forensics by reconstructing the implicit reward, decomposing it into interpretable feature contributions, and surgically repairing hacking-related components. (3) ARA brings robustness into the RLHF loop through adversarial reward auditing: a Hacker–Auditor game actively surfaces exploits, and auditor-gated rewards make exploitative behaviors unprofitable during training. I conclude with open problems and a roadmap toward reward-hacking-resistant alignment.

Bio: Lifu Huang is an Assistant Professor of Computer Science at UC Davis. He received his Ph.D. in Computer Science from the University of Illinois Urbana-Champaign in 2020 and was an Assistant Professor at Virginia Tech from 2021 to 2024. His research spans natural language processing and multimodal learning, with an emphasis on the fundamentals and applications of large language and multimodal foundation models. His work has been recognized with an NSF CAREER Award (2023), an Outstanding Paper Award (ACL 2023), a Best Paper Award Honorable Mention (SIGIR 2023), and a Best Paper Award (AI4Research Workshop, AAAI 2025).

Focus Areas

Foundation Models & Core Capabilities

Agent Infrastructure & Tooling

Learning, Adaptation & Feedback

Multi-Agent Systems & Social Intelligence

Evaluation, Safety & Alignment

Applications & Vertical Use Cases

Interface & Interaction Design

Governance, Ethics & Ecosystem Building

Organizing Committee

Photo of Ming Jin

Ming Jin

Virginia Tech

He is an assistant professor in the Bradley Department of Electrical and Computer Engineering at Virginia Tech. He works on trustworthy AI, safe reinforcement learning, foundation models, with applications for cybersecurity, power systems, recommender systems, and CPS.

Photo of Shangding Gu

Shangding Gu

UC Berkeley

He is a postdoctoral researcher in the Department of EECS at UC Berkeley. He works on AI safety, reinforcement learning, and robot learning.

Photo of Yali Du

Yali Du

KCL

She is an associate professor in AI at King’s College London. She works on reinforcement learning and multi-agent cooperation, with topics such as generalization, zero-shot coordination, evaluation of human and AI players, and social agency (e.g., human-involved learning, safety, and ethics).