Home · Agentic AI Frontier Seminar

Online · Pacific Time (PT)

Agentic AI Frontier Seminar

A seminar series on Agentic AI: models, tools, memory, multi-agent systems, online learning, and safety, featuring leading researchers and industry experts.

Join us! Register here to get seminar updates.

Recordings (with speaker consent) will be posted on our YouTube channel.

View Upcoming About the Seminar

Incoming Seminar

Online · 2026-03-20 · 09:00–10:00 PT

Talk Title: Measuring all the noises of agentic LLM Evals

Research Scientist · Sida Wang · Meta

Talk Abstract: As LLM benchmark questions grow more complex and requiring many hours and tokens, evaluation sample sizes have decreased, heightening the risk of being fooled by randomness. I presents a principled approach for measuring and understanding noise in LLM evaluations by clearly distinguishing between the total noise, the data noise, and the prediction noise intrinsic to LLMs. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.

Bio: Sida Wang is a Research Scientist at FAIR of Meta. His recent research focuses on evals and agentic RL on Code LLMs. He is partly responsible for SWE-RL (CWM) and well-known evals like CRUXEval and LiveCodeBench. He completed his Ph.D. in Computer Science at Stanford co-advised by Christopher D. Manning and Percy Liang where he worked on pre-LLM interactive learning agents, which are mostly realized now. Before that, he got started in research by helping Geoff Hinton invent capsules.

Join Link More details

Focus Areas

Foundation Models & Core Capabilities

Agent Infrastructure & Tooling

Learning, Adaptation & Feedback

Multi-Agent Systems & Social Intelligence

Evaluation, Safety & Alignment

Applications & Vertical Use Cases

Interface & Interaction Design

Governance, Ethics & Ecosystem Building

Organizing Committee

Ming Jin

Virginia Tech

He is an assistant professor in the Bradley Department of Electrical and Computer Engineering at Virginia Tech. He works on trustworthy AI, safe reinforcement learning, foundation models, with applications for cybersecurity, power systems, recommender systems, and CPS.

Email Website

Shangding Gu

UC Berkeley

He is a postdoctoral researcher in the Department of EECS at UC Berkeley. He works on AI safety, reinforcement learning, and robot learning.

Email Website

Yali Du

KCL

She is an associate professor in AI at King’s College London. She works on reinforcement learning and multi-agent cooperation, with topics such as generalization, zero-shot coordination, evaluation of human and AI players, and social agency (e.g., human-involved learning, safety, and ethics).

Email Website