Talk Title: Measuring all the noises of agentic LLM Evals
Talk Abstract: As LLM benchmark questions grow more complex and requiring many hours and tokens, evaluation sample sizes have decreased, heightening the risk of being fooled by randomness. I presents a principled approach for measuring and understanding noise in LLM evaluations by clearly distinguishing between the total noise, the data noise, and the prediction noise intrinsic to LLMs. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.
Bio: Sida Wang is a Research Scientist at FAIR of Meta. His recent research focuses on evals and agentic RL on Code LLMs. He is partly responsible for SWE-RL (CWM) and well-known evals like CRUXEval and LiveCodeBench. He completed his Ph.D. in Computer Science at Stanford co-advised by Christopher D. Manning and Percy Liang where he worked on pre-LLM interactive learning agents, which are mostly realized now. Before that, he got started in research by helping Geoff Hinton invent capsules.