NeurIPS 2025 Best Papers in Comics

last week NeurIPS 2025 Best Paper Awards announced,

as i did with ICML 2025 Excellent PapersI ran my automated paper review system to create a summary of all the award-winning and runner-up papers. This time I also prepared comics to explain a paper in one image.

Here are the results.

Author: Liwei Jiang, Yuanjun Chai, Margaret Lee, Mikael Liu, Raymond Fok, Noah Desiree, Yulia Tsvetkov, Maarten Sapp, Yejin Choi
paper: https://arxiv.org/abs/2510.22954, neurips submission
Code: https://github.com/liweijiang/artificial-hivemind
Dataset: HF collection

what was done? Author introduces infinity-chatA dataset of 26K real-world open-ended questions to systematically evaluate output diversity in 70+ state-of-the-art LLMs. They identify a widespread “artificial hivemind” phenomenon where models exhibit extreme mode collapse – both repeatedly generating the same outputs internally (intra-model) and converging on surprisingly similar responses across different model families (inter-model).

why it matters? This invalidates the common assumption that increasing the temperature or using model combinations guarantees diversity. Studies show that modern RLHF and instruction tuning have homogenized the “constructive” latent space of models to such an extent that different models (for example, DeepSeq and GPT-4) perform as nearly identical clones on open-ended tasks. Furthermore, it shows that existing reward models are poorly calibrated for different human preferences (pluralism), failing to correctly score valid but specific responses.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc328b6a1 5ca8 4861 bb92

Link to full review,

Author: Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Caiyu Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Daiheng Liu, Jingren Zhou, Junyang Lin (Queen Team)
paper: https://arxiv.org/abs/2505.06708, neurips submission
Code: https://github.com/qiuzh20/gated_attention
Sample: https://huggingface.co/collections/Qwen/qwen3-next

what was done? Author introduces gated attentionA mechanism that implements a learnable, input-dependent sigmoid gate immediately after scaled dot-product attention (SDPA) output. By modifying the attention output Y with a gate σ,XWθ​), this method introduces element-wise sparsity and non-linearity before the final output projection.

why it matters? This simple architectural modification improves depth stability for large-scale training (eliminating loss spikes) and consistently improves on 15B MOE and 1.7B dense models. Importantly, it mechanically eliminates the “attention sink” phenomenon and “massive activation” without the need for heuristic corrections such as “sink tokens”, thereby significantly improving long-context extrapolation.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc33f139 9fdb 42b5 ac35

Link to full review,

Author: Kevin Wang, Ishaan Javali, Michael Bortkiewicz, Tomasz Trzesinski, Benjamin Eisenbach
paper: https://openreview.net/forum?id=s0JVsx3bx1
Project Page: https://wang-kevin3290.github.io/scaling-crl/

what was done? The authors successfully scaled reinforcement learning (RL) policies from the standard 2-5 layers to over 1,000 layers using self-supervised learning (specifically Contrastive RL) combined with modern architectural choices such as residual connections, LayerNorm, and swish activation.

why it matters? This challenges the prevailing dogma that RL does not benefit from depth. While standard algorithms like sack With deep networks saturating or collapsing, this work shows that Contrastive RL allows continuous performance scaling (20x-50x gains), enabling agents to solve long-horizon humanoid mazes and develop emergent locomotor skills without explicit reward engineering.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7e9c98e beb4 42f8 a608

Link to full review,

Author: Tony Bonaire, Rafael Urfin, Giulio Biroli, Marc Mezard
paper: https://arxiv.org/abs/2505.17638, neurips submission
Code: https://github.com/tbonnair/When-Dif Fusion-Models-Don-t-Memorize

what was done? The authors provide a theoretical and empirical analysis showing the training dynamics of score-based diffusion models. Recognizing that models may eventually overfit, they identify two different time-frames: τgeneral​, when the model learns to generate valid samples, and τMam​, when it starts remembering specific training examples. This work was rewarded Best Paper Award at NeurIPS 2025,

why it matters? This work resolves the paradox of why over-parameterized diffusion models over-generalize despite having the ability to completely memorize the training data. by proving this τMam​ Scales linearly with dataset size n Whereas τgeneral​ remains stable, the paper establishes that “stopping early” is not just an assumption, but a structural requirement implicit dynamic regularizationThis explains why larger datasets widen the safety window for training, allowing models to generalize robustly on a larger scale,

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ea105f a1c6 478e 8107

Link to full review,

Author: Yang Yu, Ziqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yu, Shijie Song, Gao Huang
paper: https://arxiv.org/abs/2504.13837, neurips submission
Code: https://limit-of-rlvr.github.io

what was done? In this NeurIPS 2025 Best Paper runner-up, the authors systematically investigate the reasoning limits of large language models (LLMs) trained via reinforcement learning with verifiable rewards (RLVR). use fair pass@k metrics in mathematics, coding, and visual reasoning tasks, they compared the base models against their RL-tuned counterparts to determine whether RLVR generates novel reasoning patterns or merely enhances existing ones.

why it matters? The findings challenge the prevailing narrative that RLVR allows models to autonomously discover “superhuman” strategies, similar to AlphaGo. The study shows that while RLVR significantly improves sampling efficiency (correct answers appear more often), it does not extend the fundamental reasoning capability range of the model. In fact, for big OfBase models often solve More Unique problems compared to their RL-trained versions, suggesting that current RL methods are limited by the precursors of pre-trained models.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd82e3d33 7e2d 4110 a81f

Link to full review,

Author: Zachary Chase, Steve Haneke, Shay Moran, Jonathan Schaeffer
paper: https://openreview.net/forum?id=EoebmBe9fG

what was done? The authors resolved a 30-year-old open problem in learning theory by establishing strict error bounds for transductive online learning. Recognized as Best Paper Runner-up at NeuroIPS 2025, they proved this for a hypothesis class with Littlestone Dimension DOptimal error threshold Θ(sqrt(D)).

why it matters? This result shows exactly how much “looking ahead” helps. This proves that access to the unlabeled sequence of future test points allows a quadratic reduction in mistakes compared to the standard online setting (where the limit is DThis closes a huge exponential gap between the previous best known lower bound of Ω(log)D) and the upper bound of hey,D,

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6a1167 2345 4e74 814e

Link to full review,

Author: Yizhou Liu, Ziming Liu, and Jeff Gore
paper: https://arxiv.org/abs/2505.10465, neurips submission
Code: https://github.com/liuyz0/SuperpositionScaling

what was done? The authors propose a mechanistic explanation for them by adding neural scaling laws representation superpositionBy adopting a sparse autoencoder framework and performing validation on open-source LLMs (OPT, Pythia, QUEN), they demonstrate that when models operate in the “strong superposition” regime – representing features significantly larger than their dimensions – the loss scales inversely to the width of the model (l∝1/mThis scaling is driven by the geometric interference between feature vectors rather than the statistical properties of the data tail.

why it matters? This work, a NeurIPS 2025 Best Paper runner-up, provides a first-principles derivation of scaling laws that is robust to data distributions. In contrast to previous theories relying on manifold approximations, this research suggests that the “power law” behavior of LLMs is a geometric imperative of compressing sparse concepts into dense spaces. This implies that overcoming these scaling constraints requires architectural intervention to manage feature interference, as the geometric constraint cannot be bypassed by simply adding more data.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2d1be62 de22 4812 b37f

Link to full review,

I hope you find it fun and useful 😁



<a href=

Leave a Comment