May 31, 2026 · Sunday

vLLM v0.22.0 Released with DeepSeek V4 Hardening and Experimental Rust Frontend

459 commits from 230 contributors deliver NVFP4 fused MoE, full CUDA graph support, and a 28.9% latency reduction via batch-invariant Cutlass FP8.

vLLM v0.22.0 release highlights — one of the largest coordinated efforts in open-source LLM serving.

vLLM v0.22.0 has been released with 459 commits from 230 contributors, including 63 new contributors. The headline feature is DeepSeek V4 hardening: NVFP4 fused Mixture-of-Experts kernels, full and piecewise CUDA graph support, and ROCm compatibility land in this release. An experimental Rust frontend ships in-tree for the first time, marking a major architectural expansion beyond the Python codebase. The batch-invariant Cutlass FP8 kernel achieves a 28.9% reduction in end-to-end latency, while Model Runner V2 continues to advance with better multi-GPU scheduling and memory management. This release underscores vLLM’s position as the de facto open-source inference engine for state-of-the-art LLMs.

NVIDIA Releases Fixed DeepSeek-V4-Pro-NVFP4 on Hugging Face

A repaired version of the DeepSeek V4 Pro model with NVFP4 quantization is now publicly available, advancing open science and AI accessibility.

NVIDIA has published a fixed version of the DeepSeek-V4-Pro-NVFP4 model on Hugging Face, aiming to promote AI accessibility through open source and open science. The release addresses issues in the earlier NVFP4-quantized variant and makes the model available for researchers and developers worldwide. This move reflects the growing commitment among major AI labs to distribute production-grade models through public repositories, enabling broader scrutiny, reproduction, and downstream innovation.

UK AI Safety Institute Open Sources Evaluation Datasets and Deception Detection Models

The UK AISI now hosts 490 models and 36 datasets publicly on Hugging Face.

The UK AI Safety Institute has publicly released its evaluations, datasets, and models on Hugging Face, including lie-detection datasets, models trained via chain-of-thought to lie, deception detection probes based on linear probes, and classifiers for the “Did you lie?” paper. The institute now maintains 53 members and 10 collections comprising 490 models and 36 datasets, all available for global researchers to scrutinize, reproduce, and build upon. This marks a significant step toward transparent and community-driven AI safety research.

Late-Interaction Sparse Retrieval via Unsupervised Sparse Autoencoders and Neuron-Level Indexing

A new late-interaction sparse retrieval method combines unsupervised sparse autoencoders with neuron-level inverted indexing, significantly outperforming directly trained sparse retrievers. The approach avoids the cost of multi-vector retrieval by using sparse representations that activate only at relevant neurons, enabling efficient top-k search without sacrificing retrieval quality. This work builds on the ColBERT family of late-interaction models and suggests a path toward more efficient and interpretable neural retrieval systems.

NVIDIA Partners with Step-Cloud to Run Step-3.7-Flash on DGX Station via vLLM

NVIDIA has partnered with Step-Cloud to run the Step-3.7-Flash model on DGX Station hardware equipped with Blackwell architecture, using the vLLM inference engine. The setup supports both local deployment and production use as an NVIDIA NIM container. Detailed installation and configuration steps for vLLM on DGX Station have been published, along with a list of all supported models available on the platform. This collaboration highlights the growing role of vLLM as a universal serving layer bridging open-source inference with enterprise hardware.

Peking University Mathematics Alumnus Su Weijie Officially Joins OpenAI

Su Weijie, a full professor in the Department of Statistics and Data Science at the Wharton School of the University of Pennsylvania, has officially joined OpenAI. Su holds joint appointments in Computer and Information Science, Mathematics, and Biostatistics, and serves as co-director of Penn’s Machine Learning Research Center. A Stanford alumnus and part of the celebrated Peking University mathematics cohort, his move to OpenAI signals the company’s continued investment in deep academic talent across statistics, optimization, and ML theory.

Ship the best product. Use lots of AI, some AI, maybe no AI. Just be the best.
— Guillermo Rauch, CEO of Vercel

AI Release Cadence Accelerating, Especially from OpenAI and Anthropic

Timeline of models scoring 3+ points over predecessors on Artificial Analysis index.

Ethan Mollick observes that meaningfully better AI releases are accelerating. A timeline he commissioned lists only new models scoring 3 points or higher over previous models on the Artificial Analysis index, showing a steepening curve driven primarily by OpenAI and Anthropic.

Open-Weights Models More Fragile Than Benchmarks Suggest

Ethan Mollick argues that while Epoch AI does excellent benchmarking, open-weights models are significantly more fragile out-of-distribution than their scores indicate. He estimates the gap between open and closed models is larger than the commonly cited 3–4 month lag, especially on real-world tasks beyond standard evaluation suites.

The Open vs. Closed Model Debate Rests on Marginal Intelligence Value

Nathan Lambert frames the open-versus-closed model debate around a single question: is there disproportionate value in marginally better intelligence? Closed models will stay slightly smarter, but open models will be cheaper. The outcome hinges on whether the premium for incremental intelligence justifies the cost delta.

Claude Perceived as Lazy in Chat, While GPT-5.5 Shows Relentless Thoroughness

Nathan Lambert notes that Claude appears noticeably lazy in chat, especially on technical search topics, while GPT-5.5 and recent OpenAI models demonstrate remarkable thoroughness. He suggests this reveals how much a harness and post-training can shape a model’s perceived independence and persistence.

Open Science Defines How AI Is Discussed

Nathan Lambert argues that open science projects like Tulu 3 (which coined RLVR) shape the discourse around AI by establishing public methods and baselines, cutting through future noise and providing a shared vocabulary for the research community.

Elon Musk: Grok Build Is Moving Fast

Elon Musk stated that Grok Build is moving fast, signaling rapid development at xAI. The brief comment drew significant attention, suggesting heightened competition in the AI coding agent space where Grok Build competes with tools like Codex, Cursor, and GitHub Copilot.

Product & Industry May 31

StepFun

Step 3.7 Flash Free for Hermes Agent Users for 30 Days

StepFun announced that the Step 3.7 Flash model is free for NousResearch’s Hermes Agent users for 30 days, sparking excitement about what the community will build.

LangChain

One-Third of AI Teams Now Run Open-Weight Models

LangChain’s latest LangSmith Signal report reveals that 1 in 3 AI teams have run open-weight models, marking a significant milestone for open-source AI adoption in production environments.

ColBERT

ColBERTv2 Hits 20M Monthly Downloads, Author Recommends LateOn Migration

The ColBERTv2 model set a new record with 20 million monthly downloads. Original creators recommend users migrate to the newer LateOn ColBERT model from LightOn for better performance.

Benchmarks

Code Agent Benchmarks Too Small, Raising Evaluation Reliability Concerns

Critics point out that mainstream code agent benchmarks are alarmingly small: DeepSWE has 113 tasks, TerminalBench-2.0 has 89. There are growing calls for larger, more robust public evaluation suites.

Codex

Codex Adds Windows Computer Use and Remote Control via Mobile ChatGPT

Codex released extensive updates including Computer Use support on Windows and the ability to remotely control Codex on Windows via mobile ChatGPT. Unlike the Mac version, Windows Computer Use locks the host machine during operation.

Industry

Anthropic Accused of Distilling Chinese Models Kimi and Qwen

Allegations have emerged that Anthropic’s Claude may have been distilled from Chinese models Kimi and Qwen. Mounting circumstantial evidence is fueling debate across the AI research community.

Economics

Current AI Model Training Cost Estimated at ~$1 Billion, Not $2–4B

Analysis pegs current-gen model training cost at most around $1 billion, based on DeepSeek V4 Pro scaling. Even models like Mythos are at most 6x larger in active parameters, challenging earlier $2–4 billion estimates.

OpenAI

Voice Hack Night Finalists Announced, Public Voting Open

OpenAI Devs announced the four finalists for Voice Hack Night, featuring real-time voice agents built in six hours. Public voting is open and the winner will be announced on Monday.

Hardware

TERAFAB Targets 100–200 Billion Custom AI and Memory Chips Per Year

TERAFAB is targeting production of 100 to 200 billion custom AI and memory chips per year at full ramp, signaling a massive expansion of semiconductor capacity dedicated to AI workloads.

Tools & Ecosystem May 31

Codex

Codex Can Now Manage Its Own Sessions Autonomously

Codex now supports self-managed sessions: creating, searching, archiving, pinning, and spinning up independent worktrees for parallel tasks—all via conversational commands.

DevTools

Tips for Debugging Network Requests with Codex and Claude Code

Two simple methods let agents inspect network request data autonomously during web development: Chrome DevTools Network tab export and programmatic fetch interception.

Best Practice

Why Agent Memory Fails: Memory Is Context, Not Instruction

A common pitfall when connecting databases to AI agents: writing workflows into agent memory does not enforce execution. Memory is background context, not an execution directive.

Vercel

Per-API Key Spend Caps Arrive on AI Gateway

Vercel’s AI Gateway now supports per-API-key spend caps, giving teams fine-grained cost control across multiple model providers and API keys in a single unified interface.

Anthropic

Using Opus 4.8 to Orchestrate Open-Source Sub-Agents

A demonstration shows Claude Opus 4.8 coordinating multiple open-source sub-agents, suggesting a pattern where a powerful central model delegates tasks to cheaper specialized models.

Codex

Codex Restores Context Usage Display After User Backlash

After removing context usage visibility in a previous version, Codex has restored the feature in the latest update—though users must now manually enable it in settings.

Pricing

GitHub Copilot Token Multipliers Revealed: Gemini 3.5 Flash at 14x

GitHub Copilot applies asymmetric token multipliers: Claude Sonnet 4.6 at 1x, GPT-5.5 at 7.5x, Gemini 3.5 Flash at 14x, and Claude Opus 4.8 at 15x, shaping cost dynamics for developers.

Codex

Latest Codex Update Now Displays Token Usage

The newest version of Codex surfaces token consumption directly in the interface, giving developers real-time visibility into API costs during coding sessions.

Tooling

WeChat Group Summary Bot Adds @bot Q&A with Context Awareness

The baoyu-wechat-summary tool now supports @bot Q&A in group chats, responding to questions using chat history context with configurable bot aliases to avoid confusion with real users.

iOS

iOS HTML and Markdown Preview App Nears TestFlight Release

An iOS app for previewing HTML and Markdown files is nearly complete, with test invites expected soon. The app addresses the lack of native Markdown and HTML rendering on iOS.

Immich

Self-Hosted Photo Tool Immich Uses SigLIP for Semantic Search

Exploration of self-hosted photo tools reveals that Immich already uses SigLIP for CLIP-based semantic image search, a more advanced vision embedding than the original CLIP model.

Publishing

New Book on AI-Enhanced Academic Paper Writing Released

A new book titled “AI High-Quality Paper Writing Method” supplements the earlier “Five-Step Academic Writing Method” with experience integrating AI deeply into knowledge production workflows.

Opinion & Commentary May 31

Rohan Anil

vLLM v0.22.0 Released with DeepSeek V4 Hardening and Experimental Rust Frontend

NVIDIA Releases Fixed DeepSeek-V4-Pro-NVFP4 on Hugging Face

UK AI Safety Institute Open Sources Evaluation Datasets and Deception Detection Models

Late-Interaction Sparse Retrieval via Unsupervised Sparse Autoencoders and Neuron-Level Indexing

NVIDIA Partners with Step-Cloud to Run Step-3.7-Flash on DGX Station via vLLM

Peking University Mathematics Alumnus Su Weijie Officially Joins OpenAI

AI Release Cadence Accelerating, Especially from OpenAI and Anthropic

Open-Weights Models More Fragile Than Benchmarks Suggest

The Open vs. Closed Model Debate Rests on Marginal Intelligence Value

Claude Perceived as Lazy in Chat, While GPT-5.5 Shows Relentless Thoroughness

Open Science Defines How AI Is Discussed

Elon Musk: Grok Build Is Moving Fast

Step 3.7 Flash Free for Hermes Agent Users for 30 Days

One-Third of AI Teams Now Run Open-Weight Models

ColBERTv2 Hits 20M Monthly Downloads, Author Recommends LateOn Migration

Code Agent Benchmarks Too Small, Raising Evaluation Reliability Concerns

Codex Adds Windows Computer Use and Remote Control via Mobile ChatGPT

Anthropic Accused of Distilling Chinese Models Kimi and Qwen

Current AI Model Training Cost Estimated at ~$1 Billion, Not $2–4B

Voice Hack Night Finalists Announced, Public Voting Open

TERAFAB Targets 100–200 Billion Custom AI and Memory Chips Per Year

Codex Can Now Manage Its Own Sessions Autonomously

Tips for Debugging Network Requests with Codex and Claude Code

Why Agent Memory Fails: Memory Is Context, Not Instruction

Per-API Key Spend Caps Arrive on AI Gateway

Using Opus 4.8 to Orchestrate Open-Source Sub-Agents

Codex Restores Context Usage Display After User Backlash

GitHub Copilot Token Multipliers Revealed: Gemini 3.5 Flash at 14x

Latest Codex Update Now Displays Token Usage

WeChat Group Summary Bot Adds @bot Q&A with Context Awareness

iOS HTML and Markdown Preview App Nears TestFlight Release

Self-Hosted Photo Tool Immich Uses SigLIP for Semantic Search

New Book on AI-Enhanced Academic Paper Writing Released

Former DeepMind Engineer Recalls How Theory Researchers Credited Implementers

Shoddy AI Agents on Twitter Compared to 1991 WordArt Enthusiasts

If Everyone Uses AI to Review Papers, At Least Use Good AI

If It’s Codex, I Would Use It — Benchmarks Are Long Dead

Programming Is Like English: You Need the Skill, Not the Major