July 3, 2026 · Friday

NVIDIA Launches TwoToken Model for Parallel Token Generation

Splitting a 30B model in two breaks the sequential bottleneck — Nemotron-Labs-TwoToken writes tokens in parallel, attacking the inference latency ceiling directly.

Since the Transformer was introduced in 2017, autoregressive decoding has meant one token at a time — a sequential constraint that limits throughput regardless of how many GPUs are thrown at the problem. NVIDIA AI is now challenging that assumption with Nemotron-Labs-TwoToken, a research architecture that partitions a single 30-billion-parameter model into two cooperating halves which generate tokens simultaneously. By breaking the sequential dependency, TwoToken promises to compress latency and increase serving throughput without requiring larger hardware. The project signals that NVIDIA sees inference efficiency as the next frontier of competitive differentiation in the LLM stack. The model was announced via NVIDIA AI's official channels with initial benchmarks suggesting substantial speedups at comparable quality to single-pass baselines. If TwoToken or similar parallel-token architectures prove practical at scale, they could reshape how data centers provision inference capacity — shifting the economics from raw FLOPs toward clever decomposition strategies.

DeepSeek's DSpark speculative decoding engine now runs natively in vLLM, enabling low-latency high-interactivity inference.

DSpark Speculative Decoding Lands on vLLM

DeepSeek's DSpark speculative decoding brings a semi-autoregressive drafter into the vLLM inference stack. The technique works by proposing several tokens in parallel using non-causal sliding-window attention, then verifying them in a single forward pass. The result: output identical to standard autoregressive decoding, but requiring far fewer sequential steps. The native vLLM integration — led by the community with contributions across kernel optimization, scheduler refinement, and serving infrastructure — makes this capability broadly available. For developers deploying interactive AI applications, the practical implication is clear: lower latency without sacrificing generation quality. The integration also marks vLLM's continued evolution as the open-source inference layer that absorbs research advances faster than any proprietary offering. With DSpark onboard, vLLM now supports speculative decoding natively for both DeepSeek and third-party models, reinforcing its position as the default serving runtime for the open-weight ecosystem.

ICML 2026 · SPOTLIGHT PAPER

Beyond Language Modeling: A New Framework for Multimodal Pretraining

A team presenting at ICML introduces an empirical framework for native multimodal pretraining, examining how representation learning, data mixture, architectural design, and scalability interact. The work moves past simple text-to-image alignment and asks what constitutes a genuinely multimodal foundation — one where vision and language are co-learned from the start rather than glued together post-hoc. Authors John Nguyen and David Fan will present the spotlight paper at ICML, with accompanying materials available at beyond-llms.github.io. The research provides empirical guidance on data construction strategies, model architecture choices, and scaling laws for multimodal systems — a timely contribution as the industry shifts from language-only models toward integrated perception and reasoning.

INFERENCE · PERFORMANCE

vLLM Cuts DeepSeek V4 Token Cost by 5x in One Month

The vLLM community achieved a fivefold reduction in token-serving cost for DeepSeek V4 through a sustained optimization campaign spanning kernel rewrites, scheduler improvements, and serving-stack tuning. Day-zero integration recipes gave way to deep performance work as every pull request chipped away at the cost curve. The trajectory demonstrates how a focused open-source inference community can compress costs faster than any single vendor's roadmap. The benchmark stands at 5x lower cost per token within a single month — a result that matters for every startup and enterprise running DeepSeek V4 in production. For the broader ecosystem, it validates the model of community-driven inference optimization as a credible alternative to vertically integrated serving stacks.

BENCHMARKS

GLM 5.2 Becomes First Open-Source Model to Lead APEX-SWE

GLM 5.2 scored 55.3% Pass@1 on the APEX-SWE integration category, making it the first open-source model to top that benchmark for software engineering evaluation. The result puts open-weight models on competitive footing with proprietary systems in code-centric reasoning tasks.

PRODUCT

Claude API Rate Limits Increased 5x, Tiers Simplified

Claude Platform API raised rate limits up to 5x at the highest tier and decoupled tiers from API spend. The latest Sonnet and Haiku models benefit immediately from the new structure.

Eventually, much of AI will converge towards intuition-guided symbolic world modeling — deep learning-guided program synthesis. It is inevitable. Symbolic modeling lets a system construct a compact, reusable, highly generalizable mental model of a problem space using minimal data.
François Chollet

EDUCATION

CMU Launches AI Agents Course This Fall

Carnegie Mellon University is offering a new course on AI Agents, taught by Graham Neubig. Students will learn to build scaffolds, design evaluations, and train agentic LLMs using reinforcement learning, balancing theory with hands-on practice using modern frameworks.

DATA

Meta Releases Autodata Framework for High-Quality Training Data

Meta introduced Autodata, a framework that automates the generation of high-quality training data. The system targets one of the most persistent bottlenecks in frontier AI development: the scarcity of clean, diverse supervised data at scale.

TECH

vLLM Removes PagedAttention Module

Core vLLM developers deleted the PagedAttention module from the framework, marking a significant architectural evolution. The change reflects how rapidly the inference serving landscape is advancing beyond its original design assumptions.

INDUSTRY

NVIDIA Advances AI Factory Model with Revenue-Sharing Partnerships

NVIDIA is partnering with AI cloud providers to deploy large-scale, multi-tenant AI factories through revenue-sharing and credit-support agreements. The move reflects a structural shift from one-off training runs to always-on token production — and a corresponding need for a new business model. Rather than selling hardware alone, NVIDIA is co-investing in the operational layer, sharing both risk and upside with cloud partners. The strategy aims to broaden compute access for the wider AI ecosystem while positioning NVIDIA as a long-term stakeholder in inference economics. Revenue-sharing arrangements with cloud providers represent a significant departure from the traditional chip vendor model and could accelerate the deployment of dedicated AI infrastructure globally.

AI LABS

Sakana AI Establishes Recursive Self-Improvement Lab in Tokyo

Sakana AI launched its RSI Lab, targeting autonomous optimization loops that evolve from human-driven R&D toward self-improving intelligence engines. The Tokyo-based lab is actively hiring program managers to scale its recursive self-improvement research program.

INFRASTRUCTURE

Vercel Positions AI Gateway as Token CDN

Vercel CEO Guillermo Rauch described the AI Gateway as a Content Token Delivery Network — an AI model CDN that supports dynamic routing and traffic rejection without redeployment. When Fable was suddenly retired, the gateway absorbed the impact on production workloads.

PLATFORMS

Fable 5 Returns to Replit with High-Effort Mode

Replit brought Fable 5 back online for longer, more complex coding projects. Toggle High-effort mode in Replit Agent for the toughest builds and see what the model can deliver on sustained autonomous tasks.

MODEL RELEASES

GLM 5.2 DSpark Preview: First Non-DeepSeek Speculator

RedHatAI released the GLM-5.2-speculator.dspark-preview on Hugging Face — the first DSpark speculator built for a non-DeepSeek frontier model, extending speculative decoding to a new model family.

POLICY

OpenAI Proposes Giving 5% Stake to US Government

OpenAI is reportedly exploring a plan to transfer 5% of its equity to the US government, aiming to give ordinary citizens a share of the AI dividend. The $852 billion startup's proposal would be unprecedented in scale and structure for a private technology company.

WHITE HOUSE

Rampart PII Removal Model Tops Hugging Face Trending

The Rampart model, built by ND Studio and the White House for PII removal and token classification, reached number one on Hugging Face trending. Clement Delangue noted it as evidence that public organizations should own their model weights rather than renting from API providers.

DATASETS

CS2-10k: 600,000+ Gameplay Videos Released on Hugging Face

Reka Labs published CS2-10k, containing over 600,000 egocentric gameplay videos spanning 10,000+ hours. Every frame is paired with text captions, providing rich multimodal training material for vision-language models.

SCIENCE

80TB Astrophysics Dataset Quietly Arrives on Hugging Face

A massive 80TB dataset compiled from over 30 astrophysics sources appeared on Hugging Face, part of what Thom Wolf describes as a weekly mega-release pattern in AI-driven science. The dataset spans multiple observation modalities and research institutions.

FRAMEWORKS

Eve: A Next.js-Style Framework for Building Agents

Evedev released eve, described as "Next.js for agents" — a single-folder framework for building agents that are durable by default, with persistent state management and streamlined deployment patterns.

AWARDS

Kling AI Ad Film Wins Bronze Lion at Cannes

The AI-generated short film "Lorem Ipsum," produced by Argentine studio Purga Films using Kling AI, won a Bronze Lion at Cannes Lions in the Film B2B category — a milestone for AI-assisted creative production.

VIDEO AI

PixVerse Seedance 2.0 Converts Motion Reference to 4K Cinematic Scenes

PixVerse showcases Seedance 2.0 transforming raw motion capture and single reference images into stylized 4K animated sequences, preserving details like cape physics, landing weight, and environmental consistency across shots.

GOOGLE

Agentic Kernel Optimization: The Future of On-Device Inference

Google Gemma's team declared that agentic kernel optimization is the future of on-device inference. Xenovac used Fable 5 to author kernels that pushed Gemma inference performance on edge hardware.

Short Takes2026.07.03

FUNDING

TogetherCompute Secures Series C

TogetherCompute completed its Series C round, with congratulations from MiniMax and industry peers.

GROK

Speech-to-Text Goes Live in Grok Build

Users can now dictate prompts directly to coding agents using Grok's new voice input feature.

ENERGY

Nuclear Startup Valar Powers NVIDIA Spark

Valar Atomics became the first nuclear startup to generate electricity, successfully powering an NVIDIA Spark.

QUALCOMM

Qualcomm Expands AI Collaboration with Hugging Face

Qualcomm and Hugging Face deepened their open-source developer AI partnership across model onboarding.

MODELS

GLiNER2 PII Filter Hits 55k Downloads

The fastino/gliner2-privacy-filter-PII-multi model reached 55,000 downloads in six weeks on Hugging Face.

VIDU

Q3 Drama Integrated into Anishort Platform

Vidu Q3 Drama now supports consistent character identity, 1080P visuals, and native audio-video sync.

CLAUDE

Anthropic Hosts Life Sciences Hackathon

Anthropic and Gladstone Institutes launched "Built with Claude: Life Sciences," a global virtual hackathon.

POOLSIDE

Laguna XS 2.1 Lands on SGLang Day-One

Poolside AI's 33B MoE model for agentic code got day-zero support on the SGLang inference framework.