July 5, 2026 · Sunday

DeepSeek V4 Pro and Flash model IDs surfaced ahead of official announcement.

DeepSeek V4 Is Coming: Pro and Flash Versions Emerge, Still Per-Token

Internal sources point to imminent release of DeepSeek V4 lineup with distinct Pro and Flash tiers, maintaining per-token pricing despite industry shifts toward subscription models.

The next generation of DeepSeek models appears close to launch, with model identifiers deepseek-v4-pro-202606 and deepseek-v4-flash-202605 surfacing in community discussions. If confirmed, the Pro variant would represent DeepSeek's most capable reasoning model to date, while the Flash version targets cost-efficient inference at scale. The per-token pricing strategy, however, runs counter to a growing trend among frontier labs toward flat-rate or subscription access. DeepSeek's decision to stick with metered billing likely reflects confidence in its inference cost structure, which benefits from aggressive hardware optimizations and a vertically integrated stack. Observers note that V3's release reshaped the open-weight frontier landscape, and V4 could trigger another wave of price competition across the API inference market, particularly if the Flash variant delivers competitive quality at a fraction of current frontier pricing.

PAPER

Sakana AI Brings 11 Papers to ICML 2026, From Multi-Agent to Black-Box Optimization

Sakana AI will present a substantial body of work at ICML 2026 in Seoul, spanning multi-agent coordination, evolutionary algorithms, and foundation model techniques. Among the highlights, the paper "Bridging Spherical Black-Box Optimizers" proposes a unifying framework for connecting disparate optimization algorithms on spherical manifolds, aiming to close the gap between theory and practical deployment in high-dimensional model tuning.

INDUSTRY

Original Scaling Laws Paper Had a Bug, Leading to Years of Compute Waste

A researcher has revealed that a bug in the foundational scaling laws paper produced systematically wrong conclusions, causing labs to train models far larger than necessary for their data budgets. The bug meant optimal model sizes were underestimated, leading companies to burn enormous compute on oversized, undertrained models. This revelation comes before the industry even started properly accounting for inference cost in the total cost equation, compounding the inefficiency.

RESEARCH

Claude Fable 5 Reasoning Distilled into Qwen3-4B with 100% Self-Consistency

A team from the University of Waterloo distilled 2.3 million reasoning traces from Claude Fable 5 into the compact Qwen3-4B model, achieving perfect self-consistency across 512 samples with zero-bit output errors. The result demonstrates that frontier-level chain-of-thought reasoning can be effectively transferred to much smaller, open-weight models through careful distillation, dramatically reducing inference costs while preserving output quality.

What if the model is the router? I think people underestimate the ability of frontier models now, but especially in the near future, to delegate work on their own as needed to dumber, cheaper models.
Ethan Mollick

LongCat Weights Released: Largest Model Pretrained on Non-Western Chips

The LongCat model weights are now available on Hugging Face, representing the largest known pretraining run executed on non-Western hardware. The release offers a rare window into the capabilities of Huawei's chip stack at scale, decoupled from Meituan's data expertise. Evaluators are expected to scrutinize not just the model's benchmark scores but what the results say about the maturity of the underlying silicon ecosystem. The open-weight release invites the global research community to independently assess both the model quality and the hardware stack that produced it.

Compute Is Easier to Acquire Than Frontier Models, Two Years of Evidence Shows

A striking observation after two years of AI industry evolution: the path from having no compute to holding a solid chunk of global capacity is far shorter than the path from having abundant compute but no frontier model to having both. The implication is clear: algorithmic and architectural breakthroughs remain the true bottleneck. Hardware procurement, while capital-intensive, can be solved with money and partnerships. Building a genuinely competitive foundation model requires something far harder to buy: research talent, training infrastructure expertise, and the institutional knowledge that only comes from repeated, expensive failures.

Research & Industry07.05

MODEL RELEASE

Seedance 2.0 Goes Open Source with 4K Video Generation

Higgsfield AI released the full Seedance 2.0 project as open source, showcasing cinematic 4K video generation. The model supports prompt-driven scene composition and has already been used to create short films and experimental visual content.

INTEGRATION

GLM-5.2 Now Available Inside Claude Code via Hugging Face Inference

ZAI announced that the GLM-5.2 model is now selectable within Claude Code through Hugging Face Inference Providers, further bridging the gap between open-weight models and developer tooling.

TOOLING

Llama Index Ships Retrieval Harness for Modern Agentic Pipelines

Jerry Liu released a comprehensive Retrieval Harness providing persistent data infrastructure for agent-oriented retrieval workloads, addressing the growing need for reliable memory and search in multi-turn agent interactions.

CHIPS

Tau Law V2: Huawei LogicFolding Opens New Efficiency Tier for AI Chips

At comparable performance levels, Huawei's LogicFolding technology raises the energy efficiency ceiling for high-end AI chips while enabling a new tier of low-temperature, high-efficiency operation modes.

MEDICAL AI

AI Learns from 10,000 Tumor Transcriptomes to Improve Immunotherapy

Shared by Yann LeCun, a study used AI trained on transcriptome data from 10,000 tumor samples across 33 cancer types to predict and improve immunotherapy outcomes, marking a significant clinical application of large-scale ML.

FRAMEWORK

Diffusers Releases New Version with Ideogram4 and Video Pipelines

The latest Diffusers release adds multiple new image and video generation pipelines including Ideogram4 and MotifVideo, expanding the library's coverage of state-of-the-art generative model architectures.

We are leaving the Old Code Age, the Paleocodic, the artisanal code era, where if you needed a novel program, you would commission a local codesmith to hand-craft a work of code for you, bespoke.
Ethan Mollick

Model & Hardware Briefs07.05

MOONSHOT

Moonshot Lab Bets on Breakthrough Architectures Over System Integration

Moonshot's lead says the lab can barely keep up with model research, prioritizing novel architectures over engineering integration. The lab is described as one of the only teams to fully internalize DeepSeek's lessons.

BENCHMARK

Blackwell Bandwidth + DeepSeek MegamOE Make 300 tps on GLM 5.2 Feasible

With NVIDIA Blackwell's increased memory and communication bandwidth combined with DeepSeek's megamoe operator, achieving 300 tokens per second on GLM 5.2 is within reach, while 150 tps is expected to become the new normal.

HISTORY

Baidu, Not OpenAI, Published the First Scaling Laws Paper

A reminder that Chinese AI research has never been as far behind as commonly assumed: Baidu was the first to publish on scaling laws, challenging the narrative that algorithmic breakthroughs require proximity to frontier labs.

SPEED

V4 Translates Chinese PDF at 138 Tokens Per Second, 61s Inference

V4-flash handled a 13.5K-token Chinese PDF translation at 138 tokens per second with minimal reasoning overhead. V4-pro delivered slightly better quality at 84 t/s over a 2-minute 55-second run.

PRICING

GLM 5.2 Is 5x Cheaper Than Opus 4.8, Tops PostTrainBench

Cost comparison data shows GLM 5.2 undercuts Opus 4.8 by 5x and Fable 5 by 11x in pricing, while ranking first on the PostTrainBench benchmark, challenging assumptions about cost-quality trade-offs.

HARDWARE

V4 Optimized for Both Blackwell and Huawei Chips in Full-Stack Hedge

DeepSeek V4 reportedly targets optimization for both NVIDIA Blackwell and Huawei silicon. While Huawei's chips weren't designed for V4, this dual-hardware approach serves as a comprehensive supply chain hedge against export restrictions.

RESEARCH

Paper Proposes Domain-Specific Frontier Propagation Ratios for Small Models

A new paper suggests different domains exhibit different ratios of frontier propagation to model scale, enabling intelligent use of small sub-agent LLMs for exhaustive search in select branches.

Signs of Excessive On-Policy RL Detected in Recent Training Runs

Industry observers report telltale signs of over-reliance on on-policy reinforcement learning in recent model releases, raising questions about training methodology and reproducibility.

Community & Product07.05

LUMA