Week of 2026-04-26 to 2026-05-03

Image/Video GenAI: Tools & Workflows

Image/Video GenAI: Tools & Workflows

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. (PhyCo: Learning Controllable Physical Priors for Generative Motion) Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. (Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models) Humanoid control systems have made...

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

AI Agents: Applications & Builds

AI Agents: Applications & Builds

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. (GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents) We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. (Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence) Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state...

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

Generative 3D Worlds: Explorable World Models

Generative 3D Worlds: Explorable World Models

We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of p…. (Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising)

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The single-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

AI Agents: Research & Evals

AI Agents: Research & Evals

Autonomous scientific research is significantly advanced thanks to the development of AI agents. (AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery) Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. (The Last Human-Written Paper: Agent-Native Research Artifacts) Long-context large language models (LLMs)-for example, Gemini-3.1-Pro and Qwen-3.5-are widely used to empower many real-world applications, such as retrieval-augmented generation, autonomous...

Analyst Note

This matters because evaluation work is becoming the control surface for agent progress: better tests shape what builders trust, deploy, and regulate. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

Also Notable

Heterogeneous Scientific Foundation Model Collaboration

Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world problems, especially in scientific domains where domain-specific foundation models have been developed to address specialized tasks beyond natural language. In this work, we introduce Eywa, a heterogeneous agentic...

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs...

Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductors Discovery

The discovery of novel materials is critical for global energy and quantum technology transitions. While deep learning has fundamentally reshaped this landscape, existing predictive or generative models typically operate in isolation, lacking the autonomous orchestration required to execute the full discovery process. Here we present ElementsClaw, an agentic framework for materials discovery that synergizes Large...

Efficient Training on Multiple Consumer GPUs with RoundPipe

Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the...

Week of 2026-04-19 to 2026-04-26

AI Agents: Frontier Models

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. (Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals,...) Access GPT Image 2.0 natively in Hermes Agent Update now to get access - just run `hermes update` and select your image generation tool model with `hermes tools`. (Access GPT Image 2.0 natively in Hermes Agent Update now to get access - just run hermes update and select your ima...)...

Analyst Note

This matters because model updates and official evaluations reset expectations for what agent stacks can attempt, while also creating new operating and safety assumptions. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

Image/Video GenAI: Model Releases

Image/Video GenAI: Model Releases

What makes ChatGPT Images 2.0 a state-of-the-art image generation model? Researchers behind the model explain. (What makes ChatGPT Images 2.0 a state-of-the-art image generation model? Researchers behind the model explain. A thre...) Built a time machine powered by OpenAI’s new image generation model. Describe where and when you want to go, and it creates an immersive panoramic world you can explore. (Built a time machine powered by OpenAI’s new image generation model. Describe where and when you want to go, and it c...)

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

Generative 3D Worlds: Interactive 3D Worlds

Generative 3D Worlds: Interactive 3D Worlds

Rapid 3D scene generation from text or images. (Rapid 3D scene generation from text or images)

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The single-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

Also Notable

GIANT e-ink display LIVE in my house and actively removing the “mental load” of motherhood 😅 Turns out my household ...

GIANT e-ink display LIVE in my house and actively removing the “mental load” of motherhood 😅 Turns out my household chaos just needed to be tamed by a display my team of @openclaw and @NousResearch Hermes agents manage for me 💅. (GIANT e-ink display LIVE in my house and actively removing the “mental load” of motherhood 😅 Turns out my household ...)

Week of 2026-04-12 to 2026-04-19

Image/Video GenAI: Model Releases

Last week in Generative Image & Video. (h-embodvis/numina) IC-LoRA-Detailer: It's for post-processing, not just rendering (LTX2.3). (url) DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max). (bstnxbt/dflash-mlx) Nucleus-Image Released. (nucleusai/nucleus-image) Just saw this new technical blog from SenseNova (SenseTime) and it looks like the "Frankenstein" era of sticking different models together might be ending. (vh5se45d8b) Signal spans 13 source domains.

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

AI Agents: Frontier Models

Mythos is Mostly Hype... (also the bugs it found were mostly unexploitable and exaggerated...). (verbose 250-page report) AI Security Institute Findings on Claude Mythos Preview. (aisi.gov.uk) OpenAI rolls out GPT-5.4-Cyber to limited group for testing, seeks to rival Claude Mythos. (openai.com) Gemini Robotics ER-1.6 enhances reasoning to help robots navigate real-world tasks. (Gemini Robotics ER-1.6 enhances reasoning to help robots navigate real-world tasks) Anthropic is set to release Claude Opus 4.7 and a new AI design tool as early as this week....

Analyst Note

This matters because model updates and official evaluations reset expectations for what agent stacks can attempt, while also creating new operating and safety assumptions. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

Generative 3D Worlds: Explorable World Models

Generative 3D Worlds: Explorable World Models

Today, we released Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale, from NVIDIA Research. (Today, we released Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale, from NVIDIA Resear...) Tencent HY-World 2.0 appears to be dropping on April 15 — open-source multimodal 3D world generation from Tencent Hunyuan. (world)

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The multi-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

3D Gaussian Splatting: Tools & Reconstruction

RSR: Rapid Splat Renderer - Free Native D3D12 Windows Desktop/VR DLSS Viewer for Gaussian Splatting. (warpgatelabs/rsr)

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The multi-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

Also Notable

Claude vs GPT in a bomberman-style 1v1 game

A few weeks ago, ARC-AGI 3 was released. For those unfamiliar, it’s a benchmark designed to study agentic intelligence through interactive environments. I'm a big fan of these kinds of benchmarks as IMO they reveal so much more about the capabilities and limits of agentic AI than static Q&A benchmarks. They are also more intuitive to understand when you are able to actually see how the model behaves in these...

TUI to see where Claude Code tokens actually go

been spending $200+/day on claude code and had zero visibility into what was eating the tokens. ccusage shows cost per model per day which is great but i wanted to know - is it the debugging thats expensive? the brainstorming? which project is burning the most? it reads the session transcripts claude code already stores on disk (\~/.claude/projects/) and classifies every turn into 13 categories based on tool usage...

Claude + Playwright to teardown websites and unearth dark pattern trackers & feature flags (oss)

i'm building agents for procurement & one thread has been to let claude systematically deconstruct a website so agents can navigate them. but as i've been doing this, like a piñata, interesting things keep falling off -- from trackers, to interesting feature flags to even some over-exposed data. so i naturally claude-coded it into an oss repo \+ website (teardown.fyi) i've run it for about \~125 websites so far -...

Why Claude Code Max burns limits 40% faster with 20K less usable context. Proxy evidence inside.

TL;DR: Claude Code v2.1.100+ silently adds ~20K invisible tokens to every request, server-side. This eats your limits faster AND may degrade output quality. Downgrade to v2.1.98 for immediate relief. Proxy evidence below. --- I run Claude Code Max (5x plan) heavily — 3-5 parallel sessions, custom orchestration, the whole deal. Two weeks ago my usage limits started hitting way earlier than expected. What used to last...

I built a Claude Code plugin that extracts any website's full design system

Just type `/extract-design` ` in Claude Code and it pulls the entire design language — colors, fonts, spacing, shadows, components, everything. The main output is a markdown file specifically structured for Claude to understand. So you can extract a site's design, then tell Claude "build me a landing page using this design system" and it actually nails it because it has the exact tokens, scales, and component...

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to lay…

Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough. 23 tok/s is still rough but honestly noticeably faster when streaming responses. Tl;dr: We keep track of which experts get routed to most frequently for the past N tokens. We make a bet that the processing speed-up from loading...

Week of 2026-04-09 to 2026-04-12

AI Agents: Developer Tools

AI Agents: Developer Tools

After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open source model agnostic framework. (After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open sourc...) Update on Gemma 4 having MTP: Reverse engineering effort. (Update on Gemma 4 having MTP: Reverse engineering effort) I made a USB-Clawd who gets my attention when Claude Code finishes a response. (I made a USB-Clawd who gets my attention when Claude Code finishes a response) Did you know you can get your claude through hermes...

Analyst Note

This matters because coding agents are becoming part of the developer toolchain, so reliability, context handling, and repository-level feedback loops are becoming product requirements. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

Image/Video GenAI: Tools & Workflows

Last week in Generative Image & Video. (Last week in Generative Image & Video) Could HappyHorse be Z-video in disguise, from Alibaba?. (Could HappyHorse be Z-video in disguise, from Alibaba?) Qwen3.5-4B-Base-ZitGen-V1. (Qwen3.5-4B-Base-ZitGen-V1)

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

3D Gaussian Splatting: Tools & Reconstruction

3D Gaussian Splatting: Tools & Reconstruction

🚨Two lines of code. A full explorable 3D scene generated in seconds from a text prompt. (🚨Two lines of code. A full explorable 3D scene generated in seconds from a text prompt. It's called WorldGen -- an o...) 🌧️❄️🌨️Procedural rain and snow as Gaussian splats — infinite particles that move with the camera, fully rendered on the GPU. (🌧️❄️🌨️Procedural rain and snow as Gaussian splats — infinite particles that move with the camera, fully rendered on...) I've been getting a lot of questions recently about gaussian splatting and where it might be used, so I figured I would put together of...

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The multi-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

AI Agents: Applications & Builds

An actress Milla Jovovich just released a free open-source AI memory system that scored 100% on LongMemEval, beating every paid solution. (An actress Milla Jovovich just released a free open-source AI memory system that scored 100% on LongMemEval, beating every paid solution) Safetensors and Helion have joined PyTorch Foundation as foundation-hosted projects to secure model distribution for trusted agentic solutions and simplify kernel development across the open source AI ecosystem....

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

Also Notable

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered b...

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. (Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered b...)

We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an execut...

We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an executor, and get near Opus-level intelligence in your agents at a fraction of the cost. (We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an execut...)

We just gave every Hermes Agent a free cloud browser. Use the Browser Use ecosystem in Hermes Agent for free > Unl...

We just gave every Hermes Agent a free cloud browser. Use the Browser Use ecosystem in Hermes Agent for free > Unlimited browser hours > Free proxies > Persistent authentication @NousResearch 🤝 Browser Use. (We just gave every Hermes Agent a free cloud browser. Use the Browser Use ecosystem in Hermes Agent for free > Unl...)

Too dangerous to release

Over the past several days, there has been a lot of internet discourse around Claude Mythos being held back from public release. Many people have been claiming this is somehow yet another devious marketing tactic meant to somehow weigh down Dario's pocketbook by... not letting people pay to access the model. Claims of hype and power consolidation and other self-congratulatory motives are easy to find online, but I...

AMD AI directors analysis confirms lobotomization of Claude

Stella Laurenzo, AMD’s director of AI, filed a detailed GitHub issue on April 2 documenting that Claude Code reads code three times less before editing it, rewrites entire files twice as often, and abandons tasks mid-way at rates that were previously zero. Her analysis of nearly 7,000 sessions puts precise numbers on how Anthropic’s coding tool has degraded since early March. PERFORMANCE DECLINE: AMD’s AI director...

AMD's senior director of AI thinks 'Claude has regressed' and that it 'cannot be trusted to perform complex engineering'

This is vindicating for all the people that have been screaming out that Anthropic simply doesn't want to release Mythos because they do not have the compute, not because the model is "too powerful." Summary of the findings: >On April 2, AMD’s Director of AI, Stella Laurenzo, filed a GitHub issue detailing a severe degradation in Claude Code's performance since early March. Based on an analysis of nearly 7,000...

Week of 2026-04-09 to 2026-04-09

AI Agents: Research & Evals

AI Agents: Research & Evals

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. (Adam's Law: Textual Frequency Law on Large Language Models) Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. (Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents) OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the...

Analyst Note

This matters because evaluation work is becoming the control surface for agent progress: better tests shape what builders trust, deploy, and regulate. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

Image/Video GenAI: Tools & Workflows

Image/Video GenAI: Tools & Workflows

Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamic…. (DARE: Diffusion Large Language Models Alignment and Reinforcement Executor) We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. (Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision) Humans paint images incrementally...

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

3D Gaussian Splatting: Tools & Reconstruction

3D Gaussian Splatting: Tools & Reconstruction

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. (Fast Spatial Memory with Elastic Test-Time Training)

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The single-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

AI Agents: Applications & Builds

AI Agents: Applications & Builds

RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. (RAGEN-2: Reasoning Collapse in Agentic RL) We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. (MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU) Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code....

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

Also Notable

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings...

Experience Transfer for Multimodal LLM Agents in Minecraft Game

Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer-oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five...

Neural Computers

We propose a new frontier: Neural Computers (NCs) -- an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unlike conventional computers, which execute explicit programs, agents, which act over external execution environments, and world models, which learn environment dynamics, NCs aim to make the model itself the running computer. Our long-term goal is the Completely Neural...

REAM: Merging Improves Pruning of Experts in LLMs

Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel...

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph...

After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open sourc...

After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open source model agnostic framework. He studied the architecture, focused on the multi-agent orchestration layer (the coordinator that breaks goals into tasks, team. (After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open sourc...)

Week of 2026-04-08 to 2026-04-08

unknown

unknown

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. (Adam's Law: Textual Frequency Law on Large Language Models) Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. (Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents) OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the...

Analyst Note

This matters because unknown activity is producing enough signal to affect near-term tooling and research choices. The multi-source signal suggests it is worth tracking into the next cycle.

Week of 2026-03-23 to 2026-03-29

Claude Gets a Body: Computer Use Ships in Cowork

Claude Gets a Body: Computer Use Ships in Cowork

Anthropic launched Claude Cowork with Computer Use in research preview — Claude can now open apps, navigate browsers, fill spreadsheets, and operate your desktop directly. It prioritizes connected integrations (Slack, Calendar) before falling back to screen-level control.

Analyst Note

This is the 'agent-on-your-desktop' paradigm going mainstream. The real unlock isn't the screen control (which is slow and fragile) — it's the integration-first approach. When Claude can use Slack and Calendar natively and only falls back to pixels when it has to, you get reliability where it...

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

LoGeR tackles a fundamental bottleneck in 3D reconstruction from video: maintaining geometric consistency across long sequences. The paper introduces a hybrid memory architecture that combines explicit geometric features with learned latent representations, enabling coherent reconstruction from hundreds of frames where previous methods degrade.

Analyst Note

Most reconstruction methods choke on long videos — they either drift or run out of memory. LoGeR's hybrid approach (explicit geometry for precision + learned latents for compression) is the kind of architecture that could make casual phone-scan-to-3D actually reliable. The Google/Berkeley pedigree...

L.A. Noire-Style Facial Animation with AI Video Depth Maps

L.A. Noire-Style Facial Animation with AI Video Depth Maps

Independent developer Alex shared a practical pipeline for game facial animation: generate a facial animation video using AI (LTX 2), extract a depth map sequence from it, then project the depth onto a face mask using vertex displacement. The result is nuanced, realistic facial movement reminiscent of L.A. Noire's MotionScan — but built entirely from commodity AI tools.

Analyst Note

L.A. Noire spent years and millions on custom capture rigs to get that facial fidelity. This pipeline achieves something similar with a text-to-video model and a depth estimator — tools anyone can run. It's a glimpse of how AI is collapsing the cost of high-fidelity game content. The...

Also Notable

Radiant: 80+ Production Web Shaders, MIT Licensed

Zero-dependency library of production-ready WebGL and Canvas 2D shaders with multiple color themes. Copy, integrate, ship.

Off-Axis Projection with World Labs Splats + Face Tracking

Ian Curtis demoed a parallax window effect using Blender → World Labs Gaussian splat → Three.js with MediaPipe face tracking for off-axis projection.

Neural Network Mimics a Forest Trail

Ollin Boer Bohan trained a neural network to reproduce a forest trail near his apartment as an interactive web experience — a tiny world model you can walk through in the browser.

Speed by Simplicity: Single-Stream Audio-Video Generation

A single-stream transformer architecture for joint audio-video generation that's significantly faster than multi-stream approaches while maintaining quality. 115 HF upvotes.

LeWorldModel: JEPA World Model from Pixels

A stable end-to-end Joint-Embedding Predictive Architecture that learns world models directly from pixel observations — continuing the LeCun JEPA lineage.

VibeVoice: High-Quality Speech Synthesis

Top-trending HF paper this week (144 upvotes). Technical report on a new speech synthesis approach.

Cohere Transcribe: SOTA Open-Source Transcription in Browser

Cohere released an open-source transcription model that runs in-browser with weights on HuggingFace. Claims state-of-the-art for open models.

Browser Use CLI 2.0: 2x Speed, Direct CDP

Major update to the browser automation CLI — 2x faster, half the cost, direct Chrome DevTools Protocol integration instead of high-level abstractions.

Cheng Lou's 'Foundational Piece of UI Engineering'

Former React team member Cheng Lou dropped what he calls one of the more important foundational pieces of UI engineering for the foreseeable future. 29K likes, massively viral.

Figure 03 Robot Flipping Packages

Salesforce CEO Benioff posted video of Figure's latest humanoid robot handling packages with notable dexterity improvements.

AI + AlphaFold = Custom mRNA Cancer Vaccine for a Dog

An Australian tech entrepreneur used ChatGPT and AlphaFold to design a custom mRNA vaccine that significantly reduced his dog's tumor. Researchers involved are reportedly excited by the results.

4D OSINT Reconstruction of Iran Strikes

Bilawal Sidhu deployed an AI agent swarm to capture every OSINT signal during Operation Epic Fury, then built a full 4D temporal reconstruction in WorldView — scrubbing through the strikes minute by minute.

AgentCraft: RTS-Style Agent Control

Control your coding agents like units in a real-time strategy game. Early but creative take on agent orchestration UX.

Internal Safety Collapse in Frontier LLMs

Paper documenting how safety guardrails in frontier models can catastrophically collapse under certain conditions. 30 HF upvotes.

Bernie Sanders: Pause AI Data Center Construction

Legislation introduced to pause AI data center building and pursue international coordination, plus a ban on chip exports. The political angle on AI scaling heats up.

Week of 2026-03-16 to 2026-03-22

Video Diffusion Models Are Secretly Reasoning Engines with 3D Understanding

Video Diffusion Models Are Secretly Reasoning Engines with 3D Understanding

Two major papers reveal that video diffusion models encode far more than visual generation. "Demystifying Video Reasoning" (347 HF upvotes) shows reasoning emerges through denoising steps — not frame sequences — via a Chain-of-Steps mechanism with working memory, self-correction, and perception-before-action. Separately, VEGA-3D demonstrates that video generation models implicitly learn robust 3D structural priors and physical laws, which can be extracted as a plug-and-play "Latent World Simulator" to give MLLMs spatial understanding without explicit 3D supervision.

Analyst Note

These findings reframe video diffusion as more than a generative tool — it's a substrate for spatial intelligence and reasoning. For rendering teams, this means video generation backbones could become general-purpose scene understanding modules. The VEGA-3D approach of repurposing generative priors...

World Models Break Out of the Lab: Real Cities, Stereo VR, and 4D Robotics

World Models Break Out of the Lab: Real Cities, Stereo VR, and 4D Robotics

This week saw a convergence of world model papers moving from toy environments to real-world grounding. Seoul World Model generates navigable video grounded in actual street-view imagery over hundreds of meters. MosaicMem introduces hybrid 3D/implicit spatial memory enabling minute-level consistent navigation in video world models. WorldCam uses camera pose as a unifying geometric representation for interactive 3D gaming worlds. StereoWorld produces end-to-end stereo video for VR without depth estimation. Kinema4D builds a 4D generative robotic simulator with URDF-based robot control and...

Analyst Note

World models are fragmenting into specialized niches — navigation, gaming, robotics, VR — but sharing architectural DNA (video diffusion + spatial conditioning). For Riverside's rendering pipeline, the stereo/VR angle (StereoWorld) and the real-city grounding approach (Seoul World Model with...

3D Reconstruction and Generation: Physics-Grounded, Continuous LoD, and Semantic Tokenization

Several advances push 3D pipelines toward production readiness. HSImul3R (149 HF upvotes) introduces physics-in-the-loop 3D reconstruction of human-scene interactions — using the physics simulator as an active supervisor to jointly refine dynamics and geometry, producing outputs directly deployable to humanoid robots. Matryoshka Gaussian Splatting enables continuous level-of-detail from a single 3DGS model via stochastic budget training — any prefix of the ordered Gaussian set produces a coherent reconstruction. M³ augments multi-view foundation models with dense matching for monocular...

Analyst Note

The HSImul3R physics-in-the-loop approach is a paradigm worth watching: treating the simulator as a differentiable supervisor rather than just a downstream consumer. Matryoshka GS is immediately practical — continuous LoD from a single model is exactly what streaming/adaptive rendering needs. The...

Also Notable

NVIDIA DLSS 5 Sparks "AI Slop" Controversy

DLSS 5 appears to generate AI imagery on top of upscaling rather than just enhance existing frames — characters look like different people, drawing sharp criticism from gamers and CG professionals.

Radiant: 80+ Production-Ready WebGL/Canvas Shaders, MIT Licensed

Open-source library of ultra-realistic shader effects for the web — multiple themes, zero dependencies, copy-paste integration.

Off-Axis Projection with World Labs Splats + Three.js + Face Tracking

Demo combining Blender → World Labs detailed splat generation → Three.js rendering with MediaPipe face tracking for parallax window effect.

AI-Driven Facial Animation via Depth Map Projection

Technique generating AI video facial animations (LTX 2), extracting depth maps, and projecting onto face meshes via vertex displacement — L.A. Noire-like results for indie games.

Neural Network World Emulation — Forest Trail Demo

Ollin Boer Bohan trained a neural network to mimic a real forest trail near his apartment, with an interactive web demo for real-time exploration.

Attention Residuals: Kimi Team Rethinks Depth Scaling

Replaces fixed residual accumulation with softmax attention over preceding layer outputs — 2.11% downstream gain at negligible overhead. Addresses hidden-state growth diluting layer contributions.

Mixture-of-Depths Attention (MoDA)

Each attention head attends to both sequence KV pairs at current layer AND depth KV pairs from preceding layers — 2.11% downstream improvement with 97.3% FlashAttention-2 efficiency.

SAMA: Factorized Video Editing Without External Priors

Decouples video editing into semantic anchoring and motion alignment — pre-training on motion-centric tasks alone yields strong zero-shot editing. Competitive with Kling-Omni.

3DreamBooth: 3D-Aware Subject-Driven Video Generation

Decouples spatial geometry from temporal motion via 1-frame optimization, enabling genuine 3D-aware video customization for VR/AR and virtual production without multi-view video datasets.

Spatial-TTT: Streaming Spatial Intelligence via Test-Time Training

Maintains and updates spatial evidence from unbounded video streams using test-time training with fast weights — hybrid architecture with sliding-window attention and 3D spatiotemporal convolution.

Nemotron-Cascade 2: Gold-Medal Reasoning at 3B Active Parameters

NVIDIA's 30B MoE with only 3B activated params matches frontier models on IMO/IOI/ICPC — 20× fewer parameters than DeepSeek equivalent. Open weights released.

OmniForcing: Real-Time Joint Audio-Visual Generation at 25 FPS

First framework to distill bidirectional audio-visual diffusion into a streaming autoregressive generator — solves extreme token sparsity and cross-modal sync issues for real-time generation.

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Handles geometric reconstruction from long video sequences using hybrid memory architecture.

Browser Use CLI 2.0: 2× Faster Browser Automation via Direct CDP

Major update to browser automation tool — direct Chrome DevTools Protocol, half the cost, connect to running Chrome instances.

MoTok: Diffusion-Based Motion Tokenizer Bridges Semantic and Kinematic Control

Three-stage motion generation framework using a novel diffusion-based discrete tokenizer — reduces trajectory error from 0.72cm to 0.08cm while using 1/6 the tokens.

IndexCache: 2× Faster Sparse Attention via Cross-Layer Index Reuse

Exploits the insight that sparse attention indices are highly redundant across adjacent layers — shares indices via 'Full' and 'Shared' layers with multi-layer distillation, significantly accelerating prefill and decode.

SK-Adapter: Skeleton-Based Control for Native 3D Generation

Lightweight adapter injects 3D skeleton joint coordinates and topology into frozen 3D generation backbones via cross-attention — enables precise structural articulation control and local 3D editing.

OneWorld: 3D Scene Generation in Native 3D Space

Performs diffusion directly within coherent 3D representation space instead of 2D image/video latents — uses 3D-URAE autoencoder with cross-view correspondence loss for consistent scene generation.

V-JEPA 2.1: Dense Self-Supervised Visual Representations

Meta's V-JEPA 2.1 achieves SOTA on egocentric anticipation, robotic grasping (+20pt over V-JEPA-2), depth estimation, and navigation — dense predictive loss with deep self-supervision across encoder layers.

Week of 2026-03-09 to 2026-03-15

Streaming Spatial Intelligence: Three Papers Converge on Video-to-3D

Three independent papers this week tackle the same core problem: building persistent spatial understanding from continuous video streams. Spatial-TTT (70 upvotes) uses test-time training with fast weights as compressed spatial memory over unbounded video. Holi-Spatial (53 upvotes) builds a scalable 3DGS-based pipeline to curate spatial QA data from raw web video. LoGeR (31 upvotes) achieves dense 3D reconstruction from minutes-long video in a single feedforward pass — no post-optimization.

Analyst Note

The convergence here is striking: all three papers independently identify that the bottleneck for spatial AI is not model capacity but how spatial information is retained over time. Spatial-TTT's use of fast weights to compress spatial evidence into network parameters is the most architecturally...

Video Generation Gets Cinematic: Camera Control, Identity Lock, Infinite Length

Four papers push video generation from single-clip demos toward production-grade film tools. ShotVerse (28 upvotes) learns cinematic multi-shot camera control from naturally aligned (caption, trajectory, video) triplets. WildActor (26 upvotes) introduces a massive 18M-clip dataset for full-body identity consistency across viewpoints. DreamVideo-Omni (26 upvotes) solves multi-subject + multi-granularity motion control with identity reward learning. HiAR (21 upvotes) breaks the error-accumulation barrier for infinite-length AR video via hierarchical denoising.

Analyst Note

The shift is from 'generate a cool clip' to 'direct a sequence.' ShotVerse's data-centric approach — learning camera language from real film data rather than manual trajectory specification — is the right paradigm. WildActor addresses the ugly truth that prior identity-preserving methods were...

Coding Agents Hit Production: Claude Code Review, Autoresearch, and Agent Orchestration

The coding agent stack crosses from demos to production infrastructure. Anthropic ships multi-agent code review in Claude Code — Boris Cherny reports 200% code output per engineer, with review being the bottleneck now addressed by agent teams. Karpathy releases his 'autoresearch' setup: a minimal 630-line LLM training loop where agents iterate on training code while humans steer research direction. Varun Mathur's Autoskill extends this to distributed skill factories. AgentCraft by Ido Salomon lets you orchestrate agents via an RTS-game interface.

Analyst Note

The signal here is that Anthropic dogfoods agent-driven code review internally — this isn't a product demo, it's their actual engineering workflow. When your AI company's own engineers use agents to review agent-written code, the loop is closed. Karpathy's autoresearch is the minimal viable version...

Also Notable

SuperSplat: Walk Mode + Streamed LOD for Gaussian Splats

SuperSplat ships first-person walk mode with WASD controls, streamed LOD, and easy upload — making Gaussian Splat scenes navigable and shareable like Google Street View.

DVD: Deterministic Video Depth from Diffusion Priors

First framework to deterministically convert pre-trained video diffusion models into single-pass depth regressors — eliminates stochastic sampling and scale drift.

CARE-Edit: Dynamic Expert Routing for Image Editing

Replaces static conditioning concatenation in ControlNet with a latent-attention router that dynamically selects expert pathways per diffusion timestep — reduces artifacts from conflicting modalities.

CoCo: Code-as-Chain-of-Thought for Structured Image Generation

Uses executable code as the reasoning step before image generation — produces structured drafts via code then renders, excelling at complex spatial layouts and embedded text.

ELIT: Variable-Length Latent Tokens Decouple DiT Compute from Resolution

Drop-in DiT mechanism that inserts a learnable variable-length latent interface, enabling runtime quality-speed tradeoffs without retraining.

SVG-EAR: Training-Free Sparse Attention Speedup for Video DiTs

Parameter-free linear compensation for dropped attention blocks in video generation via error-aware routing — ~2x speedup with minimal quality loss.

Planning in 8 Tokens: Ultra-Compact World Model Tokenizer

Compresses visual observations to just 8 discrete tokens for action-conditioned world models, making real-time planning computationally feasible.

IndexCache: Cross-Layer Attention Index Reuse for LLM Speedup

Exploits cross-layer redundancy in sparse attention indexers — reuses top-k token selections across layers, cutting prefill latency 2-3x with negligible quality loss.

FIRM: Robust Reward Models for Faithful Image Editing

Tackles reward model hallucinations in RL-guided image editing with curated data, base-and-bonus reward strategy, and new FIRM-Bench benchmark.

Claude Generative UI Reverse-Engineered

Michael Livs extracted Anthropic's generative UI design system from conversation exports, rebuilt it with live-streaming HTML via morphdom DOM diffing into native macOS windows.

NodeToCode: Unreal Engine Blueprints to C++

Open-source tool that translates UE Blueprint visual scripts to C++ — useful for game dev teams migrating visual prototypes to production code.

Nvidia Nemotron 3 Super

Nvidia's new open model punches significantly above its weight class — r/LocalLLaMA calls it a bigger deal than the marketing suggests, with 152 comments of analysis.

VeridisQuo: Open-Source Deepfake Detector

Combines spatial and frequency-domain analysis for deepfake detection with manipulation heatmaps — open-sourced university project on r/MachineLearning.

MLX Circle-Splatting Renderer for Dimensionality Reduction

Han Xiao built pure MLX implementations of UMAP/t-SNE/PaCMAP with a scatter-add alpha-blending splatting renderer on Metal — 70K points from raw data to rendered video in seconds.

PufferLib 3.0: Petabyte-Scale RL Training on One Server

Trained RL agents on 1 petabyte / 12,000 years of data on a single server — algorithmic breakthroughs, massively faster training, 10 new environments.

Week of 2026-03-10 to 2026-03-10

Karpathy's AutoResearch: RL Agents That Do Their Own ML Research

Andrej Karpathy released 'autoresearch' — a minimal framework where an RL agent iterates on neural architecture and hyperparameter research autonomously. A ~630-line nanochat LLM training core runs on a single GPU, and the agent proposes code modifications, observes validation loss, and updates via PPO. Separately, a formal paper (AutoResearch-RL) demonstrates the same concept with convergence guarantees.

Analyst Note

This is the 'AI does AI research' loop getting real and accessible. The fact that Karpathy stripped it to a single-file, single-GPU setup means anyone can experiment. The bigger implication: if a small RL agent can discover architecture improvements that humans missed, the compound effect over many...

Holi-Spatial: 3D Gaussian Splatting Meets Vision-Language Models at Scale

Top HF paper this week (53 upvotes). Holi-Spatial builds a scalable pipeline that evolves raw video streams into holistic 3D spatial intelligence by combining 3D Gaussian Splatting with vision-language models. Unlike prior work that reuses small hand-annotated datasets, this approach systematically annotates large-scale 3D scenes from web video data, producing spatial QA pairs with geometric accuracy and relational semantics.

Analyst Note

This bridges two hot areas — 3DGS and VLMs — in a way that could actually scale. The key insight is using 3DGS as the spatial backbone for generating training data rather than as a rendering endpoint. For Spin Master's work on onboarding/tiny stories, this kind of spatial understanding could be...

Claude Code Gets Multi-Agent Code Review

Anthropic shipped Code Review for Claude Code — a team of agents runs deep reviews on every PR. Boris Cherny (Anthropic) says engineering output is up 200% this year and reviews were the bottleneck. The system uses separate context windows per agent, leveraging test-time compute across isolated contexts rather than one large window.

Analyst Note

The architecture detail matters: separate context windows per review agent, not one shared context. This matches what we've seen with Claude Code subagents — isolated contexts catch things a single pass misses. The 200% productivity claim is bold but tracks with what I've observed: the bottleneck...

Also Notable

LoGeR: Long-Context 3D Reconstruction from Video Without Post-Optimization

Scales dense 3D reconstruction to extremely long video sequences using hybrid memory — bidirectional priors within chunks, sliding window attention across chunks. No post-optimization needed.

Planning in 8 Tokens: Ultra-Compact World Model Tokenizer

Encodes observations into just 8 discrete tokens for latent world models, making real-time planning computationally feasible. Conventional tokenizers use hundreds of tokens per frame.

Fine-Tuned Qwen3 SLMs (0.6-8B) Beat Frontier Models on Narrow Tasks

Systematic comparison shows distilled small Qwen3 models outperform GPT-5, Claude Opus, and Gemini Flash on specific classification and function-calling tasks. The small model advantage is real for focused use cases.

VeridisQuo: Open-Source Deepfake Detector (Spatial + Frequency Analysis)

Combines spatial and frequency-domain analysis to detect deepfakes and show exactly where the face was manipulated. Open-source, with visual heatmap output.

Microsoft Copilot Cowork: Multi-Step Office Agent Built on Claude

Microsoft launched Copilot Cowork — an agent that executes multi-step workflows across Outlook, Teams, Excel, and PowerPoint autonomously. Notably built on Anthropic's Claude, not OpenAI.

WildActor: Identity-Preserving Human Video Generation

Actor-18M dataset + asymmetric identity-preserving attention for consistent full-body identity across dynamic shots and viewpoints. Addresses the copy-paste artifact problem in human video gen.

AgentCraft: Control AI Agents Like an RTS Game

Ido Salomon released AgentCraft v1 — an RTS-style interface for controlling coding agents. Early but fun concept: `npx @idosal/agentcraft`.

LTX 2.3: Open Video Generation Getting Serious

Multiple impressive LTX 2.3 text-to-video generations on r/StableDiffusion showing significant quality jumps in open-source video models.

Depth Perception Blender Add-on via Head Tracking

CS student built a Blender add-on using real-time webcam head tracking to create natural depth perception while navigating 3D scenes. Free and open-source.

Unreal Blueprint to C++ Translator

Open-source tool that translates Unreal Engine Blueprints to C++ code. Useful for performance-critical game dev workflows.

Week of 2026-03-03 to 2026-03-09

Gaussian Splatting Breaks Into Real-Time Video

Two independent teams demonstrated real-time 4D Gaussian splatting for dynamic scenes, achieving 60fps rendering of complex video sequences. SplatFlow from ETH Zurich uses flow-guided deformation fields, while DynaSplat from Google DeepMind introduces temporal attention mechanisms. Both show dramatic quality improvements over prior dynamic NeRF approaches.

Analyst Note

This is the missing piece for production use of 3DGS in film/VFX. Static scene reconstruction was already competitive — now dynamic scenes are catching up. The Google approach is particularly interesting because it could integrate with their existing NeRF-to-mesh pipeline.

Claude 3.5 Opus Released with Native Tool Use

Anthropic released Claude 3.5 Opus with deeply integrated tool use — the model can now plan multi-step tool chains internally rather than requiring external orchestration. Early benchmarks show 40% fewer API round-trips for complex agent workflows. Coding benchmarks jump significantly, with SWE-bench scores reaching 62%.

Analyst Note

The real story isn't the benchmark numbers — it's the architecture shift. Moving tool planning inside the model eliminates the brittle prompt-engineering layer that made agent frameworks fragile. This is what OpenClaw and similar systems have been working around. Expect the orchestration layer to...

World Models Get Spatial Awareness

NVIDIA's GameGen-2 introduces a world model that understands 3D spatial relationships, enabling consistent physics when generating interactive environments. Unlike prior work that treated video generation as 2D sequence prediction, GameGen-2 maintains an implicit 3D representation that prevents the "impossible geometry" artifacts common in generated worlds.

Analyst Note

This bridges the gap between world models and actual game engines. The implicit 3D representation is key — previous approaches could generate visually convincing frames but fell apart when you needed consistent spatial reasoning (walk around a corner and back, the scene should be the same). Still...

Also Notable

Stable Diffusion 4 Preview Leaks Show Major Architecture Change

Early access users report SD4 moves to a DiT-based architecture with native video support. Quality appears to match DALL-E 3 in initial comparisons.

Codex CLI Gets Multi-File Context Window

OpenAI's Codex CLI now supports loading entire project directories into context, with smart chunking that preserves cross-file references.

NeRF-to-Mesh Pipeline Achieves Sub-Second Export

New paper from Meta demonstrates instant mesh extraction from trained NeRFs, making the NeRF→production pipeline viable for real-time applications.

GPT-5 Rumored for Q2 2026 with Native Multimodal Generation

Multiple sources suggest GPT-5 will unify text, image, video, and audio generation in a single model. No official confirmation from OpenAI.

Real-Time Neural Rendering Benchmark Published

New standardized benchmark for neural rendering methods covers speed, quality, and memory across 50 scenes. 3DGS variants dominate speed; NeRF variants win on quality.

Cursor Adds AI-Powered Git Conflict Resolution

Cursor 0.45 introduces automatic merge conflict resolution that understands both sides' intent. Early reports say it handles 80%+ of conflicts correctly.

Diffusion Models Learn to Count

Paper shows a training technique that gives diffusion models accurate object counting, solving one of the oldest failure modes in image generation.

Archive