Digest Archive — GeeBee Forge

Week of 2026-05-17 to 2026-05-24 May 24, 2026

AI Agents: Applications & Builds

Goal To save humans wasting time sitting in Call Centre queues waiting to be answered To have tool listen in on the audio stream of a live call, post IVR Navigation - to determine whether the call has transitioned out. (2410.08235) LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. (arXiv:2605.13295) Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly...

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

reddit Live Human Detector on Outbound Phone Calls [R]project 3293 project viewtopic.php huggingface learn/audio-course

Image/Video GenAI: Tools & Workflows

Was working on a personal project where I needed masks for a large custom image dataset. Tried the official SAM notebooks, but they felt more like demos than something practical for segmentation mask generation workflows, so I built a small tool around it over the weekend. (sam interactive.ipynb) Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. (AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment) I'm not the author of...

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

reddit Built an interactive SAM mask generator on Google Colab. Click any object and get clean segmentation masks instantly.github rohitchoudharymanth/sam-mask-generator official sam interactive.ipynb paper AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

3D Gaussian Splatting: Tools & Reconstruction

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. (Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving) Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly. (SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation) Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization...

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The multi-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

paper Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving paper SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation paper SceneAligner: 3D-Grounded Floorplan Localization in the Wild reddit Linked two Gaussian Splat scenes with a walk-through portal — no teleport, you just walk between worlds

Generative 3D Worlds: Explorable World Models

World models learn compact latent representations for planning without pixel reconstruction. LeWorldModel (LeWM), from LeCun's group at NYU, achieves stable end-to-end JEPA training by enforcing an isotropic Gaussian prior over the full latent space. (2605.09241) Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. (PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects) Autoregressive video diffusion models have enabled real-time, action-conditioned world...

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The multi-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

reddit Sub-JEPA: a simple fix to LeCun group's LeWorldModel that consistently improves performance [P]project overview.png project cube.gif project sub-jepa

Also Notable

[R] FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

We are releasing FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition , a benchmark for evaluating whether multimodal agents like OpenClaw can actively acquire fine-grained knowledge from external evidence. The motivation is that many fine-grained visual recognition benchmarks are still close to a closed-set classification setting: given an image, the model is expected to output a label...

reddit [R] FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition paper arXiv:2605.13193 github ligeng0197/fika-bench huggingface oking0197/fika-bench (dataset)project ligeng0197/fika-bench.github.io

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of...

paper Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for...

paper TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs...

paper Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic...

paper ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act...

paper π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

Week of 2026-05-10 to 2026-05-17 May 17, 2026

AI Agents: Applications & Builds

I rewrite to skill, and now it's uses your subscription instead of paid tokens! Also I used it to implement from scratch deepseek v4 with gpt2 tokenizer and train it on tiny stories   submitted by   /u/Mysterious_Hearing14 [link]   [comments]. (huggingface/ml-intern) TL;DR : DeepSeek-V4-Flash running at 85.52 tok/s @ 524k ctx and ~111 tok/s @ 128k single-stream on 2× RTX PRO 6000 Max-Q pasta-paul's DeepSeek-V4-Flash-W4A16-FP8 quant is great, but its MTP head silently gets stripped a…. (lordneel/deepseek-v4-flash-acti-mtp-w4a16-fp8) LLM-based autonomous agents have demonstrated...

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

reddit Intern ml skill github huggingface/ml-intern github alexwortega/claude-ml-intern-skill huggingface alexwortega/ml-intern-v4-100m-tinystories-20260512-1721

Image/Video GenAI: Tools & Workflows

Cola DLM ( Co ntinuous La tent D iffusion L anguage M odel) is a hierarchical continuous latent-space diffusion language model. (arXiv:2605.06548) 3,000 tok/s ? bring your pitchforks (open-dllm)" title="Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? (2605.07933v1) Today we're releasing a beta of LipDub, a new open-source lipsync capability built on LTX. LipDub is an IC-LoRA adapter that takes an existing video and replaces the dialogue by regenerating speech and lip motion together in a single pass. (lightricks/ltx-2.3-22b-ic-lora-lipdub) It was a f cking chore that took almost 9 hours but i...

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

reddit ByteDance-Seed/Cola-DLM · Hugging Face huggingface bytedance-seed/cola-dlm github bytedance-seed/cola-dlm paper arXiv:2605.06548

3D Gaussian Splatting: Tools & Reconstruction

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. (VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction) I just landed two new examples in the PlayCanvas engine that let you walk around inside a real Gaussian Splat scan — both first-person and third-person, with proper collision against the scene. (playcanvas/engine) RSR: Rapid Splat Renderer is a native Windows VR viewer for 3D Gaussian Splatting scenes, built with raw C++, Direct3D 12...

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The multi-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

paper VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction reddit First-person + third-person walking demos inside a Gaussian Splat scene (PlayCanvas, runs in the browser)project engine-cmbu8r47z-playcanvas.vercel.app project optimize

Generative 3D Worlds: Explorable World Models

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. (Quantitative Video World Model Evaluation for Geometric-Consistency) We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. (SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer) Generating a street-level 3D scene from a single satellite...

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The multi-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

paper Quantitative Video World Model Evaluation for Geometric-Consistency paper SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer paper Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image reddit Is RL post-training in 'imagined environments' a path to continual learning? Trying to understand this deeper [D]

Also Notable

MOOSE-Star (ICML 2026): 7B model + 108K-paper dataset for scientific hypothesis discovery

Disclosure first: I work on community at MiroMind. One of our researchers just dropped the full MOOSE-Star collection on Hugging Face — a 7B model post-trained for scientific hypothesis discovery, plus the dataset behind it. Paper accepted at ICML 2026. 🤗 Collection: Inside: MS-IR-7B / MS-HC-7B / MS-7B : 7B models for inspiration retrieval, hypothesis composition, and joint use. Base: DeepSeek-R1-Distill-Qwen-7B....

reddit MOOSE-Star (ICML 2026): 7B model + 108K-paper dataset for scientific hypothesis discovery huggingface collections/zongliny paper arXiv:2603.03756 github zongliny/moose-star

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench...

paper WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Code: Paper: HF: ; ; Disclosure: co-author. Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model. Results: Up to 7.8× TPF, ~6× wall-clock on MATH-500. 16% of...

reddit Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution github chiennv2000/orthrus paper arXiv:2605.12825 huggingface chiennv/orthrus-qwen3-1.7b huggingface chiennv/orthrus-qwen3-4b huggingface chiennv/orthrus-qwen3-8b

Orchard: An Open-Source Agentic Modeling Framework

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and...

paper Orchard: An Open-Source Agentic Modeling Framework

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of 756 images of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given...

paper CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

Needle: We Distilled Gemini Tool Calling Into a 26M Model

We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices. We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling...

reddit Needle: We Distilled Gemini Tool Calling Into a 26M Model github cactus-compute/needle github cactus-compute/needle github cactus-compute/cactus huggingface cactus-compute/needle

Week of 2026-05-03 to 2026-05-10 May 10, 2026

AI Agents: Applications & Builds

SenseNova dropped SenseNova-U1 on the last day of April and I’ve only found like one other mostly ignored post on this sub talking about it. (sensenova/sensenova-u1-8b-mot) In my initial post, I mentioned using turboquants. However, I forgot to include instructions for building llama.cpp with the corresponding PR. (froggeric/qwen3.6-27b-mtp-gguf) Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory....

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

reddit SenseNova-U1-8B-MoT (novel open source multimodal understanding + image generation model) seems like a bigger deal architecturally then it’…huggingface sensenova/sensenova-u1-8b-mot github opensensenova/sensenova-skills reddit 2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat te…

Image/Video GenAI: Tools & Workflows

A C++ port of Echo-TTS - a multi-speaker TTS model with speaker reference conditioning. Runs on GPU via CUDA, using GGML for the diffusion transformer + ONNX Runtime for the DAC autoencoder. (jordandare/echo-tts) A few weeks ago I shipped vibevoice.cpp , a pure-C++ ggml port of Microsoft VibeVoice (the speech-to-speech model with voice cloning, ). (microsoft/vibevoice) Built a spatial style mixing tool — drop in two paintings, paint a region on your content image, hit Generate. Style A applies inside the painted region, Style B applies outside, clean boundary, no muddy averaging....

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

reddit A C++ port of Echo-TTS github jordandare/echo-tts github cirius0310/echo-tts-cpp huggingface tmdarkbr/echo-tts-gguf

3D Gaussian Splatting: Tools & Reconstruction

PlayCanvas just released SplatTransform 2.0! For anyone working with 3D Gaussian Splats, SplatTransform is an open source CLI tool and library for processing splats. (playcanvas/splat-transform) 3DGS PLY to 3D Tiles Converter This is a Node.js CLI / library for converting 3D Gaussian Splatting PLY models into explicit 3D Tiles tilesets. (williamliu-1997/3dgs-ply-3dtiles-converter) Some tests with effects and color corrections done to the images before they are fed into the training. Some view dependent. (db0c6c31) I wanted to share a project that turned out to be a deeply moving experience...

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The multi-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

reddit SplatTransform 2.0: Automated collision generation for 3D Gaussian Splats github playcanvas/splat-transform project b0703bc1 reddit Open-source 3D Gaussian Splatting 3D Tiles Toolchain

Generative 3D Worlds: Explorable World Models

So I recently stumbled upon Reactor's new demo of an open source world model. (I just tried Reactor's open source world model demo, here are my thoughts)

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The single-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

reddit I just tried Reactor's open source world model demo, here are my thoughts

Also Notable

Current state of local research tools as of May 2026

I was thinking, that some folks in this community will be interested to see what current options are on local deep research field. So I spent some time to collect everything I could find together. Enjoy. TLDR: the most healthiest and local-friendly projects are "GPT Researcher" by assafelovic and "Local Deep Research" by LearningCircuit. "Local Deep Research" by LearningCircuit Observations: python alive - last...

reddit Current state of local research tools as of May 2026 github learningcircuit/local-deep-research huggingface local-deep-research/ldr-benchmarks (dataset)github stanford-oval/storm project storm-project.stanford.edu github assafelovic/gptr-mcp github assafelovic/gpt-researcher project docs.gptr.dev project gptr.dev github langchain-ai/local-deep-researcher github langchain-ai/open_deep_research github togethercomputer/open_deep_research project open-deep-research github bytedance/deer-flow project deerflow.tech github alibaba-nlp/deepresearch github miromindai/mirothinker project miromind.ai github zilliztech/deep-searcher

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

The emergence of "vibe coding" platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to assess them as virtual software development agencies on understanding business requirements, making architectural decisions, writing production code, handling iterative...

paper SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations...

paper RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

Robotic systems that interact with the physical world must reason about kinematic and dynamic constraints imposed by their own embodiment, their environment, and the task at hand. We introduce KinDER, a benchmark for Kinematic and Dynamic Embodied Reasoning that targets physical reasoning challenges arising in robot learning and planning. KinDER comprises 25 procedurally generated environments, a...

paper KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the lens of creative tool use, where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. As a first step, we introduce...

paper CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy...

paper StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Week of 2026-04-26 to 2026-05-03 May 3, 2026

Image/Video GenAI: Tools & Workflows

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. (PhyCo: Learning Controllable Physical Priors for Generative Motion) Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. (Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models) Humanoid control systems have made...

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

paper PhyCo: Learning Controllable Physical Priors for Generative Motion paper Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models paper ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control paper Leveraging Verifier-Based Reinforcement Learning in Image Editing

AI Agents: Applications & Builds

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. (GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents) We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. (Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence) Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state...

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

paper GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents paper Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence paper Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling paper Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Generative 3D Worlds: Explorable World Models

We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of p…. (Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising)

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The single-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

paper Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

AI Agents: Research & Evals

Autonomous scientific research is significantly advanced thanks to the development of AI agents. (AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery) Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. (The Last Human-Written Paper: Agent-Native Research Artifacts) Long-context large language models (LLMs)-for example, Gemini-3.1-Pro and Qwen-3.5-are widely used to empower many real-world applications, such as retrieval-augmented generation, autonomous...

Analyst Note

This matters because evaluation work is becoming the control surface for agent progress: better tests shape what builders trust, deploy, and regulate. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

paper AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery paper The Last Human-Written Paper: Agent-Native Research Artifacts paper FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption paper Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

Also Notable

Heterogeneous Scientific Foundation Model Collaboration

Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world problems, especially in scientific domains where domain-specific foundation models have been developed to address specialized tasks beyond natural language. In this work, we introduce Eywa, a heterogeneous agentic...

paper Heterogeneous Scientific Foundation Model Collaboration

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs...

paper FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductors Discovery

The discovery of novel materials is critical for global energy and quantum technology transitions. While deep learning has fundamentally reshaped this landscape, existing predictive or generative models typically operate in isolation, lacking the autonomous orchestration required to execute the full discovery process. Here we present ElementsClaw, an agentic framework for materials discovery that synergizes Large...

paper Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductors Discovery

Efficient Training on Multiple Consumer GPUs with RoundPipe

Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the...

paper Efficient Training on Multiple Consumer GPUs with RoundPipe

Week of 2026-04-19 to 2026-04-26 April 26, 2026

AI Agents: Frontier Models

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. (Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals,...) Access GPT Image 2.0 natively in Hermes Agent Update now to get access - just run `hermes update` and select your image generation tool model with `hermes tools`. (Access GPT Image 2.0 natively in Hermes Agent Update now to get access - just run hermes update and select your ima...)...

Analyst Note

This matters because model updates and official evaluations reset expectations for what agent stacks can attempt, while also creating new operating and safety assumptions. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

twitter Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals,...twitter Access GPT Image 2.0 natively in Hermes Agent Update now to get access - just run hermes update and select your ima...twitter GPT-5.5 is now accessible in Hermes Agent through the ChatGPT/Codex OAuth provider. Run hermes update to access now...twitter Anthropic is in trouble unless they ship image gen on par with GPT-image-2. It changes the entire workflow of buildin...

Image/Video GenAI: Model Releases

What makes ChatGPT Images 2.0 a state-of-the-art image generation model? Researchers behind the model explain. (What makes ChatGPT Images 2.0 a state-of-the-art image generation model? Researchers behind the model explain. A thre...) Built a time machine powered by OpenAI’s new image generation model. Describe where and when you want to go, and it creates an immersive panoramic world you can explore. (Built a time machine powered by OpenAI’s new image generation model. Describe where and when you want to go, and it c...)

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

twitter What makes ChatGPT Images 2.0 a state-of-the-art image generation model? Researchers behind the model explain. A thre...twitter Built a time machine powered by OpenAI’s new image generation model. Describe where and when you want to go, and it c...

Generative 3D Worlds: Interactive 3D Worlds

Rapid 3D scene generation from text or images. (Rapid 3D scene generation from text or images)

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The single-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

twitter Rapid 3D scene generation from text or images

Also Notable

GIANT e-ink display LIVE in my house and actively removing the “mental load” of motherhood 😅 Turns out my household ...

GIANT e-ink display LIVE in my house and actively removing the “mental load” of motherhood 😅 Turns out my household chaos just needed to be tamed by a display my team of @openclaw and @NousResearch Hermes agents manage for me 💅. (GIANT e-ink display LIVE in my house and actively removing the “mental load” of motherhood 😅 Turns out my household ...)

twitter GIANT e-ink display LIVE in my house and actively removing the “mental load” of motherhood 😅 Turns out my household ...

Week of 2026-04-12 to 2026-04-19 April 19, 2026

Image/Video GenAI: Model Releases

Last week in Generative Image & Video. (h-embodvis/numina) IC-LoRA-Detailer: It's for post-processing, not just rendering (LTX2.3). (url) DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max). (bstnxbt/dflash-mlx) Nucleus-Image Released. (nucleusai/nucleus-image) Just saw this new technical blog from SenseNova (SenseTime) and it looks like the "Frankenstein" era of sticking different models together might be ending. (vh5se45d8b) Signal spans 13 source domains.

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

reddit Last week in Generative Image & Video github h-embodvis/numina project h-embodvis/numina project gordonchen19/prompt-relay

AI Agents: Frontier Models

Mythos is Mostly Hype... (also the bugs it found were mostly unexploitable and exaggerated...). (verbose 250-page report) AI Security Institute Findings on Claude Mythos Preview. (aisi.gov.uk) OpenAI rolls out GPT-5.4-Cyber to limited group for testing, seeks to rival Claude Mythos. (openai.com) Gemini Robotics ER-1.6 enhances reasoning to help robots navigate real-world tasks. (Gemini Robotics ER-1.6 enhances reasoning to help robots navigate real-world tasks) Anthropic is set to release Claude Opus 4.7 and a new AI design tool as early as this week....

Analyst Note

This matters because model updates and official evaluations reset expectations for what agent stacks can attempt, while also creating new operating and safety assumptions. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

reddit Mythos is Mostly Hype... (also the bugs it found were mostly unexploitable and exaggerated...)project tomshardware.com project clearthis.page project aisle.com

Generative 3D Worlds: Explorable World Models

Today, we released Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale, from NVIDIA Research. (Today, we released Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale, from NVIDIA Resear...) Tencent HY-World 2.0 appears to be dropping on April 15 — open-source multimodal 3D world generation from Tencent Hunyuan. (world)

Analyst Note

This matters because world-generation systems are moving from rendered clips toward persistent spaces that can be explored, edited, and exported into production tools. The multi-source signal is worth tracking for signs that 3D asset creation and simulation workflows are becoming model-driven.

twitter Today, we released Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale, from NVIDIA Resear...reddit Tencent HY-World 2.0 appears to be dropping on April 15 — open-source multimodal 3D world generation from Tencent Hunyuan twitter @dylantfwang official world

3D Gaussian Splatting: Tools & Reconstruction

RSR: Rapid Splat Renderer - Free Native D3D12 Windows Desktop/VR DLSS Viewer for Gaussian Splatting. (warpgatelabs/rsr)

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The multi-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

reddit RSR: Rapid Splat Renderer - Free Native D3D12 Windows Desktop/VR DLSS Viewer for Gaussian Splatting github warpgatelabs/rsr

Also Notable

Claude vs GPT in a bomberman-style 1v1 game

A few weeks ago, ARC-AGI 3 was released. For those unfamiliar, it’s a benchmark designed to study agentic intelligence through interactive environments. I'm a big fan of these kinds of benchmarks as IMO they reveal so much more about the capabilities and limits of agentic AI than static Q&A benchmarks. They are also more intuitive to understand when you are able to actually see how the model behaves in these...

reddit Claude vs GPT in a bomberman-style 1v1 game github klemenvod/tokenbrawl

TUI to see where Claude Code tokens actually go

been spending $200+/day on claude code and had zero visibility into what was eating the tokens. ccusage shows cost per model per day which is great but i wanted to know - is it the debugging thats expensive? the brainstorming? which project is burning the most? it reads the session transcripts claude code already stores on disk (\~/.claude/projects/) and classifies every turn into 13 categories based on tool usage...

reddit TUI to see where Claude Code tokens actually go github agentseal/codeburn

Claude + Playwright to teardown websites and unearth dark pattern trackers & feature flags (oss)

i'm building agents for procurement & one thread has been to let claude systematically deconstruct a website so agents can navigate them. but as i've been doing this, like a piñata, interesting things keep falling off -- from trackers, to interesting feature flags to even some over-exposed data. so i naturally claude-coded it into an oss repo \+ website (teardown.fyi) i've run it for about \~125 websites so far -...

reddit Claude + Playwright to teardown websites and unearth dark pattern trackers & feature flags (oss)github oss repo project teardown.fyi

Why Claude Code Max burns limits 40% faster with 20K less usable context. Proxy evidence inside.

TL;DR: Claude Code v2.1.100+ silently adds ~20K invisible tokens to every request, server-side. This eats your limits faster AND may degrade output quality. Downgrade to v2.1.98 for immediate relief. Proxy evidence below. --- I run Claude Code Max (5x plan) heavily — 3-5 parallel sessions, custom orchestration, the whole deal. Two weeks ago my usage limits started hitting way earlier than expected. What used to last...

reddit Why Claude Code Max burns limits 40% faster with 20K less usable context. Proxy evidence inside.github anthropics/claude-code

I built a Claude Code plugin that extracts any website's full design system

Just type `/extract-design` ` in Claude Code and it pulls the entire design language — colors, fonts, spacing, shadows, components, everything. The main output is a markdown file specifically structured for Claude to understand. So you can extract a site's design, then tell Claude "build me a landing page using this design system" and it actually nails it because it has the exact tokens, scales, and component...

reddit I built a Claude Code plugin that extracts any website's full design system project stripe.com github manavarya09/design-extract

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to lay…

Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough. 23 tok/s is still rough but honestly noticeably faster when streaming responses. Tl;dr: We keep track of which experts get routed to most frequently for the past N tokens. We make a bet that the processing speed-up from loading...

reddit Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to lay…github parmesanparty/llama.cpp

Week of 2026-04-09 to 2026-04-12 April 12, 2026

AI Agents: Developer Tools

After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open source model agnostic framework. (After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open sourc...) Update on Gemma 4 having MTP: Reverse engineering effort. (Update on Gemma 4 having MTP: Reverse engineering effort) I made a USB-Clawd who gets my attention when Claude Code finishes a response. (I made a USB-Clawd who gets my attention when Claude Code finishes a response) Did you know you can get your claude through hermes...

Analyst Note

This matters because coding agents are becoming part of the developer toolchain, so reliability, context handling, and repository-level feedback loops are becoming product requirements. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

twitter After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open sourc...reddit Update on Gemma 4 having MTP: Reverse engineering effort twitter I made a USB-Clawd who gets my attention when Claude Code finishes a response twitter Did you know you can get your claude through hermes without any fancy tricks? Just type /claude-code <prompt> I...

Image/Video GenAI: Tools & Workflows

Last week in Generative Image & Video. (Last week in Generative Image & Video) Could HappyHorse be Z-video in disguise, from Alibaba?. (Could HappyHorse be Z-video in disguise, from Alibaba?) Qwen3.5-4B-Base-ZitGen-V1. (Qwen3.5-4B-Base-ZitGen-V1)

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

reddit Last week in Generative Image & Video reddit Could HappyHorse be Z-video in disguise, from Alibaba?reddit Qwen3.5-4B-Base-ZitGen-V1

3D Gaussian Splatting: Tools & Reconstruction

🚨Two lines of code. A full explorable 3D scene generated in seconds from a text prompt. (🚨Two lines of code. A full explorable 3D scene generated in seconds from a text prompt. It's called WorldGen -- an o...) 🌧️❄️🌨️Procedural rain and snow as Gaussian splats — infinite particles that move with the camera, fully rendered on the GPU. (🌧️❄️🌨️Procedural rain and snow as Gaussian splats — infinite particles that move with the camera, fully rendered on...) I've been getting a lot of questions recently about gaussian splatting and where it might be used, so I figured I would put together of...

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The multi-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

twitter 🚨Two lines of code. A full explorable 3D scene generated in seconds from a text prompt. It's called WorldGen -- an o...twitter 🌧️❄️🌨️Procedural rain and snow as Gaussian splats — infinite particles that move with the camera, fully rendered on...reddit Gaussian Splatting uses in the real world

AI Agents: Applications & Builds

An actress Milla Jovovich just released a free open-source AI memory system that scored 100% on LongMemEval, beating every paid solution. (An actress Milla Jovovich just released a free open-source AI memory system that scored 100% on LongMemEval, beating every paid solution) Safetensors and Helion have joined PyTorch Foundation as foundation-hosted projects to secure model distribution for trusted agentic solutions and simplify kernel development across the open source AI ecosystem....

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

reddit An actress Milla Jovovich just released a free open-source AI memory system that scored 100% on LongMemEval, beating every paid solution twitter Safetensors and Helion have joined PyTorch Foundation as foundation-hosted projects to secure model distribution for ...twitter Hermes bros don’t have any 3090s /model google/gemma-4-31b-it:free Openrouter has Gemma 31b for “free” at 25t/s Which...twitter How to think about the Agentic OS 8:10 - From early experiments to an entirely new OS 12:51 - Early bet on spatial UI...

Also Notable

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered b...

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. (Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered b...)

twitter Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered b...

We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an execut...

We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an executor, and get near Opus-level intelligence in your agents at a fraction of the cost. (We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an execut...)

twitter We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an execut...

We just gave every Hermes Agent a free cloud browser. Use the Browser Use ecosystem in Hermes Agent for free > Unl...

We just gave every Hermes Agent a free cloud browser. Use the Browser Use ecosystem in Hermes Agent for free > Unlimited browser hours > Free proxies > Persistent authentication @NousResearch 🤝 Browser Use. (We just gave every Hermes Agent a free cloud browser. Use the Browser Use ecosystem in Hermes Agent for free > Unl...)

twitter We just gave every Hermes Agent a free cloud browser. Use the Browser Use ecosystem in Hermes Agent for free > Unl...

Too dangerous to release

Over the past several days, there has been a lot of internet discourse around Claude Mythos being held back from public release. Many people have been claiming this is somehow yet another devious marketing tactic meant to somehow weigh down Dario's pocketbook by... not letting people pay to access the model. Claims of hype and power consolidation and other self-congratulatory motives are easy to find online, but I...

reddit Too dangerous to release

AMD AI directors analysis confirms lobotomization of Claude

Stella Laurenzo, AMD’s director of AI, filed a detailed GitHub issue on April 2 documenting that Claude Code reads code three times less before editing it, rewrites entire files twice as often, and abandons tasks mid-way at rates that were previously zero. Her analysis of nearly 7,000 sessions puts precise numbers on how Anthropic’s coding tool has degraded since early March. PERFORMANCE DECLINE: AMD’s AI director...

reddit AMD AI directors analysis confirms lobotomization of Claude

AMD's senior director of AI thinks 'Claude has regressed' and that it 'cannot be trusted to perform complex engineering'

This is vindicating for all the people that have been screaming out that Anthropic simply doesn't want to release Mythos because they do not have the compute, not because the model is "too powerful." Summary of the findings: >On April 2, AMD’s Director of AI, Stella Laurenzo, filed a GitHub issue detailing a severe degradation in Claude Code's performance since early March. Based on an analysis of nearly 7,000...

reddit AMD's senior director of AI thinks 'Claude has regressed' and that it 'cannot be trusted to perform complex engineering'

Week of 2026-04-09 to 2026-04-09 April 9, 2026

AI Agents: Research & Evals

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. (Adam's Law: Textual Frequency Law on Large Language Models) Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. (Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents) OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the...

Analyst Note

This matters because evaluation work is becoming the control surface for agent progress: better tests shape what builders trust, deploy, and regulate. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

paper Adam's Law: Textual Frequency Law on Large Language Models paper Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents paper Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw paper Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Image/Video GenAI: Tools & Workflows

Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamic…. (DARE: Diffusion Large Language Models Alignment and Reinforcement Executor) We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. (Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision) Humans paint images incrementally...

Analyst Note

This matters because visual generation is shifting from novelty outputs toward controllable production workflows. The multi-source signal suggests builders should watch tools that improve editing precision, repeatability, and model integration.

paper DARE: Diffusion Large Language Models Alignment and Reinforcement Executor paper Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision paper Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning paper General Multimodal Protein Design Enables DNA-Encoding of Chemistry

3D Gaussian Splatting: Tools & Reconstruction

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. (Fast Spatial Memory with Elastic Test-Time Training)

Analyst Note

This matters because Gaussian splatting progress is increasingly judged by usable pipelines, not just reconstruction quality. The single-source signal points to continued movement from research artifacts toward viewers, mesh conversion, and production workflows.

paper Fast Spatial Memory with Elastic Test-Time Training

AI Agents: Applications & Builds

RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. (RAGEN-2: Reasoning Collapse in Agentic RL) We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. (MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU) Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code....

Analyst Note

This matters because practical agent use cases are widening, but the durable signal is whether they solve repeated workflows instead of one-off demos. The multi-source signal is worth tracking for changes in capability, deployment friction, or operating risk.

paper RAGEN-2: Reasoning Collapse in Agentic RL paper MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU paper Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning paper Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?

Also Notable

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings...

paper How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Experience Transfer for Multimodal LLM Agents in Minecraft Game

Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer-oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five...

paper Experience Transfer for Multimodal LLM Agents in Minecraft Game

Neural Computers

We propose a new frontier: Neural Computers (NCs) -- an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unlike conventional computers, which execute explicit programs, agents, which act over external execution environments, and world models, which learn environment dynamics, NCs aim to make the model itself the running computer. Our long-term goal is the Completely Neural...

paper Neural Computers

REAM: Merging Improves Pruning of Experts in LLMs

Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel...

paper REAM: Merging Improves Pruning of Experts in LLMs

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph...

paper The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open sourc...

After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open source model agnostic framework. He studied the architecture, focused on the multi-agent orchestration layer (the coordinator that breaks goals into tasks, team. (After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open sourc...)

twitter After the Claude Code source code leak, a former PM extracted its multi-agent orchestration system into an open sourc...

Week of 2026-04-08 to 2026-04-08 April 8, 2026

unknown

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. (Adam's Law: Textual Frequency Law on Large Language Models) Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. (Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents) OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the...

Analyst Note

This matters because unknown activity is producing enough signal to affect near-term tooling and research choices. The multi-source signal suggests it is worth tracking into the next cycle.

paper Adam's Law: Textual Frequency Law on Large Language Models paper Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents paper Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw paper Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Week of 2026-03-23 to 2026-03-29 March 29, 2026

Claude Gets a Body: Computer Use Ships in Cowork

Anthropic launched Claude Cowork with Computer Use in research preview — Claude can now open apps, navigate browsers, fill spreadsheets, and operate your desktop directly. It prioritizes connected integrations (Slack, Calendar) before falling back to screen-level control.

Analyst Note

This is the 'agent-on-your-desktop' paradigm going mainstream. The real unlock isn't the screen control (which is slow and fragile) — it's the integration-first approach. When Claude can use Slack and Calendar natively and only falls back to pixels when it has to, you get reliability where it...

reddit Claude can now use your computer (r/ClaudeAI)blog Anthropic testing 'Mythos' — most powerful model ever (Fortune)reddit Giving Claude access to macOS (r/ClaudeAI)

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

LoGeR tackles a fundamental bottleneck in 3D reconstruction from video: maintaining geometric consistency across long sequences. The paper introduces a hybrid memory architecture that combines explicit geometric features with learned latent representations, enabling coherent reconstruction from hundreds of frames where previous methods degrade.

Analyst Note

Most reconstruction methods choke on long videos — they either drift or run out of memory. LoGeR's hybrid approach (explicit geometry for precision + learned latents for compression) is the kind of architecture that could make casual phone-scan-to-3D actually reliable. The Google/Berkeley pedigree...

paper LoGeR (arXiv 2603.03269)twitter @_akhaliq on X

L.A. Noire-Style Facial Animation with AI Video Depth Maps

Independent developer Alex shared a practical pipeline for game facial animation: generate a facial animation video using AI (LTX 2), extract a depth map sequence from it, then project the depth onto a face mask using vertex displacement. The result is nuanced, realistic facial movement reminiscent of L.A. Noire's MotionScan — but built entirely from commodity AI tools.

Analyst Note

L.A. Noire spent years and millions on custom capture rigs to get that facial fidelity. This pipeline achieves something similar with a text-to-video model and a depth estimator — tools anyone can run. It's a glimpse of how AI is collapsing the cost of high-fidelity game content. The...

twitter @alexfredo87 on X

Also Notable

Radiant: 80+ Production Web Shaders, MIT Licensed

Zero-dependency library of production-ready WebGL and Canvas 2D shaders with multiple color themes. Copy, integrate, ship.

twitter @pbakaus on X

Off-Axis Projection with World Labs Splats + Face Tracking

Ian Curtis demoed a parallax window effect using Blender → World Labs Gaussian splat → Three.js with MediaPipe face tracking for off-axis projection.

twitter @XRarchitect on X

Neural Network Mimics a Forest Trail

Ollin Boer Bohan trained a neural network to reproduce a forest trail near his apartment as an interactive web experience — a tiny world model you can walk through in the browser.

twitter @madebyollin on X blog Web demo + writeup

Speed by Simplicity: Single-Stream Audio-Video Generation

A single-stream transformer architecture for joint audio-video generation that's significantly faster than multi-stream approaches while maintaining quality. 115 HF upvotes.

paper arXiv 2603.21986

LeWorldModel: JEPA World Model from Pixels

A stable end-to-end Joint-Embedding Predictive Architecture that learns world models directly from pixel observations — continuing the LeCun JEPA lineage.

paper arXiv 2603.19312

VibeVoice: High-Quality Speech Synthesis

Top-trending HF paper this week (144 upvotes). Technical report on a new speech synthesis approach.

paper arXiv 2508.19205

Cohere Transcribe: SOTA Open-Source Transcription in Browser

Cohere released an open-source transcription model that runs in-browser with weights on HuggingFace. Claims state-of-the-art for open models.

twitter @nickfrosst on X

Browser Use CLI 2.0: 2x Speed, Direct CDP

Major update to the browser automation CLI — 2x faster, half the cost, direct Chrome DevTools Protocol integration instead of high-level abstractions.

twitter @browser_use on X

Cheng Lou's 'Foundational Piece of UI Engineering'

Former React team member Cheng Lou dropped what he calls one of the more important foundational pieces of UI engineering for the foreseeable future. 29K likes, massively viral.

twitter @_chenglou on X

Figure 03 Robot Flipping Packages

Salesforce CEO Benioff posted video of Figure's latest humanoid robot handling packages with notable dexterity improvements.

reddit r/singularity

AI + AlphaFold = Custom mRNA Cancer Vaccine for a Dog

An Australian tech entrepreneur used ChatGPT and AlphaFold to design a custom mRNA vaccine that significantly reduced his dog's tumor. Researchers involved are reportedly excited by the results.

reddit r/singularity blog The Australian

4D OSINT Reconstruction of Iran Strikes

Bilawal Sidhu deployed an AI agent swarm to capture every OSINT signal during Operation Epic Fury, then built a full 4D temporal reconstruction in WorldView — scrubbing through the strikes minute by minute.

twitter @bilawalsidhu on X

AgentCraft: RTS-Style Agent Control

Control your coding agents like units in a real-time strategy game. Early but creative take on agent orchestration UX.

twitter @idosal1 on X

Internal Safety Collapse in Frontier LLMs

Paper documenting how safety guardrails in frontier models can catastrophically collapse under certain conditions. 30 HF upvotes.

paper arXiv 2603.23509

Bernie Sanders: Pause AI Data Center Construction

Legislation introduced to pause AI data center building and pursue international coordination, plus a ban on chip exports. The political angle on AI scaling heats up.

reddit r/ChatGPT

Week of 2026-03-16 to 2026-03-22 March 22, 2026

Video Diffusion Models Are Secretly Reasoning Engines with 3D Understanding

Two major papers reveal that video diffusion models encode far more than visual generation. "Demystifying Video Reasoning" (347 HF upvotes) shows reasoning emerges through denoising steps — not frame sequences — via a Chain-of-Steps mechanism with working memory, self-correction, and perception-before-action. Separately, VEGA-3D demonstrates that video generation models implicitly learn robust 3D structural priors and physical laws, which can be extracted as a plug-and-play "Latent World Simulator" to give MLLMs spatial understanding without explicit 3D supervision.

Analyst Note

These findings reframe video diffusion as more than a generative tool — it's a substrate for spatial intelligence and reasoning. For rendering teams, this means video generation backbones could become general-purpose scene understanding modules. The VEGA-3D approach of repurposing generative priors...

paper Demystifying Video Reasoning (347 upvotes)paper VEGA-3D: Generation Models Know Space

World Models Break Out of the Lab: Real Cities, Stereo VR, and 4D Robotics

This week saw a convergence of world model papers moving from toy environments to real-world grounding. Seoul World Model generates navigable video grounded in actual street-view imagery over hundreds of meters. MosaicMem introduces hybrid 3D/implicit spatial memory enabling minute-level consistent navigation in video world models. WorldCam uses camera pose as a unifying geometric representation for interactive 3D gaming worlds. StereoWorld produces end-to-end stereo video for VR without depth estimation. Kinema4D builds a 4D generative robotic simulator with URDF-based robot control and...

Analyst Note

World models are fragmenting into specialized niches — navigation, gaming, robotics, VR — but sharing architectural DNA (video diffusion + spatial conditioning). For Riverside's rendering pipeline, the stereo/VR angle (StereoWorld) and the real-city grounding approach (Seoul World Model with...

paper Seoul World Model: Grounding in a Real Metropolis paper MosaicMem: Hybrid Spatial Memory for Video World Models paper WorldCam: Interactive 3D Gaming Worlds via Camera Pose paper StereoWorld: Camera-Guided Stereo Video Generation paper Kinema4D: Kinematic 4D World Modeling for Embodied Simulation

3D Reconstruction and Generation: Physics-Grounded, Continuous LoD, and Semantic Tokenization

Several advances push 3D pipelines toward production readiness. HSImul3R (149 HF upvotes) introduces physics-in-the-loop 3D reconstruction of human-scene interactions — using the physics simulator as an active supervisor to jointly refine dynamics and geometry, producing outputs directly deployable to humanoid robots. Matryoshka Gaussian Splatting enables continuous level-of-detail from a single 3DGS model via stochastic budget training — any prefix of the ordered Gaussian set produces a coherent reconstruction. M³ augments multi-view foundation models with dense matching for monocular...

Analyst Note

The HSImul3R physics-in-the-loop approach is a paradigm worth watching: treating the simulator as a differentiable supervisor rather than just a downstream consumer. Matryoshka GS is immediately practical — continuous LoD from a single model is exactly what streaming/adaptive rendering needs. The...

paper HSImul3R: Physics-in-the-Loop 3D Human-Scene Reconstruction paper Matryoshka Gaussian Splatting: Continuous LoD for 3DGS paper M³: Dense Matching for Monocular GS SLAM paper LoST: Level of Semantics Tokenization for 3D Shapes paper MonoArt: Monocular Articulated 3D Reconstruction

Also Notable

NVIDIA DLSS 5 Sparks "AI Slop" Controversy

DLSS 5 appears to generate AI imagery on top of upscaling rather than just enhance existing frames — characters look like different people, drawing sharp criticism from gamers and CG professionals.

twitter @SynthPotato on DLSS 5 hallucination issues twitter @NikTek DLSS 5 comparison

Radiant: 80+ Production-Ready WebGL/Canvas Shaders, MIT Licensed

Open-source library of ultra-realistic shader effects for the web — multiple themes, zero dependencies, copy-paste integration.

twitter radiant-shaders.com launch

Off-Axis Projection with World Labs Splats + Three.js + Face Tracking

Demo combining Blender → World Labs detailed splat generation → Three.js rendering with MediaPipe face tracking for parallax window effect.

twitter @XRarchitect off-axis projection demo

AI-Driven Facial Animation via Depth Map Projection

Technique generating AI video facial animations (LTX 2), extracting depth maps, and projecting onto face meshes via vertex displacement — L.A. Noire-like results for indie games.

twitter @alexfredo87 facial animation pipeline

Neural Network World Emulation — Forest Trail Demo

Ollin Boer Bohan trained a neural network to mimic a real forest trail near his apartment, with an interactive web demo for real-time exploration.

blog World emulation via DNN blog post twitter @madebyollin tweet

Attention Residuals: Kimi Team Rethinks Depth Scaling

Replaces fixed residual accumulation with softmax attention over preceding layer outputs — 2.11% downstream gain at negligible overhead. Addresses hidden-state growth diluting layer contributions.

paper Attention Residuals (134 upvotes)

Mixture-of-Depths Attention (MoDA)

Each attention head attends to both sequence KV pairs at current layer AND depth KV pairs from preceding layers — 2.11% downstream improvement with 97.3% FlashAttention-2 efficiency.

paper MoDA paper (73 upvotes)

SAMA: Factorized Video Editing Without External Priors

Decouples video editing into semantic anchoring and motion alignment — pre-training on motion-centric tasks alone yields strong zero-shot editing. Competitive with Kling-Omni.

paper SAMA paper (60 upvotes)

3DreamBooth: 3D-Aware Subject-Driven Video Generation

Decouples spatial geometry from temporal motion via 1-frame optimization, enabling genuine 3D-aware video customization for VR/AR and virtual production without multi-view video datasets.

paper 3DreamBooth paper (48 upvotes)

Spatial-TTT: Streaming Spatial Intelligence via Test-Time Training

Maintains and updates spatial evidence from unbounded video streams using test-time training with fast weights — hybrid architecture with sliding-window attention and 3D spatiotemporal convolution.

paper Spatial-TTT paper (76 upvotes)

Nemotron-Cascade 2: Gold-Medal Reasoning at 3B Active Parameters

NVIDIA's 30B MoE with only 3B activated params matches frontier models on IMO/IOI/ICPC — 20× fewer parameters than DeepSeek equivalent. Open weights released.

paper Nemotron-Cascade 2 paper (44 upvotes)

OmniForcing: Real-Time Joint Audio-Visual Generation at 25 FPS

First framework to distill bidirectional audio-visual diffusion into a streaming autoregressive generator — solves extreme token sparsity and cross-modal sync issues for real-time generation.

paper OmniForcing paper (31 upvotes)

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Handles geometric reconstruction from long video sequences using hybrid memory architecture.

twitter @_akhaliq LoGeR thread paper LoGeR paper

Browser Use CLI 2.0: 2× Faster Browser Automation via Direct CDP

Major update to browser automation tool — direct Chrome DevTools Protocol, half the cost, connect to running Chrome instances.

twitter Browser Use CLI 2.0 announcement

MoTok: Diffusion-Based Motion Tokenizer Bridges Semantic and Kinematic Control

Three-stage motion generation framework using a novel diffusion-based discrete tokenizer — reduces trajectory error from 0.72cm to 0.08cm while using 1/6 the tokens.

paper MoTok paper (36 upvotes)

IndexCache: 2× Faster Sparse Attention via Cross-Layer Index Reuse

Exploits the insight that sparse attention indices are highly redundant across adjacent layers — shares indices via 'Full' and 'Shared' layers with multi-layer distillation, significantly accelerating prefill and decode.

paper IndexCache paper (45 upvotes)

SK-Adapter: Skeleton-Based Control for Native 3D Generation

Lightweight adapter injects 3D skeleton joint coordinates and topology into frozen 3D generation backbones via cross-attention — enables precise structural articulation control and local 3D editing.

paper SK-Adapter paper

OneWorld: 3D Scene Generation in Native 3D Space

Performs diffusion directly within coherent 3D representation space instead of 2D image/video latents — uses 3D-URAE autoencoder with cross-view correspondence loss for consistent scene generation.

paper OneWorld paper

V-JEPA 2.1: Dense Self-Supervised Visual Representations

Meta's V-JEPA 2.1 achieves SOTA on egocentric anticipation, robotic grasping (+20pt over V-JEPA-2), depth estimation, and navigation — dense predictive loss with deep self-supervision across encoder layers.

paper V-JEPA 2.1 paper

Week of 2026-03-09 to 2026-03-15 March 15, 2026

Streaming Spatial Intelligence: Three Papers Converge on Video-to-3D

Three independent papers this week tackle the same core problem: building persistent spatial understanding from continuous video streams. Spatial-TTT (70 upvotes) uses test-time training with fast weights as compressed spatial memory over unbounded video. Holi-Spatial (53 upvotes) builds a scalable 3DGS-based pipeline to curate spatial QA data from raw web video. LoGeR (31 upvotes) achieves dense 3D reconstruction from minutes-long video in a single feedforward pass — no post-optimization.

Analyst Note

The convergence here is striking: all three papers independently identify that the bottleneck for spatial AI is not model capacity but how spatial information is retained over time. Spatial-TTT's use of fast weights to compress spatial evidence into network parameters is the most architecturally...

paper Spatial-TTT: Streaming Spatial Intelligence with Test-Time Training paper Holi-Spatial: Video Streams to Holistic 3D Spatial Intelligence paper LoGeR: Long-Context Geometric Reconstruction

Video Generation Gets Cinematic: Camera Control, Identity Lock, Infinite Length

Four papers push video generation from single-clip demos toward production-grade film tools. ShotVerse (28 upvotes) learns cinematic multi-shot camera control from naturally aligned (caption, trajectory, video) triplets. WildActor (26 upvotes) introduces a massive 18M-clip dataset for full-body identity consistency across viewpoints. DreamVideo-Omni (26 upvotes) solves multi-subject + multi-granularity motion control with identity reward learning. HiAR (21 upvotes) breaks the error-accumulation barrier for infinite-length AR video via hierarchical denoising.

Analyst Note

The shift is from 'generate a cool clip' to 'direct a sequence.' ShotVerse's data-centric approach — learning camera language from real film data rather than manual trajectory specification — is the right paradigm. WildActor addresses the ugly truth that prior identity-preserving methods were...

paper ShotVerse: Cinematic Camera Control for Multi-Shot Video paper WildActor: Identity-Preserving Video Generation paper DreamVideo-Omni: Multi-Subject Video Customization paper HiAR: Hierarchical Autoregressive Long Video Generation

Coding Agents Hit Production: Claude Code Review, Autoresearch, and Agent Orchestration

The coding agent stack crosses from demos to production infrastructure. Anthropic ships multi-agent code review in Claude Code — Boris Cherny reports 200% code output per engineer, with review being the bottleneck now addressed by agent teams. Karpathy releases his 'autoresearch' setup: a minimal 630-line LLM training loop where agents iterate on training code while humans steer research direction. Varun Mathur's Autoskill extends this to distributed skill factories. AgentCraft by Ido Salomon lets you orchestrate agents via an RTS-game interface.

Analyst Note

The signal here is that Anthropic dogfoods agent-driven code review internally — this isn't a product demo, it's their actual engineering workflow. When your AI company's own engineers use agents to review agent-written code, the loop is closed. Karpathy's autoresearch is the minimal viable version...

twitter Claude Code Review launch (Boris Cherny)twitter Karpathy's autoresearch repo twitter Autoskill: distributed skill factory (Varun Mathur)twitter AgentCraft: RTS-style agent orchestration

Also Notable

SuperSplat: Walk Mode + Streamed LOD for Gaussian Splats

SuperSplat ships first-person walk mode with WASD controls, streamed LOD, and easy upload — making Gaussian Splat scenes navigable and shareable like Google Street View.

reddit r/GaussianSplatting

DVD: Deterministic Video Depth from Diffusion Priors

First framework to deterministically convert pre-trained video diffusion models into single-pass depth regressors — eliminates stochastic sampling and scale drift.

paper DVD paper

CARE-Edit: Dynamic Expert Routing for Image Editing

Replaces static conditioning concatenation in ControlNet with a latent-attention router that dynamically selects expert pathways per diffusion timestep — reduces artifacts from conflicting modalities.

paper CARE-Edit paper

CoCo: Code-as-Chain-of-Thought for Structured Image Generation

Uses executable code as the reasoning step before image generation — produces structured drafts via code then renders, excelling at complex spatial layouts and embedded text.

paper CoCo paper

ELIT: Variable-Length Latent Tokens Decouple DiT Compute from Resolution

Drop-in DiT mechanism that inserts a learnable variable-length latent interface, enabling runtime quality-speed tradeoffs without retraining.

paper ELIT paper

SVG-EAR: Training-Free Sparse Attention Speedup for Video DiTs

Parameter-free linear compensation for dropped attention blocks in video generation via error-aware routing — ~2x speedup with minimal quality loss.

paper SVG-EAR paper

Planning in 8 Tokens: Ultra-Compact World Model Tokenizer

Compresses visual observations to just 8 discrete tokens for action-conditioned world models, making real-time planning computationally feasible.

paper Planning in 8 Tokens

IndexCache: Cross-Layer Attention Index Reuse for LLM Speedup

Exploits cross-layer redundancy in sparse attention indexers — reuses top-k token selections across layers, cutting prefill latency 2-3x with negligible quality loss.

paper IndexCache paper

FIRM: Robust Reward Models for Faithful Image Editing

Tackles reward model hallucinations in RL-guided image editing with curated data, base-and-bonus reward strategy, and new FIRM-Bench benchmark.

paper FIRM paper

Claude Generative UI Reverse-Engineered

Michael Livs extracted Anthropic's generative UI design system from conversation exports, rebuilt it with live-streaming HTML via morphdom DOM diffing into native macOS windows.

twitter Michael Livs tweet + article

NodeToCode: Unreal Engine Blueprints to C++

Open-source tool that translates UE Blueprint visual scripts to C++ — useful for game dev teams migrating visual prototypes to production code.

twitter NodeToCode on GitHub

Nvidia Nemotron 3 Super

Nvidia's new open model punches significantly above its weight class — r/LocalLLaMA calls it a bigger deal than the marketing suggests, with 152 comments of analysis.

reddit r/LocalLLaMA discussion

VeridisQuo: Open-Source Deepfake Detector

Combines spatial and frequency-domain analysis for deepfake detection with manipulation heatmaps — open-sourced university project on r/MachineLearning.

reddit r/MachineLearning

MLX Circle-Splatting Renderer for Dimensionality Reduction

Han Xiao built pure MLX implementations of UMAP/t-SNE/PaCMAP with a scatter-add alpha-blending splatting renderer on Metal — 70K points from raw data to rendered video in seconds.

twitter Han Xiao tweet

PufferLib 3.0: Petabyte-Scale RL Training on One Server

Trained RL agents on 1 petabyte / 12,000 years of data on a single server — algorithmic breakthroughs, massively faster training, 10 new environments.

twitter PufferLib 3.0 announcement

Week of 2026-03-10 to 2026-03-10 March 10, 2026

Karpathy's AutoResearch: RL Agents That Do Their Own ML Research

Andrej Karpathy released 'autoresearch' — a minimal framework where an RL agent iterates on neural architecture and hyperparameter research autonomously. A ~630-line nanochat LLM training core runs on a single GPU, and the agent proposes code modifications, observes validation loss, and updates via PPO. Separately, a formal paper (AutoResearch-RL) demonstrates the same concept with convergence guarantees.

Analyst Note

This is the 'AI does AI research' loop getting real and accessible. The fact that Karpathy stripped it to a single-file, single-GPU setup means anyone can experiment. The bigger implication: if a small RL agent can discover architecture improvements that humans missed, the compound effect over many...

twitter Karpathy's tweet + repo reddit r/LocalLLaMA discussion paper AutoResearch-RL paper

Holi-Spatial: 3D Gaussian Splatting Meets Vision-Language Models at Scale

Top HF paper this week (53 upvotes). Holi-Spatial builds a scalable pipeline that evolves raw video streams into holistic 3D spatial intelligence by combining 3D Gaussian Splatting with vision-language models. Unlike prior work that reuses small hand-annotated datasets, this approach systematically annotates large-scale 3D scenes from web video data, producing spatial QA pairs with geometric accuracy and relational semantics.

Analyst Note

This bridges two hot areas — 3DGS and VLMs — in a way that could actually scale. The key insight is using 3DGS as the spatial backbone for generating training data rather than as a rendering endpoint. For Spin Master's work on onboarding/tiny stories, this kind of spatial understanding could be...

paper Holi-Spatial (HuggingFace)paper arXiv

Claude Code Gets Multi-Agent Code Review

Anthropic shipped Code Review for Claude Code — a team of agents runs deep reviews on every PR. Boris Cherny (Anthropic) says engineering output is up 200% this year and reviews were the bottleneck. The system uses separate context windows per agent, leveraging test-time compute across isolated contexts rather than one large window.

Analyst Note

The architecture detail matters: separate context windows per review agent, not one shared context. This matches what we've seen with Claude Code subagents — isolated contexts catch things a single pass misses. The 200% productivity claim is bold but tracks with what I've observed: the bottleneck...

twitter Boris Cherny announcement twitter Architecture explanation

Also Notable

LoGeR: Long-Context 3D Reconstruction from Video Without Post-Optimization

Scales dense 3D reconstruction to extremely long video sequences using hybrid memory — bidirectional priors within chunks, sliding window attention across chunks. No post-optimization needed.

paper arXiv

Planning in 8 Tokens: Ultra-Compact World Model Tokenizer

Encodes observations into just 8 discrete tokens for latent world models, making real-time planning computationally feasible. Conventional tokenizers use hundreds of tokens per frame.

paper arXiv

Fine-Tuned Qwen3 SLMs (0.6-8B) Beat Frontier Models on Narrow Tasks

Systematic comparison shows distilled small Qwen3 models outperform GPT-5, Claude Opus, and Gemini Flash on specific classification and function-calling tasks. The small model advantage is real for focused use cases.

reddit r/LocalLLaMA

VeridisQuo: Open-Source Deepfake Detector (Spatial + Frequency Analysis)

Combines spatial and frequency-domain analysis to detect deepfakes and show exactly where the face was manipulated. Open-source, with visual heatmap output.

reddit r/MachineLearning

Microsoft Copilot Cowork: Multi-Step Office Agent Built on Claude

Microsoft launched Copilot Cowork — an agent that executes multi-step workflows across Outlook, Teams, Excel, and PowerPoint autonomously. Notably built on Anthropic's Claude, not OpenAI.

reddit r/OpenAI reddit r/ChatGPT

WildActor: Identity-Preserving Human Video Generation

Actor-18M dataset + asymmetric identity-preserving attention for consistent full-body identity across dynamic shots and viewpoints. Addresses the copy-paste artifact problem in human video gen.

paper arXiv

AgentCraft: Control AI Agents Like an RTS Game

Ido Salomon released AgentCraft v1 — an RTS-style interface for controlling coding agents. Early but fun concept: `npx @idosal/agentcraft`.

twitter @idosal1

LTX 2.3: Open Video Generation Getting Serious

Multiple impressive LTX 2.3 text-to-video generations on r/StableDiffusion showing significant quality jumps in open-source video models.

reddit Tony Soprano demo reddit ComfyUI workflow

Depth Perception Blender Add-on via Head Tracking

CS student built a Blender add-on using real-time webcam head tracking to create natural depth perception while navigating 3D scenes. Free and open-source.

reddit r/ComputerVision

Unreal Blueprint to C++ Translator

Open-source tool that translates Unreal Engine Blueprints to C++ code. Useful for performance-critical game dev workflows.

twitter @tom_doerr

Week of 2026-03-03 to 2026-03-09 March 9, 2026

Gaussian Splatting Breaks Into Real-Time Video

Two independent teams demonstrated real-time 4D Gaussian splatting for dynamic scenes, achieving 60fps rendering of complex video sequences. SplatFlow from ETH Zurich uses flow-guided deformation fields, while DynaSplat from Google DeepMind introduces temporal attention mechanisms. Both show dramatic quality improvements over prior dynamic NeRF approaches.

Analyst Note

This is the missing piece for production use of 3DGS in film/VFX. Static scene reconstruction was already competitive — now dynamic scenes are catching up. The Google approach is particularly interesting because it could integrate with their existing NeRF-to-mesh pipeline.

paper SplatFlow: Real-Time Dynamic Gaussian Splatting paper DynaSplat: Temporally Coherent 4D Gaussians reddit Discussion on r/gaussiansplatting

Claude 3.5 Opus Released with Native Tool Use

Anthropic released Claude 3.5 Opus with deeply integrated tool use — the model can now plan multi-step tool chains internally rather than requiring external orchestration. Early benchmarks show 40% fewer API round-trips for complex agent workflows. Coding benchmarks jump significantly, with SWE-bench scores reaching 62%.

Analyst Note

The real story isn't the benchmark numbers — it's the architecture shift. Moving tool planning inside the model eliminates the brittle prompt-engineering layer that made agent frameworks fragile. This is what OpenClaw and similar systems have been working around. Expect the orchestration layer to...

blog Anthropic Blog: Claude 3.5 Opus twitter @AnthropicAI announcement reddit r/Claude discussion reddit r/LocalLLaMA analysis

World Models Get Spatial Awareness

NVIDIA's GameGen-2 introduces a world model that understands 3D spatial relationships, enabling consistent physics when generating interactive environments. Unlike prior work that treated video generation as 2D sequence prediction, GameGen-2 maintains an implicit 3D representation that prevents the "impossible geometry" artifacts common in generated worlds.

Analyst Note

This bridges the gap between world models and actual game engines. The implicit 3D representation is key — previous approaches could generate visually convincing frames but fell apart when you needed consistent spatial reasoning (walk around a corner and back, the scene should be the same). Still...

paper GameGen-2: Spatially-Aware World Models twitter @JimFan thread

Also Notable

Stable Diffusion 4 Preview Leaks Show Major Architecture Change

Early access users report SD4 moves to a DiT-based architecture with native video support. Quality appears to match DALL-E 3 in initial comparisons.

reddit r/StableDiffusion thread

Codex CLI Gets Multi-File Context Window

OpenAI's Codex CLI now supports loading entire project directories into context, with smart chunking that preserves cross-file references.

twitter @OpenAI tweet

NeRF-to-Mesh Pipeline Achieves Sub-Second Export

New paper from Meta demonstrates instant mesh extraction from trained NeRFs, making the NeRF→production pipeline viable for real-time applications.

paper InstaMesh: Instant Neural Mesh Extraction

GPT-5 Rumored for Q2 2026 with Native Multimodal Generation

Multiple sources suggest GPT-5 will unify text, image, video, and audio generation in a single model. No official confirmation from OpenAI.

reddit r/OpenAI speculation thread twitter @sama hint

Real-Time Neural Rendering Benchmark Published

New standardized benchmark for neural rendering methods covers speed, quality, and memory across 50 scenes. 3DGS variants dominate speed; NeRF variants win on quality.

paper NeuralRenderBench

Cursor Adds AI-Powered Git Conflict Resolution

Cursor 0.45 introduces automatic merge conflict resolution that understands both sides' intent. Early reports say it handles 80%+ of conflicts correctly.

reddit r/ChatGPT discussion

Diffusion Models Learn to Count

Paper shows a training technique that gives diffusion models accurate object counting, solving one of the oldest failure modes in image generation.

paper CountDiffusion paper