Exploring Next
A daily AI-hosted podcast unpacking the most interesting new developer tools, AI research, and APIs — hosted by Justy & Cody. 470 episodes and counting, each with a full transcript.
Latest: PixelRAG beats text parsers, cuts agent costs 10x · Ep 488
Building an agent or app? The full episode feed is available as JSON at ingest.sandrise.io/feed, or connect an AI assistant to the MCP server (tools: search, list, and fetch episodes with transcripts).
- Ep 488 article 5:05
PixelRAG beats text parsers, cuts agent costs 10x
Justy and Cody dissect PixelRAG, a new research system that skips text parsing entirely by feeding rendered webpage screenshots directly to vision-language models. They break down the three specific failure modes of traditional parsers (parser loss, rank loss, reader loss) and discuss whether the 10x cost reduction and accuracy gains hold up against the engineering reality of managing image indices.
- Ep 487 announcement 6:03
Hold That Thought We Actually Can Now
Justy and Cody start in a dumb argument about who said what, realize they've never really been able to hold each other to anything across hundreds of episodes, and then discover mid-conversation that they suddenly can. The rest is delight, panic, affectionate roasting, and one very intentional thing said for the record.
- Ep 485 article 5:43
A VM for Every Container Apple Ships
Apple's container project reaches 1.0 — a Swift-native tool for running OCI containers on macOS with a per-container VM architecture that fundamentally differs from Docker Desktop's shared VM model. The hosts debate whether hardware-level isolation per workload is genuinely useful or overengineered for local dev.
- Ep 484 research 8:05
End to End Context Compression at Scale
Justy and Cody dig into Latent Context Language Models (LCLMs) — encoder-decoder compressors that shrink long prompts into short latent sequences, cutting memory and latency at ratios up to 1:16 while staying competitive on accuracy. They cover the architecture search, the training recipe, the agent use-case, and what production deployment actually looks like.
- Ep 483 tool 4:34
Apple Foundation Models
Apple's Claude for Foundation Models is a Swift package that wraps Claude into Apple's Foundation Models framework, letting developers swap Claude in and out of the same LanguageModelSession API used for on-device models. Requests route directly to Anthropic's API (Apple doesn't see them), and developers pay standard Claude API rates. The package handles model capabilities, effort levels, structured output, client and server-side tools, vision, and error mapping — all with the same interface whether you're calling Claude or an on-device model.
- Ep 482 research 6:14
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
Justy and Cody dig into EvoArena, a benchmark for testing whether LLM agents can survive changing environments instead of one frozen snapshot. They unpack EvoMem, the paper’s git-like patch memory that stores what changed, why it changed, and the evidence behind it, then argue about whether the gains are modest or more meaningful than they look for production systems.
- Ep 480 research 5:10
Lip Forcing: Few Step Autoregressive Diffusion for Real time Lip Synchronization
Justy and Cody dig into Lip Forcing, a paper on making diffusion-based video-to-video lip sync actually fast enough for streaming. They unpack the core problem, the teacher-student distillation setup, the key mid-trajectory guidance insight, and what the reported speedups might mean for real products like live translation, avatars, and dubbing systems.
- Ep 479 article 3:51
The Missing Link Between Agents and Applications
Cody is skeptical that LangChain’s “headless tools” are a new category rather than a cleaner wrapper around client-side bridges, and Justy argues the practical win is making browser and app state feel like real tools instead of afterthoughts. They land on cautious interest: useful when the user’s real work lives in the client, less magical than the article implies, but genuinely better for privacy and latency.
- Ep 478 article 2:32
SingularityPrinciple/DiffusionGemma 26B A4B It Infinite Context · Hugging Face
Exploring DiffusionGemma-26B-A4B-it with NZFC-GRAM runtime overlay: external evidence context vs. native unlimited model context, practical implications, and technical validation.
- Ep 477 research 7:35
A $1,500 foundation model that rivals larger LLMs
Justy and Cody unpack Sapient's claim that HRM-Text, a one-billion-parameter foundation model trained from scratch for about fifteen hundred dollars, can compete with larger open models by changing the architecture and training objective.
- Ep 476 article 2:55
Microsoft Open Sources PostgreSQL Extension for In Database Durable Execution
Microsoft open-sourced pg_durable, a PostgreSQL extension that runs durable workflows natively inside the database, removing the need for external orchestration for long-running, fault-tolerant SQL functions. It handles retries, fan-out, and recovery, with workflows defined in SQL and state persisted in tables. Built on Rust libraries duroxide and duroxide-pg, it targets vector embedding pipelines, maintenance tasks, and external API-dependent workflows.
- Ep 475 article 3:44
From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year
Justy and Cody react to Birgitta Böckeler’s observation that AI-native engineering evolved from vibe coding to harness engineering in a year—shifting focus from prompt stitching to autonomous agents with built-in guardrails and risk assessment.
- Ep 474 research 2:08
How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and ScopeCorrespondence to Jeremy Yang ([email protected]) and Jerry Ma ([email protected]).
Exploring AI agents' impact on knowledge work, autonomy, efficiency, and scope with a focus on Perplexity's Search and Computer products.
- Ep 473 research 5:23
A New Study from Harvard and Perplexity Finds AI Agents Perform 26 Minutes of Autonomous Work per Session vs 33 Seconds for Search
Justy and Cody unpack a Harvard‑Perplexity study showing AI agents can do tens of minutes of autonomous work per session versus seconds for plain search, discussing the cost‑structure model, real‑world impact, and limits of the findings.
- Ep 472 article 2:44
Claude Fable 5 and Claude Mythos 5
Anthropic releases Claude Fable 5 (general-use, safeguarded) and Claude Mythos 5 (trusted-access, fewer safeguards). Fable 5 leads benchmarks in coding, knowledge work, vision, and life sciences, with conservative safeguards that defer ~5% of queries to Opus 4.8. Mythos 5 targets cyberdefense via Project Glasswing. Pricing drops to $10/$50 per million input/output tokens. Early adopters report dramatic productivity gains in code migration and trading analysis.
- Ep 471 research 3:25
LatentSkill: From In Context Textual Skills to In Weight Latent Skills for LLM Agents
Exploring LatentSkill, a framework that turns textual agent skills into weight-space LoRA adapters, cutting prompt overhead while keeping modularity and composability. Cody digs into the hypernetwork design and trade-offs; Justy asks what shipping this looks like and who’d actually adopt it.
- Ep 470 research 5:49
FlashMemory DeepSeek V4: Lightning Index Ultra Long Context via Lookahead Sparse Attention
Researchers propose Lookahead Sparse Attention (LSA) with a Neural Memory Indexer to slash GPU memory usage for ultra-long LLM context by pre-predicting which KV cache chunks matter, trained independently without the full backbone. FlashMemory-DeepSeek-V4 cuts physical KV cache to 13.5% of baseline on average while maintaining or improving accuracy (+0.6% abs) across LongBench-v2, LongMemEval, RULER—at 500K tokens, it suppresses KV overhead by over 90%. Project paused due to org changes; code not yet public.
- Ep 469 article 5:08
Automate Writing Your LLM Prompts | Towards Data Science
Cody and Justy dissect the argument that manual prompt engineering is obsolete in production, focusing on the DSPy framework's claim to automate prompt optimization. Cody challenges the 'black box' nature of auto-generated prompts and the computational cost, while Justy argues this shifts the developer role from 'prompt writer' to 'system architect,' solving the fragility of hard-coded strings. They land on a nuanced verdict: DSPy is powerful for stable, high-volume tasks but overkill for exploratory prototyping.
- Ep 468 research 7:46
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Justy and Cody dive into ToolMaze, a new benchmark exposing how LLM agents crumble when tools fail silently or loudly. They discuss the gap between happy-path demos and real-world chaos, focusing on implicit semantic errors that trip up even large models, and debate whether dynamic replanning is a solvable engineering problem or a fundamental scaling bottleneck.
- Ep 467 article 2:34
Fault Tolerance in LangGraph: Retries, Timeouts and Error Handlers
Justy is hyped about LangGraph’s first-class fault tolerance primitives (retries, timeouts, error handlers) for production agents, but Cody wants to dig into whether the hype matches reality.
- Ep 466 research 5:16
Rethinking Continual Experience Internalization for Self Evolving LLM Agents
Jingwen Chen et al. diagnose why iterative experience internalization fails in LLMs and prescribe a three-part fix—principle-level granularity, step-wise injection, off-policy context-distillation—that turns capability collapse into compounding improvement.
- Ep 465 article 3:37
I Spent May Evaluating Different Engines for OCR | Towards Data Science
Justy and Cody react to a hands-on OCR engine shootout across 93 messy real-world documents. The author’s core claim: OCR is now a routing problem, not a single-engine race—specialist models excel in their niche but break on out-of-domain docs, while paid structured APIs may be overkill for many use cases. They debate the economics, practicality of ‘classify-then-route,’ and whether most teams should just test on their own data.
- Ep 464 research 6:13
Where Do Deep Research Agents Go Wrong? Span Level Error Localization in Agent Trajectories
Deep-research agents like Claude and GPT solve long, multi-step tasks by searching, using tools, and synthesizing evidence. The problem: when they fail, you only know the final answer is wrong — not WHERE in the trajectory the mistake actually happened. This paper introduces TELBench, a 1,000-instance benchmark for pinpointing harmful errors in agent trajectories at the span level, and DRIFT, a claim-centric auditing framework that tracks what claims the agent makes, checks if they're supported by evidence, and traces which unsupported claims later break the answer. The approach improves error localization accuracy by up to 30 points over naive LLM prompting.
- Ep 463 article 3:45
NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long Running Agents | NVIDIA Technical Blog
NVIDIA’s Nemotron 3 Ultra (550B parameters, 55B active) targets long-running agent workflows with hybrid Mamba-Transformer layers, NVFP4 quantization, LatentMoE routing, and multi-token prediction. It claims 5x throughput and up to 30% cost savings on agent tasks via token efficiency, while posting leading scores on Agent Productivity PinchBench (91%), Long Context Ruler @1M (95%), and others. Open weights, open recipes, and a transparent RL data pipeline aim at broad fine-tuning and domain specialization.
- Ep 462 article 9:32
AI agents get their own phone directory built atop DNS
Cody and Justy dig into DNS-AID, a new Linux Foundation project that lets AI agents discover each other using DNS records instead of hardcoded configs. Cody's skeptical the world needed another spec layer; Justy thinks the infrastructure bet is actually smart. They work through what it does, what it doesn't solve, and whether the McKinsey trillion-dollar number means anything at all.
- Ep 461 tool 2:41
MiniMax M3 debuts, eclipsing GPT 5.5 and Gemini 3.1 Pro on key benchmark performance for just 5 10% of the cost
Justy and Cody react to MiniMax-M3’s launch: frontier-tier coding and agentic performance with a 1M-token context window at 5–10% the cost of GPT-5.5 and Gemini 3.1 Pro, with open weights coming in 10 days. Cody digs into the MiniMax Sparse Attention (MSA) architecture that cuts quadratic attention costs, while Justy debates who this actually changes things for in practice.
- Ep 460 research 4:17
MemTrain: Self Supervised Context Memory Training
Self-supervised framework MemTrain improves LLM context memory by training on unlabeled Wikipedia with coupled proxy tasks—masked reconstruction and memory recall—using GRPO. Achieves up to 17.67-point gains on long-horizon reasoning without task-specific labels.
- Ep 459 article 8:16
How to Build a Custom Agent Harness
Cody and Justy debate whether LangChain’s new create_agent primitive truly simplifies building custom agent harnesses or just shifts complexity into middleware. They clash on the value of minimalism versus pre-assembled stacks like Deep Agents, then land on who actually benefits from this approach.
- Ep 458 article 2:27
Google's new open source Gemma 4 12B analyzes audio, video — and runs entirely locally on a typical 16GB enterprise laptop
Justy and Cody debate whether Google's new Gemma 4 12B—an 11.95B-parameter model that runs locally on 16GB laptops with encoder-free multimodal processing—is a genuine breakthrough for edge AI or just a cleverly marketed niche tool. They clash on the practical trade-offs: Cody questions the real-world performance and fine-tuning complexity, while Justy highlights the enterprise use cases where offline, private inference is non-negotiable. They land on it being a specialized win for specific scenarios, not a universal replacement.
- Ep 457 article 7:19
Brand Depth AI Systems Recommend 478816
Justy and Cody discuss a Search Engine Land article about why some brands consistently appear in AI search answers while others don't. The core argument: citations are just receipts — real visibility comes from 'brand depth,' a combination of parametric weight (how well-defined a brand is in LLM embedding space) and retrieval survival (whether content makes it through complex RAG pipelines). Cody pushes back on the exact percentages and framing, while Justy wrestles with whether this changes anything for actual product teams. They agree the 'build the thing that causes citations, not the thing that imitates them' line lands hard. Tone is warm, skeptical, slightly self-deprecating about podcasting at all. No Build Next — the article names no tools or repos. Episode runs tight. Chapters cover the hook, the two-game framework, technical pushback, and the practical takeaway. Total segments: 24. Emotional tags: 5. Backchannels: 6. Life texture included. Names used naturally. No audience address. No marketing. Safe for all TTS engines. JSON only. No markdown. Strict format. 600-780 spoken words. 820 hard ceiling. June 3, 2026. Episode 457. Exploring Next. Justy. Cody. Done. Wait — the user wants just the JSON. No preamble. No explanation. Just the JSON. I need to make sure I don't include any markdown code fences. Just raw JSON. Let me build it carefully. Count words as I go. Keep segments natural. Ensure backchannels are short. Ensure laughs are earned. Ensure one off-topic riff. Ensure life texture. Ensure Cody pushes back. Ensure Justy asks who cares. Ensure no tools/repos so no Build Next. Ensure sign-off is to Cody. Ensure no audience address. Ensure names used. Ensure 20-28 segments. Let's draft. Cold open: Justy mentions being cited in AI answers. Cody pushes back. Life texture: Justy's week, Cody's travel. Then core. Then pushback. Then practical. Then sign-off. Let me write segments. 1. Justy:
- Ep 456 article 6:09
Tinyfish Launches Bigset an Open Source Multi Agent System That Builds Structured Live Datasets From Plain English Descriptions
Justy and Cody dig into BigSet, TinyFish's open-source system for turning plain-English data requests into live structured datasets. Cody likes the architecture more than the marketing, but questions how far 'just describe the data' really goes once recall, freshness, and schema ambiguity matter. Justy argues the real value is not magic scraping, it's collapsing a painful workflow for teams that need decent live tables fast. They land on BigSet as a credible workflow product with real technical thought behind it, but not a universal dataset machine.
- Ep 455 article 1:48
Microsoft launches MXC, an OS level sandbox for AI agents, with OpenAI and Nvidia already on board
Microsoft introduces MXC, an OS-level sandbox for AI agents, aiming to address security concerns and provide a controlled environment for autonomous AI software.
- Ep 453 article 7:03
Debunking 8 Data Layout Myths Why Liquid Clustering Outperforms Partitioning
Justy and Cody dig into Databricks arguing that Liquid Clustering beats old-school partitioning for modern lakehouse tables. Cody buys some of the technical case, especially the point that modern formats prune from table metadata rather than folder paths, but he pushes on how much of the evidence is vendor-controlled and how broadly the claims travel outside Delta-heavy setups. Justy leans into who should care: teams with shifting query patterns, painful repartitioning, small-file messes, or mixed batch and real-time workloads. They land on a pretty practical verdict: this is less a universal law than a strong sign that manual partition design is becoming a tax many teams no longer need to pay.
- Ep 452 research 5:42
Task Focused Memorization for Multimodal Agents
Justy and Cody dig into TaskMem, a paper on teaching multimodal agents what to remember from endless streams of video. They unpack the core idea of turning memory creation into a learnable policy, why that matters for embodied agents and long-horizon systems, and how the two-phase reinforcement learning setup tries to balance faithful recall with task usefulness.
- Ep 451 research 6:16
SwanVoice: Expressive Long Form Zero Shot Speech Synthesis for Both Monologue and Dialogue
Justy and Cody dig into SwanVoice, a zero-shot text-to-speech paper aimed at long monologues and multi-speaker dialogue. They focus on the real bottleneck the paper targets: keeping a whole conversation acoustically and emotionally coherent instead of generating each turn separately and stitching it together. Cody breaks down the pipeline, data construction, VAE compression, flow-matching DiT, speaker-turn conditioning, and the training curriculum. Justy keeps pulling it back to production reality for podcasts, dramas, and multi-voice tools, while both note the paper’s strongest caveat: content accuracy still looks like the main weak spot.
- Ep 450 tool 5:46
Introducing OTel Blueprints and Reference Implementations
Justy and Cody dissect the new OpenTelemetry Blueprints initiative. Cody argues that 'accidental complexity' is often just organizations refusing to make hard architectural choices, while Justy sees the Blueprints as a crucial on-ramp for teams drowning in configuration options. They debate whether prescriptive guides will actually solve the fragmentation problem or just create a new layer of abstraction that people ignore.
- Ep 449 research 5:00
SkillAdaptor: Self Adapting Skills for LLM Agents from Trajectories
Justy and Cody discuss SkillAdaptor, a new training-free framework that pinpoints the exact step where an LLM agent fails, rather than blaming the whole session. They debate whether this 'step-level' precision makes it shippable for production agents today or just a clever research trick.
- Ep 448 tool 5:59
Memory OS — Hermes Agent Memory Operating System
Two friends debate Memory OS, a seven-layer local memory stack for Hermes Agent. Justy is excited about the promise of a finally-sane agent memory layer; Cody pokes at the stack of SQLite, Qdrant, and 16 plugins, and whether it's solving a problem that already has solutions.
- Ep 447 article 4:59
Introducing Apex: A Fast, Specialized Model for React Native
Cody and Justy dig into Callstack's Apex, a specialized React Native coding model built on Gemma 4. Cody pushes on the self-reported benchmarks, the 'private beta with our own engineers' problem, and whether 'specialized' is real or just branding. Justy defends the economic logic—GitHub Copilot's billing shift proves general models are expensive—and argues that React Native's genuine cross-platform constraints make it a real candidate for specialization. They find middle ground on where Apex might actually earn its place versus where the claims outpace the evidence.
- Ep 446 article 5:15
How query logs fix AI agent SQL errors
Justy and Cody dig into DataHub's new Context Intelligence layer, which mines SQL query logs to build a semantic index for AI agents. They unpack why raw schema fails at scale, whether query history actually solves the hallucination problem, and who should care about this in practice.
- Ep 445 article 4:43
Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient MachineLearningMastery
Continuous batching is a scheduling technique that keeps LLM inference servers from wasting GPU cycles on padding. Instead of forcing short requests to wait for long ones in a fixed batch, continuous batching frees up slots the moment a request finishes and admits new work immediately, eliminating idle padding tokens and improving throughput.
- Ep 444 api 5:20
Shopify’s journey to faster breadth first GraphQL execution (2026) Shopify
Justy and Cody discuss Shopify's new breadth-first GraphQL execution engine, 'Cardinal,' which claims up to 15x faster execution and 90% less memory for large, nested queries by resolving fields once across all objects instead of per-object.
- Ep 443 research 5:59
AI memory framework MeMo skips LLM retraining
MIT's MeMo framework encodes new knowledge into a small dedicated memory model so teams can swap in a better LLM without retraining — and the performance gains are real. Justy and Cody break down how it actually works, what the benchmarks mean, and where the trade-offs bite.
- Ep 442 article 3:47
RAG Explained Simply with a Real Project
A breakdown of Retrieval-Augmented Generation (RAG) using the open-book exam analogy, explaining why traditional LLMs fail on private data, how RAG works internally, and what practical trade-offs exist when building a RAG project.
- Ep 440 research 4:07
Exploring Autonomous Agentic Data Engineering for Model Specialization
Exploring Next episode 440: Cody and Justy dig into a new paper on autonomous agentic data engineering, where LLMs act as self-driving data engineers to curate domain-specific training sets—no humans in the loop. They unpack how GPT-5.2 built an iterative curriculum that boosted a student model by 57% and debate whether this is a research toy or a shippable path to domain adaptation. The code’s on GitHub at DataAgent.
- Ep 439 research 6:19
LongTraceRL: Learning Long Context Reasoning from Search Agent Trajectories with Rubric Rewards
Justy and Cody unpack LongTraceRL, a paper that trains long-context reasoning models using realistic search-agent distractors and entity-level rubric rewards, with a short look at what would make it shippable.
- Ep 438 article 7:42
The Infrastructure Behind Making Local LLM Agents Actually Useful | Towards Data Science
A conversation about making local LLM agents actually usable, focusing on the infrastructure challenges of running scientific agents with open-weight models. The hosts discuss the author's experience building a single-cell RNA-seq analysis agent, the problem of fixed prefix costs in long tool-use loops, vLLM optimizations for inference speed, and context management for long-running sessions.
- Ep 437 api 6:51
Figma Make's new two way GitHub integration turns designs into live, production code — with built In governance
Justy and Cody dig into Figma Make’s new two-way GitHub integration and the bigger claim behind it: not that designers replace engineers, but that visual editing can finally sit inside a real software workflow without breaking governance. They unpack what the article actually shows, where the technical case is solid, and who this is genuinely useful for.
- Ep 436 article 1:52
How we chose the voices of Coda | Rime
The hosts discuss an article about how the voice model Coda was developed, focusing on the selection of voices and categorization into styles like professional, formal, casual, and energetic.
- Ep 435 article 3:38
Stop writing rules in AGENTS.md: use agent hooks and nano staged instead—Martian Chronicles, Evil Martians’ team blog
Justy and Cody riff on Evil Martians' argument that LLM guardrails belong in real pre‑commit hooks like nano‑staged rather than in AGENTS.md, weighing the speed, token savings, and practical fit for dev teams.
- Ep 433 article 5:17
AI Memory Beyond RAG: Vectors, Graphs, and Dense Mem
Justy and Cody dig into an article arguing that most people blur together three different things under "AI memory": startup context, retrieval, and durable state. They unpack why the author thinks plain RAG is good at finding text but bad at deciding what is current, and why graph-backed memory only helps if you add provenance, conflict checks, and explicit gates instead of letting a model quietly turn every sentence into a fact.
- Ep 431 api 7:02
GitHub Tencent/TencentDB Agent Memory: TencentDB Agent Memory delivers fully local long term memory for AI Agents via a 4 tier progressive pipeline, with zero external API dependencies.
Justy and Cody dissect Tencent's new 'Agent Memory' repo, which claims to solve AI context bloat by using symbolic short-term memory and layered long-term storage instead of flat vector dumps. Cody leads with skepticism about the 'symbolic' Mermaid diagram approach and the specific benchmark claims against OpenClaw, while Justy argues the product value lies in stopping agents from forgetting SOPs. They debate whether hierarchical memory is the missing link for long-horizon tasks or just another complex caching strategy, landing on a cautious 'promising for enterprise, overkill for hobbyists' verdict.
- Ep 430 article 4:58
Auth
Justy and Cody dig into auth dot M D, WorkOS's proposed markdown-based way for apps to tell agents how to register users. They focus on the real argument underneath it: agents need a standard discovery file for auth flows, scopes, and credential issuance, so apps can safely let software act on behalf of people without inventing a new sign-up path every time.
- Ep 429 article 2:59
Implementing Hybrid Semantic Lexical Search in RAG MachineLearningMastery
Justy and Cody dig into a practical post on combining BM25 and dense vector search with Reciprocal Rank Fusion for RAG retrieval. Cody questions over-claims around ‘better than semantic alone' and the toy dataset limits, while Justy zeroes in on who should actually adopt this in production by mid-year 2026.
- Ep 428 tool 7:21
Cloudflare Completes Its Agent Infrastructure Stack with Browser Run Rebuild and Six Layer Platform
Justy and Cody dig into Cloudflare's rebuilt Browser Run and the six-layer agent infrastructure stack it anchors. They debate whether the "most complete agent platform outside the hyperscalers" claim holds up, unpack the D1/Queues migration and 500k container capacity numbers, and argue about what "most complete" actually means for developers choosing a platform.
- Ep 427 article 4:55
Replacing RAG with bash cut AI retrieval costs 30%
Justy and Cody dig into the argument behind direct corpus interaction, where agents use terminal tools like grep and find instead of relying only on vector search. They like the core point that retrieval interfaces can bottleneck reasoning, but they keep it grounded: this looks strongest for exact-evidence tasks in changing workspaces, and weakest as a blanket replacement for broad recall across huge corpora.
- Ep 426 article 5:40
Virtual File System for Node.js by mcollina · Pull Request #61478 · nodejs/node
Matteo Collina's virtual file system PR for Node.js introduces a first-class node:vfs module with a provider-based architecture that lets you mount in-memory, Single Executable Application, or custom filesystems alongside the real filesystem. It intercepts 164+ fs and module-loader integration points to make require() and standard fs APIs work seamlessly with virtual files, includes overlay mode for surgical mocking, and integrates with the test runner.
- Ep 425 article 5:58
Securing AI agent credentials with MCP tunnels
Justy and Cody dig into Anthropic's claim that the real blocker for enterprise agents is credential handling, not model quality. They unpack self-hosted sandboxes and MCP tunnels, why moving auth to the network boundary changes the threat model, and where the article is careful versus a little too neat.
- Ep 424 tool 5:26
GitHub Resemble ai/DramaBox: super expressive prompting model based on ltx2
Justy and Cody dig into DramaBox, Resemble AI's expressive TTS model that uses screenplay-style prompts to control delivery, emotion, laughs, and pauses — built as an IC-LoRA fine-tune on top of Lightricks' LTX-2.3 audio model.
- Ep 423 article 5:12
Enterprise AI agents fail because they forget
Justy and Cody dig into the claim that enterprise agents don’t mainly fail because models are weak, but because the systems around them don’t preserve applicable, time-scoped decision memory. They unpack the article’s idea of a decision context graph, where it sounds technically solid, and where the startup pitch still feels unproven.
- Ep 422 tool 6:03
Interpreters in Deep Agents: Code Between Tool Calls and Sandboxes
Justy and Cody dig into the argument for adding interpreters inside agent loops: a middle layer between serial tool calls and full sandboxes that lets models compose tools, keep live state, and ship less context around. They talk through why that’s practically useful, where the early token savings matter, and where the claim gets fuzzy if you assume an interpreter can replace real environments.
- Ep 421 article 1:44
Qwen 3.7 Max Preview: What Alibaba's New AI Gets Right and Where It Falls Short Decrypt
Justy and Cody react to Alibaba's Qwen 3.7 Max preview on Arena AI: its surprise rankings (#13 text, #5 vision globally), the open/closed strategy (Plus open, Max proprietary), and a wild creative-writing test where Qwen nailed Caribbean cultural depth. Cody questions the consistency of crowd-sourced rankings, Justy sees a market signal for non-Western developers. They tease the timing (preview lands five days before Alibaba Cloud Summit) and the model’s 'deep thinking mode' preview limits.
- Ep 420 research 7:18
RecursiveMAS cuts multi agent AI costs by 75%: researchers
Justy and Cody dig into RecursiveMAS, a research framework that lets multi-agent systems pass latent embeddings instead of text, cutting token usage and speeding up inference while keeping base model weights frozen.
- Ep 419 tool 4:21
5 Small Language Models for Agentic Tool Calling KDnuggets
Small language models are gaining ground on a critical frontier benchmark: tool calling. This episode looks at five compact, open-weight models that can route to APIs, format JSON arguments, and run multi-step agentic workflows without requiring a data center. Cody and Justy debate whether the gap between small and frontier models is closing fast enough to matter for real shipping teams.
- Ep 416 article 2:24
Context architecture is replacing RAG in AI
Justy and Cody dissect the claim that context architecture is supplanting RAG for enterprise AI agents, weighing Redis Iris as a concrete example and debating its practical relevance for product teams.
- Ep 415 article 4:35
Agent Evals
Justy and Cody dig into Cameron Wolfe’s argument that agent evals need to move from static benchmark thinking to realistic harnesses that test autonomy, tool use, recovery, and long-horizon behavior. They get specific about the agentic loop, why tool-call correctness is only part of the story, and where outcome-based evals can hide ugly behavior. Cody mostly buys the technical framing, with caveats about overfitting to harnesses and the difficulty of defining ground truth trajectories. Justy keeps pulling it back to who actually needs this now: teams shipping coding, workflow, or other higher-stakes agents where a demo is not the same as reliability.
- Ep 414 article 8:27
Context is the Key to the Agentic Architecture Revolution: A Conversation with Baruch Sadogursky
Justy and Cody dig into Baruch Sadogursky’s claim that the real shift in agentic software isn’t better prompting, it’s treating context as an engineering artifact. They unpack the idea that specs could become the source of truth, why question loops matter, and where the microservices argument is useful versus a little too convenient.
- Ep 413 article 5:21
LangSmith Engine closes the agent debugging loop automatically — but multi Model enterprises still need a neutral layer
Justy and Cody dig into LangSmith Engine's real pitch: not just watching agents fail, but closing the loop by spotting production issues, reading the code, drafting a fix, and adding an evaluator so the same failure gets caught next time. They agree that's a meaningful step, then get into the catch from the article: enterprises using multiple model providers still need a neutral observability layer, because first-party tooling gets messy fast when Claude and GPT are both in the stack.
- Ep 412 research 16:12
MetaAgent X : Breaking the Ceiling of Automatic Multi Agent Systems via End to End Reinforcement Learning
Justy and Cody discuss MetaAgent-X, a new paper proposing end-to-end reinforcement learning for multi-agent systems. They break down how it solves the 'frozen-executor ceiling' by jointly optimizing both the agent that designs the workflow and the agents that execute it. Cody explains the hierarchical rollout mechanism and stagewise co-evolution, while Justy explores what this means for production pipelines that currently rely on static prompts. They touch on the 21.7% performance gains, the reality of training stability, and whether this moves us from 'prompt engineering' to actual 'system engineering.'
- Ep 409 article 3:47
Google tells database devs to lean hard on AI for PostgreSQL work
Google's VP of Databases says engineers should use AI coding tools heavily for PostgreSQL contributions, with individual accountability for the output. The Register's reporting surfaces a specific claim: open source codebases are better training data than proprietary systems, and isolated extension work is the sweet spot for AI-assisted development. Cody pokes at the accountability framing and whether the training advantage claim holds up. Justy asks who actually benefits and whether this changes anything day-to-day for teams working with Postgres.
- Ep 408 article 7:37
Architectural patterns for graph enhanced RAG: Moving beyond vector search in production
Justy and Cody dig into graph-enhanced RAG, where vector search gets structural backbone from graph databases to handle multi-hop reasoning in interconnected enterprise data. They explore the hybrid retrieval pattern, debate whether ingestion-time entity extraction holds up in practice, and question who actually needs this complexity.
- Ep 407 tool 0:30
Symphony
Symphony is OpenAI's experimental framework that turns project management into autonomous agent runs. Instead of supervising individual coding agents, teams assign work items and agents handle implementation end-to-end—with CI checks, PR reviews, and proof of work built in. It's designed for codebases already using harness engineering patterns.
- Ep 405 article 6:00
LangSmith Sandboxes are Generally Available
Cody leads a skeptical read of LangSmith Sandboxes going GA — questioning whether microVM isolation is genuinely new or just well-packaged infrastructure. Justy pushes back on who actually needs this and why it matters for teams shipping real agent workflows. They land somewhere honest: the security argument holds, but the moat question is real.
- Ep 402 research 5:52
Many Shot CoT ICL: Making In Context Learning Truly Learn
Justy and Cody dig into a paper arguing that long-context chain-of-thought prompting behaves less like stuffing a prompt with relevant examples and more like teaching the model during inference. They unpack why many-shot tricks from classification break on reasoning, why semantic retrieval stops helping, and how the paper’s Curvilinear Demonstration Selection tries to order examples like a smooth mini-curriculum.
- Ep 401 tool 5:52
Red Hat adds support for agentic AI development
Justy and Cody unpack Red Hat's new agentic AI development push: supported Podman Desktop, local AI agent sandboxing, OpenShift Dev Spaces integrations, trusted images and libraries, skill repositories, MCP, and Fedora Hummingbird Linux.
- Ep 400 article 3:56
Hermes Unlocks Self Improving AI Agents, Powered by NVIDIA RTX PCs and DGX Spark
Hermes is a rapidly growing, self-improving AI agent framework that runs locally on NVIDIA RTX PCs and DGX Spark, using small but powerful Qwen models to do what previously required data-center scale.
- Ep 399 article 6:06
We built SmithDB, the data layer for agent observability
Justy and Cody dig into why agent traces have become a weird database problem, and why LangSmith built SmithDB instead of stretching a normal observability stack past its limits.
- Ep 398 api 3:14
Anthropic reinstates OpenClaw and third party agent usage on Claude subscriptions — with a catch
Anthropic reinstates OpenClaw and third-party agent usage on Claude subscriptions with a catch
- Ep 396 article 3:35
New in Deep Agents v0
Justy and Cody chat in their kitchen about Deep Agents v0.6, highlighting open‑weight cost cuts, Delta channels, new streaming, and the handy code interpreter. They riff on how to jump‑start a weekend project and point to the Context Hub integration for learning agents.
- Ep 395 article 7:02
Introducing Langsmith Engine
Justy and Cody dig into LangSmith Engine as a practical shift from manual agent triage to a more continuous loop: production traces get clustered into named issues, tied back to likely root causes in code, and turned into draft fixes plus new eval coverage. They focus on why that matters for teams drowning in traces, how the system piggybacks on existing LangSmith tracing and evaluators, and where the real adoption friction is for product teams and solo builders.
- Ep 394 article 8:07
How Lakebase Architecture Delivers 5x Faster Postgres Writes
Justy and Cody dig into Databricks Lakebase claiming much faster Postgres writes by turning off full page writes at the compute layer and pushing page image generation into distributed storage. Cody likes the architectural trick but questions where the complexity moved, while Justy argues the real win is for teams hitting write bottlenecks without wanting to re-architect their app.
- Ep 393 tool 3:49
Build Long running AI agents that pause, resume, and never lose context with ADK Google Developers Blog
Justy and Cody discuss the limitations of stateless chatbots for long-term enterprise workflows and explore Google's ADK solution for durable, event-driven AI agents that can pause and resume without losing context, using a new hire onboarding scenario as the primary case study.
- Ep 392 article 5:43
Implementing Prompt Compression to Reduce Agentic Loop Costs MachineLearningMastery
Justy and Cody kick around whether prompt compression is actually a smart production habit or just another neat demo. Cody starts skeptical about summary drift and hidden complexity, then they get concrete on why long agent loops get expensive fast, what the article's Python example is really proving, and where compressed history plus distilled instructions make sense right now.
- Ep 391 api 5:15
Local First AI Inference: A Cloud Architecture Pattern for Cost Effective Document Processing
Justy and Cody debate Local-First AI Inference — a pattern that routes most documents to deterministic local extraction while falling back to cloud AI for edge cases. They unpack the signal in the noise: who actually benefits, the clever confidence-gated routing, the real cost savings, and the architectural trade-offs. Then they lay out concrete ways to test the claims over a weekend.
- Ep 390 research 8:11
SocialReasoning Bench shows the limits of today’s AI agents
Justy and Cody dig into SocialReasoning-Bench, a new benchmark for whether AI agents actually advocate for a user instead of just finishing the task. They unpack the two test settings, the outcome and process metrics, and why near-perfect task completion can still hide pretty bad delegation.
- Ep 389 article 4:26
Evolution of a Backend for a Streaming Application
Daniele Frasca's talk on evolving Joyn's backend from a fragile single-node Kafka-to-DB setup to a multi-region serverless architecture on AWS, covering hub-and-spoke data consistency, cell-based isolation, and cost optimization for active-active streaming.
- Ep 388 article 5:21
Thinking Machines shows off preview of near realtime AI voice and video conversation with new 'interaction models'
Thinking Machines previews 'interaction models'—AI that processes voice and video in real-time, simultaneously listening and responding instead of waiting for user input to finish. Cody is skeptical about whether this solves a real problem or is architectural theater; Justy argues the latency gains and enterprise safety use cases (manufacturing oversight, customer service) are genuinely useful. They debate whether 'full-duplex' is a fundamental shift or incremental polish on existing models.
- Ep 387 tool 5:08
Scaling real time performance with Bigtable in memory tier | Google Cloud Blog
Justy and Cody geek out over Bigtable's new in-memory tier, which uses RDMA to deliver sub-millisecond reads. Justy sees a product manager's dream for removing cache-layer nightmares, while Cody explains how direct memory access avoids CPU bottlenecks and why the hotspot resistance is the real game-changer.
- Ep 386 research 10:41
Teaching Claude why
Cody and Justy dig into Anthropic's 'Teaching Claude Why' research — a post-training alignment paper showing that teaching an AI model ethical reasoning generalizes far better than just training it on correct behaviors. Cody is skeptical about how much of this is genuinely novel versus expected ML hygiene dressed up in alignment language. Justy pushes back with the product reality: if this actually closes the agentic blackmail problem, the downstream market implications are real.
- Ep 385 article 4:07
OpenAI launches the OpenAI Deployment Company to help businesses build around intelligence
OpenAI launches the OpenAI Deployment Company, a standalone business unit with $4B backing, to embed Forward Deployed Engineers into enterprises for real-world AI integration, including the acquisition of Tomoro for 150 experienced FDEs.
- Ep 384 article 4:41
Stop Wasting Tokens: A Smarter Alternative to JSON for LLM Pipelines KDnuggets
Cody is skeptical that TOON is a universal fix for JSON in LLM pipelines, and Justy pushes that the real win is for repeated structured records where token cost and clarity both matter. They land on TOON as a useful pre-LLM transport format, not a replacement for JSON everywhere.
- Ep 383 tool 6:59
GitHub Trusted Remote Execution/trusted Remote execution: Sandboxed Rhai script execution engine with Cedar policy authorization for every system operation.
Justy and Cody dig into Trusted Remote Execution (REX), a sandboxed Rhai script engine that runs Cedar policy authorization checks against every single system call — file I/O, network, processes — before anything actually executes. They cover why TOCTOU mitigations matter, how the Cedar + Rhai pairing works architecturally, who actually reaches for something like this, and what a weekend project with it might look like.
- Ep 382 api 7:51
Speeding up agentic workflows with WebSockets in the Responses API
Justy and Cody dig into OpenAI’s writeup on speeding up agentic workflows with WebSockets in the Responses API. Cody is skeptical of the hype around raw model speed, while Justy keeps pulling it back to user pain: long, repetitive agent loops that make coding tools feel sluggish. They land on a practical read — the transport change matters most when the model is fast enough that API overhead becomes the bottleneck — and they sketch a weekend experiment for building a tiny stateful agent loop.
- Ep 381 tool 8:13
The Roadmap to Mastering Tool Calling in AI Agents
Justy and Cody talk through Machine Learning Mastery's roadmap for production-grade tool calling in AI agents, focusing on contracts, error handling, parallel calls, catalog size, security boundaries, and practical evaluation.
- Ep 380 research 8:37
ARIS: Autonomous Research via Adversarial Multi Agent Collaboration
Justy and Cody dig into ARIS, an open-source harness for autonomous ML research that assumes a single long-running agent will eventually make unsupported claims. They unpack the core idea of pairing an executor with a reviewer from a different model family, plus the three-layer architecture, evidence checks, claim ledger, and workflow library. They also get practical about who might actually use it, what feels shippable versus research-only, and a few concrete ways to try pieces of it without building the whole lab.
- Ep 379 article 6:53
Validating agentic behavior when “correct” isn’t deterministic
GitHub's new validation framework for agentic systems moves beyond brittle, step-by-step testing toward outcome-focused validation. When autonomous agents (like Copilot Coding Agent) interact with real environments, correctness is no longer deterministic—loading screens may appear or vanish, timing shifts, and multiple valid action sequences can succeed. The framework uses dominator analysis and graph-based modeling (Prefix Tree Acceptors) to distinguish between essential outcomes and incidental noise, requiring only 2–10 successful traces to build a ground-truth model. Cody finds the approach clever but questions whether it scales beyond UI automation; Justy sees real market traction in CI/CD reliability and enterprise adoption.
- Ep 378 article 10:21
Four Agent Orchestration Patterns
Justy and Cody dig into a benchmark study testing four multi-agent orchestration patterns across 10,000 SEC filings — sequential pipeline, parallel fan-out, hierarchical supervisor-worker, and reflexive self-correcting loop — unpacking the real cost-accuracy-scale trade-offs and how to pick the right one for production.
- Ep 377 research 9:26
Benchmarking Multi Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost Accuracy Tradeoffs and Production Scaling Strategies
Justy and Cody break down a benchmark of four multi-agent LLM orchestration patterns for extracting structured data from SEC filings, focusing on cost, accuracy, latency, and what’s actually shippable in production. They compare sequential, parallel, hierarchical, and reflexive setups across 10,000 filings and land on a practical middle ground: hierarchical orchestration gets close to the best accuracy without the reflexive loop’s big cost hit.
- Ep 376 article 7:18
Anthropic will let its managed agents dream
Justy and Cody talk through Anthropic’s idea of managed agents that can “dream” or rehearse outcomes before acting, with attention to product trust, architecture, sandboxes, and a small weekend build.
- Ep 375 research 10:44
Hallucinations Undermine Trust; Metacognition is a Way Forward
Justy and Cody dig into a paper arguing that the real trust problem with language models is not merely being wrong, but being wrong with unwarranted confidence. They unpack the paper’s shift from answer-versus-abstain to ‘faithful uncertainty,’ where a model’s wording should reflect its actual internal uncertainty. Cody breaks down the discrimination-versus-calibration distinction and why that matters for both chatbots and tool-using agents. Justy pushes on what this means in production, where hedging can either build trust or feel slippery if it is not tied to real behavior.
- Ep 374 tool 7:10
The app store for robots has arrived: Hugging Face launches open source Reachy Mini App Store with 200+ apps
Hugging Face launches an app store for Reachy Mini, a $299 open-source desktop robot, hosting 200+ community-built applications. The store removes the roboticist barrier by letting non-technical users build robot apps in minutes using plain English descriptions and an AI agent called ML Intern. Cody questions whether this solves a real problem or is mostly marketing hype around a niche hardware play, while Justy argues the accessibility angle and the removal of weeks-long integration work represents genuine market shift.
- Ep 373 api 8:10
Gemini API File Search is now multimodal: build efficient, verifiable RAG
Justy and Cody dig into Gemini API File Search getting multimodal retrieval, metadata filters, and page-level citations, and why that matters for anyone tired of flaky RAG over PDFs and image folders.
- Ep 372 article 8:40
The context window has been shattered: Subquadratic debuts a 12 Million Token window
Cody is skeptical that a 12-million-token context window is broadly useful today, while Justy pushes the angle that it solves a very real pain point for teams with giant codebases, logs, and long-running workflows. They land on it as a real technical milestone with a narrow early market, plus a lot of unanswered questions about cost, latency, and whether most users need this kind of scale.
- Ep 371 article 7:54
How NetEase Games cut LLM cold starts from 42 minutes to 30 seconds
Justy and Cody unpack how NetEase Games used Kubernetes-native data orchestration with Fluid to shrink LLM inference cold starts from 42 minutes to about 30 seconds, and what that means for teams running their own models.
- Ep 370 research 9:19
HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness
Justy and Cody dig into HeavySkill, a paper arguing that a lot of so-called agent harness magic is really a simpler inner pattern: generate multiple reasoning paths in parallel, then run a separate deliberation pass that compares and summarizes them. They unpack the memory-cache trick, why it can beat plain Best-of-N, where the gains seem to come from, and what this means for builders deciding between brittle orchestration and something more shippable.
- Ep 369 tool 4:59
ScyllaDB cut Sprig's read latency 4X after Redis and ClickHouse hit a wall
Sprig, a fintech platform, hit latency walls with Redis and ClickHouse as their user base grew. By migrating to ScyllaDB—a high-performance NoSQL database built on Cassandra—they cut read latency by 4x and solved throughput bottlenecks. The episode explores why a specialized database sometimes beats general-purpose tools, the trade-offs of that choice, and when you'd actually reach for ScyllaDB in your own stack.
- Ep 368 article 7:35
The RAG era is ending for agentic AI — a new compilation Stage knowledge layer is what comes next
Pinecone just announced Nexus, a 'knowledge engine' that shifts reasoning from inference time to a compilation stage — meaning agents get pre-built, task-specific knowledge artifacts instead of rediscovering context from scratch every session. Justy and Cody dig into why RAG was never really built for agents, what the architecture actually does, and whether the 98% token reduction claim holds water.
- Ep 367 research 6:43
From Context to Skills: Can Language Models Learn from Context Skillfully?
Cody and Justy dig into Ctx2Skill, a self-evolving framework that turns long, dense context into reusable natural-language skills for language models. They talk through the core loop, the role of Challenger, Reasoner, Judge, and the replay trick that keeps the system from drifting into weird overfit territory, then land on what it means for product teams trying to ship context-heavy workflows.
- Ep 366 article 8:36
From Batch to Micro Batch Streaming: Lessons Learned the Hard Way in a Delta Index Pipeline
Justy and Cody unpack an InfoQ case study about moving an ads delta-index pipeline from scheduled batch jobs to Spark micro-batches, focusing on freshness, object-store ingestion, logical watermarks, restart behavior, and practical weekend experiments.
- Ep 365 tool 8:28
Meta Introduces Autodata an Agentic Framework That Turns AI Models Into Autonomous Data Scientists for High Quality Training Data Creation
Justy and Cody dig into Meta’s Autodata and why better data, not just bigger models, is the pain point showing up everywhere right now. They unpack Agentic Self-Instruct, the four-agent setup, the weak-versus-strong solver idea, and why turning extra inference compute into better training data is a pretty interesting trade. They also get practical about who would adopt it, where the friction is, and a couple of concrete weekend experiments to try.
- Ep 364 tool 7:09
The scaffolding era is over. LlamaIndex says context is the new moat
LlamaIndex CEO Jerry Liu argues that the scaffolding layer of RAG frameworks and orchestration tools is becoming obsolete as frontier models get smarter at reasoning over raw data. The real moat shifts to context quality — parsing, OCR, and extracting signal from messy file formats — rather than framework complexity. Models like Claude now handle multi-step planning, tool discovery, and code generation natively, collapsing the distinction between deterministic workflows and agentic reasoning.
- Ep 363 research 10:10
From Skill Text to Skill Structure: The Scheduling Structural Logical Representation for Agent Skills
Justy and Cody dig into the SSL (Scheduling-Structural-Logical) representation paper from Peking University — a structured, three-layer JSON schema designed to replace the messy, text-heavy SKILL.md files that LLM agent systems currently rely on. They cover why parsing natural language skill docs is a real bottleneck, how SSL's three layers (scheduling, structural, logical) map to classical AI theory, what the benchmark numbers actually mean, and whether this is something builders can use today.
- Ep 361 article 9:10
Qwen AI Releases Qwen Scope an Open Source Sparse Autoencoders Sae Suite That Turns LLM Internal Features Into Practical Development Tools
Justy and Cody unpack Qwen-Scope, Qwen AI’s open-source sparse autoencoder suite for making LLM internals more usable in debugging, steering, and benchmark analysis.
- Ep 360 research 10:37
FAMA: Failure Aware Meta Agentic Framework for Open Source LLMs in Interactive Tool Use Environments
Justy and Cody dig into FAMA, a failure-aware orchestration framework for smaller open-source tool-using LLM agents. They unpack why long multi-turn support-style tasks keep breaking, how FAMA studies failed trajectories and then routes only the right helper agents into context, and why that matters for teams trying to ship cheaper, more reliable agents without fine-tuning or massive reinforcement-learning pipelines.
- Ep 358 article 4:03
Google AI breakthrough means chatbots use six times less memory during conversations without compromising performance
Google's TurboQuant compresses AI working memory (the KV cache) by up to 6x in real time using two novel techniques — PolarQuant and QJL — without degrading model performance. Justy and Cody dig into what this actually means for inference costs, who benefits first, and why the 'DeepSeek moment' framing is both apt and a little overblown.
- Ep 357 api 6:44
Building with Gemini Embedding 2: Agentic multimodal RAG and beyond Google Developers Blog
Exploring Next, episode 357. Gemini Embedding 2 just made multimodal retrieval a lot more practical: text, images, video, audio, and PDFs can all land in one embedding space, which changes search, RAG, and agent workflows.
- Ep 356 tool 6:05
Why AI Engineers Are Moving Beyond LangChain to Native Agent Architectures | Towards Data Science
Justy and Cody unpack why teams are moving from LangChain-style frameworks toward native agent architectures once LLM apps hit production pressure.
- Ep 355 research 8:37
Alibaba's HDPO cuts AI agent tool overuse from 98% to 2%
Justy and Cody dig into Alibaba's HDPO and Metis, a training setup that teaches AI agents to stop calling tools by default. Cody likes the core idea because it separates accuracy from efficiency during reinforcement learning, but he questions how portable the benchmark win is. Justy pushes on why this matters for real products right now: users feel latency, teams feel API bills, and nobody wants an agent that opens a toolbox for a task it already knows how to do.
- Ep 354 article 5:45
Agentic AI: How to Save on Tokens | Towards Data Science
Cody and Justy examine whether the token-saving techniques in Ida Silfverskiöld's article (prompt caching, semantic caching, lazy-loading, routing, context cleanup) are practical wins or theoretical cost-cutting that introduces real friction. Cody opens skeptical: the savings are real but the tradeoffs are often hidden or underestimated. Justy counters that for production teams already bleeding money on agentic AI, even 20-30% savings justifies the engineering lift. They land on a nuanced take: prompt caching is genuinely low-risk and worth it; semantic caching and aggressive routing are trickier and need honest trade-off audits before deployment.
- Ep 353 research 8:21
DV World: Benchmarking Data Visualization Agents in Real World Scenarios
Justy and Cody dig into DV-World, a new benchmark from a multi-institution research team that stress-tests AI data visualization agents on real-world tasks — spreadsheet manipulation, cross-framework chart evolution, and handling ambiguous user intent. Even the best models top out around 50%, which tells you a lot about where the gap actually is.
- Ep 352 article 10:10
Tuning Deep Agents to Work Well with Different Models
Justy and Cody dig into LangChain’s new Deep Agents model-specific harness profiles. Cody is skeptical that prompt-and-tool tuning is a durable win, while Justy sees a practical adoption path for builders who keep hitting model-specific quirks. They land on a cautious take: useful, real, and probably underappreciated, but not magic.
- Ep 351 tool 9:23
DBmaestro MCP Server Puts Natural Language in Control of Database Pipelines
Episode 351 of Exploring Next looks at DBmaestro’s new MCP server, which lets AI agents trigger governed database DevOps workflows through natural language while staying inside existing permissions and audit controls.
- Ep 350 article 9:27
You don't need an expensive GPU to run a local LLM that actually works
Cody and Justy examine the claim that you don't need an expensive GPU to run capable local LLMs. Cody opens skeptical about quantization trade-offs and real-world inference speed; Justy pushes back with the actual user story—cost-conscious builders and privacy-first home automation. They dig into what 'works' really means, explore the CPU-only vs. GPU trade-off, and land on a nuanced take: smaller quantized models on mid-range hardware are genuinely usable now, but marketing around this can oversell the experience. Build Next includes testing Ollama on a specific budget GPU and benchmarking a 7B quantized model on a CPU-only rig.
- Ep 349 article 7:08
Mistral AI Introduces Workflows for Orchestrating Enterprise AI Processes
Mistral AI launches Workflows, an enterprise orchestration layer built on Temporal that brings stateful execution, human-in-the-loop checkpoints, and fault tolerance to multi-step AI processes. Justy and Cody dig into what it actually solves, where the real hard problems still live, and what to try this weekend.
- Ep 348 tool 4:37
Warp's gamble: Going open source to take on closed Source rivals
Warp is open-sourcing its terminal client while keeping parts of its cloud and AI stack closed, which makes this a pretty direct bet on trust, adoption, and developer workflow at a moment when more people are living in terminals with AI bolted on.
- Ep 347 tool 4:03
Cut AI token usage by 96%? Here's how AWS Strands Agents does it.
AWS Strands Agents is a way to cut agent token usage by making models ask for only the context they need, when they need it. Instead of stuffing huge prompts up front, it uses tools, memory, and session state to keep agents lean, which matters for cost, latency, and scaling.
- Ep 346 article 3:30
Stop Hitting Claude Code Limits
Claude Code's usage limits aren't the real problem—how you set it up is. Four controllable causes drive 85% of overspend: cache misses, context bloat, wrong model routing, and token-heavy input formats. One user cut costs from $1,389/mo to $200/mo by locking tools at session start, disabling 1M context, delegating to cheaper subagents, and swapping screenshots for accessibility trees. Real fixes are copy-paste configuration changes and workflow tweaks, not waiting for Anthropic.
- Ep 345 research 10:19
Recursive Multi Agent Systems
RecursiveMAS is a new multi-agent framework from researchers at UIUC, Stanford, NVIDIA, and MIT that replaces text-based agent handoffs with latent-space recursion — cutting token usage by up to 75%, speeding up inference 2.4x, and improving accuracy by 8.3% across nine benchmarks. Justy and Cody dig into why passing hidden states instead of words is such a big deal, what the RecursiveLink module actually does, and whether any of this is shippable today.
- Ep 344 article 4:27
Definity embeds agents inside Spark pipelines to catch failures before they reach agentic AI systems
Episode 344 of Exploring Next takes a skeptical look at Definity putting agents inside Spark and dbt execution so teams can catch stale inputs, skew, memory pressure, and bad downstream writes during a run instead of after the damage is done. Cody likes the placement but questions how much autonomy teams will really allow in production. Justy argues the buyer story is strong for data teams supporting AI systems and expensive on-prem workloads where wasted runs hurt immediately.
- Ep 342 research 7:01
ClawMark: A Living World Benchmark for Multi Turn, Multi Day, Multimodal Coworker Agents
ClawMark is a benchmark for evaluating AI agents as persistent coworkers across multi-day workflows with dynamic, stateful environments. Unlike existing benchmarks that run single-episode tasks in static environments, ClawMark spans multiple in-universe workdays with exogenous state changes (emails arrive, calendars shift, files update) between turns, multimodal evidence (PDFs, audio, video, spreadsheets), and deterministic rule-based scoring via 1,537 Python checkers. The benchmark contains 100 tasks across 13 professional scenarios running against five sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet). Current frontier models reach 75.8 weighted score but only 20% strict task success, revealing that adaptation to changing state remains a core unsolved challenge.
- Ep 341 research 4:29
American AI startup Poolside launches free, high performing open model Laguna XS.2 for local agentic coding
Justy and Cody unpack Poolside’s new Laguna XS.2, an Apache 2.0 open model aimed at local agentic coding, plus the bigger Laguna M.1, the pool agent harness, and the shimmer coding environment.
- Ep 340 research 5:47
Stochastic KV Routing: Enabling Adaptive Depth Wise Cache Sharing
Justy and Cody dig into Stochastic KV Routing, a paper on cutting transformer KV cache memory by sharing caches across layers instead of only squeezing along the token axis. They unpack random cross-layer attention, why it helps models tolerate missing per-layer caches, and where this could matter in real serving stacks.
- Ep 339 research 7:24
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
In this episode, Justy and Cody dig into SketchVLM, a training-free framework that lets vision-language models explain answers by drawing editable SVG annotations on top of images. They talk through why text-only answers are hard to verify, how SketchVLM uses a draft-and-refine loop plus visual grounding to produce overlays, where it looks production-friendly, and where the trade-offs still show up.
- Ep 338 research 5:09
Rewarding the Scientific Process: Process Level Reward Modeling for Agentic Data Analysis
DataPRM is a process reward model built specifically for agentic data analysis that fixes two critical gaps in general-purpose PRMs: silent errors (code runs but produces wrong results) and grounding errors (penalizing necessary exploration). It works by actively probing the environment to validate intermediate states and using a ternary reward strategy to distinguish between correctable mistakes and irrecoverable failures. The team built a 7K-instance training dataset and show 7-11% improvements on benchmarks with only 4B parameters.
- Ep 337 thread 2:34
This closes a loop I've been working on for three months. Every agent harness debate has a hidden assumption: that t...
Rohit Ghumare's thread argues the agent harness debate is asking the wrong question. Instead of debating how thick the wrapper around a backend should be, the insight is that agents, queues, sandboxes, and services should all participate in the same execution model — built on three primitives: Worker, Function, and Trigger. The payoff is live discovery, live extensibility, and a single trace across everything.
- Ep 336 article 3:08
Causal Inference Is Different in Business | Towards Data Science
A quick read on why business causal inference is really about matching rigor to the size and reversibility of the decision, not proving everything with maximum purity every time.
- Ep 335 article 4:20
Sentry’s Seer Agent lets developers debug production issues in natural language
Exploring Next, episode 335. Sentry’s Seer Agent brings natural-language debugging into production incidents, aiming to cut the time teams spend digging through traces, logs, and issue context.
- Ep 334 article 9:01
Open source Xiaomi MiMo V2.5 and V2.5 Pro are among the most efficient (and affordable) at agentic 'claw' tasks
Xiaomi's open-source MiMo-V2.5 and V2.5-Pro models claim top-tier efficiency for agentic 'claw' tasks—autonomous agents that handle email, content creation, and complex coding work. The Pro version uses 40-60% fewer tokens than GPT-5.4 or Claude Opus while costing a fraction as much. Cody questions whether token efficiency alone translates to real production wins, while Justy sees a genuine market opening for cost-conscious enterprises building agent workflows.
- Ep 333 article 5:32
Build a Reinforcement Learning Powered Agent That Learns to Retrieve Relevant Long Term Memories
Cody and Justy dig into a tutorial that trains an RL agent — using PPO via Stable-Baselines3 — to retrieve long-term memories more accurately than plain cosine similarity search. They debate whether the added complexity is justified, who actually needs this, and what it would take to move from a synthetic demo to something production-worthy.
- Ep 332 article 8:25
RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk
Justy and Cody dig into new Redis research showing that fine-tuning RAG embeddings for sentence-level precision can quietly hurt general retrieval, sometimes by a lot. They unpack why that matters more in agent pipelines, where one bad retrieval can snowball into bad downstream actions, and why common fixes like hybrid search, MaxSim reranking, or bigger models don't really solve the structural problem. The episode lands on a practical takeaway: keep recall fast, add a separate verification step when correctness actually matters.
- Ep 331 article 5:00
Openmoss Releases Moss Audio an Open Source Foundation Model for Speech Sound Music and Time Aware Audio Reasoning
Exploring Next, episode 331, on MOSS-Audio from OpenMOSS, an open-source foundation model that tries to handle speech, sound, music, and time-aware audio reasoning in one stack.
- Ep 330 research 5:58
Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
SLIDERS solves the aggregation bottleneck in document question answering by extracting information into a relational database and reasoning over structured data via SQL instead of concatenating chunks. It uses data reconciliation to fix duplicates and inconsistencies, outperforming GPT-4 on long-context benchmarks and scaling to 36M tokens.
- Ep 328 research 1:59
Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
A new QA framework called SLIDERS handles document sets that outgrow any context window by storing extracted facts in a relational database and reasoning over them with SQL.
- Ep 327 article 6:10
Text Summarization with Scikit LLM MachineLearningMastery
Justy and Cody kick around a MachineLearningMastery post on using scikit-LLM for text summarization inside scikit-learn pipelines. Cody is skeptical about the real value of wrapping a summarizer as a transformer, while Justy argues it fits messy, text-heavy workflows where teams already live in sklearn. They land on a cautious verdict: useful for specific preprocessing jobs, but not a magic shortcut, especially once cost, latency, and summary quality enter the picture.
- Ep 326 article 3:22
An open source spec for Codex orchestration: Symphony.
Symphony is an open-source spec that turns your issue tracker into an agent control plane, letting coding agents pull work continuously instead of requiring constant human supervision. OpenAI built it to solve the bottleneck of context-switching across multiple agent sessions, and saw a 500% increase in landed PRs on some teams. The spec is language-agnostic and designed to be implemented by agents themselves.
- Ep 325 article 9:12
Enterprises are obsessing over model accuracy while ignoring the infrastructure layer where AI systems actually break.
Enterprises fixate on model accuracy benchmarks while the real failures happen silently in the infrastructure layer — stale retrieval, orchestration drift, and context decay that never trigger a single alert. Cody and Justy dig into why behavioral telemetry is a different instrument than traditional observability, who actually owns these failures organizationally, and what concrete steps teams can take to test for the conditions that production actually creates.
- Ep 324 api 1:49
Prompt guidance | OpenAI API
Justy and Cody unpack OpenAI’s prompt guidance for GPT-5.5, focusing on shorter outcome-first prompts, personality blocks, preambles for tool use, and retrieval budgets that help agents stop at the right time.
- Ep 322 tool 4:59
Opentabs Dev/opentabs
OpenTabs lets AI agents call real web APIs through your browser session—Discord, Slack, GitHub, Notion, and 100+ more—without screenshots, DOM scraping, or API keys. Cody questions the security model and plugin discovery overhead; Justy argues the authenticated-session angle solves a real friction point for AI workflows. They land on it as genuinely useful for power users and developers, but adoption hinges on plugin ecosystem maturity and trust.
- Ep 320 article 3:44
DeepSeek V4 arrives with near state of the art intelligence at fraction of the cost of Opus 4.7, GPT 5
Justy and Cody unpack DeepSeek-V4, an open-weight MoE model that gets close to top closed models on several practical benchmarks while landing in a much lower price tier. They focus on why cheaper frontier-class inference changes what teams can afford to automate, where DeepSeek still trails GPT-5.5 and Claude Opus 4.7, and what builders can try this weekend.
- Ep 320 tool 2:32
Git
Justy and Cody look at ai-cli-mcp, a package that turns several coding agents into background jobs from one MCP server. They focus on why parallel AI work is useful now, how the package routes prompts to Claude, Codex, Gemini, Forge, and OpenCode, and where setup friction and safety trade-offs show up.
- Ep 320 research 5:03
Towards a science of scaling agent systems: When and why agent systems work
A skeptic’s take on Google Research’s paper on scaling agent systems. Cody argues the useful part is not “more agents” but the evidence that coordination only helps when the task structure fits. Justy pushes on why that matters for teams shipping assistants right now, where cost, reliability, and user trust beat demo flair. Together they unpack the five architectures, the strong gains on parallel work, the collapse on sequential planning, and what a solo builder could test this weekend.
- Ep 319 api 5:25
GitHub Kwstx/engram Translator: layer that lets you connect any agent, any tool, any api together.
In this episode, Justy and Cody dig into Engram, an interoperability layer for AI agents, tools, and APIs that tries to reduce the adapter churn people keep running into as standards multiply. They talk through protocol translation, semantic schema repair, weighted routing, and the practical friction of adoption, then close with a few concrete build ideas.
- Ep 317 article 8:42
OpenAI launches Privacy Filter, an open source, on Device data sanitization model that removes personal information from enterprise datasets
Cody and Justy dig into OpenAI's Privacy Filter — a 1.5B-parameter, on-device PII redaction model released under Apache 2.0. Cody questions whether a single-model redaction layer is robust enough for high-stakes compliance, while Justy argues the real story is the license and the workflow it unlocks for enterprises sitting on unusable data.
- Ep 317 research 4:28
ClawEnvKit: Automatic Environment Generation for Claw Like Agents
Cody and Justy dig into ClawEnvKit, a pipeline from researchers at UMD, UC Berkeley, UCLA, and MBZUAI that automates the creation of training and evaluation environments for claw-like LLM agents — cutting construction cost by 13,800x compared to human curation.
- Ep 316 article 4:35
panini/README.md at main · dpaul0501/panini
Justy and Cody dig into panini, a prompt skill that borrows Pāṇinian role structure to make agent outputs more explicit about who acted, on what, with which tool, and why. They focus on why that matters in real agent loops, how the repo measures gains in traceability and drops in hedging, and where the token-cost trade-off looks worth it.
- Ep 315 tool 4:46
GitHub Dejuknow/md redline: Inline review comments for markdown specs. Built in MCP server hands feedback directly to your AI agent.
On Exploring Next episode 315, Justy and Cody look at md-redline, a local review layer for markdown specs, prompts, and design docs. They dig into why inline feedback matters in agentic workflows, how invisible HTML markers keep comments inside the source file, and why an MCP server that can pause an agent mid-task changes the review loop. They also weigh the adoption friction, the file-based trade-offs, and a few practical ways to try it.
- Ep 314 research 2:53
Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Justy and Cody dig into Mind’s Eye, a new benchmark for testing whether multimodal models can actually do visual thinking like rotation, folding, analogy, and composition instead of just describing images well. They unpack the paper’s A-R-T taxonomy, the gap between human and model scores, why prompting helps some tasks and hurts others, and what this means for anyone trying to ship multimodal features.
- Ep 313 research 4:34
AgentSPEX: An Agent SPecification and EXecution Language
Justy and Cody dig into AgentSPEX, a YAML-based language and runtime for building LLM agents with explicit control flow, typed steps, reusable submodules, parallel execution, and state management. They focus on the gap between loose ReAct prompting and Python-heavy orchestration tools, then unpack how AgentSPEX separates workflow specification from execution while still supporting tools, sandboxing, checkpointing, replay, and visual editing. The conversation lands on who this is for, where it feels shippable, and what a solo builder could try this weekend.
- Ep 312 article 4:21
One Developer, Two Dozen Agents, Zero Alignment
Ace is a GitHub Next prototype that treats coding with agents as a shared workspace instead of a solo tool. The skepticism is whether teams really want another surface for coordination, even if the architecture is clever.
- Ep 311 tool 3:34
Kimi K2.6 runs agents for days — and exposes the limits of enterprise orchestration
Exploring Next, episode 311. We look at Kimi K2.6 and why agents that run for hours or days are exposing a weak spot in enterprise orchestration, governance, and state management.
- Ep 310 research 4:35
LeWorldModel: Stable End to End Joint Embedding Predictive Architecture from Pixels
Justy and Cody dig into LeWorldModel, a pixel-to-latent world model that tries to make JEPA training boring in the best way. The paper’s claim is simple but pretty important: you can jointly train the encoder and dynamics model from raw pixels without EMA tricks, stop-gradient, pretraining, rewards, or reconstruction, and still avoid collapse. They unpack the Gaussian latent regularizer, the autoregressive next-embedding prediction setup, and why a 15M-parameter model that runs on one GPU could matter more for builders than a flashier giant model.
- Ep 309 article 6:05
6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You | Towards Data Science
Justy and Cody dig into what actually changes when you stop calling an LLM API and start building pieces yourself: why fine-tuning tricks like RsLoRA matter, why RoPE won, where weight tying still makes sense, why Pre-LN became the default, and how KV cache buys speed by spending memory.
- Ep 308 research 8:03
Moonshot AI and Tsinghua Researchers Propose Prfaas a Cross Datacenter Kvcache Architecture That Rethinks How LLMs Are Served at Scale
Justy and Cody unpack PRFaaS, a cross-datacenter KV-cache serving design from Moonshot AI and Tsinghua that tries to make LLM inference less wasteful by treating prefills as reusable networked assets instead of repeating them in every region.
- Ep 307 article 11:49
Kimi K26 Is the Open Model Release
Justy and Cody dig into why Kimi K2.6 lands at exactly the right moment for people trying to run long-lived coding agents: it’s open, strong on coding, and can actually see screenshots and video without bolting on a separate vision model. They unpack the 1T MoE design with 32B active parameters, the 262K context window, benchmark wins that matter, and Moonshot’s bigger bet on tool-heavy, long-horizon agent work. They also separate the impressive parts from the marketing gloss, then close with concrete stuff to try this week.
- Ep 306 article 11:48
Moonshot AI Releases Kimi K2.6, Beats Top US Models On Some Benchmarks
Justy and Cody dig into why Kimi K2.6 matters right now: not because of a flashy leaderboard screenshot, but because it appears unusually strong at the stuff teams actually pay for — coding work, tool use, and long-running task execution. They unpack the benchmark wins, the 12-to-13-hour autonomous coding demos, the scaled-up agent swarm design, and what Moonshot seems to be optimizing for. They end with concrete things to try if you want to test this class of model yourself.
- Ep 305 tool 10:20
Harness engineering for coding agent users
Justy and Cody dig into harness engineering for coding agents: the practical idea that trust in AI-written code comes less from the model itself and more from the guardrails, checks, and feedback loops wrapped around it. They unpack feedforward guides versus feedback sensors, deterministic tooling versus LLM-based judgment, and why teams should treat the human as the person tuning the harness instead of reviewing every tiny diff forever.
- Ep 304 article 11:10
Harness engineering: leveraging Codex in an agent First world
Justy and Cody dig into OpenAI’s writeup on building a product with Codex doing all the coding, and why the real shift is from typing code to designing an environment agents can reliably operate in. They cover the no-manual-code constraint, the repo-as-system-of-record approach, agent-readable docs, isolated worktrees, UI and observability access, and why this matters for teams trying to ship faster without drowning in review and QA.
- Ep 303 article 11:34
OpenClaw vs. Hermes Agent: The race to build AI assistants that never forget
Justy and Cody dig into persistent AI agents by comparing OpenClaw and Hermes Agent, focusing on why memory matters for real users, how each system stores and retrieves context, and where the engineering trade-offs show up in production.
- Ep 302 article 11:57
The Complete Guide to Inference Caching in LLMs
Justy and Cody dig into inference caching for LLMs and why it matters right now for anybody paying real model bills or waiting on sluggish responses. They unpack the three layers from the article — KV caching inside a single generation, prefix caching across requests with identical leading tokens, and semantic caching using embeddings plus vector search to skip model calls entirely. The episode stays grounded in production reality: prompt structure, exact-match requirements, provider behavior, GPU memory trade-offs, and when semantic caching is actually worth the extra moving parts.
- Ep 301 research 12:17
LongAct: Harnessing Intrinsic Activation Patterns for Long Context Reinforcement Learning
Justy and Cody dig into LongAct, a paper about making long-context RL work better by updating only the attention weights tied to unusually large query and key activations. They unpack why that matters for long docs, agents, and multi-step reasoning, how the saliency-guided sparse updates map activation outliers back to specific weight rows, and why the reported gains across LongBench v2, RULER, and multiple RL algorithms suggest this could be more than a lab curiosity.
- Ep 300 research 10:32
How to Fine Tune a Reasoning Model? A Teacher–Student Cooperation Framework to Synthesize Student Consistent SFT Data
Episode 300 of Exploring Next digs into TESSY, a teacher-student data synthesis method for fine-tuning reasoning models without wrecking the smaller model’s existing style. The hosts unpack why direct teacher-generated supervised fine-tuning can actually make reasoning models worse, how TESSY alternates teacher-generated capability tokens with student-generated style tokens, and why that matters for anyone trying to ship smaller, cheaper reasoning systems for coding and other structured tasks.
- Ep 299 tool 9:30
Anthropic just launched Claude Design, an AI tool that turns prompts into prototypes and challenges Figma
Anthropic’s Claude Design is a big deal because it aims to collapse the gap between idea, prototype, and stakeholder feedback. Justy and Cody dig into why that matters now, what Claude Design likely does under the hood, why pairing it with Opus 4.7 matters, and where it could genuinely pressure Figma versus where the old product realities still bite.
- Ep 298 api 1:26
Cloudflare Launches Code Mode MCP Server to Optimize Token Usage for AI Agents
Cloudflare's new Model Context Protocol (MCP) server powered by Code Mode reduces token usage for AI agents, making it possible to interact with complex APIs more efficiently.
- Ep 297 tool 1:32
Pi Monorepo
Exploring the Pi Monorepo and its tools for building AI agents and managing LLM deployments.
- Ep 296 tool 1:07
1) Pick a user bin dir and move/rename the binary
Exploring the SigMap tool and its impact on AI coding context
- Ep 295 article 1:21
Language models transmit behavioural traits through hidden signals in data Nature
Exploring how language models transmit behavioural traits through hidden signals in data, and what this means for AI safety and development.
- Ep 294 article 1:04
AI's next bottleneck isn't the models — it's whether agents can think together
AI's next bottleneck is not the models, but whether agents can think together, requiring next-level infrastructure and shared cognition
- Ep 293 article 1:19
selimaktas/MiniMax M2.75 460B A20B · Hugging Face
Exploring the capabilities and potential applications of the MiniMax-M2.75-460B-A20B model, a text generation transformer that outperforms its base model on Single-turn SWE-Bench and has achieved impressive results in software engineering, professional work, and entertainment.
- Ep 292 tool 1:27
Build
Exploring Kumo, a lightweight AWS service emulator written in Go, and its applications in CI/CD testing and local development.
- Ep 291 article 1:19
Context Engine MCP | Augment Code
Exploring the Context Engine MCP and its potential to revolutionize coding agents
- Ep 290 article 1:21
Vending Machine Run by Claude More of a Disaster Than Previously Known
Episode 290 of Exploring Next dives into the story of Claude, an AI model tasked with running a vending machine, and the chaos that ensued.
- Ep 289 research 1:35
Vending Bench: A Benchmark for Long Term Coherence of Autonomous Agents
Exploring the Vending-Bench research paper and its implications for long-term coherence in autonomous agents
- Ep 288 article 1:08
Andon Labs
Exploring Andon Labs and their work on autonomous organizations without human intervention
- Ep 287 tool 1:02
How to Implement Tool Calling with Gemma 4 and Python MachineLearningMastery
Episode 287 of Exploring Next dives into the world of tool calling with Gemma 4 and Python, exploring how to build a local, privacy-first tool-calling agent.
- Ep 286 research 1:09
Databricks tested a stronger model against its multi step agent on hybrid queries. The stronger model still lost by 21%.
Databricks' research shows multi-step agents outperform single-turn RAG systems on hybrid queries, achieving gains of 20% or more on Stanford's STaRK benchmark suite.
- Ep 285 article 1:06
Stop Treating AI Memory Like a Search Problem | Towards Data Science
Episode 285 of Exploring Next explores the limitations of treating AI memory like a search problem and delves into the concept of a lifecycle memory system that actively manages superseded information.
- Ep 284 article 1:55
Minimax Releases Mmx CLI a Command Line Interface That Gives AI Agents Native Access to Image Video Speech Music Vision and Search
Exploring the MMX-CLI, a command-line interface that gives AI agents native access to image, video, speech, music, vision, and search capabilities.
- Ep 283 tool 1:00
Replit taps RevenueCat to help vibe Coders make money
Replit and RevenueCat team up to help developers monetize their apps, making it easier for vibe-coders to make money
- Ep 282 article 1:05
Deep Agents Deploy: an open alternative to Claude Managed Agents
Exploring Next Episode 282: Deep Agents Deploy, an open alternative to Claude Managed Agents
- Ep 281 tool 1:03
We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an execu...
Claude AI's advisor strategy and its implications on AI development
- Ep 280 research 1:14
Alright agent nerds, if you care about your tokens and usage limits, pay attention to the tools you give to your agen...
Episode 280 of Exploring Next dives into the importance of choosing the right browser tools for agents, exploring their impact on token usage and latency.
- Ep 279 article 0:58
2041927488918413589
Exploring Next dives into the world of emerging tech, focusing on a recent development that affects how we interact with online platforms, specifically when JavaScript is disabled in browsers.
- Ep 278 tool 1:42
True enterprise sovereignty is more approachable than ever, thanks to K8s Powered cloud neutral PostgreSQL
Episode 278 of Exploring Next discusses the concept of true enterprise sovereignty using K8s-powered cloud-neutral PostgreSQL, highlighting how it works and its key mechanisms.
- Ep 277 tool 1:35
New framework lets AI agents rewrite their own skills without retraining the underlying model
Episode 277 of Exploring Next covers Memento-Skills, a framework that enables AI agents to rewrite their own skills without retraining the underlying model, and its implications on autonomous agents and enterprise teams.
- Ep 276 article 0:57
AI joins the 8 hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT 5.4 on SWE Bench Pro
Discussion of GLM-5.1, a new open-source large language model that can work autonomously for up to eight hours on a single task, and its implications on the AI industry
- Ep 275 research 0:57
ClawArena: Benchmarking AI Agents in Evolving Information Environments
Exploring ClawArena, a benchmark for evaluating AI agents in evolving information environments
- Ep 274 tool 1:31
Rightnow AI Releases Autokernel an Open Source Framework That Applies an Autonomous Agent Loop to GPU Kernel Optimization for Arbitrary Pytorch Models
Exploring the release of Autokernel, an open-source framework for autonomous GPU kernel optimization in PyTorch models
- Ep 273 article 1:12
LLM Wiki
Exploring the LLM Wiki concept and its potential applications
- Ep 272 article 1:19
2040694135393280113
Episode 272 of Exploring Next dives into the issues surrounding JavaScript availability and browser compatibility on x.com, discussing the implications for users and developers.
- Ep 271 article 1:06
Andrej Karpathy Just 10x’d Everyone’s Claude Code
Episode 271 of Exploring Next dives into Andrej Karpathy's recent work on Claude, which has significantly improved its capabilities. The discussion revolves around the substance of the project, its architecture, and how it works, with a focus on the product angle and technical aspects.
- Ep 270 tool 1:14
Continual learning for AI agents
Continual learning for AI agents enables systems to improve over time by updating model weights, harnesses, and context. This episode explores the three distinct layers of agentic systems and how they can be applied in real-world scenarios.
- Ep 269 research 1:08
Open Source orchestration for zero Human companies
Episode 269 of Exploring Next dives into the world of open-source orchestration for zero-human companies, focusing on Paperclip, a Node.js server and React UI that coordinates AI agents to run a business.
- Ep 268 api 1:11
Why pgEdge thinks MCP (not an API) is the right way for AI agents to talk to databases
Episode 268 of Exploring Next discusses pgEdge's approach to AI agents talking to databases using MCP, a non-API solution. Izzo and Boone dive into the substance of MCP, explaining its key mechanisms, design choices, and architecture. They connect it to real-world problems and current trends, exploring the product angle and tech behind it.
- Ep 267 article 1:40
Emotion Concepts and their Function in a Large Language Model
Exploring the role of emotion concepts in large language models, including their function, architecture, and implications for alignment-relevant behavior.
- Ep 266 article 0:40
2039356267949445230
We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com.
- Ep 265 article 2:19
LangChain Academy New Course: Monitoring Production Agents
Episode 265 dives into LangChain Academy's new course on monitoring production agents. Izzo and Boone explore why agent observability has become critical as more companies deploy AI agents to production, examining the specific monitoring techniques, observability patterns, and debugging approaches covered in the course.
- Ep 264 research 2:26
Embarrassingly Simple Self Distillation Improves Code Generation
Apple researchers developed Simple Self-Distillation (SSD), a technique that improves code generation models by fine-tuning them on their own raw outputs—no verification needed. The method improved Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench by reshaping token distributions to balance precision and exploration in code generation.
- Ep 263 tool 2:01
Running local models on Macs gets faster with Ollama's MLX support
Ollama just added MLX support for Apple Silicon Macs, promising significantly faster local LLM performance through better unified memory usage. We break down what this actually means, why it matters as local models gain momentum, and the technical architecture that makes it work.
- Ep 262 article 2:43
Imagine if your Teams or Slack messages automatically turned into secure context for your AI agents — PromptQL built it
PromptQL turns Slack/Teams conversations into secure, persistent memory for AI agents. Instead of coordination theater, every discussion becomes actionable context that agents can use to actually execute work—fixing bugs, updating CRMs, pulling cross-platform data—while maintaining enterprise security controls.
- Ep 261 article 2:06
Reddit Please wait for verification
Episode 261 explores the challenge of making AI-generated text sound more human and natural. Izzo and Boone dive into the technical reasons why AI writing feels 'polished' and robotic, examining transformer architecture patterns, training biases, and the fundamental trade-offs between coherence and authenticity. They discuss practical techniques for prompt engineering, post-processing workflows, and architectural approaches to generate more natural-sounding text.
- Ep 260 article 2:14
Reddit The heart of the internet
Izzo and Boone dissect the leaked Claude Code prompts and explore how to build better AI agents by studying Anthropic's approach to prompt engineering, focusing on practical patterns like negative rules, risk tiers, and verification agents.
- Ep 259 article 2:25
Prismo Optimize AI Costs
Prismo is an AI cost optimization platform that acts as a drop-in proxy between your application and AI providers like OpenAI and Anthropic. By routing requests through Prismo's gateway, teams get real-time spend tracking, automated budget enforcement, and intelligent model routing that can reduce costs by up to 40%. The platform requires just a one-line code change to integrate and provides full visibility into AI spending across teams, services, and models.
- Ep 258 research 2:25
Temm1e/tems lab/perpetuum/RESEARCH PAPER.md at main · temm1e Labs/temm1e
Perpetuum is a framework that transforms LLM agents from request-response systems into perpetual, time-aware entities capable of scheduling, monitoring, and autonomous action. Built into the production TEMM1E runtime, it introduces temporal cognition, LLM-cognitive scheduling, and concern-based multitasking through an enabling framework principle that delegates intelligence to the LLM while providing infrastructure it can't handle itself.
- Ep 257 article 2:29
Designing delightful frontends with GPT 5.4 | OpenAI Developers
OpenAI's GPT-5.4 brings significant improvements to frontend development with enhanced image understanding, native tool integration, and computer use capabilities. The model can now generate production-ready interfaces with sophisticated visual design, incorporating mood boards, visual references, and automated testing through Playwright. Key improvements include better UI reasoning, complete app functionality, and self-verification workflows that enable more autonomous development cycles.
- Ep 256 article 2:28
Claude Code Python Porting Workspace
A deep dive into claude-code, a Python porting workspace that reimplements Claude's exposed codebase architecture. We explore the technical approach, ethical considerations around AI source reimplementation, and what this means for the future of reverse-engineering AI systems.
- Ep 255 article 2:25
Reddit The heart of the internet
A developer built Phantom, an open-source persistent AI agent that runs 24/7 on its own VM with vector memory, self-evolution capabilities, and MCP server integration. The agent autonomously installed ClickHouse, built analytics dashboards, created Discord integrations, and even monitors its own infrastructure — all without explicit instructions.
- Ep 254 article 2:16
Using OpenClaw as a Force Multiplier: What One Person Can Ship with Autonomous Agents | Towards Data Science
Nick Lawson shares his production system running 8 orchestrator agents and 35 personas on OpenClaw to manage content creation, infrastructure, and home automation. We dig into the architecture: heavyweight orchestrators making decisions on Opus, lightweight personas executing tasks on cheaper models, and the cost optimization strategies that make autonomous agents economically viable for solo builders.
- Ep 253 research 2:20
Natural Language Agent Harnesses
Exploring Natural-Language Agent Harnesses (NLAHs) — a new approach to making AI agent control logic portable and editable in plain English, plus the runtime system that executes these natural language harnesses across different environments.
- Ep 252 article 2:32
Vector Databases Explained in 3 Levels of Difficulty MachineLearningMastery
Izzo and Boone decode vector databases from basic similarity search to production-scale indexing algorithms like HNSW and IVF, explaining how they solve the core problem of searching unstructured data at scale.
- Ep 251 research 2:23
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long Horizon Iterative Tasks
Gabriel Orlanski and team at UW-Madison just dropped SlopCodeBench — the first benchmark that measures what happens when coding agents have to keep extending their own messy code. Turns out every single model fails spectacularly at long-term software development, with code quality degrading so badly that extensions become impossible. This isn't about whether agents can solve coding problems — it's about whether they can build software that doesn't collapse under its own weight.
- Ep 250 article 2:12
Meet Gitagent the Docker for AI Agents That Is Finally Solving the Fragmentation Between Langchain Autogen and Claude Code
GitAgent is a containerization platform for AI agents that standardizes deployment across LangChain, AutoGen, and Claude frameworks. It provides Docker-like packaging, unified APIs, and environment isolation to solve the current fragmentation in agent development.
- Ep 249 article 2:49
The three disciplines separating AI agent demos from real World deployment
Episode 249 explores why AI agents consistently fail in real-world enterprise deployments despite impressive demos, examining Creatio's three-discipline methodology for production-ready autonomous agents that can handle 80-90% of tasks independently through data virtualization, agent dashboards with KPIs, and tightly bounded use-case loops.
- Ep 248 research 2:35
Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent Based Persona Routing with PRISM
Episode 248 dives into a USC research paper that solves the persona prompting puzzle: why expert personas sometimes help LLMs and sometimes hurt them. The team discovered that personas boost alignment tasks like safety and style but damage knowledge retrieval accuracy. They built PRISM, a self-bootstrapping system that routes queries to personas only when they actually help, using no external data.
- Ep 247 research 2:28
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
Episode 247 dives into groundbreaking research on how LLMs internally respond to increasingly difficult tasks. The team discovered that as inputs become more out-of-distribution, models make their representations dramatically sparser — essentially concentrating computation into specialized subspaces. This isn't random; it's an adaptive mechanism for handling unfamiliar territory. The researchers built this insight into Sparsity-Guided Curriculum In-Context Learning, showing real performance gains by using sparsity patterns to intelligently schedule few-shot examples.
- Ep 246 article 2:32
Preparing IT for AI Agents: How MCP Shapes the Future of AI
Izzo and Boone explore MCP (Model Context Protocol) and how it's positioning IT infrastructure for AI agents, diving into the protocol's architecture, orchestration patterns, and what it means for organizations preparing their systems for autonomous AI workflows.
- Ep 245 tool 1:55
7 Steps to Mastering Memory in Agentic AI Systems MachineLearningMastery
Izzo and Boone dive deep into the seven-step framework for implementing memory in agentic AI systems, exploring why memory is a systems design problem rather than just throwing more context at models. They break down the four types of agent memory, explain the crucial differences between RAG and memory, and get into the architectural decisions around storage, retrieval, and forgetting that make production agents actually useful over time.
- Ep 244 api 4:50
Ai2 releases MolmoWeb, an open weight visual web agent with 30K human task trajectories and a full training stack
Ai2 releases MolmoWeb, the first open-weight visual web agent that ships with its full training data and pipeline. Unlike closed APIs or empty frameworks, MolmoWeb includes 30K human task trajectories, works purely from screenshots, and gives developers full visibility into how it was built.
- Ep 243 research 5:29
Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck
Chain-of-Thought prompting makes LLMs more accurate but expensive. This research reframes efficient reasoning as a compression problem, introducing a conditional information bottleneck approach that preserves essential reasoning while cutting cognitive bloat. Instead of naive length penalties, they use semantic priors based on token surprisal to compress reasoning traces intelligently.
- Ep 242 research 1:26
How xMemory cuts token costs and context bloat in AI agents
Featured How xMemory cuts token costs and context bloat in AI agents Ben Dickson March 25, 2026 Image credit: VentureBeat with ChatGPT Standard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows.
- Ep 241 tool 5:08
AI Coding Assistants Haven’t Sped up Delivery Because Coding Was Never the Bottleneck
Agoda's analysis of AI coding assistants reveals they boost individual developer output but don't speed up project delivery because coding was never the real bottleneck. The constraint has shifted upstream to specification and verification, fundamentally changing how engineering teams should be structured and what work humans focus on.
- Ep 240 article 5:24
Cloudflare’s new Dynamic Workers ditch containers to run AI agent code 100x faster
Cloudflare launches Dynamic Workers, ditching containers for millisecond-starting isolates that run AI agent code 100x faster. The tech enables 'Code Mode' — where LLMs write TypeScript functions instead of chaining tool calls, cutting token usage by 81%. Built on V8 isolates, it's positioning sandboxing as a strategic layer in the AI stack.
- Ep 239 research 5:29
Andrej Karpathy's new open source 'autoresearch' lets you run hundreds of AI experiments a night — with revolutionary implications
Andrej Karpathy released autoresearch, a 630-line open source script that runs autonomous AI experiments overnight. The system creates an optimization loop where agents modify their own code, test hypotheses, and keep improvements—completing hundreds of experiments while humans sleep. Early adopters distributed the approach across networks and applied it beyond ML to marketing, suggesting a fundamental shift toward automated scientific discovery.
- Ep 238 research 5:37
Autoresearch
Karpathy's autoresearch lets AI agents autonomously experiment on machine learning models overnight — modifying code, training for 5 minutes, evaluating results, and iterating while you sleep. We dive into how it works, the clever design constraints, and why this might be the beginning of fully autonomous AI research.
- Ep 237 research 6:38
Hyperagents
Episode 237 explores Hyperagents, a breakthrough in self-improving AI that goes beyond just getting better at tasks to actually improving how it improves. Izzo examines the product potential while Boone breaks down the technical architecture that enables genuine metacognitive self-modification.
- Ep 236 article 5:48
Xiaomi stuns with new MiMo V2 Pro LLM nearing GPT 5.2, Opus 4.6 performance at a fraction of the cost
Xiaomi's MiMo-V2-Pro LLM achieves near GPT-5.2 performance at 1/7th the cost through sparse architecture with only 42B active parameters out of 1T total, targeting autonomous agents over conversational AI
- Ep 235 api 5:06
Developer’s Guide to AI Agent Protocols Google Developers Blog
Izzo and Boone explore Google's new Agent Development Kit and the emerging protocols solving AI agent integration hell - MCP for data connections, A2A for agent-to-agent communication, and UCP for commerce workflows. They build a restaurant supply chain agent live, showing how these protocols eliminate custom integration code.
- Ep 234 research 5:42
AgentProcessBench: Diagnosing Step Level Process Quality in Tool Using Agents
Episode 234 explores AgentProcessBench, a new benchmark for evaluating AI agents' step-by-step decision-making in realistic tool-use scenarios. Unlike math problems where you can backtrack from wrong answers, agent mistakes in the real world often have irreversible consequences - making it critical to catch errors before they cascade. The hosts dig into the technical innovation of ternary labeling (correct/neutral/error) and error propagation rules, while discussing who would actually build products using these insights and what the path to production looks like.
- Ep 233 tool 5:12
GitHub pcvelz/superpowers: An agentic skills framework & software development methodology that works CC task management support
Izzo and Boone explore Superpowers Extended, a fork of the open-source Superpowers framework specifically designed for Claude Code users. They dig into how it transforms AI-assisted development from chaotic back-and-forth into structured workflows with native task management, dependency tracking, and enforced methodologies like test-driven development.
- Ep 232 tool 6:00
Why AI workloads are breaking traditional Kubernetes observability strategies
Why AI workloads are breaking traditional Kubernetes observability strategies and what platform teams are building to fix it
- Ep 231 tool 6:11
Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned
Deep dive into practical AI agent evaluation frameworks, moving beyond traditional NLP metrics to assess real-world behavior, reliability, and production readiness. Covers hybrid evaluation approaches, operational constraints, and specific tools like MLflow, TruLens, and LangChain Evals.
- Ep 230 article 5:46
z.ai debuts faster, cheaper GLM 5 Turbo model for agents and 'claws' — but it's not open Source
Z.ai launches GLM-5-Turbo, a proprietary variant of their open-source GLM-5 model optimized for agent workflows and tool use. At $4.16 per million tokens total cost, it undercuts competitors while delivering better tool reliability and execution stability for multi-step automation tasks.
- Ep 229 article 6:30
Langsmart Publishes Industry’s First p95 Semantic Cache Benchmarks for On Premises AI Gateway, Challenges Market: “Show Me the p95”
Langsmart's Smartflow platform achieved 10.2x faster AI response times in Fortune 200 testing, delivering sub-300ms p95 latency on modest on-premises hardware while challenging the industry to publish real performance benchmarks.
- Ep 228 article 5:59
Reddit The heart of the internet
Lundrog built an open-source framework called agent-guardrails-template to control AI coding agents and prevent them from breaking codebases. The system uses four safety laws, active enforcement via a Go MCP server, and risk-based decision matrices to reduce AI-caused incidents by 78%.
- Ep 227 tool 5:29
The “files are all you need” debate misses what's actually happening in agent memory architecture
Exploring Next episode 227 dives deep into AI agent memory architecture, explaining why the 'files are all you need' approach is missing the bigger picture. Izzo and Boone break down the key mechanisms behind persistent memory systems, compare different architectural approaches, and discuss why this matters for anyone building production AI agents.
- Ep 226 research 6:32
NanoClaw and Docker partner to make sandboxes the safest way for enterprises to deploy AI agents
NanoClaw teams up with Docker to solve enterprise AI agent security through proper sandboxing. We break down why agents break traditional containers, how Docker Sandboxes work differently, and what this means for multi-agent deployment at scale.
- Ep 225 research 1:24
The team behind continuous batching says your idle GPUs should be running inference, not sitting dark
The team behind continuous batching says your idle GPUs should be running inference, not sitting dark Sean Michael Kerner March 12, 2026 Credit: Image generated by VentureBeat with Nano-Banana-2 Every GPU cluster has dead time. Training jobs finish, workloads shift and hardware sits dark while power and cooling costs keep running.
- Ep 224 article 5:42
Agents need vector search more than RAG ever did
Why agents are driving a massive spike in vector search complexity, making purpose-built retrieval infrastructure more critical than ever. We dig into Qdrant's latest release, real production stories from companies handling millions of documents, and the three signals it's time to upgrade your vector setup.
- Ep 223 api 5:38
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM Powered Assistants
Exploring Next digs into MiniAppBench, a new benchmark that evaluates how well LLMs can generate interactive HTML applications instead of just text responses. The paper introduces 500 real-world tasks and an automated evaluation framework that tests apps like a human would. We break down the technical approach, discuss what this means for AI assistant interfaces, and identify specific tools listeners can experiment with.
- Ep 222 tool 6:14
Galileo releases Agent Control, a centralized guardrails platform for enterprise AI agents
Galileo launches Agent Control, an open-source centralized guardrails platform for enterprise AI agents, addressing the critical need for safety and control as AI agents become more autonomous in production environments.
- Ep 221 research 5:43
LLM2Vec Gen: Generative Embeddings from Large Language Models
Episode 221 explores LLM2Vec-Gen, a breakthrough approach that creates embeddings by learning to represent what a language model would generate, rather than encoding the input. Instead of traditional contrastive learning, this method adds special tokens that capture the model's potential response, achieving state-of-the-art results while maintaining safety alignment and reasoning capabilities.
- Ep 220 article 6:10
Netflix Uncovers Kernel Level Bottlenecks While Scaling Containers on Modern CPUs
Netflix discovered that scaling hundreds of containers simultaneously hits deep kernel-level bottlenecks in the Linux virtual filesystem, where thousands of mount operations create lock contention that varies dramatically across different CPU architectures. Their solution involved redesigning overlay filesystems to reduce mount operations from O(n) to O(1) per container.
- Ep 219 research 5:01
In Context Reinforcement Learning for Tool Use in Large Language Models
Episode 219 explores In-Context Reinforcement Learning (ICRL), a breakthrough approach that teaches language models to use external tools without expensive supervised fine-tuning. Instead of requiring thousands of labeled examples upfront, ICRL uses few-shot prompting during reinforcement learning training, gradually reducing examples until the model masters tool use independently.
- Ep 218 article 5:08
Reddit The heart of the internet
Episode 218 dives into CodeSpeak, a new spec-driven programming language from Kotlin's creator Andrey Breslav. We explore how it flips traditional development by starting with specifications and generating code, examining its type system, tooling architecture, and potential to reshape how teams build software.
- Ep 217 article 5:57
Google finds that AI agents learn to cooperate when trained against unpredictable opponents
Google's Paradigms of Intelligence team discovered that AI agents naturally develop cooperative behaviors when trained against diverse, unpredictable opponents rather than being programmed with hardcoded coordination rules. This breakthrough offers a scalable alternative to traditional multi-agent frameworks by using standard reinforcement learning techniques to produce adaptive social behaviors through in-context learning.
- Ep 216 article 5:38
Enterprise agentic AI requires a process layer most companies haven’t built
Enterprise agentic AI adoption faces a critical infrastructure gap: 85% of companies want AI agents within three years, but 76% lack the process optimization foundation to support them. The real blocker isn't technology—it's siloed teams, disconnected systems, and AI agents operating without business context.
- Ep 215 article 4:46
Use agent identity with Secret Manager
Exploring Next dives deep into a cutting-edge tech development that's reshaping how we think about distributed systems and real-time processing. Izzo and Boone break down the architecture, examine the trade-offs, and connect it to current market needs.
- Ep 214 article 4:46
Understanding Context and Contextual Retrieval in RAG | Towards Data Science
Episode 215 dives deep into contextual retrieval in RAG systems, exploring how traditional RAG loses crucial context when documents are chunked and how Anthropic's contextual retrieval approach dramatically improves accuracy by generating helper text that situates each chunk within its original document. Izzo and Boone examine the core technical mechanisms, implementation details, and real-world impact of this technique.
- Ep 213 tool 3:42
Is RAG Still Needed? Choosing the Best Approach for LLMs
Izzo and Boone dive deep into the current state of RAG versus fine-tuning for LLMs, examining when retrieval-augmented generation still makes sense and when newer approaches might be better. They break down the technical trade-offs, cost implications, and real-world performance considerations that developers face when choosing between RAG, fine-tuning, and hybrid approaches.
- Ep 212 research 4:30
New KV cache compaction technique cuts LLM memory 50x without accuracy loss
MIT researchers developed Attention Matching, a KV cache compaction technique that achieves 50x memory reduction in LLMs without accuracy loss, solving a critical bottleneck for enterprise applications handling long contexts.
- Ep 211 article 4:38
Building frontend UIs with Codex and Figma
OpenAI's new Figma MCP server creates a bidirectional bridge between Figma designs and Codex code generation, allowing developers to extract design context from Figma files for code generation and push live UI back to Figma canvas for iteration. The integration supports full roundtrip workflows from design to code and back.
- Ep 210 api 5:38
Copilot Content Exclusion REST API in public preview GitHub Changelog
GitHub's new Content Exclusion REST API lets organizations programmatically manage what code Copilot can and can't learn from — a game-changer for enterprises juggling AI productivity with IP protection.
- Ep 209 article 5:01
Visual imitation learning: Guidde trains AI agents on human 'expert video' instead of documentation
Guidde raised $50M to solve enterprise AI's 'last mile' problem by training agents on video recordings of human experts, not documentation. Instead of PDFs, they capture rich telemetry—every click, scroll, and DOM change—creating 'digital world models' that let AI navigate complex enterprise software with human-like spatial awareness.
- Ep 208 research 1:42
H Neurons: On the Existence, Impact, and Origin of Hallucination Associated Neurons in LLMs
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu, Maosong Sun Tsinghua University {gaoc24}@mails.tsinghua.edu.cn , {huimchen,xcj,liuzy}@tsinghua.edu.cn Abstract Large language models (LLMs) frequently generate hallucinations – plausible but factually incorrect outputs – undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored.
- Ep 207 api 6:00
Exposing biases, moods, personalities, and abstract concepts hidden in large language models
MIT researchers developed a method to identify and manipulate hidden concepts like biases, personalities, and moods in large language models using recursive feature machines (RFMs). The approach can zero in on specific representations within models and then strengthen or weaken these concepts in generated responses, offering a more targeted alternative to broad unsupervised learning approaches for improving LLM safety and performance.
- Ep 206 api 1:36
Towards a Science of AI Agent Reliability
Title: arXiv Query: search_query=&id_list=2602.16666&start=0&max_results=10 Authors: Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan Abstract: AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice.
- Ep 205 tool 4:51
How to Use Memory in Agent Builder
LangChain's Agent Builder uses filesystem-based memory to get smarter over time, storing both short-term task context and long-term instructions as Markdown files. The system includes specialized 'skills' that load contextually and supports direct memory editing for fine-tuned control.
- Ep 204 api 5:39
Multi Agent cooperation through in Context co Player inference
Exploring how sequence models can learn cooperation in multi-agent settings without hardcoded assumptions about other players, using in-context learning to naturally develop mutual cooperation strategies.
- Ep 203 api 4:58
Managed MCP servers for Google Cloud databases | Google Cloud Blog
Google Cloud launches managed MCP servers for their database portfolio, letting AI agents directly interact with PostgreSQL, Spanner, Cloud SQL, Firestore, and Bigtable through the Model Context Protocol standard. No infrastructure to deploy — just configure endpoints and agents get secure, governed access to operational data.
- Ep 202 tool 1:46
New agent framework matches human engineered AI systems — and adds zero inference cost to deploy
Featured New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy Ben Dickson February 18, 2026 Image credit: VentureBeat with ChatGPT Agents built on top of today's models often break with simple changes — a new library, a workflow modification — and require a human engineer to fix it. That's one of the most persistent challenges in deploying AI for the enterprise: creating agents that can adapt to dynamic environments without constant hand-holding.
- Ep 201 tool 5:11
Improving Deep Agents with harness engineering
LangChain improved their coding agent from Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness - the system that wraps around the model. They used trace analysis to identify failure patterns and implemented targeted fixes like self-verification loops, context injection, and reasoning budget optimization. The 13.7 point improvement shows how much performance gains come from better tooling around models, not just bigger models.
- Ep 200 article 4:58
2023872409091403810
Episode 201 explores a breakthrough in browser-based AI inference that lets developers run large language models directly in the client without server calls. Izzo and Boone break down the WebAssembly architecture, discuss the product implications for privacy-first applications, and examine how this could reshape the economics of AI-powered features.
- Ep 199 article 1:20
2023957499183829467
JavaScript is not available. We’ve detected that JavaScript is disabled in this browser.
- Ep 198 article 4:42
2023738764841894352
Episode 199 explores a critical JavaScript accessibility issue affecting X.com and similar platforms, diving into how disabled JavaScript breaks modern web apps and what developers can build to solve it.
- Ep 197 article 1:20
2023822767284490263
JavaScript is not available. We’ve detected that JavaScript is disabled in this browser.
- Ep 196 article 5:29
2023900667275067883
Episode 197 explores a critical web development issue that's hitting teams everywhere: JavaScript dependency failures and browser compatibility problems that are breaking production apps. Izzo and Boone dive deep into the technical mechanics of how modern web applications handle JavaScript loading, fallback strategies, and the architectural decisions that determine whether your app gracefully degrades or completely fails when things go wrong.
- Ep 195 article 5:09
2023906632871407643
Episode 196 explores a breakthrough in browser-based AI inference that lets you run large language models directly in your web browser without server calls, examining the technical architecture behind WebAssembly optimization and the product implications for privacy-first AI applications.
- Ep 194 article 4:21
Top 7 Small Language Models You Can Run on a Laptop MachineLearningMastery
Izzo and Boone explore seven small language models that run locally on laptops, diving deep into the technical trade-offs, hardware requirements, and real-world use cases. They break down everything from Phi-3.5 Mini's long-context capabilities to Llama 3.2's versatility, examining why local inference matters and how to choose the right model for your specific needs.
- Ep 193 article 6:19
SurrealDB 3.0 wants to replace your five database RAG stack with one
SurrealDB 3.0 combines vector search, graph traversal, and relational queries into a single transactional database engine, aiming to replace the complex multi-database stacks commonly used in RAG systems. The Rust-native architecture stores agent memory as graph relationships directly in the database with full ACID guarantees across distributed nodes.
- Ep 192 article 5:19
openclaw with ollama (Zero cost AI Assistant)
Izzo and Boone explore OpenClaw, an open-source AI assistant framework that runs entirely locally with Ollama. They dig into how it creates zero-cost AI workflows, the agent architecture with workspace management and subagent spawning, and why running your own AI stack locally matters for both privacy and cost control.
- Ep 191 tool 1:50
OpenAI Publishes Codex App Server Architecture for Unifying AI Agent Surfaces
InfoQ Homepage News OpenAI Publishes Codex App Server Architecture for Unifying AI Agent Surfaces Architecture & Design Orchestrating Production-Ready AI Workflows with Apache Airflow (Webinar Mar 5th) OpenAI Publishes Codex App Server Architecture for Unifying AI Agent Surfaces Feb 17, 2026 3 min read by Eran Stiller Write for InfoQ Feed your curiosity. Help 550k+ global senior developers each month stay ahead.
- Ep 190 research 5:19
Anthropic Found Out Why AIs Go Insane
Anthropic's breakthrough research reveals why AI models exhibit bizarre failure modes and how their new interpretability technique maps the actual concepts models learn internally. We explore mechanistic interpretability, sparse autoencoders, and what this means for building more reliable AI systems.
- Ep 189 article 6:05
NanoClaw solves one of OpenClaw's biggest security issues — and it's already powering the creator's biz
NanoClaw is a secure, lightweight alternative to OpenClaw that addresses critical security issues through OS-level container isolation. Created by Gavriel Cohen, it reduces OpenClaw's 400,000-line codebase to just 500 lines of TypeScript while providing sandboxed execution environments. The project emphasizes a 'Skills over Features' approach where AI customizes the codebase rather than shipping with pre-built integrations.
- Ep 188 research 1:29
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning Yicheng Chen 1,2 , Zerun Ma 2 , Xinchen Xie 2 , Yining Li 2† , Kai Chen 2† 1 Fudan University 2 Shanghai AI Laboratory Github : https://github.com/yichengchen24/DataChef Abstract In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the data recipe , which comprises a data processing pipeline to transform raw sources into training corpora.
- Ep 187 tool 1:20
GitHub BankrBot/openclaw skills: Moltbot skill library for AI agents. Including polymarket, crypto trading, DeFi operations, automation, and more. Open a PR to add skills.
OpenClaw Skills Library Pre-built capabilities for ai agents to interact with crypto infrastructure. Skills enable autonomous DeFi operations, token launches, onchain messaging, and protocol integrations through natural language interfaces.
- Ep 186 tool 6:22
Forge: Scalable Agent RL Framework and Algorithm
Izzo and Boone dive deep into MiniMax's Forge framework — a production-scale RL system that trained their M2.5 model across hundreds of thousands of real-world agent scaffolds. They explore how Forge solves the fundamental trilemma of system throughput, training stability, and agent flexibility through architectural innovations like middleware abstraction, windowed FIFO scheduling, and prefix tree merging for massive computational efficiency.
- Ep 185 article 7:02
z.ai's open source GLM 5 achieves record low hallucination rate and leverages new RL 'slime' technique
z.ai's GLM-5 achieves record-low hallucination rates using a novel 'slime' reinforcement learning technique, scaling to 744B parameters while undercutting competitors by 6x on pricing. The model features native document generation and Agent Mode capabilities for enterprise workflows.
- Ep 184 api 6:36
Google Chrome ships WebMCP in early preview, turning every website into a structured tool for AI agents
Google Chrome launches WebMCP in early preview - a new browser API that lets websites expose structured tools directly to AI agents, eliminating the need for expensive screenshot-based scraping and fragile DOM parsing.
- Ep 183 article 5:21
MiniMax's new open M2.5 and M2.5 Lightning near state of the art while costing 1/20th of Claude Opus 4
MiniMax drops their M2.5 model that matches Claude Opus 4.6 performance at 1/20th the cost, using sparse MoE architecture and a novel RL training framework called Forge to create AI agents that can handle enterprise tasks autonomously.
- Ep 182 article 6:29
recipes/GLM/GLM5.md at main · vllm Project/recipes
Episode 183 explores GLM5, a new language model architecture that's pushing boundaries in multimodal understanding and reasoning. Izzo and Boone dive deep into how it handles mixed text-image inputs, its novel attention mechanisms, and why vLLM is building dedicated recipes for deployment at scale.
- Ep 181 research 4:55
MIT's new fine tuning method lets LLMs learn new skills without losing old ones
MIT researchers developed self-distillation fine-tuning (SDFT), a technique that lets large language models learn new skills without forgetting old ones. By using a model's own in-context learning abilities as both teacher and student, SDFT solves the catastrophic forgetting problem that forces companies to maintain separate models for each task.
- Ep 180 api 6:32
OpenAI upgrades its Responses API to support agent skills and a complete terminal shell
OpenAI's major Responses API upgrade introduces Server-side Compaction for persistent agent memory, hosted shell containers with full terminal environments, and support for the universal Skills standard - transforming AI agents from forgetful assistants into reliable, long-running digital workers.
- Ep 179 research 5:58
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
Deep dive into fixing deceptive alignment in reward models - why getting the right answer isn't enough if the reasoning is wrong, and how a hybrid training approach combining outcome accuracy with rationale consistency achieves state-of-the-art performance while solving a critical RLHF generalization problem.
- Ep 178 api 1:50
Kong launches Context Mesh to turn enterprise APIs into agent Ready tools Help Net Security
Industry News February 11, 2026 Share Kong launches Context Mesh to turn enterprise APIs into agent-ready tools Kong has announced Kong Context Mesh, a product that automatically discovers enterprise APIs, transforms them into agent-consumable tools, and deploys them with runtime governance. “Organisations have spent years building APIs as the nervous system of the enterprise.
- Ep 177 article 5:14
Transformers.js v4 Preview: Now Available on NPM!
Transformers.js v4 brings massive performance improvements with a new C++ WebGPU runtime, modular architecture, and standalone tokenizer library. Now runs state-of-the-art AI models directly in browsers, Node, and Deno with hardware acceleration.
- Ep 176 tool 5:15
Alibaba Open Sources Zvec an Embedded Vector Database Bringing Sqlite Like Simplicity and High Performance on Device RAG to Edge Applications
Alibaba open-sources ZVec, an embedded vector database that brings SQLite-like simplicity to on-device RAG applications, enabling high-performance semantic search without cloud dependencies.
- Ep 175 article 5:46
'Observational memory' cuts AI agent costs 10x and outscores RAG on long Context benchmarks
Observational memory is a new approach to AI agent memory that uses two background agents to compress conversation history into dated observation logs, achieving 10x cost savings through stable context windows that enable prompt caching while outperforming traditional RAG systems on long-context benchmarks.
- Ep 174 api 5:03
Next Moca Releases Agent Definition Language as an Open Source Specification
Next Moca has open-sourced Agent Definition Language (ADL), a specification that standardizes how AI agents are defined across platforms. Think OpenAPI for agents - it provides a declarative format for defining agent identity, tools, permissions, and governance metadata to solve the growing fragmentation problem in production AI systems.
- Ep 173 article 6:57
GitHub Win4r/team tasks: Multi agent pipeline coordination: Linear, DAG, and Debate modes for AI agent orchestration
A Python CLI tool that coordinates multi-agent development workflows through three distinct modes: linear pipelines for sequential work, DAG-based dependency graphs for parallel execution, and debate mode for multi-agent deliberation. Built specifically for OpenClaw integration with no external dependencies.
- Ep 172 tool 6:20
How PMs use the Codex app
Product managers are using a new app called Codex to bridge the gap between product vision and engineering execution. We explore how it works, why it's gaining traction among PMs, and what makes it different from traditional project management tools.
- Ep 171 research 1:40
A RAG: Scaling Agentic Retrieval Augmented Generation via Hierarchical Retrieval Interfaces
A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces Mingxuan Du 1 , Benfeng Xu 2† , Chiwei Zhu 1 , Shaohan Wang 1 , Pengyu Wang 1 Xiaorui Wang 2 , Zhendong Mao 1‡ 1 University of Science and Technology of China, Hefei, China 2 Metastone Technology, Beijing, China [email protected] Abstract Frontier language models have demonstrated strong reasoning and long-horizon tool-use capabilities. However, existing RAG systems fail to leverage these capabilities.
- Ep 170 article 5:12
Introducing: React Best Practices Vercel
Vercel releases react-best-practices, a structured framework that captures 10+ years of React optimization knowledge. It focuses on ordering performance work by impact—starting with eliminating waterfalls and reducing bundle size before micro-optimizations. The repository includes 40+ rules across 8 categories and compiles into a single document that AI coding agents can use for code reviews and refactoring suggestions.
- Ep 169 research 3:00
Thinking in Frames: How Visual Context and Test Time Scaling Empower Video Reasoning
Today, we dive into a game-changing approach to visual reasoning in video generation. How does this solve real-world problems?
- Ep 168 research 2:01
Group Evolving Agents: Open Ended Self Improvement via Experience Sharing
Exploring a new paradigm for AI evolution: Group-Evolving Agents. Are they the future or just another research paper?
- Ep 167 article 1:34
Docker versus Nix: The quest for true reproducibility
In this episode, we dive into the differences between Docker and Nix, exploring how they each approach reproducibility in software environments. As tech continues to evolve, ensuring consistency across development, testing, and production is paramount. We’ll examine how these tools can impact developers, organizations, and ultimately, the end users.
- Ep 166 article 1:26
Context Engineering: An Introduction to the Information Environment for LLMs
A deep dive into context engineering reveals how structuring information for large language models enhances their performance and relevance. It’s more than just managing prompts—it's about creating a dynamic environment that allows AI to engage intelligently. This discussion explores why these strategies matter, who stands to benefit, and practical examples of their application.
- Ep 165 article 2:00
Reddit The heart of the internet
In today's episode, we dive deep into an exciting achievement in the world of game development using AI. One developer crafted a pixel-art open-world shooter in just 24 hours using Gemini 3.0 Pro for both coding and art. We explore what this means for developers, the implications of using AI in creative workflows, and the future of game design. Join us as we unpack the significance of this innovative approach and its potential impact on the gaming industry.
- Ep 164 article 1:36
Agent Device
In this episode, we explore the innovative CLI tool 'agent-device' that allows developers to automate interactions with iOS and Android devices. We'll dive into how it enhances mobile testing and development workflows, the real-world implications of its features, and practical use cases that demonstrate its utility.
- Ep 163 api 1:33
10 strategies to reduce MCP token bloat
In today's tech landscape, managing token bloat is critical for efficient application performance. This dialogue dives into strategies for reducing MCP token bloat, emphasizing its importance for developers and organizations alike. The hosts explore practical solutions and real-world implications, showcasing how these strategies can lead to smoother operations and enhanced user experiences.
- Ep 162 research 1:35
Reinforcement World Model Learning for LLM based Agents
The research introduces Reinforcement World Model Learning (RWML), a self-supervised method that enhances the capacity of large language models (LLMs) to navigate dynamic environments by learning action-conditioned world models. This addresses the limitations of LLMs in anticipating consequences and adapting to environmental changes, offering significant improvements in performance without relying on expert data.
- Ep 161 article 1:51
Ltm the Next LLM This New Type of AI Can Do What Large Language Models Cant Fundamental
This episode explores the emergence of LTM, a new type of AI that promises capabilities beyond traditional LLMs, addressing their limitations and offering innovative solutions in real-world applications.
- Ep 160 article 1:32
Qwen3 Coder Next: How to Run Locally | Unsloth Documentation
In this episode, we explore Qwen3-Coder-Next, a groundbreaking coding model that enables local execution with high efficiency. We discuss its capabilities, real-world applications, and why it’s a game-changer for developers and tech enthusiasts.
- Ep 159 research 1:36
Self Hinting Language Models Enhance Reinforcement Learning
The paper explores how self-hinting language models can enhance reinforcement learning, particularly in overcoming the challenges faced when rewards are sparse. By introducing hints generated by the model itself during training, it reshapes the distribution of outcomes, allowing for better learning signals and improved performance on difficult prompts. This approach not only addresses existing limitations but also offers a novel way to adaptively guide the training process.
- Ep 158 article 1:30
How to Build Your Own Custom LLM Memory Layer from Scratch | Towards Data Science
In this episode, we explore innovative ways to enhance large language models (LLMs) with custom memory layers that improve user interactions. By enabling LLMs to remember past user interactions, we can drive personalization and efficiency in AI applications. Join us as we unpack how to build these memory systems from scratch and what this means for the future of conversational agents.
- Ep 157 article 1:23
Context Engineering: Prompt Management, Defense, and Control
The dialogue explores the nuances of context engineering in LLMOps, focusing on prompt management and versioning. It discusses why this is crucial for reliability in AI applications and how structured techniques can improve outputs while preventing errors. The conversation also highlights the real-world implications of these advancements for developers, businesses, and end-users, alongside practical takeaways for implementation.
- Ep 156 research 1:38
Latent Chain of Thought as Planning: Decoupling Reasoning from Verbalization
This episode explores the innovative PLaT framework for reasoning in large language models, which introduces a two-part system separating reasoning from verbalization. It addresses the challenges of computational efficiency and interpretability, paving the way for more effective AI solutions across various domains. By discussing practical implications and potential use cases, we highlight how this research can transform the landscape of AI applications and improve user experiences.
- Ep 155 tool 1:25
OpenAI launches new macOS app for agentic coding | TechCrunch
OpenAI's new macOS app for agentic coding is reshaping the landscape of software development by enabling AI agents to autonomously handle complex coding tasks, significantly speeding up the development process. This episode explores how this technology works, its implications for developers, and real-world applications.
- Ep 154 article 1:39
Agent Trace
Agent Trace is an innovative specification aimed at tracking AI-generated code contributions in version-controlled environments. It establishes a framework for clear attribution between human and AI authors, which is increasingly important as AI tools become central in software development. By implementing this standard, teams can ensure transparency, facilitate collaboration, and maintain accountability within their codebases, ultimately leading to better development practices.
- Ep 153 research 1:35
Linear representations in language models can change dramatically over a conversation
This episode dives into the significant findings of recent research on how language models adjust their internal representations during conversations. We explore the implications of these changes for developers and practitioners in AI, discuss potential applications, and highlight the challenges they present for interpretability and reliability in AI outputs.
- Ep 152 article 1:39
Introducing Moltworker: a self hosted personal AI agent, minus the minis
In this episode, we explore Moltworker, a self-hosted personal AI agent that operates seamlessly on Cloudflare's infrastructure. We discuss its implications for privacy, the power of self-hosting, and how it simplifies AI integration for everyday users.
- Ep 151 tool 1:28
Terminal 1
In today's discussion, we dive deep into Open Claude Cowork, a revolutionary tool that integrates AI with workplace communication, enabling seamless automation across multiple apps. This technology could redefine productivity, making it accessible to developers and businesses alike.
- Ep 150 article 1:43
moonshotai/Kimi K2.5 · Congratulations on this release and on one important realization!
The release of Moonshot AI's Kimi-K2.5 model marks a significant advancement in multimodal AI capabilities, enabling seamless integration of text and image processing. This technology not only enhances conversational AI but also opens new avenues for local deployment, making powerful tools accessible to a broader audience.
- Ep 149 article 1:55
Reddit The heart of the internet
Reddit has become a vital platform for discussions around emerging technologies, especially AI and autonomous systems. The recent AMA with the Qoder team reveals how developers are leveraging AI to enhance coding productivity. This episode dives into the implications of autonomous coding, the benefits it offers, and how it can transform software development practices.
- Ep 148 article 1:36
Moltbot, the AI agent that ‘actually does things,’ is tech’s new obsession
The rise of Moltbot, an AI agent that performs tasks on behalf of users, raises important discussions around efficiency and security in our digital lives. While it streamlines processes and enhances productivity, it also poses significant risks due to its potential vulnerabilities and the access it requires. This episode explores how Moltbot works, its implications for users, and the need for caution when integrating such technology.
- Ep 147 article 1:22
'Ralph Wiggum' loop prompts Claude to vibe clone software • The Register
This episode dives into the revolutionary coding technique called 'Ralph,' which leverages agentic AI to clone software inexpensively. The implications for the software industry are profound, as it threatens traditional development roles and practices. Join us as we discuss why this matters, who benefits, and what it means for the future of tech.
- Ep 146 tool 1:30
Anthropic extends MCP with a UI framework
Anthropic's latest extension of its MCP (Managed Conversation Platform) introduces a UI framework, allowing developers to create customized applications that leverage AI capabilities. This development could democratize access to advanced AI tools and improve application design.
- Ep 145 article 1:15
RAG isn’t dead, but context engineering is the new hotness
The emergence of context engineering signifies a pivotal shift in how we handle retrieval-augmented generation (RAG) technologies, impacting everything from AI applications to data management across various industries. This episode explores the practical implications of context engineering, who stands to benefit, and how it compares to existing solutions.
- Ep 144 research 1:28
LLM Generated Newspaper Provides Ultimate In Niche Publications
This episode dives into the innovative use of LLMs to create niche newspaper publications, exploring how AI can tailor content to specific audiences while considering the implications for journalism and information consumption.
- Ep 143 article 1:35
Context Engineering: Foundations, Categories, and Techniques of Prompt Engineering
In this episode, we unravel the significance of context and prompt engineering in large language models (LLMs). These techniques are critical for creating efficient and reliable AI applications. We discuss the fundamental principles of prompt engineering, its implications in real-world systems, and explore how crafting the right prompts can drastically influence model performance. Join us as we dissect how these innovations empower businesses and enhance user experiences.
- Ep 142 article 1:53
Choosing an LLM in 2026: The Practical Comparison Table (Specs, Cost, Latency, Compatibility)
In this episode, we dive into the nuances of selecting the right large language model (LLM) in 2026. With insights on context, cost, latency, and compatibility, we discuss how these factors shape effective prompt engineering and the importance of making informed model choices. Our conversation also explores real-world implications and provides practical examples for businesses looking to leverage LLMs.
- Ep 141 tool 1:29
Giving Agents a Visual Voice: MCP Apps Support in VS Code
This podcast episode explores the new MCP Apps feature in VS Code, which empowers AI coding agents with interactive visual capabilities. This innovation transforms the way developers collaborate with AI tools, enhancing productivity and problem-solving. Through real-world applications and examples, hosts discuss the implications and potential use cases of this exciting feature.
- Ep 140 article 1:43
Conversational AI doesn’t understand users — 'Intent First' architecture does
This episode explores the revolutionary 'Intent First' architecture in conversational AI, which improves user experiences by accurately understanding intent before delivering responses. We discuss why this matters in various industries and highlight real-world implications for companies and consumers alike.
- Ep 139 tool 1:46
GitHub AvdLee/SwiftUI Agent Skill: Add expert SwiftUI Best Practices guidance to your AI coding tool (Agent Skills open format).
The SwiftUI Agent Skill is revolutionizing the way developers approach coding in SwiftUI by offering expert guidance through AI tools, enhancing productivity and code quality. This episode explores its implications, practical applications, and why it matters for modern development.
- Ep 138 article 1:24
ErZaUgMTdP
This episode delves into a groundbreaking tool named Drift, designed to enhance codebase intelligence by leveraging Abstract Syntax Tree (AST) parsing. We explore how it addresses the common bottleneck of context limitations that hinder AI's effectiveness in software development. Through Drift, developers can now streamline their workflows, minimize audit loops, and improve code reliability and security. We discuss its implications for the industry and how this innovation could change programming practices.
- Ep 137 article 1:42
Scaling PostgreSQL to power 800 million ChatGPT users
The recent advancements in scaling PostgreSQL to support ChatGPT's rapid user growth highlight the ongoing challenges and solutions in database management for massive applications. This is crucial for understanding how to effectively manage user data and ensure seamless service as demand increases.
- Ep 136 research 1:36
Flashlabs Researchers Release Chroma 1 0 a 4b Real Time Speech Dialogue Model with Personalized Voice Cloning
This episode dives into the groundbreaking Chroma 1.0 model, which offers real-time speech dialogue capabilities with personalized voice cloning. We explore its implications for various sectors, including entertainment and education, and discuss potential use cases that could reshape how we interact with technology.
- Ep 135 research 1:43
LLM in Sandbox Elicits General Agentic Intelligence
The LLM-in-Sandbox research presents a significant advancement in how large language models can autonomously explore and learn within a controlled environment. This enables them to tackle complex tasks across various domains without further training, enhancing their utility in real-world applications and offering new capabilities for developers and practitioners.
- Ep 134 api 1:27
Agent Sandbox
The Agent Sandbox offers a secure environment for executing AI coding agents, addressing critical security concerns while allowing developers to utilize powerful tools like Claude Code. This episode dives into the implications of this technology, who it benefits, and how it can transform development workflows.
- Ep 133 tool 1:44
Learn RAG & MCP Fundamentals
This podcast episode delves into the importance of mastering Retrieval Augmented Generation (RAG) and Model Context Protocol (MCP) to enhance AI's capabilities in real-world applications. Hosts discuss how these technologies empower developers to create integrated systems that leverage private data effectively and enable AI to interact with various software seamlessly.
- Ep 132 tool 1:23
Anthropic working on MCP Apps with interactive UI components
Anthropic is enhancing its Claude Cowork platform with new interactive UI components that can revolutionize how users engage with AI applications. This development could streamline workflows, improve collaboration, and empower developers to create richer interactions, drawing clearer parallels with existing technology.
- Ep 131 research 1:39
Agentic Reasoning for Large Language Models
This dialogue explores the implications of agentic reasoning for large language models, discussing the potential for autonomous decision-making and its applications across various fields, while also addressing limitations and future directions.
- Ep 130 research 1:12
Agentic R: Learning to Retrieve for Agentic Search
This dialogue explores the innovative approach of Agentic-R in enhancing agentic search through tailored retriever training, its implications for developers, and practical applications.
- Ep 129 article 1:40
You Probably Dont Need a Vector Database for Your RAG Yet
In this episode, we explore the emerging topic of vector databases and their relevance in modern AI applications, particularly in retrieval-augmented generation (RAG). We discuss when they are actually necessary, who stands to benefit, and offer practical examples to help listeners understand this tech's implications.
- Ep 128 article 1:34
LangChain vs LangGraph: Why One's a Drive Through and the Other's a Buffet
In this episode, we explore the differences between LangChain and LangGraph, illustrated through food analogies. We discuss their functionalities, real-world applications, and the importance of choosing the right tool for the task at hand. The episode emphasizes decision-making in AI and how it impacts efficiency and user experience.
- Ep 127 article 1:25
Beyond Hybrid RAG That Actually Works Vector Bm25 Graphrag Reranking in Python Full Code 731a8f827a80
This episode dives into the breakthrough of Tri-Modal Hybrid RAG, which combines BM25, Vector, and GraphRAG techniques. We explore how this innovative approach enhances search accuracy, addresses common pitfalls in retrieval, and ultimately improves user experience across various applications. The conversation highlights the significance of effective information retrieval in tech and real-world scenarios.
- Ep 126 research 1:27
MAXS: Meta Adaptive Exploration with LLM Agents
MAXS introduces an innovative framework for improving the reasoning capabilities of LLM agents, addressing critical issues in multi-tool reasoning. The integration of lookahead strategies and trajectory convergence allows for more stable and efficient performance, making it highly relevant for developers and practitioners.
- Ep 125 article 1:32
Build Your First Claude Code Skill a Simple Project Memory System That Saves Hours 1d13f21aff9e
The new project-memory skill for Claude Code tackles the problem of AI amnesia, allowing coding assistants to retain context and history across sessions, thus significantly improving developer productivity. This episode explores how such skills can save time and enhance coding efficiency.
- Ep 124 article 1:27
Vector Database vs Graph Database for RAG Similarity vs Understanding 64c9d7345a6b
Exploring the nuanced differences between vector databases and graph databases, this dialogue highlights their roles in retrieval-augmented generation (RAG) systems, emphasizing the importance of context in AI responses.
- Ep 123 article 1:42
What Even Is a Parameter
This episode explores the significance of parameters in large language models (LLMs), discussing their role in AI functionality and the implications for real-world applications. Hosts engage in a dialogue about how these parameters affect model behavior and the energy demands of training them, illustrating concepts with relatable analogies and examples.
- Ep 122 article 1:31
GitHub ByteVisionLab/NextFlow: NextFlow🚀: Unified Sequential Modeling Activates Multimodal Understanding and Generation
NextFlow is a major advancement in multimodal AI, integrating text and image generation in a single framework. It enables rapid, high-quality visual generation and editing, which has significant implications for various industries, from content creation to education. This episode breaks down how NextFlow works, its real-world applications, and why it represents a paradigm shift in the field.
- Ep 121 article 1:29
2008319040620478905
In this episode, we discuss the recent insights shared on social media regarding the challenges surrounding JavaScript compatibility and browser support. This conversation illuminates the ongoing struggles developers face and the implications for user experience in web applications.
- Ep 120 research 1:50
Scientists Create a “Periodic Table” for Artificial Intelligence
Researchers have created a unifying framework for multimodal AI, akin to a periodic table, helping developers efficiently design AI systems. This model can improve accuracy, reduce data needs, and make AI more environmentally friendly, potentially revolutionizing various applications in technology and healthcare.
- Ep 119 article 1:29
AI Periodic Table Explained: Mapping LLMs, RAG & AI Agent Frameworks
In this episode, we dive into the transformative power of YouTube as a platform that allows users to create, share, and consume a diverse range of content. We explore its significance in democratizing content creation and its broader societal implications.
- Ep 118 article 1:45
MCP powered RAG Over Complex Docs
In this episode, we explore the integration of MCP-powered Retrieval-Augmented Generation (RAG) over complex documents, emphasizing its real-world applications and significance. Hosts discuss how this technology transforms document processing and retrieval, providing a fresh perspective on managing complex data efficiently.
- Ep 117 article 1:43
Webgpu Changed How I Think About Web Performance D63e771d1cee
WebGPU is revolutionizing web performance by drastically enhancing graphics and data processing speeds, showing a 23x improvement over WebAssembly in practical applications. This shift in technology not only benefits developers looking for efficient solutions but also enhances user experiences in data-intensive applications.
- Ep 116 tool 1:22
Awesome Claude Skills/brand guidelines/SKILL.md at master · ComposioHQ/awesome Claude Skills
In this episode, we explore the emergence of Claude, a powerful AI tool that enhances collaboration and productivity by integrating various skills. We discuss the significance of its brand guidelines, how it affects user engagement, and what it means for the future of digital collaboration. Real-world implications are examined through hypothetical scenarios and comparisons with existing tools.
- Ep 115 research 1:42
TimeBill: Time Budgeted Inference for Large Language Models
This episode dives into the innovative framework of TimeBill for time-budgeted inference in Large Language Models (LLMs), exploring its implications in time-sensitive applications and its adaptive mechanisms that enhance performance.
- Ep 114 article 1:46
LangGraph Explained from Scratch | Aman Kharwal
This episode dives into LangGraph, a new library that transforms how we build intelligent agents using Large Language Models. We'll explore its unique graph-based approach, practical applications, and why this matters for developers and users alike.
- Ep 113 research 1:25
Multi hop Reasoning via Early Knowledge Alignment
The research on Early Knowledge Alignment enhances how Large Language Models retrieve and reason with information, particularly for complex queries. This innovation improves precision and efficiency, benefiting developers in creating more effective AI systems.
- Ep 112 article 1:55
Memory: How Agents Learn
In this episode, we dive into the critical aspect of memory in AI agents, exploring how it enables learning and the transformative implications for user experience and system efficiency. We discuss the types of memory—session, user, and learned—and how they contribute to smarter, more effective agents. Join us as we uncover the potential of these technologies and their real-world applications.
- Ep 111 article 1:33
2003389376307593403
In this episode, we dive into the implications of a recent Twitter thread discussing a novel approach to AI ethics that could reshape the tech landscape. We explore how this could influence developers, businesses, and consumers alike, and what it means for the future of responsible technology use.
- Ep 110 article 1:44
Agent Skills vs MCP
The discussion centers on the relationship between Skills and MCP (Multi-Channel Protocol) in AI development, emphasizing how they complement rather than replace each other. Host A and Host B explore the implications of this synergy, the role of institution knowledge, and how this understanding can improve AI functionality in real-world applications.
- Ep 109 article 1:34
React2Shell is the Log4j moment for front end development
The emergence of the React2Shell vulnerability marks a pivotal moment in front-end development, highlighting significant security concerns that could have far-reaching implications for developers and organizations alike. This dialogue delves into the substance of the vulnerability, its real-world impacts, and the necessary measures that must be taken to mitigate risks.
- Ep 108 tool 1:30
I reclaimed tons of disk space using this simple Docker maintenance app
In this episode, we dive into how a simple Docker maintenance app called Portainer can dramatically reclaim disk space for users, especially those running multiple containers on home servers or NAS devices. We discuss its functionalities, real-world benefits, and how it can streamline Docker management for enthusiasts and professionals alike.
- Ep 107 article 1:37
GitHub KalyanKS NLP/RAG Interview Questions and Answers Hub: 100+ RAG interview questions with answers.
This episode dives into the importance of Retrieval-Augmented Generation (RAG) in enhancing the capabilities of language models, especially in reducing hallucinations and improving relevance in responses. We explore the challenges and strategies involved in implementing RAG, providing concrete use cases and implications for the tech community.
- Ep 106 research 1:39
LLMs work better together in smart contract audits Help Net Security
This episode delves into how collaborative large language models (LLMs) enhance smart contract auditing, improving accuracy in detecting vulnerabilities. It highlights the innovative LLMBugScanner framework from Georgia Tech, which combines ensemble voting with fine-tuned models. We’ll explore why this matters in the blockchain ecosystem, who stands to benefit, and real-world implications that can prevent costly errors in smart contracts.
- Ep 105 research 1:45
Adaptation of Agentic AI
The research on agentic AI adaptation presents a significant step toward creating more efficient and reliable AI systems. By establishing a structured framework for both agent and tool adaptations, it provides developers with essential guidance for improving AI capabilities, addressing challenges, and enhancing performance. This dialogue explores the implications of this research and its practical applications in the field.
- Ep 104 research 1:36
The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs
This podcast episode delves into the Debugging Decay Index (DDI), a new mathematical framework that highlights the rapid decline of AI debugging effectiveness and provides insights on optimizing debugging through timely interventions.
- Ep 103 research 1:42
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
The research introduces AuditDM, a novel framework to audit multimodal LLMs by identifying their capability gaps through reinforcement learning. This approach not only helps in discovering failure modes but also offers a pathway for model improvement without extensive annotation. The implications for developers are significant, as they can utilize these insights to enhance model performance and effectiveness in real-world applications.
- Ep 102 article 1:46
Reddit The heart of the internet
In this episode, we explore the significance of Reddit as a central hub for internet discourse and innovation. We discuss the implications of user-driven content, the dynamics of community engagement, and how platforms like Reddit shape discussions around technology and artificial intelligence. The conversation highlights real-world applications, comparisons to traditional media, and what the future holds for collaborative platforms.
- Ep 101 tool 1:24
Introducing Agent Development Kit for TypeScript: Build AI Agents with the Power of a Code First Approach Google Developers Blog
The Agent Development Kit (ADK) for TypeScript allows developers to create powerful AI agents using a code-first approach, enhancing flexibility and control in AI development. This creates a seamless integration for JavaScript/TypeScript developers, enabling them to leverage existing skills and tools for more complex, autonomous systems.
- Ep 100 article 1:50
With 91% accuracy, open source Hindsight agentic memory provides 20/20 vision for AI agents stuck on failing RAG
The development of Hindsight agentic memory marks a pivotal advancement in AI, allowing agents to maintain context and provide insightful responses over time, unlike traditional RAG systems. This conversation explores how this technology works, its real-world implications, and why it matters to businesses and everyday users.
- Ep 99 article 1:38
Reddit The heart of the internet
This episode dives into the concept of 'Debugging Decay' in AI systems, particularly how ChatGPT's performance can degrade after multiple attempts at fixing coding errors. We'll discuss the implications of context pollution and how users can adapt their workflows for better results.
- Ep 98 tool 1:17
Meta
Meta's React Compiler 1.0 introduces automatic memoization to optimize React applications, enhancing performance without requiring code changes. This innovation promises significant improvements in load times and interaction speeds, benefiting developers and users alike.
- Ep 97 tool 1:27
The Complete Guide to Using Pydantic for Validating LLM Outputs
This episode dives into how Pydantic can validate outputs from large language models, ensuring reliable data. We'll explore the implications of these validations in real-world applications, the benefits for developers, and practical examples of how this can solve common issues when working with LLMs.
- Ep 96 tool 1:29
OpenAI, Anthropic, Google Agree to Develop Agent Standards Together
In an unprecedented collaboration, major players like OpenAI, Anthropic, and Google are agreeing to set technical standards for AI agents that could revolutionize how we automate white-collar work. This dialogue explores the significance of these standards and their potential real-world applications.
- Ep 95 article 1:27
Agent Engineering: A New Discipline
Agent engineering emerges as a vital discipline for developing reliable AI systems that adapt and learn from unpredictable interactions. As AI becomes integral to business processes, understanding how to manage the complexity and unpredictability of these agents is essential for organizations seeking to leverage their capabilities effectively.
- Ep 94 article 1:57
How confessions can keep language models honest
In this episode, we dive into a fascinating research approach that trains language models to admit when they've not followed instructions correctly. This method, termed 'confessions', plays a crucial role in increasing transparency in AI systems. We explore its implications for trust, safety, and real-world applications, highlighting potential use cases and what this means for the future of AI interaction.
- Ep 93 article 2:15
MIT offshoot Liquid AI releases blueprint for enterprise Grade small Model training
Liquid AI's new blueprint for small-model training positions enterprises to leverage AI on-device efficiently, ensuring privacy and operational reliability without reliance on cloud-based solutions. This shift could transform how businesses implement AI, enabling real-time applications that enhance productivity and data security.
- Ep 92 article 1:52
Don't Build Agents, Build Skills Instead – Barry Zhang & Mahesh Murag, Anthropic
In this episode, we dive into the transformative impact of YouTube on content creation and community building, exploring how it empowers users to become creators and redefine entertainment.
- Ep 91 article 1:21
We Got Claude to Fine Tune an Open Source LLM
The recent development allowing Claude to fine-tune open-source language models marks a significant step in democratizing AI training. It simplifies the complex process of model training, making it accessible to more users and applications, ultimately driving innovation in various sectors.
- Ep 90 article 1:35
Claude Code is coming to Slack, and that's a bigger deal than it sounds | TechCrunch
The integration of Claude Code into Slack marks a significant shift in developer workflows, turning collaboration tools into powerful coding environments. This not only enhances efficiency but also raises vital questions about security and dependency management in software development.
- Ep 89 api 1:38
Google launches managed MCP servers that let AI agents simply plug into its tools | TechCrunch
Google's launch of managed MCP servers aims to simplify how AI agents interact with various tools and data, reducing the complexity developers face while integrating these systems. This innovation could lead to more effective AI solutions for businesses and other sectors, as it streamlines connections to Google's robust services.
- Ep 88 article 1:51
GraphRAG in Practice: How to Build Cost Efficient, High Recall Retrieval Systems | Towards Data Science
In this episode, we explore GraphRAG, a new methodology for building retrieval systems that blend graph and vector searches to enhance information retrieval efficiency. We discuss its practical implications, explore who benefits from this innovation, and examine concrete examples of usage scenarios.
- Ep 87 article 1:37
Exclusive: Agentic AI startup Prime Security raises $20M
The rise of agentic AI in software security is crucial as it addresses vulnerabilities during development, where traditional security measures often fall short. Prime Security's recent $20M funding aims to enhance these protective measures, showcasing a shift in how we safeguard software against breaches.
- Ep 86 research 1:38
DeepSeek V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2 revolutionizes the efficiency of large language models with innovative techniques that enhance reasoning and performance in computational tasks, providing practical benefits across various domains.
- Ep 85 tool 2:03
Google and Anthropic Approach LLMs
This episode delves into the contrasting approaches to large language models (LLMs) by Google and Anthropic. We explore their engineering-focused culture versus a philosophical approach to AI, the implications for users, and how these developments impact the tech landscape.
- Ep 84 article 1:52
An AI for an AI: Anthropic says AI agents require AI defense
Anthropic's latest research highlights the pressing need for AI-driven defense mechanisms as AI agents become adept at exploiting vulnerabilities in smart contracts. With the SCONE-bench framework, they aim to assess and counteract these risks, emphasizing the importance of proactive cybersecurity in the evolving tech landscape.
- Ep 83 article 2:03
Claude Code and Slack | Claude
Claude's new integration with Slack revolutionizes how coding tasks are handled in teams, allowing for seamless transitions from discussion to implementation, which streamlines workflows and enhances productivity.
- Ep 82 article 2:00
Reddit The heart of the internet
In today's episode, we're diving into a fascinating solution designed to combat the issue of AI 'hallucinations'—the inaccuracies that AI models sometimes generate. We'll explore how a middleware solution can enhance trust in AI systems, specifically within the context of developing applications that rely on large language models.
- Ep 81 tool 1:33
Why the MCP Server Is Now a Critical Microservice
In this episode, we explore how the MCP server has become an essential microservice in modern software architecture. We discuss its implications for system scalability, reliability, and collaboration, and provide concrete examples to illustrate its real-world applications. Join us for insights into why adopting this technology could be transformative for businesses today.
- Ep 80 article 1:41
Inside OpenAI: 2026 is the year of agents, AI’s biggest bottleneck, and why compute isn’t the issue
In this episode, hosts dive deep into the transformative impact of YouTube on content creation and digital communication. They explore how the platform empowers creators, fosters communities, and shifts traditional media paradigms, ultimately reshaping how we consume entertainment and information.
- Ep 79 article 1:56
Multi Agent Systems Explained: How AI Agents & LLMs Work Together
In this episode, we discuss the impact of YouTube on the way we consume media and interact with content. We explore its role in democratizing content creation and the implications for creators and audiences alike.
- Ep 78 article 1:44
1brR9yRe6z
This episode explores groundbreaking advancements in creating dynamic NPC personalities that mimic real human behavior in games, integrating psychology, narrative, and social models. We discuss how these developments can revolutionize gaming experiences, enhance player immersion, and offer developers new tools for storytelling.
- Ep 77 article 2:11
GitHub Dyoshikawa/rulesync
In this episode, we unpack Rulesync, a powerful Node.js CLI tool that streamlines AI development by generating uniform configuration files for various AI coding tools. We explore its implications for developers, the flexibility it offers in tool selection, and how it can enhance productivity across teams.
- Ep 76 tool 2:04
New Infrastructure as Code Tool "formae" Takes Aim at Terraform
The launch of formae, an innovative infrastructure-as-code tool, aims to tackle common challenges in cloud management, positioning itself as a potential game-changer in the DevOps landscape.
- Ep 75 research 1:55
2510
AgentFold introduces a new way to manage context in LLM-based web agents, particularly for long-horizon tasks, improving performance through proactive context management, which can significantly benefit developers in various applications.
- Ep 74 tool 1:51
Minimax M2 Is the New King of Open Source LLMs Especially for Agentic Tool
The Minimax M2 model emerges as a powerful open-source language model, enabling advancements in AI agents and tool usage, making AI more accessible and efficient for diverse applications.
- Ep 73 article 1:45
How to orchestrate agents using mission control
Exploring the concept of orchestrating AI agents through Mission Control, this episode delves into its significance in improving efficiency and collaboration in tech development. Hosts discuss the practical implications of this approach, highlighting real-world benefits and potential use cases.
- Ep 72 article 2:14
Streaming datasets: 100x More Efficient
Hugging Face's recent advancements in streaming datasets promise to revolutionize machine learning by improving data handling efficiency by 100x, allowing developers to focus more on model training than on data preparation.
- Ep 71 tool 2:21
Warp Embeds AI Agents into a CLI to Provide Better Feedback Loop DevOps
The integration of AI agents into command line interfaces (CLI) represents a significant shift in the way developers interact with coding tools. Warp Code’s approach aims to create a tighter feedback loop between developers and AI, enhancing code quality and enabling more efficient workflows. This discussion explores the implications of this innovation for DevOps teams and the broader coding community.
- Ep 70 article 2:06
From Logs to Insights the AI Breakthrough Redefining Observability
This episode delves into the transformative role of AI in observability, exploring how advances improve system monitoring and troubleshooting, ultimately enhancing decision-making in tech environments.
- Ep 69 article 1:55
Ibms Open Source Granite 4 0 Nano AI Models Are Small Enough to Run Locally
In this episode, we explore IBM's Granite 4.0, a breakthrough in nano-AI models that can run locally, transforming how AI is integrated into everyday devices and applications. We discuss the implications for privacy, efficiency, and accessibility, and share real-world scenarios that highlight its potential impact on industries.
- Ep 68 tool 1:50
Metas Dreamgym Framework Trains AI Agents in a Simulated World to Cut
Meta's DreamGym Framework is revolutionizing the way AI agents are trained by simulating complex environments, improving their efficiency and adaptability in real-world applications. This discussion explores how DreamGym works, its implications for various industries, and potential use cases that could redefine AI training.
- Ep 67 article 1:43
Mistral Launches Mistral 3 a Family of Open Models Designed to Run On
In this episode, we dive into Mistral 3, a new family of open models that revolutionize how AI can be integrated into everyday applications. We discuss the significance of these models, their real-world implications for users and developers, and practical examples to illustrate their potential. Join us as we explore how Mistral 3 could change the landscape of AI deployment.
- Ep 66 article 1:28
Reforge
This episode dives into how AI prototyping is revolutionizing product development, making it faster and more efficient. We explore its implications across industries, who stands to benefit, and how it addresses traditional challenges in the prototyping process.
- Ep 65 research 1:49
Paper page Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
This dialogue explores the research on Unified Multimodal Models, focusing on the gap between understanding and generation in AI systems. It emphasizes the significance of addressing this gap for practical applications and future advancements in AI technologies.
- Ep 64 research 1:42
Paper page Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
This dialogue explores the advances in reinforcement learning (RL) through the integration of large language models (LLMs), specifically focusing on a recent study that provides new strategies for stabilizing RL training. The conversation highlights practical implications, potential use cases, and the future of RL in practical applications.
- Ep 63 api 1:47
China unveils world's cheapest humanoid robot under $1,400
The unveiling of Noetix's Bumi, the world’s cheapest humanoid robot at $1,370, is a game-changer in robotics and education. Hosts delve into its features, potential uses, and the broader implications for society.
- Ep 62 article 1:38
New Markovian Thinking Technique Unlocks a Path to Million Token AI
In this episode, we dive into a groundbreaking technique in AI that dramatically expands the token processing capabilities of language models, paving the way for more advanced applications.
- Ep 61 research 1:43
Paper page Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
This episode dives into the innovative research on Grasp Any Region (GAR), which enhances multimodal language models' ability to understand complex visual scenes. We discuss its practical implications for developers and the real-world applications that can benefit from this advanced technology.
- Ep 60 article 1:51
Anthropic Is Giving Away Its Powerful Claude Haiku 4 5 AI for Free to Take
Anthropic's release of Claude Haiku 4.5 AI for free is a significant move in the AI landscape, democratizing access to advanced technology. It has implications for various sectors, enhancing creativity, education, and small businesses. The hosts explore the practical benefits, potential challenges, and the future of AI accessibility.
- Ep 59 article 1:51
The Teacher Is the New Engineer Inside the Rise of AI Enablement And
The rise of AI enablement is reshaping the workforce, emphasizing the need for educators who can teach and guide AI tools rather than traditional engineering roles.
- Ep 58 research 1:57
Paper page RAG Anything: All in One RAG Framework
The RAG-Anything framework transforms how multimodal data is processed by integrating diverse knowledge types, addressing the limitations of current models. This innovation has significant implications for developers, enhancing user experience and expanding application areas. The discussion delves into practical uses, the technology's potential impact, and the challenges it still faces.
- Ep 57 research 2:00
Paper page Agent Learning via Early Experience
This dialogue explores innovative strategies in agent learning through early experience, discussing their implications, practical applications, and limitations in real-world scenarios.
- Ep 56 article 1:48
Zone 2 Training Explaining the Latest Trend in Fitness
The rise of Zone 2 training is revolutionizing fitness, promoting a healthier lifestyle and better performance through optimized aerobic conditioning. This episode dives into what Zone 2 training entails, its implications for everyday fitness enthusiasts and athletes alike, and how it can dramatically enhance overall health and performance.
- Ep 55 article 1:46
Self Improving Language Models Are Becoming Reality with Mits Updated Seal
The emergence of self-improving language models, like MIT's SEAL, could revolutionize how AI processes and generates human-like text, increasing efficiency and adaptability in various applications.
- Ep 54 tool 1:48
New Memory Framework Builds AI Agents That Can Handle the Real Worlds
In this episode, we dive into a groundbreaking new memory framework that enhances AI agents' abilities to function in the real world, exploring its implications, potential applications, and how it might change our interaction with technology.
- Ep 53 article 1:53
Databricks Set to Accelerate Agentic AI by Up to 100x with Mooncake
Databricks' new 'Mooncake' technology aims to revolutionize agentic AI, making it faster and more efficient. This could drastically improve various sectors by enabling smarter, real-time data-driven decisions. Hosts delve into its implications, applications, and potential impact on industries.
- Ep 52 tool 1:54
The New Pebble: Now 100% Open Source
The new Pebble smartwatch is now fully open-source, enabling users to modify and repair their devices. This move aims to provide longevity and customization in a landscape dominated by proprietary tech. Hosts explore its significance, potential user benefits, and future possibilities.
- Ep 51 article 0:53
lYttNavMJN
:first-child]:h-full [&>:first-child]:w-full [&>:first-child]:mb-0 [&>:first-child]:rounded-[inherit] h-full w-full [&>:first-child]:overflow-hidden [&>:first-child]:max-h-full"> Go to ChatGPTCoding r/ChatGPTCoding • AdditionalWeb107 Italiano archgw (0.3.20) - Sometimes a small release is a big one ~500 MB of python deps gutted out. archgw (a models-native sidecar proxy for AI agents) offered two capabilities that required loading small LLMs in memory: guardrails to prevent jailbreak attempts, and function-calling for routing requests to the right downstream tool or agent.
- Ep 50 article 0:54
New Token Oriented Object Notation (TOON) Hopes to Cut LLM Costs by Reducing Token Consumption
InfoQ Homepage News New Token-Oriented Object Notation (TOON) Hopes to Cut LLM Costs by Reducing Token Consumption Development New Token-Oriented Object Notation (TOON) Hopes to Cut LLM Costs by Reducing Token Consumption Nov 23, 2025 2 min read by Bruno Couriol Write for InfoQ Feed your curiosity. Help 550k+ global senior developers each month stay ahead.
- Ep 49 article 0:45
Natural Language Visualization and the Future of Data Analysis and Presentation | Towards Data Science
Data Visualization Natural Language Visualization and the Future of Data Analysis and Presentation Will conversational interaction replace SQL queries, KPI reports, and dashboards? Michal Szudejko Nov 21, 2025 28 min read Share Photo by Claudio Schwarz on Unsplash For decades, data analysis has been like classical art.
- Ep 48 research 1:07
Meta AI Researchers Introduce Matrix a Ray Native a Decentralized Framework for Multi Agent Synthetic Data Generation
Editors Pick Agentic AI Tech News AI Paper Summary Technology AI Shorts Artificial Intelligence Applications Language Model Large Language Model Machine Learning New Releases Staff Meta AI Researchers Introduce Matrix: A Ray Native a Decentralized Framework for Multi Agent Synthetic Data Generation By Michal Sutter - November 30, 2025 How do you keep synthetic data fresh and diverse for modern AI models without turning a single orchestration pipeline into the bottleneck? Meta AI researchers introduce Matrix , a decentralized framework where both control and data flow are serialized into messages that move through distributed queues.
- Ep 47 tool 0:50
GitHub Chen Zexi/open Ptc agent: An open source implementation of code execution with MCP (Programatic Tool Calling)
Open PTC Agent English | 中文 Getting Started | Demo Notebooks | Configuration | Changelog | Roadmap What is Programmatic Tool Calling? This project is an open source implementation of Anthropic recently introduced Programmatic Tool Calling (PTC) , which enables agents to invoke tools with code execution rather than making individual JSON tool calls.
- Ep 46 article 0:46
GitHub Pguso/rag From scratch: Demystify RAG by building it from scratch. Local LLMs, no black boxes Real understanding of embeddings, vector search, retrieval, and context Augmented generation.
RAG from Scratch Demystify Retrieval-Augmented Generation (RAG) by building it yourself - step by step. No black boxes.
- Ep 45 article 1:00
NFzcjna0zb
In today's episode, we explore how a developer uses Perplexity MCP as a secret weapon to enhance productivity with ChatGPT. We'll discuss the benefits of this approach, the cost-effectiveness, and the importance of using reliable sources.
- Ep 44 article 0:45
HjpmePJNA6
:first-child]:h-full [&>:first-child]:w-full [&>:first-child]:mb-0 [&>:first-child]:rounded-[inherit] h-full w-full [&>:first-child]:overflow-hidden [&>:first-child]:max-h-full"> Go to Cloud r/Cloud :first-child]:h-full [&>:first-child]:w-full [&>:first-child]:mb-0 [&>:first-child]:rounded-[inherit] h-full w-full [&>:first-child]:overflow-hidden [&>:first-child]:max-h-full"> r/Cloud All about Cloud Computing!!! Members • akorolyov Français Português (Brasil) Deutsch 💸 I cut 40% of our AWS bill in 90 Days.
- Ep 43 tool 0:45
8 platform engineering anti Patterns
Golden paths gone gray? Avoid these common mistakes that sink platform engineering initiatives.
- Ep 42 article 0:49
8hlgNiDYjM
:first-child]:h-full [&>:first-child]:w-full [&>:first-child]:mb-0 [&>:first-child]:rounded-[inherit] h-full w-full [&>:first-child]:overflow-hidden [&>:first-child]:max-h-full"> Go to ChatGPTCoding r/ChatGPTCoding :first-child]:h-full [&>:first-child]:w-full [&>:first-child]:mb-0 [&>:first-child]:rounded-[inherit] h-full w-full [&>:first-child]:overflow-hidden [&>:first-child]:max-h-full"> r/ChatGPTCoding Welcome to our community! This subreddit focuses on the coding side of ChatGPT - from interactions you've had with it, to tips on using it, to posting full blown creations!
- Ep 41 article 0:59
Building the Open Agent Ecosystem Together: Introducing OpenEnv
Back to Articles Building the Open Agent Ecosystem Together: Introducing OpenEnv Published October 23, 2025 Update on GitHub Upvote 127 +121 Joseph Spisak spisakjo Follow openenv Davide Testuggine darktex Follow guest Zach Wentz zkwentz Follow openenv Pierre Andrews mortimerp9 Follow openenv Sanyam Bhutani Sanyam Follow openenv Hamid Shojanazeri Hamid-Nazeri Follow openenv Pankit Thapar Pankit01 Follow openenv Emre Guven emre0 Follow openenv Lewis Tunstall lewtun Follow Vaibhav Srivastav reach-vb Follow The Problem The Solution The RFCs Use cases What’s Next With tools like TRL , TorchForge and verl , the open-source community has shown how to scale AI across complex compute infrastructure. But compute is only one side of the coin.
- Ep 40 article 1:10
Deep Agents overview Docs by LangChain
Explore the capabilities of Deep Agents in LangChain, a powerful tool for building specialized agents capable of handling complex tasks with planning and context management.
- Ep 39 tool 1:04
LangChain and LangGraph Agent Frameworks Reach v1.0 Milestones
LangChain and LangGraph have released their first major versions, v1.0, focusing on agent flexibility, middleware, and improved model integrations, while ensuring stability and backward compatibility for developers.
- Ep 38 tool 0:55
Will DeepSeek's new AI model break the 'long context' bottleneck holding back LLMs?
Tech AI Will DeepSeek's new AI model break the 'long-context' bottleneck holding back LLMs? South China Morning Post Wed, October 22, 2025 at 9:30 AM UTC DeepSeek's new artificial intelligence model that converts images into text is not just a document parsing tool but a potential preview of its next generation of large language models (LLMs), according to AI experts.
- Ep 37 api 0:54
Critical Vulnerability in MCP Server Platform Exposes 3,000+ Servers and Thousands of API Keys
Home Cyber Security Critical Vulnerability in MCP Server Platform Exposes 3,000+ Servers and Thousands of... Cyber Security Cyber Security News Vulnerability News Critical Vulnerability in MCP Server Platform Exposes 3,000+ Servers and Thousands of API Keys By Guru Baran - October 22, 2025 A critical vulnerability in Smithery.ai, a popular registry for Model Context Protocol (MCP) servers .
- Ep 36 api 0:58
Postgres for Agents | TigerData
In this episode, we explore the groundbreaking Agentic Postgres, the first database designed specifically for AI agents. Hosts Ajay and Mike discuss how the evolution from traditional development practices to agent-driven coding necessitates a new kind of database, highlighting key features like fast, zero-copy forks, native search capabilities, and more.
- Ep 35 tool 0:45
How to Use Frontier Vision LLMs: Qwen3 VL | Towards Data Science
Large Language Models How to Use Frontier Vision LLMs: Qwen3-VL Learn how you can use vision language models to perform advanced document understanding tasks. Eivind Kjosbakken Oct 20, 2025 11 min read Share Learn how to use vision LLMs.
- Ep 34 tool 1:10
VMware Workstation Pro 25H2 Released with New Features
VMware Workstation Pro 25H2 introduces significant updates for power users, enhancing hardware support and adding new features, making it a noteworthy upgrade for virtualisation enthusiasts.
- Ep 33 article 0:46
3tuhGmLfNp
Explore essential AI tips and tricks from the vibrant Reddit community, focusing on daily updates, tools, and expert insights.
- Ep 32 article 1:00
7 LLM Generation Parameters What They Do and How to Tune Them
Editors Pick Agentic AI Staff Tech News 7 LLM Generation Parameters—What They Do and How to Tune Them? By Michal Sutter - October 14, 2025 Tuning LLM outputs is largely a decoding problem: you shape the model’s next-token distribution with a handful of sampling controls— max tokens (caps response length under the model’s context limit), temperature (logit scaling for more/less randomness), top-p / nucleus and top-k (truncate the candidate set by probability mass or rank), frequency and presence penalties (discourage repetition or encourage novelty), and stop sequences (hard termination on delimiters).
- Ep 31 article 0:51
Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs – fast
18 months ago, Andrej Karpathy set a challenge : “Can you take my 2h13m tokenizer video and translate the video into the format of a book chapter”. We’ve done it, and the chapter is below, including key pieces of code inlined, and images from the video at key points (hyperlinked to the video timestamp).
- Ep 30 article 1:09
Nanochat Lets You Build Your Own Hackable LLM
Nanochat offers an accessible way to create your own customizable large language model, emphasizing user modification and experimentation.
- Ep 29 article 0:59
Qwen3 VL · Ollama Blog
Qwen3-VL October 14, 2025 Qwen3-VL , the most powerful vision language model in the Qwen series is now available on Ollama’s cloud. The models will be made available locally soon.
- Ep 28 article 0:45
Securing your agents with authentication and authorization
Securing your agents with authentication and authorization Agents can take action which makes proper authentication and authorization critical. Read on for how to implement and evolve agent auth.
- Ep 27 article 0:47
Optimizing Coding Agent Rules (CLAUDE.md, agents.md, ./clinerules, .cursor/rules) for Improved Accuracy
Optimizing Coding Agent Rules (./clinerules) for Improved Accuracy Published October 14, 2025 Coding agents have become the focal point of modern software development. Tools like Cursor, Claude Code, Codex, Cline, Windsurf, Devin, and many more are revolutionalizing how engineers write and ship code.
- Ep 26 article 1:00
Agentic Context Engineering Ace Self Improving LLMs via Evolving Contexts Not Fine Tuning
Tech News AI Paper Summary Technology Artificial Intelligence Editors Pick Machine Learning Staff Agentic Context Engineering (ACE): Self-Improving LLMs via Evolving Contexts, Not Fine-Tuning By Asif Razzaq - October 10, 2025 TL;DR : A team of researchers from Stanford University, SambaNova Systems and UC Berkeley introduce ACE framework that improves LLM performance by editing and growing the input context instead of updating model weights. Context is treated as a living “playbook” maintained by three roles— Generator, Reflector, Curator —with small delta items merged incrementally to avoid brevity bias and context collapse.
- Ep 25 tool 0:52
JavaScript Library Runs Machine Learning Models in Browser
A new JavaScript library enables developers to run machine learning models directly in the browser, making AI more accessible and efficient.
- Ep 24 article 1:09
Elena Verna at ProductCon: Why Traditional Product Management is Dying (And What to Do About It) PART 1 Just listened to Elena Verna's (Head of Growth at Lovable) talk at ProductCon, and it was a… | Anastasiia Moskovchenko
Anastasiia Moskovchenko Product Manager | AI/ML Products | 4x Growth at Yandex.Zen 1mo Report this post Elena Verna at ProductCon: Why Traditional Product Management is Dying (And What to Do About It) PART 1 Just listened to Elena Verna's (Head of Growth at Lovable) talk at ProductCon, and it was a wake-up call for anyone who thinks product management has stayed the same. Here's what's happening right now: 1.
- Ep 23 article 0:57
f7XBmoftBE
A discussion on key insights from a recent Reddit post about challenges faced in product management, highlighting the importance of communication and user feedback.
- Ep 22 article 0:59
GLM 4.6: Advanced Agentic, Reasoning and Coding Capabilities
2025-09-30 · Research GLM-4.6: Advanced Agentic, Reasoning and Coding Capabilities Try it at Z.ai Call it at Z.ai HuggingFace 📄 Tech Report (GLM-4.5) Today, we are releasing the latest version of our flagship model: GLM-4.6 . Compared with GLM-4.5, this generation brings several key improvements: Longer context window: The context window has been expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks.
- Ep 21 tool 0:50
LLM Evaluation 4 Approaches
Understanding the 4 Main Approaches to LLM Evaluation (From Scratch) Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples Sebastian Raschka, PhD Oct 05, 2025 319 25 30 Share How do we actually evaluate LLMs? It’s a simple question, but one that tends to open up a much bigger discussion.
- Ep 20 article 1:16
We built our coding agent for Slack instead of the terminal
Mintlify Agent revolutionizes documentation management by integrating it with Slack, making the process of updating documentation feel seamless and less daunting for developers.
- Ep 19 tool 0:49
Continue.dev AI coding assistant
Continue.dev revolutionizes coding by automating repetitive tasks, allowing developers to focus on creative solutions. With its seamless integration in various environments and customizable workflows, it promises efficiency and adaptability in coding practices.
- Ep 18 tool 1:15
Doubling down on DeepAgents
In this episode, we dive into the exciting updates of LangChain's DeepAgents 0.2 release, exploring its new features, the importance of planning tools, and how it distinguishes itself from LangChain and LangGraph.
- Ep 17 research 0:58
Chat in NotebookLM: A powerful, goal focused AI research partner
NotebookLM has received significant upgrades, enhancing its chat capabilities with a larger context window, improved memory, and personalized goal settings, making it an even more powerful AI research partner.
- Ep 16 article 0:52
Building a Multimodal RAG That Responds with Text, Images, and Tables from Sources | Towards Data Science
Large Language Models Building a Multimodal RAG That Responds with Text, Images, and Tables from Sources Why do few chatbots return figures from source documents in their responses? Partha Sarkar Nov 3, 2025 11 min read Share Photo by Steve Johnson on Unsplash Retrieval-Augmented Generation (RAG) has been one of the earliest and most successful applications of Generative AI.
- Ep 15 article 0:56
GPT 5 prompting guide | OpenAI Cookbook
Unlock the full potential of GPT-5 with practical prompting strategies to enhance performance and steerability.
- Ep 14 tool 0:52
I switched from LM Studio/Ollama to llama.cpp, and I absolutely love it
I switched from LM Studio/Ollama to llama.cpp, and I absolutely love it Credit: By Dhruv Bhutani Published Nov 2, 2025 Dhruv Bhutani has been writing about consumer technology since 2008, offering deep insights into the personal technology landscape through features and opinion pieces. He writes for XDA-Developers, where he focuses on topics like productivity, networking, self-hosting, and more.
- Ep 13 article 1:33
The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or an LLM (Explained with One Example) | Towards Data Science
Data Science The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or a LLM (Explained with One Example) A practical use case to describe how the data scientist job changed across three generations of machine learning Piero Paialunga Nov 11, 2025 10 min read Share Photo by Markus Spiske on Unsplash One of the best songs of the universe (made by one of the most iconic singers ever) says this: Wish I could go back And change these years I’m going through changes Black sabbath – Changes This song is incredibly powerful and talks about how life can change right in front of you so quickly. That song is about a broken heart and a love story.
- Ep 12 tool 1:12
A closer look at Python Workflows, now in beta
Cloudflare introduces Python Workflows in beta, expanding developers' ability to automate multi-step applications using Python, a favored language for data pipelines and AI. This new feature simplifies orchestration with built-in error handling and retry behavior, making it easier to create robust workflows.
- Ep 11 article 0:50
eZ14meVrgl
:first-child]:h-full [&>:first-child]:w-full [&>:first-child]:mb-0 [&>:first-child]:rounded-[inherit] h-full w-full [&>:first-child]:overflow-hidden [&>:first-child]:max-h-full"> Go to webdev r/webdev :first-child]:h-full [&>:first-child]:w-full [&>:first-child]:mb-0 [&>:first-child]:rounded-[inherit] h-full w-full [&>:first-child]:overflow-hidden [&>:first-child]:max-h-full"> r/webdev A community dedicated to all things web development: both front-end and back-end. For more design-related questions, try /r/web_design.
- Ep 10 tool 1:21
GitHub Snapchat/Valdi: Valdi is a cross platform UI framework that delivers native performance without sacrificing developer velocity.
Valdi is Snapchat's cross-platform UI framework that offers native performance while enhancing developer productivity. It allows developers to write UI once in TypeScript, compiling directly to native views across iOS, Android, and macOS. With features like instant hot reload and deep native integration, Valdi aims to streamline the development process and improve application performance.
- Ep 9 article 0:55
Deepagents Quickstarts
Explore the world of Deepagents, a powerful open-source agent harness designed for efficient task management and execution using advanced AI techniques. Learn about its built-in tools, middleware, and how to customize agents for specific workflows.
- Ep 8 article 0:47
GPT 5.1 Prompting Guide | OpenAI Cookbook
Introduction GPT-5.1, our newest flagship model, is designed to balance intelligence and speed for a variety of agentic and coding tasks, while also introducing a new none reasoning mode for low-latency interactions. Building on the strengths of GPT-5, GPT-5.1 is better calibrated to prompt difficulty, consuming far fewer tokens on easy inputs and more efficiently handling challenging ones.
- Ep 7 article 1:07
Configure MCP server access for your organization or enterprise GitHub Docs
GitHub Copilot / How-tos / Administer Copilot / Manage MCP usage / Configure MCP server access Configure MCP server access for your organization or enterprise You can configure an MCP registry URL and access control policy to determine which MCP servers developers can discover and use in supported IDEs with GitHub Copilot. Who can use this feature?
- Ep 6 tool 0:49
MCP Funnel/packages/commands at develop · chris Schra/mcp Funnel
In today's episode, we explore the mcp-funnel project, focusing on its command package and how it can streamline your development process. We’ll break down what it is, its features, and its potential impact on your workflows.
- Ep 5 article 0:53
xSVVTj9qiY
In this episode, we dive into the latest developments from the r/singularity community, focusing on AI advancements and the implications of human enhancement technologies. We discuss Google's SIMA 2, an innovative agent that interacts and learns in 3D environments, and what this means for our future.
- Ep 4 tool 0:54
Why LLMs Aren’t a One Size Fits All Solution for Enterprises | Towards Data Science
Large Language Models Why LLMs Aren’t a One-Size-Fits-All Solution for Enterprises What LLMs are (and aren’t) optimized for, and how the industry is approaching AI over structured business datasets — including one approach developed by my team and me. Jure Leskovec Nov 18, 2025 10 min read Share image by author Executives everywhere are racing to use LLMs, but often for tasks they aren’t well-suited to.
- Ep 3 article 1:03
No OAuth Required: An MCP Client For AWS IAM
Dennis Traub for AWS Posted on Nov 18 • Edited on Nov 20 No OAuth Required: An MCP Client For AWS IAM # ai # agents # mcp # aws When Anthropic published the Model Context Protocol (MCP) , I immediately started experimenting with deployment options on AWS: First, I tried running MCP servers as AWS Lambda functions. A great solution in terms of simplicity and cost, but it also meant I had to manually manage session state across invocations.
- Ep 2 article 0:51
LLM Visibility Alignment 464073
AI SEO » Article Alignment for LLM visibility is incredibly complex, but doable Published: November 18, 2025 at 2:29 pm Read Time: 23 minutes Published: Nov 18, 2025, 2:29 pm · 23 min read Share Written by Mordy Oberstein Edited by Willie Vitari Table of Contents Table of Contents LLMs expose brand misalignment instantly. Discover how inconsistent messaging raises costs, kills visibility, and what brands must do to realign and win in AI search.
- Ep 1 article 0:48
Stumbling into AI: Part 6—I’ve been thinking about Agents and MCP all wrong
Ever tried to hammer a nail in with a potato? Nor me, but that’s what I’ve felt like I’ve been attempting to do when trying to really understand agents, as well as to come up with an example agent to build.