NVIDIA unveils major Dynamo updates targeting AI coding agents, achieving up to 97% KV cache hit rates and 4x latency improvements for enterprise deployments. (NVIDIA unveils major Dynamo updates targeting AI coding agents, achieving up to 97% KV cache hit rates and 4x latency improvements for enterprise deployments. (

NVIDIA Dynamo Gets Agentic AI Overhaul With 97% Cache Hit Rates

2026/04/18 07:22
Okuma süresi: 4 dk
Bu içerikle ilgili geri bildirim veya endişeleriniz için lütfen [email protected] üzerinden bizimle iletişime geçin.

NVIDIA Dynamo Gets Agentic AI Overhaul With 97% Cache Hit Rates

Lawrence Jengar Apr 17, 2026 23:22

NVIDIA unveils major Dynamo updates targeting AI coding agents, achieving up to 97% KV cache hit rates and 4x latency improvements for enterprise deployments.

NVIDIA Dynamo Gets Agentic AI Overhaul With 97% Cache Hit Rates

NVIDIA has released a comprehensive update to its Dynamo inference framework specifically optimized for AI coding agents, addressing a critical bottleneck as enterprise adoption of automated code generation accelerates. The company reports achieving up to 97.2% cache hit rates for multi-agent workflows—a metric that directly translates to reduced compute costs and faster response times.

The timing isn't accidental. Stripe's internal agents now generate over 1,300 pull requests weekly. Ramp attributes 30% of its merged PRs to AI agents. Spotify reports 650+ agent-generated PRs monthly. Behind each of these workflows sits an inference stack under intense pressure from repeated context processing.

The Cache Problem Nobody Talks About

Here's what makes agentic AI different from chatbots: a coding agent like Claude Code or Codex makes hundreds of API calls per session, each carrying the full conversation history. After the first call writes the conversation prefix to KV cache, every subsequent call hits 85-97% cache on the same worker. NVIDIA measured an 11.7x read/write ratio—the system reads from cache nearly 12 times for every token written.

Without cache-aware routing, turn 2 of a conversation has roughly a 1/N chance of landing on the same worker as turn 1. Every miss forces complete prefix recomputation. For a 200K context window, that's expensive.

Three-Layer Architecture

Dynamo's update attacks the problem at three levels. The frontend now supports multiple API protocols—v1/responses, v1/messages, and v1/chat/completions—through a common internal representation. This matters because newer APIs use typed content blocks, letting the orchestrator see boundaries between thinking, tool calls, and text to apply different cache policies per block type.

The new "agent hints" extension allows harnesses to attach structured metadata to requests: priority levels, estimated output length, and speculative prefill flags. A harness can signal "warm this cache ahead of time" when it knows a tool call is about to return.

At the routing layer, NVIDIA's Flash Indexer now handles 170 million operations per second for KV-aware placement decisions. The NeMo Agent Toolkit team built a custom router using these APIs and measured 4x reduction in p50 time-to-first-token and up to 63% latency improvement for priority-tagged requests under memory pressure.

Rethinking Cache Eviction

Standard LRU eviction treats all cached data identically—a fundamental mismatch with how agents actually work. System prompts get reused every turn. Reasoning tokens inside <think> blocks? Typically zero reuse after the loop closes, yet they account for roughly 40% of generated tokens.

The update introduces selective retention with per-region control. Teams can specify that system prompt blocks evict last, conversation context survives 30-second tool call gaps, and decode tokens go first. TensorRT-LLM's new TokenRangeRetentionConfig enables this granularity within single requests.

NVIDIA is also building toward a four-tier memory hierarchy—GPU, CPU, local NVMe, and remote storage—where blocks flow automatically via write-through. When one worker computes KV for a prefix, any other worker can load those blocks via RDMA instead of recomputing. Four redundant prefill computations become one compute and three loads.

What This Means for Deployment

The company has been running internal Dynamo deployments of GLM-5 and MiniMax2.5 to power Codex and Claude Code harnesses, benchmarking against closed-source inference. They're targeting parity on cache reuse performance with optimized recipes coming in the next few weeks.

For teams already running open-source models on their own GPUs, the gap with managed API providers just got smaller. The cache_control API mirrors Anthropic's prompt caching semantics, so migration paths exist for teams familiar with that interface.

The agent hints specification remains v1, and NVIDIA is actively soliciting feedback from teams building agent harnesses on which signals prove most useful. Given that Dynamo 1.0 launched just last month with major cloud provider adoption, expect rapid iteration as enterprise agentic workloads scale.

Image source: Shutterstock
  • nvidia
  • ai infrastructure
  • dynamo
  • coding agents
  • enterprise ai
Piyasa Fırsatı
Major Logosu
Major Fiyatı(MAJOR)
$0.06506
$0.06506$0.06506
-0.19%
USD
Major (MAJOR) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen [email protected] ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

USD1 Genesis: 0 Fees + 12% APR

USD1 Genesis: 0 Fees + 12% APRUSD1 Genesis: 0 Fees + 12% APR

New users: stake for up to 600% APR. Limited time!