KV cache is the real inference bottleneck (Not GPUs)
GPUs get all the attention, but in inference, the real bottleneck is often memory, specifically the KV cache. In this episode of Pop Goes the Stack, Lori MacVittie sits down with Tim Michels to explain why inference stopped being stateless the moment long contexts, multi-turn conversations, and never-ending agents became normal. That state has to live somewhere, and too often it’s living in the most expensive place in the stack.
Tim breaks down what KV cache actually is by separating inference into its two phases: prefill, where prompts are tokenized and transformed into the internal structures the model needs, and decode, where the response is generated token by token. KV cache is the bridge between them, and keeping it available can skip expensive recomputation and drastically improve time to first token.
From there, the conversation moves into the architectural shift: building a memory hierarchy that offloads cache from GPU HBM to host DRAM, to local SSD, and even to network-attached storage. It’s slower than keeping everything on-GPU, but still faster than starting cold. They also cover semantic caching as an external shortcut, and why routing and load balancing need to become cache-aware, steering users back to the GPU or cluster that already holds their state.
The big takeaway for enterprises is practical: stop accepting “buy more GPUs” as the default plan. KV cache awareness, smarter routing, and storage/network tuning are where the next 2x to 5x efficiency gains are likely to come from, especially as agentic workloads multiply demand.
Creators and Guests
Host
Lori MacVittie
Distinguished Engineer and Chief Evangelist at F5, Lori has more than 25 years of industry experience spanning application development, IT architecture, and network and systems' operation. She co-authored the CADD profile for ANSI NCITS 320-1998 and is a prolific author with books spanning security, cloud, and enterprise architecture.
Producer
Tabitha R.R. Powell
Technical Thought Leadership Evangelist producing content that makes complex ideas clear and engaging.
Guest
Tim Michels
F5 Distinguished Engineer and Architect integrating concepts for platform design, hardware acceleration (custom and COTS), software stack functional delineation and abstraction, to create hyper-converged appliances, chassis, and COTS systems.
