Ask HN: How the same LLM "instance" serve multiple clients?
I've been playing with running LLMs locally and only then realized I have no idea how to scale it (I don't really know how LLMs work internally).
I'm assuming context is everything but if the same LLM process can serve multiple clients, aren't there risks of mixing contexts? Does anyone have any ideas?
Let me ChatGPT for you:
Good question. Let’s break it down carefully.
When you hear about a single LLM instance serving multiple clients at the same time, it usually works like this: • The LLM instance is stateless: Each client sends a request (prompt + settings), the model processes that one request independently, and returns the response. The LLM doesn’t “remember” between requests unless you explicitly include conversation history in the prompt. • Concurrency is handled by infrastructure: Even though the LLM is “one model,” it can handle many incoming requests because the backend (server) wraps the model with techniques like: • Asynchronous request handling (e.g., using async/await patterns) • Batching: multiple prompts are packed together into a single forward pass through the model (very common in high-traffic servers) • Parallelism: the server could have multiple workers/replicas of the model (copies or shared GPUs) running side-by-side. • Queueing: if too many clients at once, requests are queued and processed in order. • Memory isolation: Each request is kept separate in memory. No client’s data leaks into another client’s conversation unless you (the app developer) accidentally introduce a bug.
So:
It’s not that one model is “locked” into serving only one person at a time. It’s more like the model is a very fast function being called many times in parallel.
⸻