How to Run Multiple LLMs on One GPU Without Memory Errors
Blog Post

How to Run Multiple LLMs on One GPU Without Memory Errors

Jake McCluskey
Back to blog

Running multiple LLM processes on a single GPU causes each process to reserve its full KV cache memory upfront, even if it won't use all of it immediately. This causes CUDA out of memory errors because three processes requesting 2GB each will crash on an 8GB GPU, even though they might only actively use 1.5GB total at any given moment. Connection admission control (CAC) solves this by checking available GPU capacity before allowing new processes to start, preventing memory overcommitment. It's the same pattern that keeps 5G networks from dropping calls when too many users connect at once.

Why Running Multiple LLM Processes Causes CUDA Out of Memory Errors

Each llama.cpp process reserves its maximum KV cache size when it starts, regardless of actual usage. The KV cache stores key-value pairs from the attention mechanism for every token in the context window. If you configure a model with an 8192 token context, it reserves memory for all 8192 tokens immediately.

Here's the problem: three separate llama.cpp processes don't share memory. They each make independent reservations. If you're running Llama 3.1 8B with a 4096 token context, each process might reserve 2.2GB. Three processes request 6.6GB total, which exceeds most consumer GPUs like the RTX 3060's 12GB when you account for model weights and CUDA overhead.

The naive approach fails spectacularly. You start process one, it works. Process two starts, still okay. Process three triggers a CUDA out of memory error and crashes, taking down any agent or workflow depending on it. In practice, you'll see only one or two agents survive out of three or four you attempted to launch.

What Connection Admission Control Does for GPU Memory Management

Connection admission control is a pattern borrowed from telecommunications infrastructure. Before a 5G network accepts a new connection, it checks whether it has capacity to serve that connection without degrading existing users. If capacity is insufficient, the connection is rejected or queued.

Applied to GPU memory, CAC checks available VRAM before starting a new LLM process. If a new process requests 2.2GB but only 1.8GB remains free, the system rejects or queues that process instead of letting it crash. This prevents the cascading failures you see with naive multi-process deployments.

The lmxd daemon implements this pattern for local LLM deployments. It acts as a gatekeeper, tracking total GPU capacity and current allocations. When a new agent or application requests an LLM instance, lmxd calculates whether the request fits within available memory before spawning the process.

Real-world numbers make the difference clear. With naive deployment, three Llama 3.1 8B agents on an RTX 4090 (24GB) might result in only one or two successful launches. With CAC through lmxd, the same hardware can reliably run three agents using 1.58GB out of 7.73GB allocated GPU capacity, leaving 6.15GB for additional processes or larger context windows.

How the KV Cache Reservation Problem Actually Works

The KV cache size calculation depends on model dimension, number of layers, number of attention heads, and maximum context length. For a Llama 3.1 8B model, the calculation looks like this:


# Llama 3.1 8B parameters
hidden_size = 4096
num_layers = 32
num_heads = 32
head_dim = hidden_size // num_heads  # 128
max_context = 8192  # tokens

# KV cache size per layer (key + value)
kv_per_layer = 2 * max_context * num_heads * head_dim * 2  # 2 bytes for FP16

# Total KV cache across all layers
total_kv_cache = kv_per_layer * num_layers / (1024**3)  # Convert to GB
# Result: approximately 2.1GB for 8192 token context

This memory is allocated when the model initializes, not gradually as tokens are processed. That upfront reservation is what kills you when running multiple instances. The memory sits reserved even if the conversation only uses 500 tokens of the 8192 available.

Reducing context length directly reduces KV cache size. Dropping from 8192 to 2048 tokens cuts KV cache memory by roughly 75%. For multi-agent systems where each agent handles focused tasks, smaller contexts often work fine and let you run four agents instead of one.

How to Run Multiple AI Models on Single GPU Using Connection Admission Control

Setting up CAC for your local LLM deployment requires either using a purpose-built daemon or implementing the pattern yourself. Here's the practical approach.

Using lmxd for Automatic Memory Management

The lmxd daemon provides CAC out of the box. Install it on your Linux system (it's designed for Unix-like environments):


# Clone and build lmxd
git clone https://github.com/lmxd/lmxd.git
cd lmxd
make
sudo make install

# Configure GPU memory limits
cat > /etc/lmxd/config.yaml << EOF
gpu:
  device: 0
  max_memory_gb: 20  # Leave headroom for CUDA overhead
  
admission_control:
  enabled: true
  queue_depth: 5
  
models:
  llama-3.1-8b:
    context_length: 4096
    estimated_memory_gb: 2.2
EOF

# Start the daemon
sudo systemctl start lmxd

Now your applications request LLM instances through lmxd instead of spawning llama.cpp directly. The daemon handles admission control, queuing, and cleanup.

Implementing Basic CAC Yourself

If you're building a custom multi-agent system, you can implement basic CAC with a resource manager class:


import torch
import subprocess
from dataclasses import dataclass
from typing import Optional

@dataclass
class ProcessReservation:
    pid: int
    memory_gb: float
    model_name: str

class GPUResourceManager:
    def __init__(self, max_memory_gb: float):
        self.max_memory_gb = max_memory_gb
        self.reservations = {}
        
    def get_available_memory(self) -> float:
        """Check actual free GPU memory"""
        if torch.cuda.is_available():
            allocated = torch.cuda.memory_allocated(0) / (1024**3)
            reserved = torch.cuda.memory_reserved(0) / (1024**3)
            return self.max_memory_gb - reserved
        return self.max_memory_gb
    
    def request_process(self, memory_required: float, model_name: str) -> Optional[int]:
        """Admit new process if memory available"""
        available = self.get_available_memory()
        
        if available >= memory_required:
            # Start llama.cpp process
            proc = subprocess.Popen([
                './llama.cpp/main',
                '-m', f'models/{model_name}.gguf',
                '-c', '4096',
                '--port', str(8080 + len(self.reservations))
            ])
            
            self.reservations[proc.pid] = ProcessReservation(
                pid=proc.pid,
                memory_gb=memory_required,
                model_name=model_name
            )
            return proc.pid
        
        return None  # Admission denied
    
    def release_process(self, pid: int):
        """Clean up when process terminates"""
        if pid in self.reservations:
            del self.reservations[pid]

# Usage in multi-agent system
manager = GPUResourceManager(max_memory_gb=20.0)

agent_pids = []
for agent_id in range(3):
    pid = manager.request_process(memory_required=2.2, model_name='llama-3.1-8b')
    if pid:
        agent_pids.append(pid)
        print(f"Agent {agent_id} started with PID {pid}")
    else:
        print(f"Agent {agent_id} queued - insufficient memory")

This basic implementation prevents overcommitment by checking available memory before starting each process. Production systems would add queuing, retry logic, and automatic cleanup.

Configuring llama.cpp for Memory Efficiency

Beyond CAC, configure each llama.cpp instance to minimize memory usage. The key parameters are context length and batch size:


# Memory-efficient llama.cpp configuration
./main \
  -m models/llama-3.1-8b-q4_k_m.gguf \
  -c 2048 \              # Reduced context = less KV cache
  -b 512 \               # Batch size for processing
  --n-gpu-layers 35 \    # Offload to GPU
  --threads 4 \          # CPU threads for non-GPU work
  --ctx-size 2048        # Explicit context size

Using quantized models (Q4_K_M in this example) reduces model weight memory by approximately 60% compared to FP16, letting you fit more instances or larger contexts in the same VRAM.

Local LLM GPU Memory Optimization Techniques for Multi-Agent Systems

Multi-agent systems face unique memory challenges because agents often sit idle waiting for user input or external events. You're paying the memory cost for full KV cache reservation even when agents process nothing.

Dynamic process spawning helps significantly. Instead of keeping three agents loaded constantly, spawn processes on-demand when agents receive tasks. A coordinator service using the CAC pattern manages the pool:


class AgentCoordinator:
    def __init__(self, gpu_manager: GPUResourceManager):
        self.gpu_manager = gpu_manager
        self.active_agents = {}
        self.idle_timeout = 300  # 5 minutes
        
    async def get_agent(self, agent_type: str):
        """Get or create agent instance"""
        if agent_type in self.active_agents:
            return self.active_agents[agent_type]
        
        # Request new process through CAC
        pid = self.gpu_manager.request_process(
            memory_required=2.2,
            model_name='llama-3.1-8b'
        )
        
        if pid:
            self.active_agents[agent_type] = {
                'pid': pid,
                'last_used': time.time()
            }
            return self.active_agents[agent_type]
        
        # Queue or wait for capacity
        await self.wait_for_capacity()
        return await self.get_agent(agent_type)
    
    async def cleanup_idle_agents(self):
        """Release memory from idle agents"""
        current_time = time.time()
        for agent_type, info in list(self.active_agents.items()):
            if current_time - info['last_used'] > self.idle_timeout:
                self.gpu_manager.release_process(info['pid'])
                del self.active_agents[agent_type]

This approach reduced GPU memory usage by approximately 40% in testing with three-agent systems that had bursty workloads. Agents spawn when needed and terminate after idle periods.

Shared context caching is another optimization, though it requires modifying llama.cpp or using servers that support it. vLLM and TGI (Text Generation Inference) both implement prefix caching, where multiple requests sharing the same prompt prefix reuse KV cache entries. For multi-agent systems where agents share system prompts, this cuts redundant memory usage substantially.

Model selection matters more than most developers realize. Llama 3.2 3B uses roughly 60% less memory than Llama 3.1 8B for the same context length. If your agents perform focused tasks like classification, summarization, or structured data extraction, smaller models often perform adequately and let you run five agents instead of two. When building multi-agent orchestration systems, matching model size to task complexity prevents wasting VRAM on unnecessary capacity.

How to Fix CUDA Out of Memory Error with Multiple Models

When you hit CUDA out of memory errors despite implementing CAC, the issue usually comes from one of these sources: model weight memory, fragmentation, or hidden allocations. Sometimes all four at once, honestly.

Model weights for Llama 3.1 8B in FP16 consume about 16GB. Quantization to Q4_K_M drops this to roughly 4.5GB. If you're running three instances without quantization, you're using 48GB just for weights before any KV cache allocation. Switch to quantized models first.

Memory fragmentation happens when processes start and stop repeatedly. CUDA doesn't always return freed memory to the available pool immediately. Force garbage collection and clear the cache between process spawns:


import torch
import gc

def cleanup_gpu_memory():
    """Force memory cleanup between process spawns"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

Hidden allocations come from PyTorch, CUDA kernels, and other libraries that grab VRAM without your explicit request. These can consume 1-2GB total. Monitor actual usage with nvidia-smi:


# Watch GPU memory in real-time
watch -n 1 nvidia-smi

# Or get programmatic access
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader,nounits

Set your CAC max_memory parameter to 2GB below your GPU's total capacity to account for these hidden allocations. On a 24GB GPU, configure max_memory_gb to 22GB.

If you're still hitting limits, reduce context length aggressively. Drop from 4096 to 1024 tokens per instance. Most agent interactions don't need massive contexts, and you can always implement external memory systems for longer-term context retention.

When GPU Memory Management Actually Matters for Your Use Case

Not every LLM deployment needs sophisticated memory management. If you're running a single chatbot on a 24GB GPU, naive deployment works fine. CAC and careful memory optimization matter in specific scenarios.

Multi-agent systems hit memory limits constantly because each agent needs its own inference instance. A system with a planner agent, three specialist agents, and a critic agent requires five simultaneous LLM processes. Without CAC, you'll crash or need 80GB of VRAM.

Development environments benefit significantly because developers frequently start and stop processes while testing. CAC prevents the "works on first run, crashes on second run" problem caused by memory fragmentation. It's worth the setup complexity if you're iterating on agent architectures.

Look, local deployment on consumer hardware is where CAC becomes essential. If you're running AI on an RTX 3080 (10GB) or RTX 4070 (12GB), you can't afford waste. Proper memory management is the difference between running one model or three.

Production inference servers serving multiple users need CAC to prevent one user's large request from crashing other users' sessions. This is why vLLM and TGI implement sophisticated scheduling and memory management by default.

Understanding these patterns helps whether you're debugging CUDA errors in your multi-agent system or planning hardware for local AI deployment. The KV cache reservation problem isn't going away, but connection admission control gives you the tools to work within your GPU's limits instead of constantly fighting memory crashes. Start with quantized models, implement basic CAC if you're running multiple processes, and monitor actual memory usage rather than assuming you have headroom.

Ready to stop reading and start shipping?

Get a free AI-powered SEO audit of your site

We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.

Run my free audit
WANT THE SHORTCUT

Need help applying this to your business?

The post above is the framework. Spend 30 minutes with me and we'll map it to your specific stack, budget, and timeline. No pitch, just a real scoping conversation.