ai-infrastructure

Routing LLM Traffic Across Providers

Abubakar Siddiq Ango
Abubakar Siddiq Ango Senior Developer Advocate
Jun 16, 2026 4 min read Intermediate
ai-infrastructure agentic-ai llm

Prerequisites

  • Completed ‘Installing an Agent Gateway on Kubernetes’ (part 2) — agentgateway and the agentgateway-proxy Gateway are running
  • An OpenAI-compatible model endpoint you can reach from the cluster. This tutorial uses a local llama.cpp server, but Ollama, vLLM, or a hosted provider work the same way.
  • kubectl installed and configured

Introduction

A second job for the agent gateway is sitting in front of the models your agents call. Your application talks to one OpenAI-compatible endpoint — the gateway — and the gateway decides which provider actually serves each request. Switching or adding a provider becomes a configuration change, with no application code to touch.

This tutorial routes chat completions through agentgateway to a model running on your own machine, then load-balances across two providers. Using a self-hosted model keeps the whole tutorial free of paid API calls.

This walkthrough uses a local llama.cpp server on port 8090 serving gemma-4-12b, reachable from the cluster at host.docker.internal:8090. Any OpenAI-compatible endpoint works — substitute your own host, port, and model name.

Step 1 — Point a backend at your model

An LLM backend is an AgentgatewayBackend with an ai provider. The host and port under provider override the default endpoint, so the gateway sends requests to your server. Self-hosted endpoints usually need no key, so the auth block is omitted:

kubectl apply -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: local-llm
  namespace: agentgateway-system
spec:
  ai:
    provider:
      host: host.docker.internal
      port: 8090
      openai:
        model: gemma-4-12b-it-Q5_K_M.gguf
EOF

Confirm it is accepted:

kubectl get agentgatewaybackend local-llm -n agentgateway-system
NAME        ACCEPTED   AGE
local-llm   True       4s

Step 2 — Route and test a chat completion

Attach the backend to the gateway:

kubectl apply -f- <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: local-llm
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
    - backendRefs:
      - name: local-llm
        namespace: agentgateway-system
        group: agentgateway.dev
        kind: AgentgatewayBackend
EOF

Port-forward the proxy and send a standard OpenAI chat-completions request:

kubectl port-forward deployment/agentgateway-proxy -n agentgateway-system 8080:80
curl -s localhost:8080/v1/chat/completions -H 'content-type: application/json' -d '{
  "model": "",
  "messages": [{"role": "user", "content": "Reply with exactly: agentgateway works"}]
}'
{
  "model": "gemma-4-12b-it-Q5_K_M.gguf",
  "choices": [{ "message": { "role": "assistant", "content": "agentgateway works" }, "finish_reason": "stop" }],
  "usage": { "prompt_tokens": 23, "completion_tokens": 57, "total_tokens": 80 }
}

Your application sent a normal OpenAI request to the gateway, and the gateway served it from the model on your machine.

Step 3 — Load-balance across providers

To spread traffic — or fail over — list several providers in one backend. agentgateway balances across the providers in a group using a power-of-two-choices algorithm. The example below uses two entries pointing at the same local server to keep things free; in production each entry is a different provider, and a hosted one adds an auth block referencing a Secret:

kubectl apply -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: llm-pool
  namespace: agentgateway-system
spec:
  ai:
    groups:
      - providers:
          - name: local-a
            host: host.docker.internal
            port: 8090
            openai:
              model: gemma-4-12b-it-Q5_K_M.gguf
          - name: local-b
            host: host.docker.internal
            port: 8090
            openai:
              model: gemma-4-12b-it-Q5_K_M.gguf
EOF

Route to the pool and test it the same way:

kubectl apply -f- <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-pool
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
    - backendRefs:
      - name: llm-pool
        namespace: agentgateway-system
        group: agentgateway.dev
        kind: AgentgatewayBackend
EOF
curl -s localhost:8080/v1/chat/completions -H 'content-type: application/json' -d '{
  "model": "", "messages": [{"role": "user", "content": "Reply with exactly: pool ok"}]
}'
{ "model": "gemma-4-12b-it-Q5_K_M.gguf",
  "choices": [{ "message": { "content": "pool ok" }, "finish_reason": "stop" }],
  "usage": { "total_tokens": 105 } }

To make this a real failover, give one provider entry a different host/model (say a hosted provider with an auth.secretRef) and keep your self-hosted model as the other. The gateway then balances healthy providers and routes around one that is failing — and your application keeps calling the same endpoint.

A note on adding a hosted provider

A hosted provider needs a key. Create a Secret and reference it from the provider’s policies.auth:

# inside a provider entry
- name: openai
  openai:
    model: gpt-4o
  policies:
    auth:
      secretRef:
        name: openai-secret

agentgateway also tracks token usage per request, which is the basis for budgets and cost reporting (see the LLM docs linked below).

Clean up

kubectl delete httproute local-llm llm-pool -n agentgateway-system
kubectl delete agentgatewaybackend local-llm llm-pool -n agentgateway-system

What’s next

The gateway now fronts your tools (MCP) and your models (LLM). Next, it carries traffic between agents themselves: agent-to-agent (A2A) communication with consistent security.

Next in this series: Agent-to-Agent Communication.

Summary

  • An LLM backend is an AgentgatewayBackend whose ai.provider sets host, port, and openai.model; self-hosted endpoints need no auth block.
  • Applications send normal OpenAI chat-completions requests to the gateway, and the gateway serves them from the configured provider.
  • ai.groups[].providers[] lists several providers in one backend; the gateway load-balances across them and routes around an unhealthy one.
  • A hosted provider adds policies.auth.secretRef; the application endpoint never changes.