Introduction
A second job for the agent gateway is sitting in front of the models your agents call. Your application talks to one OpenAI-compatible endpoint — the gateway — and the gateway decides which provider actually serves each request. Switching or adding a provider becomes a configuration change, with no application code to touch.
This tutorial routes chat completions through agentgateway to a model running on your own machine, then load-balances across two providers. Using a self-hosted model keeps the whole tutorial free of paid API calls.
This walkthrough uses a local llama.cpp server on port 8090 serving gemma-4-12b, reachable from the cluster at host.docker.internal:8090. Any OpenAI-compatible endpoint works — substitute your own host, port, and model name.
Step 1 — Point a backend at your model
An LLM backend is an AgentgatewayBackend with an ai provider. The host and port under provider override the default endpoint, so the gateway sends requests to your server. Self-hosted endpoints usually need no key, so the auth block is omitted:
kubectl apply -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: local-llm
namespace: agentgateway-system
spec:
ai:
provider:
host: host.docker.internal
port: 8090
openai:
model: gemma-4-12b-it-Q5_K_M.gguf
EOF
Confirm it is accepted:
kubectl get agentgatewaybackend local-llm -n agentgateway-system
NAME ACCEPTED AGE
local-llm True 4s
Step 2 — Route and test a chat completion
Attach the backend to the gateway:
kubectl apply -f- <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: local-llm
namespace: agentgateway-system
spec:
parentRefs:
- name: agentgateway-proxy
namespace: agentgateway-system
rules:
- backendRefs:
- name: local-llm
namespace: agentgateway-system
group: agentgateway.dev
kind: AgentgatewayBackend
EOF
Port-forward the proxy and send a standard OpenAI chat-completions request:
kubectl port-forward deployment/agentgateway-proxy -n agentgateway-system 8080:80
curl -s localhost:8080/v1/chat/completions -H 'content-type: application/json' -d '{
"model": "",
"messages": [{"role": "user", "content": "Reply with exactly: agentgateway works"}]
}'
{
"model": "gemma-4-12b-it-Q5_K_M.gguf",
"choices": [{ "message": { "role": "assistant", "content": "agentgateway works" }, "finish_reason": "stop" }],
"usage": { "prompt_tokens": 23, "completion_tokens": 57, "total_tokens": 80 }
}
Your application sent a normal OpenAI request to the gateway, and the gateway served it from the model on your machine.
Step 3 — Load-balance across providers
To spread traffic — or fail over — list several providers in one backend. agentgateway balances across the providers in a group using a power-of-two-choices algorithm. The example below uses two entries pointing at the same local server to keep things free; in production each entry is a different provider, and a hosted one adds an auth block referencing a Secret:
kubectl apply -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: llm-pool
namespace: agentgateway-system
spec:
ai:
groups:
- providers:
- name: local-a
host: host.docker.internal
port: 8090
openai:
model: gemma-4-12b-it-Q5_K_M.gguf
- name: local-b
host: host.docker.internal
port: 8090
openai:
model: gemma-4-12b-it-Q5_K_M.gguf
EOF
Route to the pool and test it the same way:
kubectl apply -f- <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-pool
namespace: agentgateway-system
spec:
parentRefs:
- name: agentgateway-proxy
namespace: agentgateway-system
rules:
- backendRefs:
- name: llm-pool
namespace: agentgateway-system
group: agentgateway.dev
kind: AgentgatewayBackend
EOF
curl -s localhost:8080/v1/chat/completions -H 'content-type: application/json' -d '{
"model": "", "messages": [{"role": "user", "content": "Reply with exactly: pool ok"}]
}'
{ "model": "gemma-4-12b-it-Q5_K_M.gguf",
"choices": [{ "message": { "content": "pool ok" }, "finish_reason": "stop" }],
"usage": { "total_tokens": 105 } }
To make this a real failover, give one provider entry a different host/model (say a hosted provider with an auth.secretRef) and keep your self-hosted model as the other. The gateway then balances healthy providers and routes around one that is failing — and your application keeps calling the same endpoint.
A note on adding a hosted provider
A hosted provider needs a key. Create a Secret and reference it from the provider’s policies.auth:
# inside a provider entry
- name: openai
openai:
model: gpt-4o
policies:
auth:
secretRef:
name: openai-secret
agentgateway also tracks token usage per request, which is the basis for budgets and cost reporting (see the LLM docs linked below).
Clean up
kubectl delete httproute local-llm llm-pool -n agentgateway-system
kubectl delete agentgatewaybackend local-llm llm-pool -n agentgateway-system
What’s next
The gateway now fronts your tools (MCP) and your models (LLM). Next, it carries traffic between agents themselves: agent-to-agent (A2A) communication with consistent security.
Next in this series: Agent-to-Agent Communication.
Summary
- An LLM backend is an
AgentgatewayBackendwhoseai.providersetshost,port, andopenai.model; self-hosted endpoints need no auth block. - Applications send normal OpenAI chat-completions requests to the gateway, and the gateway serves them from the configured provider.
ai.groups[].providers[]lists several providers in one backend; the gateway load-balances across them and routes around an unhealthy one.- A hosted provider adds
policies.auth.secretRef; the application endpoint never changes.
