Performance Guide#
This guide focuses on performance tuning that matches the current Cloud Native MCP Server implementation.
What You Can Tune Today#
1. Server Timeouts#
| |
Recommendations:
- Keep
writeTimeoutSec: 0for SSE-heavy workloads. - Increase
readTimeoutSecfor slow clients or large request bodies. - Increase
idleTimeoutSecfor long-lived clients behind proxies.
2. Kubernetes Client Throughput#
| |
Recommendations:
- Increase
qps/burstfor larger clusters. - Increase
timeoutSecfor heavy list/watch or expensive queries.
3. Request Rate Limiting#
| |
Recommendations:
- Enable in multi-tenant or internet-facing deployments.
- Start with conservative values and raise gradually from metrics.
4. Reduce Unnecessary Service Overhead#
| |
Recommendations:
- Only enable services you actually use.
- Disable unused tools for lower memory and startup overhead.
5. Audit Cost Control (When Audit Is Enabled)#
| |
Recommendations:
- Use sampling on high-QPS workloads.
- Use
storage: databasefor large, query-heavy audit datasets.
Built-In Optimizations (No Public YAML Key)#
The server already includes internal optimizations such as:
- response truncation safeguards
- efficient JSON processing paths
- internal caching and pooling in service/tool layers
These are implementation details, not stable public config keys.
Use observable knobs above (server, kubernetes, ratelimit, enableDisable, audit) first.
Performance Metrics to Track#
Key metrics to watch:
- request rate
- p95/p99 latency
- error rate
- active connections
- memory and CPU usage
Example queries:
| |
| |
| |
Benchmarking#
Health Endpoint#
| |
Tool Call Endpoint (Legacy Message Endpoint Compatibility)#
Create payload:
| |
Run benchmark:
| |
Production Baseline Example#
| |
Troubleshooting#
High Latency#
Actions:
- Increase
kubernetes.timeoutSecif backend queries are slow. - Increase
kubernetes.qps/kubernetes.burstfor API-client bottlenecks. - Check p99 latency and correlate with backend service saturation.
High Error Rate#
Actions:
- Verify auth mode and credentials.
- Check backend service availability (Prometheus/Grafana/etc.).
- Inspect audit logs via
/api/audit/logsif audit is enabled.
High Memory Usage#
Actions:
- Reduce enabled services/tools.
- Enable audit sampling.
- Reduce burst limits if traffic spikes cause memory pressure.
Best Practices#
- Prefer
streamable-httpin production unless a client requires SSE. - Tune one dimension at a time (
timeout, thenqps/burst, thenratelimit). - Keep load tests representative of real tool usage.
- Track p95/p99 latency, not only average latency.