Architecture at a Glance
RLAAS separates policy storage from counter storage and uses a canonical internal model so every deployment mode shares the same evaluation engine.
A policy-driven platform for enforcing limits, quotas, and traffic control across APIs and service workloads — built in Go for speed, designed for any stack.
Whether you embed, centralize, or sidecar — the same policy engine powers every decision.
Import as a library. Sub-millisecond local decisions. No network hop. Perfect for Go services.
Language-agnostic decision service. Centralized governance, unified telemetry, one version to manage.
Run alongside your app in Kubernetes. Local latency, central governance. Best of both worlds.
Everything you need to protect APIs, control telemetry, manage quotas, and enforce traffic policy at scale.
Fixed window, sliding window (log & counter), token bucket, leaky bucket, concurrency limiter, and quota/budget limiter — all behind a single interface.
Go beyond allow/deny. Support delay, sample, drop, downgrade, drop-low-priority, and shadow-only actions per policy.
Match on 20+ dimensions: org, tenant, service, endpoint, method, user, API key, region, tags, and more. Advanced match_expr expressions supported.
Shadow mode for dry-run evaluation. Gradual rollout percentages. Version history with one-click rollback.
Built-in analytics summary with tag aggregation. Full audit trail and version history for every policy change.
Configure fail-open or fail-closed per policy. Graceful degradation when backends are unavailable.
Lock-sharded in-memory counters (~6 ns/op). Async invalidation with bounded workers. Burst coalescing in sidecar sync.
Weighted regional allocation primitives. Overflow detection across regions. Built for global deployments.
Processor primitives for batch log and span filtering. Worker pools with fail-open/closed. Control telemetry volume per policy.
RLAAS separates policy storage from counter storage and uses a canonical internal model so every deployment mode shares the same evaluation engine.
From HTTP ingress to background jobs — one platform covers all your rate limiting needs.
Per-IP, per-API-key, per-user, per-endpoint, per-org throttling for any REST service.
Per-method, per-service, per-tenant concurrency & rate limiting via interceptors.
Control log/span/trace volume per org, service, severity, or attribute set.
Per-topic, per-consumer-group, per-event-type limits for Kafka, Pub/Sub, SQS, NATS.
Per-job-type, per-org, per-workflow-step throttling for batch and async workloads.
Login attempts, OTP generation, password resets, device registration — protect every auth flow.
Choose the right algorithm for each use case, or let the policy engine decide.
| Algorithm | Best For | Trade-off |
|---|---|---|
| Fixed Window | Simple org-wide limits, low-complexity quotas | Boundary burst possible |
| Sliding Window Log | Security-sensitive exact checks, low-volume | Higher memory cost |
| Sliding Window Counter | APIs, OTEL signals, general distributed loads | Approximation complexity |
| Token Bucket | REST/gRPC throttling, burst control | Refill math & atomicity |
| Leaky Bucket | Egress smoothing, outbound traffic shaping | Less intuitive |
| Concurrency Limiter | DB-heavy ops, file processing, dependency protection | Requires acquire/release lifecycle |
| Quota / Budget | SaaS plan enforcement, daily/monthly budgets | Not for short-burst protection alone |
Native Go plus eight HTTP client SDKs — integrate in minutes, regardless of your tech stack.
Embedded library with direct engine access. Sub-millisecond decisions, zero network hop.
Lightweight requests-based client. Full API coverage including analytics and audit.
Modern fetch-based client. Type-safe interfaces. Works in Node.js and edge runtimes.
Zero-dependency pure JS client. require() or import. Express & Fastify middleware included.
java.net.http.HttpClient with Jackson. Java 11+ compatible. Full CRUD support.
HttpClient with System.Text.Json. Async/await, CancellationToken. .NET 8 ready.
libcurl + nlohmann/json. CMake FetchContent integration. C++17, thread-safe client.
Async reqwest + tokio. Fully typed with serde. Clone-able for multi-task use.
Zero runtime dependencies — stdlib only. Rails before_action middleware included.
Simple, RESTful endpoints. One POST /rlaas/v1/check call to get a rate-limit decision.
| Method | Endpoint | Description |
|---|---|---|
| POST | /rlaas/v1/check | Evaluate a rate-limit decision |
| POST | /rlaas/v1/acquire | Acquire a concurrency lease |
| POST | /rlaas/v1/release | Release a concurrency lease |
| GET | /rlaas/v1/policies | List all policies |
| POST | /rlaas/v1/policies | Create a new policy |
| GET | /rlaas/v1/policies/{id} | Get a specific policy |
| PUT | /rlaas/v1/policies/{id} | Update a policy |
| DELETE | /rlaas/v1/policies/{id} | Delete a policy |
| GET | /rlaas/v1/policies/{id}/audit | Policy change audit trail |
| GET | /rlaas/v1/policies/{id}/versions | Policy version history |
| POST | /rlaas/v1/policies/{id}/rollout | Update rollout percentage |
| POST | /rlaas/v1/policies/{id}/rollback | Rollback to a previous version |
| POST | /rlaas/v1/policies/validate | Validate a policy definition |
| GET | /rlaas/v1/analytics/summary | Decision analytics summary |
RLAAS handles every rate limiting challenge that comes with running LLMs, GPU training, AI agents, and RAG pipelines in production.
Enforce daily or monthly token limits per user, per org, or per pricing tier — scoped to a specific model. Pass quantity: token_count and the quota algorithm deducts real token cost, not just request count.
When a premium model's RPM is exhausted, return action: "downgrade" — the caller routes to GPT-3.5 or Claude Haiku automatically. No error, no user-visible failure, no redeploy.
Cap concurrent GPU training jobs per org with the concurrency limiter. Acquire/release holds the slot for the full job duration — auto-released via TTL if the job crashes.
Autonomous agents can spiral into thousands of tool calls per minute. Rate-limit each tool type per session with a sliding window — give agents the retry_after hint so they can plan around the constraint.
Map API cost to RLAAS units (e.g. 1 unit = $0.00001) and enforce a hard daily spend cap per tenant. No surprise bills. Change the budget live via the policy API — zero deploy.
Two-phase pattern for streaming LLM responses: pre-check on known input tokens, then deduct actual output tokens after the stream ends. Fail-open on the second phase so a network blip never breaks UX.
RLAAS is ready for customer integration in controlled production environments. Start with the Quick Start guide.