Rate Limiting As A Service

A policy-driven platform for enforcing limits, quotas, and traffic control across APIs and service workloads — built in Go for speed, designed for any stack.

Fixed Window Token Bucket Sliding Window Concurrency Limiter Quota / Budget Shadow Mode Multi-Tenant

Get Started Read the Design

Three Deployment Models, One Engine

Whether you embed, centralize, or sidecar — the same policy engine powers every decision.

📦 Embedded Go SDK

Import as a library. Sub-millisecond local decisions. No network hop. Perfect for Go services.

🌐 Centralized HTTP / gRPC

Language-agnostic decision service. Centralized governance, unified telemetry, one version to manage.

⚙️ Sidecar Local Proxy

Run alongside your app in Kubernetes. Local latency, central governance. Best of both worlds.

Built for Real Workloads

Everything you need to protect APIs, control telemetry, manage quotas, and enforce traffic policy at scale.

⚡

Seven Algorithms

Fixed window, sliding window (log & counter), token bucket, leaky bucket, concurrency limiter, and quota/budget limiter — all behind a single interface.

🛡️

Rich Action Model

Go beyond allow/deny. Support delay, sample, drop, downgrade, drop-low-priority, and shadow-only actions per policy.

🎯

Fine-Grained Matching

Match on 20+ dimensions: org, tenant, service, endpoint, method, user, API key, region, tags, and more. Advanced match_expr expressions supported.

🔄

Safe Rollout

Shadow mode for dry-run evaluation. Gradual rollout percentages. Version history with one-click rollback.

📊

Analytics & Audit

Built-in analytics summary with tag aggregation. Full audit trail and version history for every policy change.

🔒

Fail-Safe by Design

Configure fail-open or fail-closed per policy. Graceful degradation when backends are unavailable.

🚀

High Performance

Lock-sharded in-memory counters (~6 ns/op). Async invalidation with bounded workers. Burst coalescing in sidecar sync.

🌍

Multi-Region Ready

Weighted regional allocation primitives. Overflow detection across regions. Built for global deployments.

📡

OTEL Integration

Processor primitives for batch log and span filtering. Worker pools with fail-open/closed. Control telemetry volume per policy.

Architecture at a Glance

RLAAS separates policy storage from counter storage and uses a canonical internal model so every deployment mode shares the same evaluation engine.

Open full diagram ↗

Design Principle Treat rate limiting as a generic policy decision engine, not as a database-specific algorithm runner. Policies in the database, counters in fast stores.

Works Across Every Signal Type

From HTTP ingress to background jobs — one platform covers all your rate limiting needs.

🌐

HTTP / REST APIs

Per-IP, per-API-key, per-user, per-endpoint, per-org throttling for any REST service.

📨

gRPC Services

Per-method, per-service, per-tenant concurrency & rate limiting via interceptors.

📋

OpenTelemetry Signals

Control log/span/trace volume per org, service, severity, or attribute set.

📩

Event & Messaging

Per-topic, per-consumer-group, per-event-type limits for Kafka, Pub/Sub, SQS, NATS.

🔧

Background Jobs

Per-job-type, per-org, per-workflow-step throttling for batch and async workloads.

🚪

Auth & Abuse Prevention

Supported Algorithms

Choose the right algorithm for each use case, or let the policy engine decide.

Algorithm	Best For	Trade-off
Fixed Window	Simple org-wide limits, low-complexity quotas	Boundary burst possible
Sliding Window Log	Security-sensitive exact checks, low-volume	Higher memory cost
Sliding Window Counter	APIs, OTEL signals, general distributed loads	Approximation complexity
Token Bucket	REST/gRPC throttling, burst control	Refill math & atomicity
Leaky Bucket	Egress smoothing, outbound traffic shaping	Less intuitive
Concurrency Limiter	DB-heavy ops, file processing, dependency protection	Requires acquire/release lifecycle
Quota / Budget	SaaS plan enforcement, daily/monthly budgets	Not for short-burst protection alone

Client SDKs for Every Stack

Native Go plus eight HTTP client SDKs — integrate in minutes, regardless of your tech stack.

Go (Native)

Embedded library with direct engine access. Sub-millisecond decisions, zero network hop.

Python

Lightweight requests-based client. Full API coverage including analytics and audit.

TypeScript

Modern fetch-based client. Type-safe interfaces. Works in Node.js and edge runtimes.

Node.js

Zero-dependency pure JS client. require() or import. Express & Fastify middleware included.

Java

java.net.http.HttpClient with Jackson. Java 11+ compatible. Full CRUD support.

.NET

HttpClient with System.Text.Json. Async/await, CancellationToken. .NET 8 ready.

C++

libcurl + nlohmann/json. CMake FetchContent integration. C++17, thread-safe client.

Rust

Async reqwest + tokio. Fully typed with serde. Clone-able for multi-task use.

Ruby

Zero runtime dependencies — stdlib only. Rails before_action middleware included.

View SDK Documentation →

API at a Glance

Simple, RESTful endpoints. One POST /rlaas/v1/check call to get a rate-limit decision.

Method	Endpoint	Description
POST	/rlaas/v1/check	Evaluate a rate-limit decision
POST	/rlaas/v1/acquire	Acquire a concurrency lease
POST	/rlaas/v1/release	Release a concurrency lease
GET	/rlaas/v1/policies	List all policies
POST	/rlaas/v1/policies	Create a new policy
GET	/rlaas/v1/policies/{id}	Get a specific policy
PUT	/rlaas/v1/policies/{id}	Update a policy
DELETE	/rlaas/v1/policies/{id}	Delete a policy
GET	/rlaas/v1/policies/{id}/audit	Policy change audit trail
GET	/rlaas/v1/policies/{id}/versions	Policy version history
POST	/rlaas/v1/policies/{id}/rollout	Update rollout percentage
POST	/rlaas/v1/policies/{id}/rollback	Rollback to a previous version
POST	/rlaas/v1/policies/validate	Validate a policy definition
GET	/rlaas/v1/analytics/summary	Decision analytics summary

Full API Reference →

Built for AI & ML Workloads

RLAAS handles every rate limiting challenge that comes with running LLMs, GPU training, AI agents, and RAG pipelines in production.

🪙

Token Budget Management

Enforce daily or monthly token limits per user, per org, or per pricing tier — scoped to a specific model. Pass quantity: token_count and the quota algorithm deducts real token cost, not just request count.

⬇️

Auto-Downgrade to Cheaper Models

When a premium model's RPM is exhausted, return action: "downgrade" — the caller routes to GPT-3.5 or Claude Haiku automatically. No error, no user-visible failure, no redeploy.

🖥️

GPU Slot Fairness

Cap concurrent GPU training jobs per org with the concurrency limiter. Acquire/release holds the slot for the full job duration — auto-released via TTL if the job crashes.

🤖

AI Agent Guardrails

Autonomous agents can spiral into thousands of tool calls per minute. Rate-limit each tool type per session with a sliding window — give agents the retry_after hint so they can plan around the constraint.

💰

Dollar Budget Enforcement

Map API cost to RLAAS units (e.g. 1 unit = $0.00001) and enforce a hard daily spend cap per tenant. No surprise bills. Change the budget live via the policy API — zero deploy.

🌊

Streaming Token Accounting

Two-phase pattern for streaming LLM responses: pre-check on known input tokens, then deduct actual output tokens after the stream ends. Fail-open on the second phase so a network blip never breaks UX.

✗ Without RLAAS

✗ Hard-coded limits — same for every user

✗ Surprise $800+ GPT-4 bills from monthly invoice

✗ 429 errors propagate directly to end users

✗ Runaway agents burn API credits in minutes

✗ GPU slots locked forever when jobs crash

✗ Every change needs a code deploy

Add RLAAS

✓ With RLAAS

✓ Per-org, per-model, per-tier token budgets

✓ Real-time cost tracking — hard daily caps

✓ Auto-downgrade: GPT-4 → 3.5 → Haiku seamlessly

✓ Per-tool, per-session guardrails with retry_after

✓ Concurrency limiter with TTL auto-release

✓ Live policy changes via API — zero deploy

See All AI & ML Scenarios →

Ready to Protect Your APIs?

RLAAS is ready for customer integration in controlled production environments. Start with the Quick Start guide.

Quick Start Guide View on GitHub