LLM APIs

Per-User Token Budgets

Stop charging every user the same flat limit. Quota policies deduct actual token cost — so a 50,000-token request drains the budget 1,000× more than a 50-token one.

Before & After

Without RLAAS

Hard-Coded Global Limits

  • Same 10,000 token limit for every user regardless of plan
  • Changing the limit requires a code change and redeploy
  • Every request counts as "1" regardless of token usage — fairness is broken
  • No per-model scoping — GPT-4 and GPT-3.5 share the same counters
# ✗ Hard-coded limit, no token awareness TOKEN_LIMIT = 10_000 # same for free & paid users def call_llm(user_id: str, prompt: str) -> str: count = redis.get(f"tokens:{user_id}") or 0 if count >= TOKEN_LIMIT: # ← same limit for everyone raise HTTPException(429) # ← redeploy to change this resp = openai.chat.completions.create(...) # ✗ counts +1, not +actual_tokens redis.incrby(f"tokens:{user_id}", 1) return resp.choices[0].message.content
With RLAAS

Per-Plan Token Budgets, Live

  • Free: 50K tokens/day · Pro: 500K · Enterprise: unlimited — changed live
  • Deducts actual token count (Quantity field) — not request count
  • Per-model scoping: GPT-4 budget is separate from GPT-3.5 budget
  • Quota algorithm rolls over daily/monthly with zero code changes
# ✓ RLAAS enforces per-plan token budgets client = RlaasClient(base_url="http://rlaas:8080") def call_llm(user_id: str, plan: str, prompt: str) -> str: # estimate tokens before the call (or use a real counter) est_tokens = len(prompt.split()) * 1.3 decision = client.check(CheckRequest( user_id=user_id, resource="gpt-4", quantity=int(est_tokens), # ← deducts real token cost metadata={"plan": plan}, )) if not decision.allowed: raise HTTPException(429, {"retry_after": decision.retry_after}) return openai.chat.completions.create(...).choices[0].message.content

How It Works

Policy Configuration

# POST /rlaas/v1/policies
{
  "id": "llm-token-budget-gpt4",
  "resource": "gpt-4",
  "algorithm": "quota",
  "config": {
    "quota": 100000,
    "window_seconds": 86400
  },
  "action_deny": "reject",
  "metadata": {
    "description": "100K tokens/day per user for GPT-4 (Pro plan)",
    "tier": "pro"
  }
}

Request Flow

  1. Client sends quantity: N — the estimated (or exact) token count for this request
  2. RLAAS deducts N from the user's daily quota — not just +1
  3. If quota exhausted — returns allowed: false with retry_after = seconds until reset
  4. Update policy live — PATCH /rlaas/v1/policies/{id} to change quota without redeploy
  5. Per-plan policies — create separate policies per plan tier, route by metadata.plan

SDK Examples

Check the token budget before every LLM call using your preferred language.

// check token budget before calling LLM decision, err := client.Check(ctx, &rlaas.CheckRequest{ UserID: userID, Resource: "gpt-4", Quantity: int64(estimatedTokens), // real token cost }) if err != nil { return "", err } if !decision.Allowed { return "", fmt.Errorf("token budget exhausted, retry in %ds", decision.RetryAfter) } // make LLM call resp, err := openaiClient.CreateChatCompletion(ctx, req)
from rlaas_sdk import RlaasClient, CheckRequest client = RlaasClient(base_url="http://rlaas:8080") decision = client.check(CheckRequest( user_id=user_id, resource="gpt-4", quantity=estimated_tokens, )) if not decision.allowed: raise RateLimitError(retry_after=decision.retry_after) response = openai_client.chat.completions.create( model="gpt-4", messages=messages, )
import { RlaasClient } from '@rlaas/sdk'; const rlaas = new RlaasClient({ baseUrl: 'http://rlaas:8080' }); const decision = await rlaas.check({ userId: userId, resource: 'gpt-4', quantity: estimatedTokens, }); if (!decision.allowed) { throw new RateLimitError({ retryAfter: decision.retryAfter }); } const response = await openai.chat.completions.create({ model: 'gpt-4', messages });
// check token budget before calling LLM import io.rlaas.sdk.RlaasClient; import io.rlaas.sdk.model.*; RlaasClient rlaas = new RlaasClient("http://rlaas:8080"); Decision decision = rlaas.checkLimit(new CheckRequest( userId, "gpt-4", estimatedTokens)); if (!decision.isAllowed()) { throw new RateLimitException( "Token budget exhausted, retry in " + decision.getRetryAfter() + "s"); } // make LLM call var resp = openAiClient.createChatCompletion(req);
// check token budget before calling LLM using Rlaas.Sdk; using Rlaas.Sdk.Models; var rlaas = new RlaasClient("http://rlaas:8080"); var decision = await rlaas.CheckLimitAsync( new CheckRequest(userId, "gpt-4", estimatedTokens)); if (!decision.Allowed) throw new RateLimitException( $"Token budget exhausted, retry in {decision.RetryAfter}s"); // make LLM call var resp = await openAi.CreateChatCompletionAsync(req);
// check token budget before calling LLM (Node.js) const { RlaasClient } = require('@rlaas/node-sdk'); const client = new RlaasClient('http://rlaas:8080'); const decision = await client.check({ user_id: userId, resource: 'gpt-4', quantity: estimatedTokens, }); if (!decision.allowed) { res.status(429).json({ error: 'Token budget exhausted', retry_after: decision.retry_after }); return; } const response = await openai.chat.completions.create({ model: 'gpt-4', messages });
// check token budget before calling LLM (C++) #include "rlaas/client.h" rlaas::Client client("http://rlaas:8080"); rlaas::CheckRequest req; req.user_id = user_id; req.resource = "gpt-4"; req.quantity = estimated_tokens; auto decision = client.check(req); if (!decision.allowed) { throw std::runtime_error( "Token budget exhausted, retry in " + std::to_string(decision.retry_after_ms) + "ms"); } // make LLM call auto resp = openai_client.create_chat_completion(chat_req);
// check token budget before calling LLM (Rust) use rlaas_sdk::{Client, CheckRequest}; let client = Client::new("http://rlaas:8080"); let decision = client.check(&CheckRequest { user_id: user_id.into(), resource: "gpt-4".into(), quantity: estimated_tokens as i64, ..Default::default() }).await?; if !decision.allowed { return Err(anyhow!("token budget exhausted, retry in {}s", decision.retry_after)); } let resp = openai.create_chat_completion(req).await?;
# check token budget before calling LLM (Ruby) require 'rlaas_sdk' client = Rlaas::Client.new('http://rlaas:8080') decision = client.check( user_id: user_id, resource: 'gpt-4', quantity: estimated_tokens ) unless decision.allowed raise RateLimitError, "Token budget exhausted, retry in #{decision.retry_after}s" end response = openai_client.chat(model: 'gpt-4', messages: messages)