Rate Limiting Configuration¶
文档版本: 1.0.0
最后更新: 2025-08-28 Git 提交: c1aa5b0f
作者: Lincoln
JAiRouter provides multiple rate limiting algorithms, supporting global, service-level, and instance-level rate limiting configurations, as well as independent rate limiting based on client IP. This document details the configuration and usage of various rate limiting strategies.
Modular Configuration Overview¶
Starting from v1.0.0, JAiRouter adopts a modular configuration structure with rate limiting related configurations moved to separate configuration files:
- Main configuration file: application.yml
- Model services base configuration: config/base/model-services-base.yml
You can find all rate limiting related configurations in the config/base/model-services-base.yml file, including global configurations, service type configurations, and instance configurations.
Rate Limiting Overview¶
Supported Algorithms¶
Algorithm | Characteristics | Use Cases | Burst Handling |
---|---|---|---|
Token Bucket | Allows burst traffic | Steady traffic + occasional bursts | Excellent |
Leaky Bucket | Smooths output traffic | Requires steady output | Moderate |
Sliding Window | Precise time window control | Precise traffic control | Moderate |
Warm Up | Cold start protection | Service startup scenarios | Gradual |
Rate Limiting Hierarchy¶
graph TB
A[Global Rate Limiting Configuration] --> B[Service-Level Rate Limiting Configuration]
B --> C[Instance-Level Rate Limiting Configuration]
A --> D[Default Algorithm and Parameters]
B --> E[Service-Specific Rate Limiting]
C --> F[Instance-Independent Rate Limiting]
G[Client IP Rate Limiting] --> B
G --> C
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style G fill:#fff3e0
Global Rate Limiting Configuration¶
Basic Configuration¶
Configure global rate limiting in the config/base/model-services-base.yml file:
# config/base/model-services-base.yml
model:
# Global rate limiting configuration
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 1000
rate: 100
scope: "service"
client-ip-enable: true # Enable client IP rate limiting
services:
# Chat service configuration
chat:
load-balance:
type: least-connections
adapter: gpustack # Use GPUStack adapter
# Service-level rate limiting configuration
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 100
rate: 10
scope: "service"
client-ip-enable: true
Advanced Configuration¶
model:
rate-limit:
enabled: true
algorithm: "token-bucket"
# Client IP rate limiting configuration
client-ip:
enabled: true
cleanup-interval: 300s # Cleanup interval
max-idle-time: 1800s # Maximum idle time
max-clients: 10000 # Maximum number of clients
# Rate limiter performance configuration
performance:
async-processing: true # Asynchronous processing
batch-size: 100 # Batch processing size
cache-size: 1000 # Cache size
Detailed Rate Limiting Algorithms¶
1. Token Bucket Algorithm¶
Algorithm Principle¶
The token bucket algorithm maintains a fixed-capacity bucket and adds tokens at a fixed rate. Requests consume tokens when they arrive, and are rejected if no tokens are available.
graph LR
A[Token Generator] -->|Fixed Rate| B[Token Bucket]
C[Request] --> D{Tokens in Bucket?}
D -->|Yes| E[Consume Token, Allow Request]
D -->|No| F[Reject Request]
B --> D
Configuration Example¶
Configure the token bucket algorithm in the config/base/model-services-base.yml file:
model:
services:
chat:
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 100 # Bucket capacity (maximum tokens)
rate: 10 # Token refill rate (per second)
scope: "service" # Rate limiting scope
client-ip-enable: true # Enable client IP rate limiting
JSON Configuration¶
JAiRouter also supports updating rate limiting configurations via the dynamic configuration API:
{
"services": {
"chat": {
"rateLimit": {
"type": "token-bucket",
"capacity": 100,
"refillRate": 10,
"clientIpEnable": true
}
}
}
}
Parameter Description¶
Parameter | Type | Description | Default Value |
---|---|---|---|
capacity | Integer | Bucket capacity, maximum number of tokens | 100 |
rate/refillRate | Integer | Token refill rate (per second) | 10 |
client-ip-enable | Boolean | Whether to enable client IP rate limiting | false |
Use Cases¶
- Steady Traffic + Occasional Bursts: Daily traffic is steady with occasional burst requests
- API Gateway: Need to allow short-term bursts but control long-term average rate
- User Request Limiting: Allow users to send multiple requests in a short time
Configuration Recommendations¶
# High concurrency scenario
rate-limit:
algorithm: "token-bucket"
capacity: 1000 # Large capacity to support bursts
rate: 100 # Moderate refill rate
# Backend protection scenario
rate-limit:
algorithm: "token-bucket"
capacity: 50 # Small capacity to limit bursts
rate: 5 # Low rate to protect backend
2. Leaky Bucket Algorithm¶
Algorithm Principle¶
The leaky bucket algorithm maintains a fixed-capacity bucket. Requests enter the bucket to queue, and are processed at a fixed rate from the bottom of the bucket.
graph TB
A[Request Inflow] --> B{Is Bucket Full?}
B -->|No| C[Request Enters Bucket]
B -->|Yes| D[Reject Request]
C --> E[Queue in Bucket]
E -->|Fixed Rate| F[Process Request]
Configuration Example¶
Configure the leaky bucket algorithm in the config/base/model-services-base.yml file:
model:
services:
embedding:
rate-limit:
enabled: true
algorithm: "leaky-bucket"
capacity: 50 # Bucket capacity
rate: 5 # Leak rate (per second)
scope: "service"
client-ip-enable: true
JSON Configuration¶
JAiRouter also supports updating rate limiting configurations via the dynamic configuration API:
{
"services": {
"embedding": {
"rateLimit": {
"type": "leaky-bucket",
"capacity": 50,
"leakRate": 5,
"clientIpEnable": true
}
}
}
}
Parameter Description¶
Parameter | Type | Description | Default Value |
---|---|---|---|
capacity | Integer | Bucket capacity, maximum queued requests | 50 |
rate/leakRate | Integer | Leak rate (per second) | 5 |
Use Cases¶
- Steady Output Required: Backend needs to receive requests at a steady rate
- Message Queue: Need to control message processing rate
- Database Protection: Protect database from burst traffic impact
3. Sliding Window Algorithm¶
Algorithm Principle¶
The sliding window algorithm counts requests within a fixed time window and rejects requests when the threshold is exceeded.
graph LR
A[Time Window] --> B[Count Requests]
C[New Request] --> D{Requests in Window < Threshold?}
D -->|Yes| E[Allow Request]
D -->|No| F[Reject Request]
B --> D
Configuration Example¶
Configure the sliding window algorithm in the config/base/model-services-base.yml file:
model:
services:
tts:
rate-limit:
enabled: true
algorithm: "sliding-window"
window-size: 60 # Time window size (seconds)
max-requests: 100 # Maximum requests in window
scope: "service"
client-ip-enable: true
JSON Configuration¶
JAiRouter also supports updating rate limiting configurations via the dynamic configuration API:
{
"services": {
"tts": {
"rateLimit": {
"type": "sliding-window",
"windowSize": 60,
"maxRequests": 100,
"clientIpEnable": true
}
}
}
}
Parameter Description¶
Parameter | Type | Description | Default Value |
---|---|---|---|
window-size /windowSize | Integer | Time window size (seconds) | 60 |
max-requests /maxRequests | Integer | Maximum requests in window | 100 |
Use Cases¶
- Precise Traffic Control: Need precise control of requests within time window
- API Quota Management: Allocate API call quotas by time period
- Interface Protection: Prevent large number of duplicate requests in short time
4. Warm Up Algorithm¶
Algorithm Principle¶
The warm-up algorithm gradually increases the allowed request rate during service startup to avoid performance issues during cold start.
graph LR
A[Service Startup] --> B[Low Rate]
B --> C[Gradually Increase]
C --> D[Reach Target Rate]
E[Warm-up Time] --> F[Current Allowed Rate]
F --> G[= Target Rate * (Current Time / Warm-up Time)]
Configuration Example¶
Configure the warm-up algorithm in the config/base/model-services-base.yml file:
model:
services:
chat:
rate-limit:
enabled: true
algorithm: "warm-up"
capacity: 100 # Final capacity
rate: 10 # Final rate
warm-up-period: 300 # Warm-up time (seconds)
cold-factor: 3 # Cold start factor
scope: "service"
JSON Configuration¶
JAiRouter also supports updating rate limiting configurations via the dynamic configuration API:
{
"services": {
"chat": {
"rateLimit": {
"type": "warm-up",
"capacity": 100,
"warmUpPeriod": 300,
"coldFactor": 3,
"clientIpEnable": true
}
}
}
}
Parameter Description¶
Parameter | Type | Description | Default Value |
---|---|---|---|
capacity | Integer | Final capacity | 100 |
warm-up-period /warmUpPeriod | Integer | Warm-up time (seconds) | 300 |
cold-factor /coldFactor | Integer | Cold start factor | 3 |
Use Cases¶
- Service Cold Start: Service needs warm-up when starting
- Cache Warm-up: Time needed to build cache
- Connection Pool Warm-up: Database connection pool needs time to establish
Client IP Rate Limiting¶
Basic Configuration¶
Configure client IP rate limiting in the config/base/model-services-base.yml file:
model:
rate-limit:
client-ip-enable: true # Enable client IP rate limiting
# Client IP rate limiting configuration
client-ip:
enabled: true
cleanup-interval: 300s # Cleanup interval
max-idle-time: 1800s # Maximum idle time
max-clients: 10000 # Maximum number of clients
default-capacity: 100 # Default capacity
default-rate: 10 # Default rate
Service-Level IP Rate Limiting¶
model:
services:
chat:
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 1000 # Service-level total capacity
rate: 100 # Service-level total rate
client-ip-enable: true # Enable IP rate limiting
# Rate limiting configuration per IP
client-ip:
capacity: 50 # Capacity per IP
rate: 5 # Rate per IP
Instance-Level IP Rate Limiting¶
model:
services:
chat:
instances:
- name: "model-1"
base-url: "http://server-1:8080"
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 100
rate: 10
scope: "instance"
client-ip-enable: true
IP Rate Limiting Monitoring¶
# View client IP rate limiting statistics
curl "http://localhost:8080/actuator/metrics/jairouter.ratelimit.clients"
# View IP rate limiter cleanup statistics
curl "http://localhost:8080/actuator/metrics/jairouter.ratelimit.cleanup"
Multi-Layer Rate Limiting Configuration¶
Layered Rate Limiting Strategy¶
Configure multi-layer rate limiting in the config/base/model-services-base.yml file:
model:
# Global rate limiting: Protect the entire system
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 10000
rate: 1000
client-ip-enable: true
services:
chat:
# Service-level rate limiting: Protect specific service
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 1000
rate: 100
client-ip-enable: true
client-ip:
capacity: 50
rate: 5
instances:
- name: "high-perf-model"
base-url: "http://gpu-server:8080"
# Instance-level rate limiting: Protect specific instance
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 500
rate: 50
scope: "instance"
Rate Limiting Priority¶
- Instance-Level Rate Limiting: Highest priority, directly protects instances
- Service-Level Rate Limiting: Medium priority, protects service types
- Global Rate Limiting: Lowest priority, protects the entire system
Dynamic Rate Limiting Configuration¶
Dynamic Adjustment via Configuration Files¶
JAiRouter also supports updating rate limiting configurations via the dynamic configuration API:
{
"services": {
"chat": {
"rateLimit": {
"type": "token-bucket",
"capacity": 200,
"refillRate": 20,
"clientIpEnable": true
},
"instances": [
{
"name": "model-1",
"baseUrl": "http://server-1:8080",
"rateLimit": {
"type": "token-bucket",
"capacity": 100,
"refillRate": 10
}
}
]
}
}
}
Rate Limiting Parameter Tuning¶
# Monitor current rate limiting effectiveness
curl "http://localhost:8080/actuator/metrics/jairouter.ratelimit.requests"
# Adjust configuration based on monitoring results
# If rejection rate is too high, increase capacity or rate
# If backend pressure is high, reduce capacity or rate
Rate Limiting Monitoring and Alerting¶
Monitoring Metrics¶
# Total rate limiting requests
curl "http://localhost:8080/actuator/metrics/jairouter.ratelimit.requests.total"
# Rate limiting rejections
curl "http://localhost:8080/actuator/metrics/jairouter.ratelimit.rejected.total"
# Number of client IPs
curl "http://localhost:8080/actuator/metrics/jairouter.ratelimit.clients.active"
# Rate limiter cleanup statistics
curl "http://localhost:8080/actuator/metrics/jairouter.ratelimit.cleanup.total"
Prometheus Metrics¶
# Rate limiting request rate
rate(jairouter_ratelimit_requests_total[5m])
# Rate limiting rejection rate
rate(jairouter_ratelimit_rejected_total[5m]) / rate(jairouter_ratelimit_requests_total[5m])
# Active client count
jairouter_ratelimit_clients_active
# Rate limiter memory usage
jairouter_ratelimit_memory_usage_bytes
Alert Rules¶
# Prometheus alert rules
groups:
- name: jairouter_ratelimit
rules:
- alert: HighRateLimitRejection
expr: rate(jairouter_ratelimit_rejected_total[5m]) / rate(jairouter_ratelimit_requests_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High rate limiting rejection rate"
description: "Service {{ $labels.service }} rejection rate exceeds 10%"
- alert: TooManyActiveClients
expr: jairouter_ratelimit_clients_active > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "Too many active clients"
description: "Active client count reached {{ $value }}, may need to adjust cleanup strategy"
Performance Optimization¶
1. Rate Limiter Performance Configuration¶
model:
rate-limit:
performance:
async-processing: true # Enable asynchronous processing
batch-size: 100 # Batch processing size
cache-size: 1000 # Cache size
thread-pool-size: 4 # Thread pool size
2. Memory Optimization¶
model:
rate-limit:
client-ip:
cleanup-interval: 180s # More frequent cleanup
max-idle-time: 900s # Shorter idle time
max-clients: 5000 # Limit maximum clients
# Memory optimization configuration
memory:
initial-capacity: 1000 # Initial capacity
load-factor: 0.75 # Load factor
concurrency-level: 16 # Concurrency level
3. Algorithm Selection Optimization¶
# High concurrency scenario: Choose the best performing algorithm
model:
services:
high-traffic:
rate-limit:
algorithm: "token-bucket" # Best performance
# Precise control scenario: Choose the most accurate algorithm
model:
services:
precise-control:
rate-limit:
algorithm: "sliding-window" # Highest accuracy
Practical Use Cases¶
Case 1: API Gateway Rate Limiting¶
# Multi-layer rate limiting protection
model:
# Global rate limiting: Protect the entire gateway
rate-limit:
enabled: true
algorithm: "token-bucket"
capacity: 10000
rate: 1000
client-ip-enable: true
client-ip:
capacity: 100
rate: 10
services:
# Chat service: High frequency usage
chat:
rate-limit:
algorithm: "token-bucket"
capacity: 5000
rate: 500
client-ip-enable: true
client-ip:
capacity: 50
rate: 5
# Image generation: Resource intensive
image-generation:
rate-limit:
algorithm: "leaky-bucket"
capacity: 100
rate: 10
client-ip-enable: true
client-ip:
capacity: 5
rate: 1
Case 2: Anti-Scraping Protection¶
# Prevent malicious interface scraping
model:
services:
sensitive-api:
rate-limit:
enabled: true
algorithm: "sliding-window"
window-size: 300 # 5-minute window
max-requests: 50 # Maximum 50 requests
client-ip-enable: true
client-ip:
window-size: 60 # 1-minute window
max-requests: 10 # Maximum 10 requests per IP
Case 3: Service Warm-up¶
# New service online warm-up
model:
services:
new-service:
rate-limit:
enabled: true
algorithm: "warm-up"
capacity: 1000
rate: 100
warm-up-period: 600 # 10-minute warm-up
cold-factor: 5 # Initial rate is 1/5 of target
Troubleshooting¶
Common Issues¶
Rate Limiting Too Strict
High Memory Usage
Rate Limiting Not Working
Debugging Commands¶
# View rate limiting configuration
curl "http://localhost:8080/actuator/configprops" | jq '.model.rate-limit'
# View rate limiting metrics
curl "http://localhost:8080/actuator/metrics" | grep ratelimit
# Test rate limiting effect
for i in {1..20}; do
curl -w "%{http_code}\n" "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "test", "messages": [{"role": "user", "content": "test"}]}'
done
Next Steps¶
After completing rate limiting configuration, you can continue to learn about:
- Circuit Breaker Configuration - Configure fault protection mechanisms
- Monitoring Guide - Set up rate limiting monitoring and alerts
- Troubleshooting - Diagnose and resolve rate limiting issues
- Performance Tuning - Optimize rate limiting performance