熔断器配置¶
文档版本: 1.0.0
最后更新: 2025-08-28 Git 提交: c1aa5b0f
作者: Lincoln
JAiRouter 内置熔断器机制,用于防止服务雪崩和级联故障。当后端服务出现故障时,熔断器会自动切断请求,并在服务恢复后自动恢复正常。本文档详细介绍熔断器的配置和使用。
模块化配置说明¶
从 v1.0.0 版本开始,JAiRouter 采用模块化配置结构,熔断器相关配置已移至独立的配置文件中:
- 主配置文件: application.yml
- 模型服务基础配置: config/base/model-services-base.yml
您可以在 config/base/model-services-base.yml 文件中找到所有熔断器相关配置,包括全局配置、各服务类型配置和实例配置。
熔断器概述¶
工作原理¶
熔断器有三种状态:
stateDiagram-v2
[*] --> CLOSED
CLOSED --> OPEN : 失败次数 >= 失败阈值
OPEN --> HALF_OPEN : 恢复检测时间到
HALF_OPEN --> CLOSED : 成功次数 >= 成功阈值
HALF_OPEN --> OPEN : 任何失败
note right of CLOSED : 正常状态,请求正常通过
note right of OPEN : 熔断状态,直接返回错误
note right of HALF_OPEN : 半开状态,允许少量请求测试
状态说明¶
状态 | 说明 | 行为 |
---|---|---|
CLOSED | 正常状态 | 请求正常通过,统计失败次数 |
OPEN | 熔断状态 | 直接拒绝请求,返回熔断错误 |
HALF_OPEN | 半开状态 | 允许少量请求测试服务恢复 |
核心参数¶
参数 | 说明 | 默认值 |
---|---|---|
failureThreshold | 失败阈值,超过后熔断 | 5 |
recoveryTimeout | 恢复检测间隔(毫秒) | 60000 |
successThreshold | 成功阈值,达到后关闭熔断 | 3 |
timeout | 请求超时时间(毫秒) | 30000 |
全局熔断器配置¶
基础配置¶
在 config/base/model-services-base.yml 文件中配置全局熔断器:
# config/base/model-services-base.yml
model:
# 全局熔断配置
circuit-breaker:
enabled: true
failureThreshold: 5
timeout: 60000
successThreshold: 2
# 全局降级配置
fallback:
enabled: true
strategy: default
services:
# 聊天服务配置
chat:
load-balance:
type: least-connections
adapter: gpustack # 使用GPUStack适配器
# 服务级别熔断配置
circuit-breaker:
enabled: true
failureThreshold: 5
timeout: 60000
successThreshold: 2
高级配置¶
model:
circuit-breaker:
enabled: true
failure-threshold: 5
recovery-timeout: 60000
success-threshold: 3
timeout: 30000
# 统计窗口配置
statistics:
window-size: 100 # 统计窗口大小
minimum-requests: 10 # 最小请求数
# 异常配置
exceptions:
include: # 包含的异常类型
- "java.net.ConnectException"
- "java.net.SocketTimeoutException"
exclude: # 排除的异常类型
- "java.lang.IllegalArgumentException"
# 监控配置
monitoring:
enabled: true # 启用监控
metrics-interval: 10s # 指标收集间隔
服务级熔断器配置¶
YAML 配置¶
在 config/base/model-services-base.yml 文件中配置各服务类型的熔断器:
model:
services:
chat:
circuit-breaker:
enabled: true
failure-threshold: 3 # 更敏感的失败阈值
recovery-timeout: 30000 # 更快的恢复检测
success-threshold: 2 # 更快的恢复
timeout: 45000 # 更长的超时时间
instances:
- name: "unstable-model"
base-url: "http://unstable-server:8080"
path: "/v1/chat/completions"
weight: 1
embedding:
circuit-breaker:
enabled: true
failure-threshold: 10 # 更宽松的失败阈值
recovery-timeout: 120000 # 更长的恢复时间
success-threshold: 5 # 更多成功次数要求
timeout: 60000
JSON 配置¶
JAiRouter 也支持通过动态配置 API 更新熔断器配置:
{
"services": {
"chat": {
"circuitBreaker": {
"enabled": true,
"failureThreshold": 5,
"recoveryTimeout": 60000,
"successThreshold": 3,
"timeout": 30000
},
"instances": [
{
"name": "reliable-model",
"baseUrl": "http://reliable-server:8080",
"path": "/v1/chat/completions",
"weight": 2
}
]
}
}
}
实例级熔断器配置¶
独立实例熔断¶
在 config/base/model-services-base.yml 文件中配置实例级别的熔断器:
model:
services:
chat:
instances:
- name: "high-risk-model"
base-url: "http://high-risk-server:8080"
circuit-breaker:
enabled: true
failure-threshold: 2 # 更严格的阈值
recovery-timeout: 30000
success-threshold: 1
timeout: 20000
- name: "stable-model"
base-url: "http://stable-server:8080"
circuit-breaker:
enabled: true
failure-threshold: 10 # 更宽松的阈值
recovery-timeout: 60000
success-threshold: 3
timeout: 30000
实例特定配置¶
JAiRouter 也支持通过动态配置 API 更新实例级熔断器配置:
{
"services": {
"chat": {
"instances": [
{
"name": "experimental-model",
"baseUrl": "http://experimental-server:8080",
"circuitBreaker": {
"enabled": true,
"failureThreshold": 3,
"recoveryTimeout": 45000,
"successThreshold": 2,
"timeout": 25000,
"statistics": {
"windowSize": 50,
"minimumRequests": 5
}
}
}
]
}
}
}
熔断器与降级策略¶
基础降级配置¶
在 config/base/model-services-base.yml 文件中配置全局和服务级降级策略:
model:
services:
chat:
circuit-breaker:
enabled: true
failure-threshold: 5
recovery-timeout: 60000
# 降级策略配置
fallback:
enabled: true
strategy: default # 降级策略:default 或 cache
response:
choices:
- message:
role: "assistant"
content: "服务暂时不可用,请稍后重试。"
finish_reason: "stop"
usage:
prompt_tokens: 0
completion_tokens: 0
total_tokens: 0
缓存降级配置¶
model:
services:
embedding:
circuit-breaker:
enabled: true
failure-threshold: 3
recovery-timeout: 30000
fallback:
enabled: true
strategy: cache
cache:
size: 1000 # 缓存大小
ttl: 3600000 # 缓存TTL(毫秒)
key-generator: "md5" # 缓存键生成算法
JSON 降级配置¶
JAiRouter 也支持通过动态配置 API 更新降级策略:
{
"services": {
"chat": {
"circuitBreaker": {
"enabled": true,
"failureThreshold": 5,
"recoveryTimeout": 60000
},
"fallback": {
"type": "default",
"response": {
"id": "fallback-response",
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "抱歉,服务暂时不可用。我们正在努力恢复服务,请稍后重试。"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0
}
}
}
}
}
}
熔断器监控¶
监控指标¶
# 熔断器状态
curl "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.state"
# 熔断器调用统计
curl "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.calls"
# 熔断器失败统计
curl "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.failures"
# 熔断器恢复统计
curl "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.recovery"
Prometheus 指标¶
# 熔断器状态(0=CLOSED, 1=OPEN, 2=HALF_OPEN)
jairouter_circuitbreaker_state{service="chat",instance="model-1"}
# 熔断器调用总数
jairouter_circuitbreaker_calls_total{service="chat",instance="model-1",result="success"}
# 熔断器失败率
rate(jairouter_circuitbreaker_calls_total{result="failure"}[5m]) / rate(jairouter_circuitbreaker_calls_total[5m])
# 熔断器开启次数
jairouter_circuitbreaker_opened_total{service="chat",instance="model-1"}
监控仪表板¶
# Grafana 仪表板配置
dashboard:
panels:
- title: "熔断器状态"
type: "stat"
targets:
- expr: 'jairouter_circuitbreaker_state'
legendFormat: '{{service}}-{{instance}}'
- title: "熔断器失败率"
type: "graph"
targets:
- expr: 'rate(jairouter_circuitbreaker_calls_total{result="failure"}[5m]) / rate(jairouter_circuitbreaker_calls_total[5m])'
legendFormat: '{{service}}-{{instance}}'
- title: "熔断器开启次数"
type: "graph"
targets:
- expr: 'increase(jairouter_circuitbreaker_opened_total[1h])'
legendFormat: '{{service}}-{{instance}}'
熔断器告警¶
Prometheus 告警规则¶
groups:
- name: jairouter_circuitbreaker
rules:
- alert: CircuitBreakerOpen
expr: jairouter_circuitbreaker_state == 1
for: 1m
labels:
severity: critical
annotations:
summary: "熔断器开启"
description: "服务 {{ $labels.service }} 实例 {{ $labels.instance }} 的熔断器已开启"
- alert: HighFailureRate
expr: rate(jairouter_circuitbreaker_calls_total{result="failure"}[5m]) / rate(jairouter_circuitbreaker_calls_total[5m]) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "熔断器失败率过高"
description: "服务 {{ $labels.service }} 实例 {{ $labels.instance }} 的失败率超过 50%"
- alert: FrequentCircuitBreakerOpening
expr: increase(jairouter_circuitbreaker_opened_total[1h]) > 5
for: 0m
labels:
severity: warning
annotations:
summary: "熔断器频繁开启"
description: "服务 {{ $labels.service }} 实例 {{ $labels.instance }} 在过去1小时内熔断器开启超过5次"
告警通知配置¶
# AlertManager 配置
route:
group_by: ['alertname', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://your-webhook-url/alert'
send_resolved: true
熔断器调优¶
参数调优指南¶
1. 失败阈值调优¶
# 敏感服务:快速熔断
sensitive-service:
circuit-breaker:
failure-threshold: 3 # 3次失败即熔断
# 稳定服务:宽松熔断
stable-service:
circuit-breaker:
failure-threshold: 10 # 10次失败才熔断
# 实验性服务:极敏感熔断
experimental-service:
circuit-breaker:
failure-threshold: 1 # 1次失败即熔断
2. 恢复时间调优¶
# 快速恢复服务
fast-recovery:
circuit-breaker:
recovery-timeout: 30000 # 30秒后尝试恢复
# 慢速恢复服务
slow-recovery:
circuit-breaker:
recovery-timeout: 300000 # 5分钟后尝试恢复
# 自适应恢复
adaptive-recovery:
circuit-breaker:
recovery-timeout: 60000
adaptive-recovery:
enabled: true
min-timeout: 30000
max-timeout: 600000
backoff-multiplier: 2
3. 成功阈值调优¶
# 严格恢复:需要更多成功
strict-recovery:
circuit-breaker:
success-threshold: 5
# 快速恢复:较少成功即可
quick-recovery:
circuit-breaker:
success-threshold: 1
# 标准恢复
standard-recovery:
circuit-breaker:
success-threshold: 3
性能优化¶
1. 统计窗口优化¶
model:
circuit-breaker:
statistics:
window-size: 100 # 统计窗口大小
minimum-requests: 10 # 最小请求数
sliding-window: true # 使用滑动窗口
# 内存优化
memory:
max-statistics: 1000 # 最大统计记录数
cleanup-interval: 300s # 清理间隔
2. 并发优化¶
model:
circuit-breaker:
concurrency:
thread-safe: true # 线程安全
lock-free: true # 无锁实现
batch-processing: true # 批量处理
实际使用案例¶
案例 1:微服务保护¶
# 保护关键微服务
model:
services:
user-service:
circuit-breaker:
enabled: true
failure-threshold: 5
recovery-timeout: 60000
success-threshold: 3
fallback:
enabled: true
strategy: cache
cache:
size: 1000
ttl: 300000
order-service:
circuit-breaker:
enabled: true
failure-threshold: 3 # 更敏感
recovery-timeout: 30000 # 更快恢复
success-threshold: 2
fallback:
enabled: true
strategy: default
案例 2:外部 API 保护¶
# 保护外部 API 调用
model:
services:
external-ai-api:
circuit-breaker:
enabled: true
failure-threshold: 3
recovery-timeout: 120000 # 外部服务恢复较慢
success-threshold: 5 # 需要更多成功确认
timeout: 60000 # 更长的超时时间
fallback:
enabled: true
strategy: default
response:
error:
code: "service_unavailable"
message: "外部AI服务暂时不可用"
案例 3:分层熔断保护¶
# 多层熔断保护
model:
# 全局熔断器
circuit-breaker:
enabled: true
failure-threshold: 20
recovery-timeout: 300000
services:
# 服务级熔断器
critical-service:
circuit-breaker:
enabled: true
failure-threshold: 10
recovery-timeout: 120000
instances:
# 实例级熔断器
- name: "primary-instance"
circuit-breaker:
enabled: true
failure-threshold: 5
recovery-timeout: 60000
- name: "backup-instance"
circuit-breaker:
enabled: true
failure-threshold: 3 # 备用实例更敏感
recovery-timeout: 30000
故障排查¶
常见问题¶
1. 熔断器过于敏感¶
现象:服务频繁熔断,影响正常使用
排查:
# 检查失败率
curl "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.calls"
# 检查熔断器开启频率
curl "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.opened"
解决:
# 调整参数
circuit-breaker:
failure-threshold: 10 # 增加失败阈值
recovery-timeout: 30000 # 减少恢复时间
success-threshold: 2 # 减少成功阈值
2. 熔断器不生效¶
现象:服务故障时熔断器没有开启
排查:
# 检查熔断器配置
curl "http://localhost:8080/actuator/configprops" | jq '.model.circuit-breaker'
# 检查熔断器状态
curl "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.state"
解决:
3. 降级响应不正确¶
现象:熔断时返回的降级响应格式错误
排查:
# 测试降级响应
curl -X POST "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "unavailable-model", "messages": [{"role": "user", "content": "test"}]}'
解决:
# 修正降级响应格式
fallback:
strategy: default
response:
id: "fallback-response"
object: "chat.completion"
choices:
- index: 0
message:
role: "assistant"
content: "服务暂时不可用"
finish_reason: "stop"
调试工具¶
1. 熔断器状态检查¶
#!/bin/bash
# 熔断器状态检查脚本
echo "=== 熔断器状态 ==="
curl -s "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.state" | jq .
echo "=== 熔断器调用统计 ==="
curl -s "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.calls" | jq .
echo "=== 熔断器失败统计 ==="
curl -s "http://localhost:8080/actuator/metrics/jairouter.circuitbreaker.failures" | jq .
2. 熔断器测试工具¶
#!/bin/bash
# 熔断器测试脚本
SERVICE_URL="http://localhost:8080/v1/chat/completions"
TEST_PAYLOAD='{"model": "test-model", "messages": [{"role": "user", "content": "test"}]}'
echo "开始熔断器测试..."
for i in {1..20}; do
RESPONSE=$(curl -s -w "%{http_code}" -X POST "$SERVICE_URL" \
-H "Content-Type: application/json" \
-d "$TEST_PAYLOAD")
HTTP_CODE=$(echo "$RESPONSE" | tail -c 4)
echo "请求 $i: HTTP $HTTP_CODE"
if [ "$HTTP_CODE" = "503" ]; then
echo "熔断器已开启"
break
fi
sleep 1
done
最佳实践¶
1. 熔断器配置策略¶
# 根据服务特性配置不同的熔断策略
# 关键服务:快速熔断,快速恢复
critical-services:
circuit-breaker:
failure-threshold: 3
recovery-timeout: 30000
success-threshold: 2
# 一般服务:标准配置
normal-services:
circuit-breaker:
failure-threshold: 5
recovery-timeout: 60000
success-threshold: 3
# 实验性服务:极敏感熔断
experimental-services:
circuit-breaker:
failure-threshold: 1
recovery-timeout: 120000
success-threshold: 5
2. 监控和告警¶
- 设置熔断器状态监控
- 配置失败率告警
- 监控熔断器开启频率
- 设置降级响应监控
3. 测试和验证¶
- 定期进行熔断器测试
- 验证降级响应的正确性
- 测试恢复机制
- 验证监控和告警
下一步¶
完成熔断器配置后,您可以继续了解: