监控 API¶
文档版本: 1.0.0
最后更新: 2025-08-19
Git 提交: latest
作者: Lincoln
JAiRouter 提供全面的监控 API,用于健康检查、指标收集和系统状态监控。
概述¶
监控 API 包括:
- 健康检查 - 服务和实例健康状态
- 指标 - 性能和使用统计
- 系统状态 - 整体系统健康和配置
所有监控端点都在 /actuator/* 路径下提供,为您的 JAiRouter 部署提供实时洞察。
健康检查端点¶
系统健康¶
获取整体系统健康状态:
响应:
{
  "status": "UP",
  "components": {
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 499963174912,
        "free": 91943821312,
        "threshold": 10485760,
        "exists": true
      }
    },
    "ping": {
      "status": "UP"
    },
    "modelRouter": {
      "status": "UP",
      "details": {
        "activeInstances": 5,
        "totalInstances": 8,
        "circuitBreakerStatus": "CLOSED"
      }
    }
  }
}
详细健康信息¶
获取包括所有组件的详细健康信息:
响应:
{
  "status": "UP",
  "components": {
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 499963174912,
        "free": 91943821312,
        "threshold": 10485760,
        "exists": true
      }
    },
    "modelRouter": {
      "status": "UP",
      "details": {
        "services": {
          "chat": {
            "totalInstances": 3,
            "healthyInstances": 2,
            "instances": [
              {
                "id": "ollama-1",
                "url": "http://localhost:11434",
                "status": "UP",
                "lastCheck": "2025-08-19T10:30:00Z",
                "responseTime": 45
              },
              {
                "id": "ollama-2", 
                "url": "http://localhost:11435",
                "status": "UP",
                "lastCheck": "2025-08-19T10:30:00Z",
                "responseTime": 52
              },
              {
                "id": "ollama-3",
                "url": "http://localhost:11436", 
                "status": "DOWN",
                "lastCheck": "2025-08-19T10:29:45Z",
                "error": "连接超时"
              }
            ]
          },
          "embedding": {
            "totalInstances": 2,
            "healthyInstances": 2,
            "instances": [
              {
                "id": "xinference-1",
                "url": "http://localhost:9997",
                "status": "UP",
                "lastCheck": "2025-08-19T10:30:00Z",
                "responseTime": 38
              },
              {
                "id": "xinference-2",
                "url": "http://localhost:9998",
                "status": "UP", 
                "lastCheck": "2025-08-19T10:30:00Z",
                "responseTime": 41
              }
            ]
          }
        }
      }
    }
  }
}
指标端点¶
应用指标¶
获取 Prometheus 格式的指标:
响应:
# HELP jvm_memory_used_bytes 已使用内存量
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 2.38026752E8
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 1048576.0
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 4.2991616E7
# HELP model_router_requests_total 请求总数
# TYPE model_router_requests_total counter
model_router_requests_total{service="chat",instance="ollama-1",status="success",} 1247.0
model_router_requests_total{service="chat",instance="ollama-1",status="error",} 23.0
model_router_requests_total{service="chat",instance="ollama-2",status="success",} 1156.0
model_router_requests_total{service="chat",instance="ollama-2",status="error",} 18.0
# HELP model_router_request_duration_seconds 请求持续时间(秒)
# TYPE model_router_request_duration_seconds histogram
model_router_request_duration_seconds_bucket{service="chat",instance="ollama-1",le="0.1",} 234.0
model_router_request_duration_seconds_bucket{service="chat",instance="ollama-1",le="0.5",} 892.0
model_router_request_duration_seconds_bucket{service="chat",instance="ollama-1",le="1.0",} 1156.0
model_router_request_duration_seconds_bucket{service="chat",instance="ollama-1",le="+Inf",} 1270.0
# HELP model_router_circuit_breaker_state 熔断器状态 (0=关闭, 1=打开, 2=半开)
# TYPE model_router_circuit_breaker_state gauge
model_router_circuit_breaker_state{service="chat",instance="ollama-1",} 0.0
model_router_circuit_breaker_state{service="chat",instance="ollama-2",} 0.0
# HELP model_router_rate_limit_remaining 限流剩余请求数
# TYPE model_router_rate_limit_remaining gauge
model_router_rate_limit_remaining{service="chat",client_ip="192.168.1.100",} 45.0
model_router_rate_limit_remaining{service="chat",client_ip="192.168.1.101",} 38.0
指标摘要¶
获取人类可读的指标摘要:
响应:
{
  "names": [
    "jvm.memory.used",
    "jvm.memory.max", 
    "jvm.gc.pause",
    "http.server.requests",
    "model.router.requests.total",
    "model.router.request.duration",
    "model.router.circuit.breaker.state",
    "model.router.rate.limit.remaining",
    "model.router.load.balancer.weight"
  ]
}
特定指标详情¶
获取特定指标的详细信息:
响应:
{
  "name": "model.router.requests.total",
  "description": "模型路由器处理的请求总数",
  "baseUnit": null,
  "measurements": [
    {
      "statistic": "COUNT",
      "value": 2647.0
    }
  ],
  "availableTags": [
    {
      "tag": "service",
      "values": ["chat", "embedding", "rerank", "tts", "stt", "image"]
    },
    {
      "tag": "instance", 
      "values": ["ollama-1", "ollama-2", "xinference-1", "xinference-2"]
    },
    {
      "tag": "status",
      "values": ["success", "error", "timeout", "circuit_breaker"]
    }
  ]
}
系统信息¶
应用信息¶
获取应用程序信息:
响应:
{
  "app": {
    "name": "JAiRouter",
    "version": "1.0.0",
    "description": "AI 模型服务路由器和负载均衡器"
  },
  "build": {
    "version": "1.0.0",
    "artifact": "model-router",
    "name": "model-router",
    "group": "org.unreal",
    "time": "2025-08-19T08:15:30.123Z"
  },
  "git": {
    "branch": "main",
    "commit": {
      "id": "3418d3f6",
      "time": "2025-08-19T08:00:00Z"
    }
  },
  "java": {
    "version": "17.0.8",
    "vendor": "Eclipse Adoptium"
  }
}
环境信息¶
获取环境和配置详情:
响应:
{
  "activeProfiles": ["default"],
  "propertySources": [
    {
      "name": "server.ports",
      "properties": {
        "local.server.port": {
          "value": 8080
        }
      }
    },
    {
      "name": "applicationConfig: [classpath:/application.yml]",
      "properties": {
        "model-router.load-balancer.default-strategy": {
          "value": "ROUND_ROBIN"
        },
        "model-router.rate-limit.default-algorithm": {
          "value": "TOKEN_BUCKET"
        },
        "model-router.circuit-breaker.failure-threshold": {
          "value": 5
        }
      }
    }
  ]
}
自定义监控端点¶
服务实例状态¶
获取所有服务实例的状态:
响应:
{
  "services": {
    "chat": {
      "instances": [
        {
          "id": "ollama-1",
          "url": "http://localhost:11434",
          "adapter": "OLLAMA",
          "status": "HEALTHY",
          "lastHealthCheck": "2025-08-19T10:30:00Z",
          "responseTime": 45,
          "successRate": 0.982,
          "requestCount": 1270,
          "errorCount": 23,
          "circuitBreakerState": "CLOSED",
          "weight": 1.0
        },
        {
          "id": "ollama-2", 
          "url": "http://localhost:11435",
          "adapter": "OLLAMA",
          "status": "HEALTHY",
          "lastHealthCheck": "2025-08-19T10:30:00Z", 
          "responseTime": 52,
          "successRate": 0.985,
          "requestCount": 1174,
          "errorCount": 18,
          "circuitBreakerState": "CLOSED",
          "weight": 1.0
        },
        {
          "id": "ollama-3",
          "url": "http://localhost:11436",
          "adapter": "OLLAMA", 
          "status": "UNHEALTHY",
          "lastHealthCheck": "2025-08-19T10:29:45Z",
          "error": "连接超时",
          "circuitBreakerState": "OPEN",
          "weight": 0.0
        }
      ]
    }
  }
}
负载均衡器统计¶
获取负载均衡器性能统计:
响应:
{
  "services": {
    "chat": {
      "strategy": "ROUND_ROBIN",
      "totalRequests": 2444,
      "distribution": {
        "ollama-1": {
          "requests": 1270,
          "percentage": 52.0,
          "avgResponseTime": 45
        },
        "ollama-2": {
          "requests": 1174,
          "percentage": 48.0,
          "avgResponseTime": 52
        }
      }
    },
    "embedding": {
      "strategy": "LEAST_CONNECTIONS", 
      "totalRequests": 856,
      "distribution": {
        "xinference-1": {
          "requests": 428,
          "percentage": 50.0,
          "avgResponseTime": 38
        },
        "xinference-2": {
          "requests": 428,
          "percentage": 50.0,
          "avgResponseTime": 41
        }
      }
    }
  }
}
限流状态¶
获取当前限流状态:
响应:
{
  "services": {
    "chat": {
      "algorithm": "TOKEN_BUCKET",
      "globalLimit": {
        "capacity": 1000,
        "remaining": 847,
        "refillRate": 100,
        "nextRefill": "2025-08-19T10:30:10Z"
      },
      "clientLimits": [
        {
          "clientIp": "192.168.1.100",
          "remaining": 45,
          "capacity": 50,
          "lastRequest": "2025-08-19T10:29:58Z"
        },
        {
          "clientIp": "192.168.1.101", 
          "remaining": 38,
          "capacity": 50,
          "lastRequest": "2025-08-19T10:29:59Z"
        }
      ]
    }
  }
}
熔断器状态¶
获取所有实例的熔断器状态:
响应:
{
  "instances": [
    {
      "id": "ollama-1",
      "service": "chat",
      "state": "CLOSED",
      "failureCount": 2,
      "failureThreshold": 5,
      "successThreshold": 3,
      "timeout": 60000,
      "lastFailure": "2025-08-19T10:25:30Z",
      "nextRetry": null
    },
    {
      "id": "ollama-2",
      "service": "chat", 
      "state": "CLOSED",
      "failureCount": 1,
      "failureThreshold": 5,
      "successThreshold": 3,
      "timeout": 60000,
      "lastFailure": "2025-08-19T10:20:15Z",
      "nextRetry": null
    },
    {
      "id": "ollama-3",
      "service": "chat",
      "state": "OPEN",
      "failureCount": 8,
      "failureThreshold": 5,
      "successThreshold": 3,
      "timeout": 60000,
      "lastFailure": "2025-08-19T10:29:45Z",
      "nextRetry": "2025-08-19T10:30:45Z"
    }
  ]
}
监控集成¶
Prometheus 集成¶
JAiRouter 在 /actuator/prometheus 端点暴露 Prometheus 格式的指标。配置 Prometheus 抓取这些指标:
# prometheus.yml
scrape_configs:
  - job_name: 'jairouter'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s
Grafana 仪表板¶
导入 JAiRouter Grafana 仪表板进行可视化:
- 仪表板 ID: 即将推出
- 指标: 请求速率、响应时间、错误率、熔断器状态
- 告警: 高错误率、熔断器打开、实例宕机
健康检查集成¶
配置外部监控工具检查 JAiRouter 健康状态:
# 简单健康检查
curl -f http://localhost:8080/actuator/health || exit 1
# 特定组件的详细健康检查
curl -f http://localhost:8080/actuator/health/modelRouter || exit 1
监控最佳实践¶
关键监控指标¶
- 请求指标
- 请求速率(请求/秒)
- 响应时间(p50、p95、p99)
- 错误率(百分比) 
- 实例健康 
- 实例可用性
- 健康检查响应时间
- 熔断器状态 
- 资源使用 
- JVM 内存使用
- CPU 利用率
- 磁盘空间 
- 限流 
- 限流利用率
- 被拒绝的请求
- 客户端特定限制
告警规则¶
为关键条件设置告警:
# Prometheus 告警规则
groups:
  - name: jairouter
    rules:
      - alert: 高错误率
        expr: rate(model_router_requests_total{status="error"}[5m]) / rate(model_router_requests_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "检测到高错误率"
      - alert: 熔断器打开
        expr: model_router_circuit_breaker_state > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "熔断器已打开"
      - alert: 实例宕机
        expr: up{job="jairouter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "JAiRouter 实例已宕机"
日志监控¶
监控应用日志以发现:
- 错误模式
- 性能问题
- 配置变更
- 安全事件
故障排除¶
常见问题¶
- 高响应时间
- 检查实例健康状态
- 查看负载均衡器分布
- 监控资源使用 
- 熔断器打开 
- 检查实例连接性
- 查看错误日志
- 验证实例配置 
- 超出限流限制 
- 查看限流配置
- 检查客户端请求模式
- 考虑增加限制
调试端点¶
启用调试日志以获得详细监控:
访问调试信息:
安全考虑¶
监控端点安全¶
在生产环境中保护监控端点:
# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  endpoint:
    health:
      show-details: when-authorized
  security:
    enabled: true
敏感信息¶
避免在指标中暴露敏感数据:
- API 密钥
- 内部 URL
- 用户信息
- 配置机密
下一步¶
- 管理 API - 配置管理
- 统一 API - OpenAI 兼容端点
- OpenAPI 规范 - 交互式 API 文档