跳转至

监控配置参考

文档版本: 1.0.0
最后更新: 2025-08-19
Git 提交: c1aa5b0f
作者: Lincoln

本文档提供 JAiRouter 监控系统的完整配置参考,包括所有配置选项的详细说明、默认值和使用示例。

配置文件结构

主配置文件

JAiRouter 监控配置主要在 application.yml 中定义:

# 监控配置
monitoring:
  metrics:
    # 基础配置
    enabled: true
    prefix: "jairouter"
    collection-interval: 10s

    # 指标类别
    enabled-categories:
      - system
      - business
      - infrastructure

    # 自定义标签
    custom-tags:
      environment: "${spring.profiles.active:default}"
      version: "@project.version@"

    # 采样配置
    sampling:
      request-metrics: 1.0
      backend-metrics: 1.0
      infrastructure-metrics: 1.0

    # 性能配置
    performance:
      async-processing: true
      batch-size: 500
      buffer-size: 2000

    # 内存配置
    memory:
      cache-size: 10000
      cache-expiry: 5m

    # 安全配置
    security:
      data-masking: false
      mask-labels: []

# Spring Actuator 配置
management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
      base-path: /actuator

  endpoint:
    health:
      show-details: always
    prometheus:
      cache:
        time-to-live: 10s

  metrics:
    export:
      prometheus:
        enabled: true
        descriptions: true
        step: 10s

基础配置

monitoring.metrics.enabled

类型: Boolean
默认值: true
描述: 是否启用监控指标收集功能

monitoring:
  metrics:
    enabled: true  # 启用监控
    # enabled: false  # 禁用监控

环境变量: MONITORING_METRICS_ENABLED

monitoring.metrics.prefix

类型: String
默认值: "jairouter"
描述: 指标名称前缀,用于区分不同应用的指标

monitoring:
  metrics:
    prefix: "jairouter"        # 默认前缀
    # prefix: "my-app"         # 自定义前缀
    # prefix: ""               # 无前缀

monitoring.metrics.collection-interval

类型: Duration
默认值: 10s
描述: 指标收集间隔

monitoring:
  metrics:
    collection-interval: 10s   # 10 秒
    # collection-interval: 5s  # 5 秒(更频繁)
    # collection-interval: 30s # 30 秒(较少频繁)

指标类别配置

monitoring.metrics.enabled-categories

类型: List
默认值: ["system", "business", "infrastructure"]
描述: 启用的指标类别

monitoring:
  metrics:
    enabled-categories:
      - system          # 系统指标(JVM、HTTP等)
      - business        # 业务指标(模型调用、用户会话等)
      - infrastructure  # 基础设施指标(负载均衡、限流、熔断等)

可选值: - system: JVM 内存、GC、HTTP 请求等系统指标 - business: 模型调用、用户会话、业务流程等业务指标 - infrastructure: 负载均衡、限流、熔断、健康检查等基础设施指标

自定义标签配置

monitoring.metrics.custom-tags

类型: Map
默认值: {}
描述: 添加到所有指标的自定义标签

monitoring:
  metrics:
    custom-tags:
      environment: "${spring.profiles.active:default}"
      version: "@project.version@"
      region: "us-west-1"
      datacenter: "dc1"
      team: "platform"

注意事项: - 标签值支持 Spring 表达式和占位符 - 避免使用高基数标签(如用户 ID、IP 地址) - 标签数量建议不超过 10 个

采样配置

monitoring.metrics.sampling

类型: Object
描述: 指标采样率配置,用于控制指标收集的频率

monitoring:
  metrics:
    sampling:
      request-metrics: 1.0        # 请求指标采样率(100%)
      backend-metrics: 1.0        # 后端调用指标采样率
      infrastructure-metrics: 1.0 # 基础设施指标采样率
      system-metrics: 1.0         # 系统指标采样率
      debug-metrics: 0.1          # 调试指标采样率(10%)

采样率说明: - 1.0: 100% 采样,收集所有指标 - 0.5: 50% 采样,随机收集一半指标 - 0.1: 10% 采样,随机收集十分之一指标 - 0.0: 0% 采样,不收集指标

环境特定配置:

# 开发环境 - 全量采样便于调试
monitoring:
  metrics:
    sampling:
      request-metrics: 1.0
      backend-metrics: 1.0

# 生产环境 - 降低采样率减少开销
monitoring:
  metrics:
    sampling:
      request-metrics: 0.1
      backend-metrics: 0.5

性能配置

monitoring.metrics.performance

类型: Object
描述: 性能相关配置

monitoring:
  metrics:
    performance:
      # 异步处理配置
      async-processing: true
      async-thread-pool-size: 4
      async-thread-pool-max-size: 8
      async-queue-capacity: 1000

      # 批处理配置
      batch-size: 500
      batch-timeout: 1s

      # 缓冲区配置
      buffer-size: 2000
      buffer-flush-interval: 5s

      # 处理超时配置
      processing-timeout: 5s

async-processing

类型: Boolean
默认值: true
描述: 是否启用异步指标处理

monitoring:
  metrics:
    performance:
      async-processing: true   # 启用异步处理(推荐)
      # async-processing: false # 同步处理(调试时使用)

batch-size

类型: Integer
默认值: 500
描述: 批处理大小,一次处理的指标事件数量

monitoring:
  metrics:
    performance:
      batch-size: 500    # 默认批大小
      # batch-size: 100  # 小批量,低延迟
      # batch-size: 1000 # 大批量,高吞吐

buffer-size

类型: Integer
默认值: 2000
描述: 缓冲区大小,待处理指标事件的队列容量

monitoring:
  metrics:
    performance:
      buffer-size: 2000   # 默认缓冲区大小
      # buffer-size: 5000 # 大缓冲区,处理突发流量
      # buffer-size: 1000 # 小缓冲区,节省内存

内存配置

monitoring.metrics.memory

类型: Object
描述: 内存使用相关配置

monitoring:
  metrics:
    memory:
      # 缓存配置
      cache-size: 10000
      cache-expiry: 5m
      cache-cleanup-interval: 1m

      # 内存阈值配置
      memory-threshold: 80
      low-memory-sampling-rate: 0.1

      # 对象池配置
      object-pool-enabled: true
      object-pool-size: 1000

cache-size

类型: Integer
默认值: 10000
描述: 指标缓存大小

cache-expiry

类型: Duration
默认值: 5m
描述: 缓存过期时间

memory-threshold

类型: Integer
默认值: 80
描述: 内存使用阈值(百分比),超过后启用低内存模式

安全配置

monitoring.metrics.security

类型: Object
描述: 安全相关配置

monitoring:
  metrics:
    security:
      # 数据脱敏
      data-masking: true
      mask-labels:
        - user_id
        - client_ip
        - api_key
        - session_id

      # IP 地址脱敏
      ip-masking: true
      ip-mask-pattern: "xxx.xxx.xxx.xxx"

      # 敏感指标过滤
      sensitive-metrics-filter: true
      filtered-metrics:
        - "*.password.*"
        - "*.secret.*"
        - "*.token.*"

data-masking

类型: Boolean
默认值: false
描述: 是否启用数据脱敏

mask-labels

类型: List
默认值: []
描述: 需要脱敏的标签名称列表

Spring Actuator 配置

management.endpoints.web.exposure.include

类型: String
默认值: "health,info"
描述: 暴露的端点列表

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
        # include: "*"  # 暴露所有端点(仅开发环境)

management.endpoint.prometheus.cache.time-to-live

类型: Duration
默认值: 10s
描述: Prometheus 端点缓存时间

management:
  endpoint:
    prometheus:
      cache:
        time-to-live: 10s  # 10 秒缓存
        # time-to-live: 0s # 禁用缓存
        # time-to-live: 60s # 1 分钟缓存

management.metrics.export.prometheus

类型: Object
描述: Prometheus 导出配置

management:
  metrics:
    export:
      prometheus:
        enabled: true
        descriptions: true
        step: 10s
        pushgateway:
          enabled: false
          base-url: http://localhost:9091

环境特定配置

开发环境配置

# application-dev.yml
monitoring:
  metrics:
    enabled: true
    sampling:
      request-metrics: 1.0
      backend-metrics: 1.0
      infrastructure-metrics: 1.0
    performance:
      async-processing: false  # 便于调试
      batch-size: 100
    security:
      data-masking: false

management:
  endpoints:
    web:
      exposure:
        include: "*"  # 开发环境暴露所有端点
  endpoint:
    prometheus:
      cache:
        time-to-live: 1s  # 减少缓存时间便于测试

测试环境配置

# application-test.yml
monitoring:
  metrics:
    enabled: true
    prefix: "test_jairouter"
    sampling:
      request-metrics: 0.1  # 降低采样率减少测试干扰
      backend-metrics: 0.5
    performance:
      async-processing: true
      batch-size: 50
    memory:
      cache-size: 1000

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus

生产环境配置

# application-prod.yml
monitoring:
  metrics:
    enabled: true
    sampling:
      request-metrics: 0.1
      backend-metrics: 0.5
      infrastructure-metrics: 0.1
      system-metrics: 0.5
    performance:
      async-processing: true
      batch-size: 1000
      buffer-size: 5000
    memory:
      cache-size: 20000
      memory-threshold: 85
      low-memory-sampling-rate: 0.01
    security:
      data-masking: true
      mask-labels:
        - user_id
        - client_ip
        - api_key
      ip-masking: true

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    prometheus:
      cache:
        time-to-live: 30s
  security:
    enabled: true

动态配置

运行时配置更新

JAiRouter 支持运行时动态更新监控配置:

# 更新采样率
curl -X POST http://localhost:8080/actuator/monitoring/config \
  -H "Content-Type: application/json" \
  -d '{
    "sampling": {
      "request-metrics": 0.5,
      "backend-metrics": 0.8
    }
  }'

# 启用/禁用指标类别
curl -X POST http://localhost:8080/actuator/monitoring/categories \
  -H "Content-Type: application/json" \
  -d '{
    "enabled-categories": ["system", "business"]
  }'

# 更新性能配置
curl -X POST http://localhost:8080/actuator/monitoring/performance \
  -H "Content-Type: application/json" \
  -d '{
    "batch-size": 200,
    "buffer-size": 1000
  }'

配置文件热重载

支持通过配置文件更新监控配置:

# config/monitoring-override.yml
monitoring:
  metrics:
    sampling:
      request-metrics: 0.3
    performance:
      batch-size: 200

系统会自动检测配置文件变化并应用新配置。

配置验证

配置语法验证

# 验证 YAML 语法
./mvnw spring-boot:run -Dspring-boot.run.arguments="--spring.config.location=classpath:/application.yml --spring.profiles.active=test"

配置有效性检查

# 检查当前配置
curl http://localhost:8080/actuator/monitoring/config

# 检查指标收集状态
curl http://localhost:8080/actuator/monitoring/status

# 验证端点可访问性
curl http://localhost:8080/actuator/prometheus

配置最佳实践

1. 分环境配置

  • 开发环境: 启用所有指标,便于调试
  • 测试环境: 降低采样率,减少测试干扰
  • 生产环境: 平衡性能和监控精度

2. 性能优化配置

# 高性能配置
monitoring:
  metrics:
    sampling:
      request-metrics: 0.1
    performance:
      async-processing: true
      batch-size: 1000
      buffer-size: 5000
    memory:
      cache-size: 20000

3. 安全配置

# 安全配置
monitoring:
  metrics:
    security:
      data-masking: true
      mask-labels:
        - user_id
        - client_ip
        - api_key

management:
  security:
    enabled: true
  server:
    port: 8081
    address: 127.0.0.1

4. 监控配置

# 监控监控系统
monitoring:
  metrics:
    custom-tags:
      monitoring_version: "1.0"
    enabled-categories:
      - system
      - monitoring  # 监控系统自身的指标

故障排查配置

调试配置

# 启用调试模式
logging:
  level:
    org.unreal.modelrouter.monitoring: DEBUG
    io.micrometer: DEBUG

monitoring:
  metrics:
    debug:
      enabled: true
      log-metrics: true
      log-interval: 30s

问题诊断配置

# 诊断配置
monitoring:
  metrics:
    diagnostics:
      enabled: true
      collect-jvm-metrics: true
      collect-system-metrics: true
      health-check-interval: 10s

配置模板

基础模板

# 基础监控配置模板
monitoring:
  metrics:
    enabled: true
    prefix: "jairouter"
    enabled-categories:
      - system
      - business
      - infrastructure
    sampling:
      request-metrics: 1.0
      backend-metrics: 1.0
    performance:
      async-processing: true
      batch-size: 500

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    prometheus:
      cache:
        time-to-live: 10s

高性能模板

# 高性能监控配置模板
monitoring:
  metrics:
    enabled: true
    sampling:
      request-metrics: 0.1
      backend-metrics: 0.5
      infrastructure-metrics: 0.1
    performance:
      async-processing: true
      batch-size: 1000
      buffer-size: 5000
    memory:
      cache-size: 20000
      memory-threshold: 85

安全模板

# 安全监控配置模板
monitoring:
  metrics:
    enabled: true
    security:
      data-masking: true
      mask-labels:
        - user_id
        - client_ip
        - api_key
      ip-masking: true

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,prometheus
  security:
    enabled: true
  server:
    port: 8081
    address: 127.0.0.1

相关文档


提示: 建议根据实际环境和需求选择合适的配置模板,并根据系统运行情况持续优化配置参数。