跳转至

监控性能优化指南

文档版本: 1.0.0
最后更新: 2025-08-19
Git 提交: c1aa5b0f
作者: Lincoln

本文档介绍如何优化 JAiRouter 监控系统的性能,减少监控对应用性能的影响,并提高监控系统本身的效率。

性能影响评估

监控开销分析

监控功能对系统性能的影响主要体现在以下几个方面:

组件性能影响典型开销优化空间
指标收集CPU、内存2-5% CPU, 50-100MB 内存
指标暴露网络、CPU1-2% CPU, 网络带宽
异步处理内存、线程20-50MB 内存, 2-4 线程
缓存机制内存10-30MB 内存

性能基准测试

测试环境配置

# 测试配置
monitoring:
  metrics:
    enabled: true
    performance:
      async-processing: true
      batch-size: 500
      buffer-size: 2000
    sampling:
      request-metrics: 1.0
      backend-metrics: 1.0

基准测试结果

# 运行性能测试
./mvnw test -Dtest=MonitoringPerformanceBenchmarkTest

# 典型结果
- 无监控: 1000 RPS, 50ms P95 延迟
- 启用监控: 950 RPS, 52ms P95 延迟
- 性能影响: 5% 吞吐量下降, 4% 延迟增加

指标收集优化

采样策略优化

1. 分层采样配置

monitoring:
  metrics:
    sampling:
      # 核心业务指标保持高采样率
      request-metrics: 1.0
      model-call-metrics: 1.0

      # 基础设施指标可以降低采样率
      infrastructure-metrics: 0.1
      system-metrics: 0.5

      # 调试指标仅在需要时启用
      debug-metrics: 0.01

2. 动态采样率调整

monitoring:
  metrics:
    adaptive-sampling:
      enabled: true
      # 基于系统负载动态调整
      cpu-threshold: 80
      memory-threshold: 85
      # 高负载时的采样率
      high-load-sampling-rate: 0.1

3. 条件采样

@Component
public class ConditionalMetricsCollector {

    @Value("${monitoring.metrics.sampling.request-metrics:1.0}")
    private double requestSamplingRate;

    public void recordRequest(HttpServletRequest request) {
        // 仅对重要路径进行全量采样
        if (isImportantPath(request.getRequestURI()) || 
            Math.random() < requestSamplingRate) {
            metricsCollector.recordRequest(request);
        }
    }

    private boolean isImportantPath(String path) {
        return path.startsWith("/v1/") || 
               path.contains("/chat/") || 
               path.contains("/embedding/");
    }
}

异步处理优化

1. 异步处理配置

monitoring:
  metrics:
    performance:
      # 启用异步处理
      async-processing: true

      # 线程池配置
      async-thread-pool-size: 4
      async-thread-pool-max-size: 8
      async-queue-capacity: 1000

      # 批处理配置
      batch-size: 500
      batch-timeout: 1s

      # 缓冲区配置
      buffer-size: 2000
      buffer-flush-interval: 5s

2. 异步处理实现

@Component
public class AsyncMetricsProcessor {

    private final BlockingQueue<MetricEvent> eventQueue = 
        new ArrayBlockingQueue<>(bufferSize);

    private final ScheduledExecutorService executor = 
        Executors.newScheduledThreadPool(threadPoolSize);

    @PostConstruct
    public void startProcessing() {
        executor.scheduleAtFixedRate(this::processBatch, 0, batchTimeout, TimeUnit.SECONDS);
    }

    public void submitEvent(MetricEvent event) {
        if (!eventQueue.offer(event)) {
            // 队列满时的处理策略
            handleQueueFull(event);
        }
    }

    private void processBatch() {
        List<MetricEvent> batch = new ArrayList<>();
        eventQueue.drainTo(batch, batchSize);

        if (!batch.isEmpty()) {
            processBatchEvents(batch);
        }
    }
}

内存优化

1. 内存使用配置

monitoring:
  metrics:
    memory:
      # 指标缓存配置
      cache-size: 10000
      cache-expiry: 5m
      cache-cleanup-interval: 1m

      # 内存阈值配置
      memory-threshold: 80
      low-memory-sampling-rate: 0.1

      # 对象池配置
      object-pool-enabled: true
      object-pool-size: 1000

2. 内存监控和自适应

@Component
public class MemoryAwareMetricsCollector {

    private final MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();

    public void recordMetric(String name, double value, String... tags) {
        if (isMemoryPressureHigh()) {
            // 高内存压力时降低采样率
            if (Math.random() < lowMemorySamplingRate) {
                doRecordMetric(name, value, tags);
            }
        } else {
            doRecordMetric(name, value, tags);
        }
    }

    private boolean isMemoryPressureHigh() {
        MemoryUsage heapUsage = memoryBean.getHeapMemoryUsage();
        double usageRatio = (double) heapUsage.getUsed() / heapUsage.getMax();
        return usageRatio > memoryThreshold;
    }
}

Prometheus 优化

抓取配置优化

1. 抓取间隔优化

# prometheus.yml
global:
  scrape_interval: 30s  # 增加全局抓取间隔
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'jairouter'
    scrape_interval: 15s  # 重要服务保持较短间隔
    scrape_timeout: 10s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['jairouter:8080']

    # 指标重新标记以减少存储
    metric_relabel_configs:
      # 删除不需要的指标
      - source_labels: [__name__]
        regex: 'jvm_classes_.*'
        action: drop

      # 删除高基数标签
      - regex: 'client_ip'
        action: labeldrop

2. 存储优化

# prometheus.yml 启动参数
command:
  - '--config.file=/etc/prometheus/prometheus.yml'
  - '--storage.tsdb.path=/prometheus'
  - '--storage.tsdb.retention.time=15d'  # 减少保留时间
  - '--storage.tsdb.retention.size=5GB'   # 限制存储大小
  - '--storage.tsdb.min-block-duration=2h'
  - '--storage.tsdb.max-block-duration=25h'
  - '--storage.tsdb.wal-compression'      # 启用 WAL 压缩

Recording Rules 优化

1. 预聚合常用查询

# rules/jairouter-recording-rules.yml
groups:
  - name: jairouter.recording
    interval: 30s
    rules:
      # 基础指标预聚合
      - record: jairouter:request_rate_5m
        expr: sum(rate(jairouter_requests_total[5m]))

      - record: jairouter:error_rate_5m
        expr: sum(rate(jairouter_requests_total{status=~"4..|5.."}[5m])) / sum(rate(jairouter_requests_total[5m]))

      - record: jairouter:response_time_p95_5m
        expr: histogram_quantile(0.95, sum(rate(jairouter_request_duration_seconds_bucket[5m])) by (le))

      # 按服务分组的预聚合
      - record: jairouter:request_rate_by_service_5m
        expr: sum by (service) (rate(jairouter_requests_total[5m]))

      - record: jairouter:error_rate_by_service_5m
        expr: sum by (service) (rate(jairouter_requests_total{status=~"4..|5.."}[5m])) / sum by (service) (rate(jairouter_requests_total[5m]))

2. 分层 Recording Rules

groups:
  # 高频率基础规则
  - name: jairouter.recording.basic
    interval: 15s
    rules:
      - record: jairouter:up
        expr: up{job="jairouter"}

      - record: jairouter:request_rate_1m
        expr: sum(rate(jairouter_requests_total[1m]))

  # 中频率聚合规则
  - name: jairouter.recording.aggregate
    interval: 60s
    rules:
      - record: jairouter:request_rate_5m
        expr: sum(rate(jairouter_requests_total[5m]))

      - record: jairouter:backend_health_ratio
        expr: sum(jairouter_backend_health) / count(jairouter_backend_health)

  # 低频率复杂规则
  - name: jairouter.recording.complex
    interval: 300s
    rules:
      - record: jairouter:sla_availability_24h
        expr: avg_over_time(jairouter:up[24h])

Grafana 优化

查询优化

1. 使用 Recording Rules

# 优化前:复杂查询
sum(rate(jairouter_requests_total{status=~"4..|5.."}[5m])) / sum(rate(jairouter_requests_total[5m])) * 100

# 优化后:使用预聚合
jairouter:error_rate_5m * 100

2. 查询缓存配置

# grafana.ini
[caching]
enabled = true

[query_caching]
enabled = true
ttl = 60s
max_cache_size_mb = 100

3. 面板优化

{
  "targets": [
    {
      "expr": "jairouter:request_rate_5m",
      "interval": "30s",
      "maxDataPoints": 100,
      "refId": "A"
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    }
  }
}

仪表板优化

1. 面板数量控制

  • 每个仪表板不超过 20 个面板
  • 使用行折叠功能组织面板
  • 避免在单个面板中显示过多数据系列

2. 刷新间隔优化

{
  "refresh": "30s",  // 增加刷新间隔
  "time": {
    "from": "now-1h",  // 减少默认时间范围
    "to": "now"
  }
}

网络优化

指标传输优化

1. 压缩配置

# prometheus.yml
scrape_configs:
  - job_name: 'jairouter'
    params:
      format: ['prometheus']
    honor_labels: true
    compression: 'gzip'  # 启用压缩

2. 批量传输

@Component
public class BatchMetricsExporter {

    private final List<Metric> metricsBatch = new ArrayList<>();
    private final int batchSize = 1000;

    @Scheduled(fixedDelay = 5000)
    public void exportBatch() {
        if (metricsBatch.size() >= batchSize) {
            exportMetrics(new ArrayList<>(metricsBatch));
            metricsBatch.clear();
        }
    }
}

网络连接优化

1. 连接池配置

monitoring:
  metrics:
    http-client:
      connection-pool-size: 10
      connection-timeout: 5s
      read-timeout: 10s
      keep-alive: true

2. 本地缓存

@Component
public class CachedMetricsCollector {

    private final Cache<String, Double> metricsCache = 
        Caffeine.newBuilder()
            .maximumSize(10000)
            .expireAfterWrite(Duration.ofMinutes(5))
            .build();

    public void recordMetric(String name, double value) {
        // 检查缓存避免重复计算
        Double cachedValue = metricsCache.getIfPresent(name);
        if (cachedValue == null || Math.abs(cachedValue - value) > threshold) {
            metricsCache.put(name, value);
            doRecordMetric(name, value);
        }
    }
}

监控系统监控

监控性能指标

1. 监控开销指标

# 添加监控系统自身的指标
- record: monitoring:collection_duration_seconds
  expr: histogram_quantile(0.95, rate(jairouter_metrics_collection_duration_seconds_bucket[5m]))

- record: monitoring:memory_usage_bytes
  expr: jvm_memory_used_bytes{area="heap", application="jairouter"}

- record: monitoring:cpu_usage_percent
  expr: rate(process_cpu_seconds_total{application="jairouter"}[5m]) * 100

2. 性能告警

groups:
  - name: monitoring.performance
    rules:
      - alert: MonitoringHighCPUUsage
        expr: monitoring:cpu_usage_percent > 10
        for: 5m
        annotations:
          summary: "监控功能 CPU 使用率过高"

      - alert: MonitoringHighMemoryUsage
        expr: monitoring:memory_usage_bytes > 200 * 1024 * 1024
        for: 5m
        annotations:
          summary: "监控功能内存使用过高"

自动性能调优

1. 自适应配置

@Component
public class AdaptiveMonitoringConfig {

    @Scheduled(fixedDelay = 60000)
    public void adjustConfiguration() {
        double cpuUsage = getCurrentCPUUsage();
        double memoryUsage = getCurrentMemoryUsage();

        if (cpuUsage > 80 || memoryUsage > 85) {
            // 降低采样率
            reduceSamplingRate();
            // 增加批处理大小
            increaseBatchSize();
            // 减少指标类别
            disableNonCriticalMetrics();
        } else if (cpuUsage < 50 && memoryUsage < 60) {
            // 恢复正常配置
            restoreNormalConfiguration();
        }
    }
}

环境特定优化

开发环境优化

# application-dev.yml
monitoring:
  metrics:
    enabled: true
    sampling:
      request-metrics: 1.0  # 开发环境保持全量采样
    performance:
      async-processing: false  # 简化调试
      batch-size: 100

测试环境优化

# application-test.yml
monitoring:
  metrics:
    enabled: true
    sampling:
      request-metrics: 0.1  # 降低采样率减少干扰
    performance:
      async-processing: true
      batch-size: 1000

生产环境优化

# application-prod.yml
monitoring:
  metrics:
    enabled: true
    sampling:
      request-metrics: 0.1
      backend-metrics: 0.5
      infrastructure-metrics: 0.1
    performance:
      async-processing: true
      batch-size: 1000
      buffer-size: 5000
    memory:
      cache-size: 20000
      low-memory-sampling-rate: 0.01

性能测试

基准测试

1. 性能测试脚本

#!/bin/bash

echo "开始性能基准测试..."

# 无监控基准测试
echo "测试无监控性能..."
curl -X POST http://localhost:8080/actuator/monitoring/disable
ab -n 10000 -c 100 http://localhost:8080/v1/chat/completions > baseline.txt

# 启用监控测试
echo "测试启用监控性能..."
curl -X POST http://localhost:8080/actuator/monitoring/enable
ab -n 10000 -c 100 http://localhost:8080/v1/chat/completions > monitoring.txt

# 分析结果
python analyze_performance.py baseline.txt monitoring.txt

2. 性能分析脚本

import re

def analyze_performance(baseline_file, monitoring_file):
    def parse_ab_output(filename):
        with open(filename, 'r') as f:
            content = f.read()

        rps = re.search(r'Requests per second:\s+(\d+\.\d+)', content)
        latency = re.search(r'Time per request:\s+(\d+\.\d+)', content)

        return float(rps.group(1)), float(latency.group(1))

    baseline_rps, baseline_latency = parse_ab_output(baseline_file)
    monitoring_rps, monitoring_latency = parse_ab_output(monitoring_file)

    rps_impact = (baseline_rps - monitoring_rps) / baseline_rps * 100
    latency_impact = (monitoring_latency - baseline_latency) / baseline_latency * 100

    print(f"性能影响分析:")
    print(f"吞吐量影响: {rps_impact:.2f}%")
    print(f"延迟影响: {latency_impact:.2f}%")

持续性能监控

1. 性能回归检测

# 在 CI/CD 中添加性能检查
- name: Performance Regression Test
  run: |
    ./scripts/performance-test.sh
    if [ $PERFORMANCE_IMPACT -gt 10 ]; then
      echo "性能影响超过 10%,构建失败"
      exit 1
    fi

2. 性能趋势监控

# 监控性能趋势
increase(jairouter_request_duration_seconds_sum[24h]) / increase(jairouter_request_duration_seconds_count[24h])

最佳实践总结

配置优化清单

  • [ ] 启用异步处理
  • [ ] 配置合理的采样率
  • [ ] 使用 Recording Rules 预聚合
  • [ ] 优化 Prometheus 存储配置
  • [ ] 限制 Grafana 面板数量
  • [ ] 启用压缩和缓存
  • [ ] 监控监控系统性能
  • [ ] 定期进行性能测试

性能优化原则

  1. 渐进式优化: 从影响最大的优化开始
  2. 监控优化效果: 每次优化后测量性能影响
  3. 平衡精度和性能: 在监控精度和性能之间找到平衡
  4. 环境特定配置: 不同环境使用不同的优化策略
  5. 持续监控: 建立性能监控和告警机制

相关文档


重要提醒: 性能优化是一个持续的过程,需要根据实际使用情况不断调整和优化配置。