运维指南¶
本文档为生产环境中 JAiRouter 分布式追踪系统的运维提供完整指南。
生产环境部署¶
环境准备¶
系统要求¶
- JVM: OpenJDK 17 或更高版本
- 内存: 最小 4GB,推荐 8GB+
- CPU: 4 核心以上
- 磁盘: SSD 存储,至少 50GB 可用空间
依赖服务¶
# docker-compose.yml 示例
version: '3.8'
services:
jairouter:
image: jairouter:latest
environment:
- JAIROUTER_TRACING_ENABLED=true
- JAIROUTER_TRACING_EXPORTER_TYPE=otlp
depends_on:
- otel-collector
otel-collector:
image: otel/opentelemetry-collector:latest
ports:
- "4317:4317"
volumes:
- ./otel-config.yaml:/etc/config.yaml
生产配置¶
基础配置¶
jairouter:
tracing:
enabled: true
service-name: "jairouter-prod"
service-version: "${app.version}"
environment: "production"
# 采样配置
sampling:
strategy: "adaptive"
adaptive:
base-sample-rate: 0.01 # 1% 基础采样
max-traces-per-second: 100
error-sample-rate: 1.0 # 错误 100% 采样
slow-request-threshold: 3000
# 导出配置
exporter:
type: "otlp"
batch-size: 512
export-timeout: 10s
max-queue-size: 2048
# 内存管理
memory:
max-spans: 50000
cleanup-interval: 30s
span-ttl: 300s
# 安全配置
security:
enabled: true
sensitive-headers:
- "Authorization"
- "Cookie"
- "X-API-Key"
JVM 调优¶
# 生产环境 JVM 参数
-Xmx8g -Xms8g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+UnlockExperimentalVMOptions
-XX:+UnlockDiagnosticVMOptions
-XX:+LogVMOutput
-XX:LogFile=/var/log/jairouter/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
监控和告警¶
Prometheus 指标配置¶
指标收集¶
# prometheus.yml
scrape_configs:
- job_name: 'jairouter-tracing'
static_configs:
- targets: ['jairouter:8080']
metrics_path: '/actuator/prometheus'
scrape_interval: 30s
关键指标¶
# 追踪导出成功率
rate(jairouter_tracing_spans_exported_total[5m]) /
rate(jairouter_tracing_spans_created_total[5m])
# 平均响应时间
jairouter_tracing_request_duration_seconds_sum /
jairouter_tracing_request_duration_seconds_count
# 内存使用率
jairouter_tracing_memory_used_bytes /
jairouter_tracing_memory_max_bytes
# 错误率
rate(jairouter_tracing_errors_total[5m])
告警规则¶
# tracing-alerts.yml
groups:
- name: jairouter_tracing
rules:
- alert: TracingExportFailureHigh
expr: rate(jairouter_tracing_export_errors_total[5m]) > 0.05
for: 2m
labels:
severity: warning
service: jairouter
annotations:
summary: "追踪数据导出失败率过高"
description: "过去5分钟内追踪数据导出失败率超过5%"
- alert: TracingMemoryUsageHigh
expr: jairouter_tracing_memory_used_ratio > 0.85
for: 1m
labels:
severity: critical
service: jairouter
annotations:
summary: "追踪系统内存使用率过高"
- alert: TracingSlowRequests
expr: histogram_quantile(0.95, jairouter_tracing_request_duration_seconds_bucket) > 5
for: 3m
labels:
severity: warning
annotations:
summary: "95% 请求处理时间超过 5 秒"
Grafana 仪表板¶
核心面板配置¶
{
"dashboard": {
"title": "JAiRouter 追踪监控",
"panels": [
{
"title": "请求追踪概览",
"type": "stat",
"targets": [
{
"expr": "rate(jairouter_tracing_requests_total[5m])",
"legendFormat": "RPS"
}
]
},
{
"title": "追踪数据导出状态",
"type": "timeseries",
"targets": [
{
"expr": "rate(jairouter_tracing_spans_exported_total[5m])",
"legendFormat": "导出成功"
},
{
"expr": "rate(jairouter_tracing_export_errors_total[5m])",
"legendFormat": "导出失败"
}
]
}
]
}
}
容量规划¶
内存规划¶
Span 内存估算¶
# 每个 Span 平均占用内存:约 2KB
# 每秒 1000 个请求,采样率 10%,Span TTL 5分钟
# 内存需求 = 1000 * 0.1 * 300 * 2KB ≈ 60MB
# 建议配置
jairouter:
tracing:
memory:
max-spans: 100000 # 基于内存容量调整
span-ttl: 300s # 5分钟 TTL
动态调整策略¶
jairouter:
tracing:
memory:
# 内存压力阈值
memory-threshold: 0.8
# 自动清理配置
auto-cleanup:
enabled: true
trigger-threshold: 0.85
target-threshold: 0.7
存储规划¶
日志存储¶
# logback-spring.xml
<configuration>
<appender name="TRACING_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/jairouter/tracing.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/jairouter/tracing.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>10GB</totalSizeCap>
</rollingPolicy>
</appender>
</configuration>
安全运维¶
数据脱敏检查¶
# 定期检查敏感数据是否被正确脱敏
grep -r "password\|token\|secret" /var/log/jairouter/tracing.log
# 检查配置中的敏感信息
curl -s http://localhost:8080/actuator/configprops | \
jq '.jairouter.tracing.security.sensitive_headers'
访问控制审计¶
# 启用安全审计
jairouter:
tracing:
security:
audit:
enabled: true
log-access: true
log-config-changes: true
retention-days: 90
加密配置管理¶
# 使用环境变量管理敏感配置
export JAIROUTER_TRACING_EXPORTER_OTLP_HEADERS_API_KEY="your-api-key"
# 或使用 Kubernetes Secret
kubectl create secret generic tracing-config \
--from-literal=api-key=your-api-key
性能调优¶
实时性能监控¶
# 监控脚本示例
#!/bin/bash
while true; do
echo "=== $(date) ==="
# CPU 使用率
echo "CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)"
# 内存使用
echo "Memory: $(free -m | awk 'NR==2{printf "%.1f%%", $3*100/$2}')"
# 追踪指标
curl -s http://localhost:8080/actuator/metrics/jairouter.tracing.spans.active | \
jq '.measurements[0].value'
sleep 30
done
自动化调优¶
# 配置自动调优策略
jairouter:
tracing:
auto-tuning:
enabled: true
# CPU 使用率超过 80% 时降低采样率
cpu-threshold: 80
sampling-rate-adjustment: 0.5
# 内存使用率超过 85% 时触发清理
memory-threshold: 85
cleanup-aggressive: true
备份和恢复¶
配置备份¶
# 每日配置备份脚本
#!/bin/bash
DATE=$(date +%Y%m%d)
BACKUP_DIR="/backup/jairouter-config"
# 备份当前配置
mkdir -p $BACKUP_DIR
curl -s http://localhost:8080/actuator/configprops > \
$BACKUP_DIR/config-$DATE.json
# 保留 30 天备份
find $BACKUP_DIR -name "config-*.json" -mtime +30 -delete
追踪数据备份¶
# 配置追踪数据导出到长期存储
jairouter:
tracing:
exporter:
backup:
enabled: true
location: "/backup/tracing-data"
retention-days: 90
compression: true
升级和维护¶
滚动升级策略¶
# 滚动升级脚本
#!/bin/bash
# 1. 健康检查
curl -f http://localhost:8080/actuator/health/tracing || exit 1
# 2. 导出当前配置
curl -s http://localhost:8080/actuator/configprops > /tmp/pre-upgrade-config.json
# 3. 执行升级
docker-compose pull jairouter
docker-compose up -d jairouter
# 4. 升级后验证
sleep 30
curl -f http://localhost:8080/actuator/health/tracing || {
echo "升级失败,回滚中..."
docker-compose down
# 回滚逻辑
}
维护窗口操作¶
# 维护模式脚本
#!/bin/bash
case $1 in
"enter")
# 进入维护模式
echo "进入维护模式..."
# 降低采样率以减少负载
curl -X PUT http://localhost:8080/api/admin/tracing/sampling-rate \
-H "Content-Type: application/json" \
-d '{"rate": 0.01}'
# 等待当前 Span 处理完成
sleep 60
;;
"exit")
# 退出维护模式
echo "退出维护模式..."
# 恢复正常采样率
curl -X PUT http://localhost:8080/api/admin/tracing/sampling-rate \
-H "Content-Type: application/json" \
-d '{"rate": 0.1}'
;;
esac
应急响应¶
常见应急场景¶
1. 追踪系统过载¶
# 紧急降低采样率
curl -X PUT http://localhost:8080/api/admin/tracing/emergency-config \
-d '{"sampling_rate": 0.001, "reason": "system_overload"}'
# 临时禁用追踪
curl -X POST http://localhost:8080/api/admin/tracing/disable \
-d '{"duration": "1h", "reason": "emergency"}'
2. 导出器故障¶
3. 内存泄漏¶
# 强制 GC 和内存清理
curl -X POST http://localhost:8080/actuator/gc
curl -X POST http://localhost:8080/api/admin/tracing/force-cleanup
应急联系方式¶
建立应急响应流程: 1. 监控告警 → 自动通知运维团队 2. 问题分类 → 确定影响范围和优先级
3. 应急处理 → 执行预定义的应急脚本 4. 问题跟进 → 记录和分析根因
最佳实践总结¶
1. 监控策略¶
- 设置多层次告警(警告、严重、紧急)
- 定期检查追踪数据完整性
- 监控系统资源使用趋势
2. 性能优化¶
- 根据业务需求调整采样率
- 定期清理过期数据
- 合理配置批处理大小
3. 安全管控¶
- 定期审查敏感数据过滤规则
- 启用配置变更审计日志
- 实施最小权限原则
4. 容灾准备¶
- 建立备份和恢复流程
- 准备应急响应预案
- 定期进行故障演练