Operations Guide¶
This document provides a complete guide for operating the JAiRouter distributed tracing system in production environments.
Production Deployment¶
Environment Preparation¶
System Requirements¶
- JVM: OpenJDK 17 or higher
- Memory: Minimum 4GB, recommended 8GB+
- CPU: 4 cores or more
- Disk: SSD storage, at least 50GB available space
Dependency Services¶
# docker-compose.yml example
version: '3.8'
services:
jairouter:
image: jairouter:latest
environment:
- JAIROUTER_TRACING_ENABLED=true
- JAIROUTER_TRACING_EXPORTER_TYPE=otlp
depends_on:
- otel-collector
otel-collector:
image: otel/opentelemetry-collector:latest
ports:
- "4317:4317"
volumes:
- ./otel-config.yaml:/etc/config.yaml
Production Configuration¶
Basic Configuration¶
jairouter:
tracing:
enabled: true
service-name: "jairouter-prod"
service-version: "${app.version}"
environment: "production"
# Sampling configuration
sampling:
strategy: "adaptive"
adaptive:
base-sample-rate: 0.01 # 1% base sampling
max-traces-per-second: 100
error-sample-rate: 1.0 # 100% error sampling
slow-request-threshold: 3000
# Export configuration
exporter:
type: "otlp"
batch-size: 512
export-timeout: 10s
max-queue-size: 2048
# Memory management
memory:
max-spans: 50000
cleanup-interval: 30s
span-ttl: 300s
# Security configuration
security:
enabled: true
sensitive-headers:
- "Authorization"
- "Cookie"
- "X-API-Key"
JVM Tuning¶
# Production JVM parameters
-Xmx8g -Xms8g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+UnlockExperimentalVMOptions
-XX:+UnlockDiagnosticVMOptions
-XX:+LogVMOutput
-XX:LogFile=/var/log/jairouter/gc.log
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
Monitoring and Alerting¶
Prometheus Metrics Configuration¶
Metric Collection¶
# prometheus.yml
scrape_configs:
- job_name: 'jairouter-tracing'
static_configs:
- targets: ['jairouter:8080']
metrics_path: '/actuator/prometheus'
scrape_interval: 30s
Key Metrics¶
# Tracing export success rate
rate(jairouter_tracing_spans_exported_total[5m]) /
rate(jairouter_tracing_spans_created_total[5m])
# Average response time
jairouter_tracing_request_duration_seconds_sum /
jairouter_tracing_request_duration_seconds_count
# Memory usage ratio
jairouter_tracing_memory_used_ratio
# Error rate
rate(jairouter_tracing_errors_total[5m])
Alert Rules¶
# tracing-alerts.yml
groups:
- name: jairouter_tracing
rules:
- alert: TracingExportFailureHigh
expr: rate(jairouter_tracing_export_errors_total[5m]) > 0.05
for: 2m
labels:
severity: warning
service: jairouter
annotations:
summary: "High tracing data export failure rate"
description: "Tracing data export failure rate exceeded 5% in the last 5 minutes"
- alert: TracingMemoryUsageHigh
expr: jairouter_tracing_memory_used_ratio > 0.85
for: 1m
labels:
severity: critical
service: jairouter
annotations:
summary: "High tracing system memory usage"
- alert: TracingSlowRequests
expr: histogram_quantile(0.95, jairouter_tracing_request_duration_seconds_bucket) > 5
for: 3m
labels:
severity: warning
annotations:
summary: "95% of request processing time exceeds 5 seconds"
Grafana Dashboard¶
Core Panel Configuration¶
{
"dashboard": {
"title": "JAiRouter Tracing Monitoring",
"panels": [
{
"title": "Request Tracing Overview",
"type": "stat",
"targets": [
{
"expr": "rate(jairouter_tracing_requests_total[5m])",
"legendFormat": "RPS"
}
]
},
{
"title": "Tracing Data Export Status",
"type": "timeseries",
"targets": [
{
"expr": "rate(jairouter_tracing_spans_exported_total[5m])",
"legendFormat": "Export Success"
},
{
"expr": "rate(jairouter_tracing_export_errors_total[5m])",
"legendFormat": "Export Failed"
}
]
}
]
}
}
Capacity Planning¶
Memory Planning¶
Span Memory Estimation¶
# Each Span occupies approximately 2KB of memory
# 1000 requests per second, 10% sampling rate, 5-minute Span TTL
# Memory requirement = 1000 * 0.1 * 300 * 2KB ≈ 60MB
# Recommended configuration
jairouter:
tracing:
memory:
max-spans: 100000 # Adjust based on memory capacity
span-ttl: 300s # 5-minute TTL
Dynamic Adjustment Strategy¶
jairouter:
tracing:
memory:
# Memory pressure threshold
memory-threshold: 0.8
# Auto cleanup configuration
auto-cleanup:
enabled: true
trigger-threshold: 0.85
target-threshold: 0.7
Storage Planning¶
Log Storage¶
# logback-spring.xml
<configuration>
<appender name="TRACING_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/jairouter/tracing.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/jairouter/tracing.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>10GB</totalSizeCap>
</rollingPolicy>
</appender>
</configuration>
Security Operations¶
Data Sanitization Check¶
# Regularly check if sensitive data is properly sanitized
grep -r "password\|token\|secret" /var/log/jairouter/tracing.log
# Check sensitive information in configuration
curl -s http://localhost:8080/actuator/configprops | \
jq '.jairouter.tracing.security.sensitive_headers'
Access Control Audit¶
# Enable security audit
jairouter:
tracing:
security:
audit:
enabled: true
log-access: true
log-config-changes: true
retention-days: 90
Encryption Configuration Management¶
# Manage sensitive configuration using environment variables
export JAIROUTER_TRACING_EXPORTER_OTLP_HEADERS_API_KEY="your-api-key"
# Or use Kubernetes Secret
kubectl create secret generic tracing-config \
--from-literal=api-key=your-api-key
Performance Tuning¶
Real-time Performance Monitoring¶
# Monitoring script example
#!/bin/bash
while true; do
echo "=== $(date) ==="
# CPU usage
echo "CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)"
# Memory usage
echo "Memory: $(free -m | awk 'NR==2{printf "%.1f%%", $3*100/$2}')"
# Tracing metrics
curl -s http://localhost:8080/actuator/metrics/jairouter.tracing.spans.active | \
jq '.measurements[0].value'
sleep 30
done
Automated Tuning¶
# Configure automatic tuning strategy
jairouter:
tracing:
auto-tuning:
enabled: true
# Reduce sampling rate when CPU usage exceeds 80%
cpu-threshold: 80
sampling-rate-adjustment: 0.5
# Trigger cleanup when memory usage exceeds 85%
memory-threshold: 85
cleanup-aggressive: true
Backup and Recovery¶
Configuration Backup¶
# Daily configuration backup script
#!/bin/bash
DATE=$(date +%Y%m%d)
BACKUP_DIR="/backup/jairouter-config"
# Backup current configuration
mkdir -p $BACKUP_DIR
curl -s http://localhost:8080/actuator/configprops > \
$BACKUP_DIR/config-$DATE.json
# Keep 30 days of backups
find $BACKUP_DIR -name "config-*.json" -mtime +30 -delete
Tracing Data Backup¶
# Configure tracing data export to long-term storage
jairouter:
tracing:
exporter:
backup:
enabled: true
location: "/backup/tracing-data"
retention-days: 90
compression: true
Upgrade and Maintenance¶
Rolling Upgrade Strategy¶
# Rolling upgrade script
#!/bin/bash
# 1. Health check
curl -f http://localhost:8080/actuator/health/tracing || exit 1
# 2. Export current configuration
curl -s http://localhost:8080/actuator/configprops > /tmp/pre-upgrade-config.json
# 3. Perform upgrade
docker-compose pull jairouter
docker-compose up -d jairouter
# 4. Post-upgrade verification
sleep 30
curl -f http://localhost:8080/actuator/health/tracing || {
echo "Upgrade failed, rolling back..."
docker-compose down
# Rollback logic
}
Maintenance Window Operations¶
# Maintenance mode script
#!/bin/bash
case $1 in
"enter")
# Enter maintenance mode
echo "Entering maintenance mode..."
# Reduce sampling rate to reduce load
curl -X PUT http://localhost:8080/api/admin/tracing/sampling-rate \
-H "Content-Type: application/json" \
-d '{"rate": 0.01}'
# Wait for current spans to be processed
sleep 60
;;
"exit")
# Exit maintenance mode
echo "Exiting maintenance mode..."
# Restore normal sampling rate
curl -X PUT http://localhost:8080/api/admin/tracing/sampling-rate \
-H "Content-Type: application/json" \
-d '{"rate": 0.1}'
;;
esac
Emergency Response¶
Common Emergency Scenarios¶
1. Tracing System Overload¶
# Emergency sampling rate reduction
curl -X PUT http://localhost:8080/api/admin/tracing/emergency-config \
-d '{"sampling_rate": 0.001, "reason": "system_overload"}'
# Temporarily disable tracing
curl -X POST http://localhost:8080/api/admin/tracing/disable \
-d '{"duration": "1h", "reason": "emergency"}'
2. Exporter Failure¶
# Switch to backup exporter
jairouter:
tracing:
exporter:
fallback:
enabled: true
type: "logging" # Temporarily use log export
3. Memory Leak¶
# Force GC and memory cleanup
curl -X POST http://localhost:8080/actuator/gc
curl -X POST http://localhost:8080/api/admin/tracing/force-cleanup
Emergency Contact¶
Establish emergency response procedures: 1. Monitoring Alerts → Automatic notification to operations team 2. Issue Classification → Determine impact scope and priority
3. Emergency Handling → Execute predefined emergency scripts 4. Issue Follow-up → Record and analyze root causes
Best Practices Summary¶
1. Monitoring Strategy¶
- Set multi-level alerts (warning, critical, emergency)
- Regularly check tracing data integrity
- Monitor system resource usage trends
2. Performance Optimization¶
- Adjust sampling rate based on business needs
- Regularly clean up expired data
- Properly configure batch processing size
3. Security Control¶
- Regularly review sensitive data filtering rules
- Enable configuration change audit logs
- Implement principle of least privilege
4. Disaster Recovery¶
- Establish backup and recovery procedures
- Prepare emergency response plans
- Regularly conduct failure drills
Next Steps¶
- Troubleshooting - Detailed problem diagnosis and solutions
- Performance Tuning - In-depth performance optimization guide
- Developer Integration - Developer integration documentation