Monitoring Guide¶

文档版本: 1.0.0
最后更新: 2025-08-19
Git 提交: 87d3bddc
作者: Lincoln

JAiRouter provides comprehensive monitoring capabilities to help you track performance, health, and usage patterns of your AI model routing gateway.

Monitoring Overview¶

JAiRouter offers multiple monitoring approaches:

Setup - Configure monitoring infrastructure
Dashboards - Grafana dashboards and visualizations
Alerts - Alert configuration and notifications
Troubleshooting - Monitoring-based troubleshooting
- Alerts Rules - Alert Rules instructions

Built-in Monitoring Features¶

Health Checks¶

JAiRouter provides multiple health check endpoints:

# Overall application health
curl http://localhost:8080/actuator/health

# Detailed health information
curl http://localhost:8080/actuator/health/detailed

# Readiness probe (for Kubernetes)
curl http://localhost:8080/actuator/health/readiness

# Liveness probe (for Kubernetes)
curl http://localhost:8080/actuator/health/liveness

Metrics Collection¶

JAiRouter exposes metrics in multiple formats:

# Prometheus metrics
curl http://localhost:8080/actuator/prometheus

# JSON metrics
curl http://localhost:8080/actuator/metrics

# Specific metric
curl http://localhost:8080/actuator/metrics/http.server.requests

Application Information¶

Get detailed application information:

# Application info
curl http://localhost:8080/actuator/info

# Environment details
curl http://localhost:8080/actuator/env

# Configuration properties
curl http://localhost:8080/actuator/configprops

Key Metrics¶

Request Metrics¶

Metric	Description	Type
`http.server.requests`	HTTP request duration and count	Timer
`jairouter.requests.total`	Total requests by service type	Counter
`jairouter.requests.duration`	Request processing time	Timer
`jairouter.requests.errors`	Error count by type	Counter

Load Balancer Metrics¶

Metric	Description	Type
`jairouter.loadbalancer.requests`	Requests per instance	Counter
`jairouter.loadbalancer.active_connections`	Active connections per instance	Gauge
`jairouter.loadbalancer.instance_health`	Instance health status	Gauge

Rate Limiter Metrics¶

Metric	Description	Type
`jairouter.ratelimit.requests.allowed`	Allowed requests	Counter
`jairouter.ratelimit.requests.denied`	Denied requests	Counter
`jairouter.ratelimit.tokens.available`	Available tokens	Gauge

Circuit Breaker Metrics¶

Metric	Description	Type
`jairouter.circuitbreaker.state`	Circuit breaker state	Gauge
`jairouter.circuitbreaker.failures`	Failure count	Counter
`jairouter.circuitbreaker.successes`	Success count	Counter

Quick Monitoring Setup¶

Docker Compose with Monitoring Stack¶

version: '3.8'
services:
  jairouter:
    image: sodlinken/jairouter:latest
    ports:
      - "8080:8080"
    volumes:
      - ./config:/app/config:ro
    environment:
      - MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE=health,metrics,prometheus,info

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro

volumes:
  grafana-storage:

Prometheus Configuration¶

# monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'jairouter'
    static_configs:
      - targets: ['jairouter:8080']
    metrics_path: '/actuator/prometheus'
    scrape_interval: 10s
    scrape_timeout: 5s

Grafana Data Source¶

# monitoring/grafana/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

Custom Metrics¶

Adding Custom Metrics¶

JAiRouter allows you to add custom metrics:

@Component
public class CustomMetrics {
    private final MeterRegistry meterRegistry;
    private final Counter customRequestCounter;
    private final Timer customProcessingTimer;

    public CustomMetrics(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        this.customRequestCounter = Counter.builder("jairouter.custom.requests")
            .description("Custom request counter")
            .register(meterRegistry);
        this.customProcessingTimer = Timer.builder("jairouter.custom.processing.time")
            .description("Custom processing time")
            .register(meterRegistry);
    }
}

Business Metrics¶

Track business-specific metrics:

# Model usage statistics
curl http://localhost:8080/actuator/metrics/jairouter.model.usage

# Token consumption
curl http://localhost:8080/actuator/metrics/jairouter.tokens.consumed

# Cost tracking
curl http://localhost:8080/actuator/metrics/jairouter.cost.total

Alerting¶

Basic Alert Rules¶

# monitoring/alert-rules.yml
groups:
  - name: jairouter
    rules:
      - alert: JAiRouterDown
        expr: up{job="jairouter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "JAiRouter instance is down"
          description: "JAiRouter instance {{ $labels.instance }} has been down for more than 1 minute."

      - alert: HighErrorRate
        expr: rate(jairouter_requests_errors_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors per second."

      - alert: CircuitBreakerOpen
        expr: jairouter_circuitbreaker_state == 1
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker is open"
          description: "Circuit breaker for {{ $labels.service }} is open."

Notification Channels¶

Configure alert notifications:

# alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@jairouter.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    email_configs:
      - to: 'admin@jairouter.com'
        subject: 'JAiRouter Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

Log Monitoring¶

Structured Logging¶

Configure structured logging for better monitoring:

# application.yml
logging:
  level:
    org.unreal.modelrouter: INFO
  pattern:
    console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level [%X{traceId}] %logger{36} - %msg%n"
    file: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level [%X{traceId}] %logger{36} - %msg%n"
  file:
    name: logs/jairouter.log

Log Aggregation¶

Use ELK stack or similar for log aggregation:

# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /app/logs/*.log
  fields:
    service: jairouter
  fields_under_root: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "jairouter-logs-%{+yyyy.MM.dd}"

Performance Monitoring¶

JVM Metrics¶

Monitor JVM performance:

# Heap memory usage
curl http://localhost:8080/actuator/metrics/jvm.memory.used

# Garbage collection
curl http://localhost:8080/actuator/metrics/jvm.gc.pause

# Thread count
curl http://localhost:8080/actuator/metrics/jvm.threads.live

Application Performance¶

Track application-specific performance:

# Connection pool metrics
curl http://localhost:8080/actuator/metrics/hikaricp.connections

# HTTP client metrics
curl http://localhost:8080/actuator/metrics/http.client.requests

# Cache metrics (if using cache)
curl http://localhost:8080/actuator/metrics/cache.gets

Monitoring Best Practices¶

1. Set Up Proper Alerting¶

Alert on symptoms, not causes
Use appropriate thresholds
Avoid alert fatigue

2. Monitor Key Business Metrics¶

Request success rate
Response time percentiles
Model usage patterns
Cost metrics

3. Use Dashboards Effectively¶

Create role-specific dashboards
Include both technical and business metrics
Use appropriate time ranges

4. Regular Health Checks¶

Implement comprehensive health checks
Monitor dependencies
Use circuit breakers appropriately

Troubleshooting with Monitoring¶

High Response Times¶

Check load balancer metrics
Examine backend service health
Review rate limiting settings
Analyze JVM metrics

High Error Rates¶

Check circuit breaker status
Review backend service logs
Examine request patterns
Verify configuration

Memory Issues¶

Monitor JVM heap usage
Check for memory leaks
Review garbage collection metrics
Analyze thread usage

Next Steps¶

Setup - Set up monitoring infrastructure
Dashboards - Create monitoring dashboards
Alerts - Configure alerting rules
Troubleshooting - Use monitoring for troubleshooting