Alert Configuration Guide¶

文档版本: 1.0.0
最后更新: 2025-08-19
Git 提交: c1aa5b0f
作者: Lincoln

This document describes how to configure and manage the JAiRouter alerting system, including alert rule setup, notification configuration, and alert handling procedures.

Alert Architecture¶

graph TB
    subgraph "Metric Collection"
        A[JAiRouter Application] --> B[Prometheus]
    end

    subgraph "Alert Processing"
        B --> C[Alert Rule Evaluation]
        C --> D[AlertManager]
        D --> E[Notification Routing]
    end

    subgraph "Notification Channels"
        E --> F[Email]
        E --> G[Slack]
        E --> H[DingTalk]
        E --> I[SMS]
        E --> J[Webhook]
    end

    subgraph "Alert Management"
        K[Grafana Alerts] --> D
        L[Silence Rules] --> D
        M[Inhibition Rules] --> D
    end

Alert Rule Configuration¶

Basic Alert Rules¶

Create monitoring/prometheus/rules/jairouter-alerts.yml:

groups:
  - name: jairouter.critical
    interval: 30s
    rules:
      # Service Unavailable
      - alert: JAiRouterDown
        expr: up{job="jairouter"} == 0
        for: 1m
        labels:
          severity: critical
          service: jairouter
        annotations:
          summary: "JAiRouter Service Unavailable"
          description: "JAiRouter service has stopped responding for more than 1 minute"
          runbook_url: "https://jairouter.com/troubleshooting/service-down"

      # High Error Rate
      - alert: HighErrorRate
        expr: sum(rate(jairouter_requests_total{status=~"5.."}[5m])) / sum(rate(jairouter_requests_total[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
          service: jairouter
        annotations:
          summary: "High Error Rate Alert"
          description: "5xx error rate exceeds 5%, current value: {{ $value | humanizePercentage }}"
          runbook_url: "https://jairouter.com/troubleshooting/high-error-rate"

      # High Latency
      - alert: HighLatency
        expr: histogram_quantile(0.95, sum(rate(jairouter_request_duration_seconds_bucket[5m])) by (le)) > 5
        for: 5m
        labels:
          severity: critical
          service: jairouter
        annotations:
          summary: "Response Time Too Long"
          description: "P95 response time exceeds 5 seconds, current value: {{ $value }}s"
          runbook_url: "https://jairouter.com/troubleshooting/high-latency"

      # High Memory Usage
      - alert: HighMemoryUsage
        expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.90
        for: 2m
        labels:
          severity: critical
          service: jairouter
        annotations:
          summary: "Memory Usage Too High"
          description: "JVM heap memory usage exceeds 90%, current value: {{ $value | humanizePercentage }}"
          runbook_url: "https://jairouter.com/troubleshooting/memory-issues"

      # Backend Service Unavailable
      - alert: BackendServiceDown
        expr: jairouter_backend_health == 0
        for: 1m
        labels:
          severity: critical
          service: jairouter
          adapter: "{{ $labels.adapter }}"
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "Backend Service Unavailable"
          description: "Backend service {{ $labels.adapter }}/{{ $labels.instance }} health check failed"
          runbook_url: "https://jairouter.com/troubleshooting/backend-down"

  - name: jairouter.warning
    interval: 60s
    rules:
      # Moderate Error Rate
      - alert: ModerateErrorRate
        expr: sum(rate(jairouter_requests_total{status=~"4..|5.."}[5m])) / sum(rate(jairouter_requests_total[5m])) > 0.10
        for: 5m
        labels:
          severity: warning
          service: jairouter
        annotations:
          summary: "Error Rate High"
          description: "Total error rate exceeds 10%, current value: {{ $value | humanizePercentage }}"

      # Response Time Warning
      - alert: ModerateLatency
        expr: histogram_quantile(0.95, sum(rate(jairouter_request_duration_seconds_bucket[5m])) by (le)) > 2
        for: 10m
        labels:
          severity: warning
          service: jairouter
        annotations:
          summary: "Response Time High"
          description: "P95 response time exceeds 2 seconds, current value: {{ $value }}s"

      # Memory Usage Warning
      - alert: ModerateMemoryUsage
        expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.80
        for: 5m
        labels:
          severity: warning
          service: jairouter
        annotations:
          summary: "Memory Usage High"
          description: "JVM heap memory usage exceeds 80%, current value: {{ $value | humanizePercentage }}"

      # Circuit Breaker Open
      - alert: CircuitBreakerOpen
        expr: jairouter_circuit_breaker_state == 1
        for: 1m
        labels:
          severity: warning
          service: jairouter
          circuit_breaker: "{{ $labels.circuit_breaker }}"
        annotations:
          summary: "Circuit Breaker Open"
          description: "Circuit breaker {{ $labels.circuit_breaker }} is open"

      # High Rate Limit Rejection
      - alert: HighRateLimitRejection
        expr: sum(rate(jairouter_rate_limit_events_total{result="denied"}[5m])) / sum(rate(jairouter_rate_limit_events_total[5m])) > 0.20
        for: 5m
        labels:
          severity: warning
          service: jairouter
        annotations:
          summary: "Rate Limit Rejection Rate High"
          description: "Rate limit rejection rate exceeds 20%, current value: {{ $value | humanizePercentage }}"

      # Load Imbalance
      - alert: LoadImbalance
        expr: |
          (
            max(sum by (instance) (rate(jairouter_backend_calls_total[5m]))) -
            min(sum by (instance) (rate(jairouter_backend_calls_total[5m])))
          ) / avg(sum by (instance) (rate(jairouter_backend_calls_total[5m]))) > 0.5
        for: 10m
        labels:
          severity: warning
          service: jairouter
        annotations:
          summary: "Load Imbalance"
          description: "Load difference between instances exceeds 50%"

  - name: jairouter.business
    interval: 60s
    rules:
      # High Model Call Failure Rate
      - alert: HighModelCallFailureRate
        expr: sum(rate(jairouter_model_calls_total{status!="success"}[5m])) / sum(rate(jairouter_model_calls_total[5m])) > 0.10
        for: 5m
        labels:
          severity: warning
          service: jairouter
        annotations:
          summary: "Model Call Failure Rate High"
          description: "Model call failure rate exceeds 10%, current value: {{ $value | humanizePercentage }}"

      # Unusual Active Session Count
      - alert: UnusualActiveSessionCount
        expr: |
          (
            sum(jairouter_user_sessions_active) > 
            (avg_over_time(sum(jairouter_user_sessions_active)[1h:5m]) * 2)
          ) or (
            sum(jairouter_user_sessions_active) < 
            (avg_over_time(sum(jairouter_user_sessions_active)[1h:5m]) * 0.5)
          )
        for: 10m
        labels:
          severity: info
          service: jairouter
        annotations:
          summary: "Unusual Active Session Count"
          description: "Current active session count: {{ $value }}, significantly different from historical average"

Business-Specific Alert Rules¶

groups:
  - name: jairouter.business-specific
    interval: 60s
    rules:
      # Slow Chat Service Response
      - alert: ChatServiceSlowResponse
        expr: histogram_quantile(0.95, sum(rate(jairouter_request_duration_seconds_bucket{service="chat"}[5m])) by (le)) > 3
        for: 5m
        labels:
          severity: warning
          service: jairouter
          business_service: chat
        annotations:
          summary: "Chat Service Slow Response"
          description: "Chat service P95 response time exceeds 3 seconds"

      # Embedding Service Traffic Drop
      - alert: EmbeddingServiceLowTraffic
        expr: sum(rate(jairouter_requests_total{service="embedding"}[5m])) < (avg_over_time(sum(rate(jairouter_requests_total{service="embedding"}[5m]))[1h:5m]) * 0.3)
        for: 15m
        labels:
          severity: info
          service: jairouter
          business_service: embedding
        annotations:
          summary: "Embedding Service Traffic Drop"
          description: "Embedding service request volume is 70% lower than historical average"

      # Specific Model Provider Down
      - alert: ModelProviderDown
        expr: sum by (provider) (jairouter_backend_health{adapter=~".*"}) == 0
        for: 2m
        labels:
          severity: critical
          service: jairouter
          provider: "{{ $labels.provider }}"
        annotations:
          summary: "Model Provider Service Down"
          description: "All instances of model provider {{ $labels.provider }} are unavailable"

AlertManager Configuration¶

Basic Configuration¶

Create monitoring/alertmanager/alertmanager.yml:

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@jairouter.com'
  smtp_auth_username: 'alerts@jairouter.com'
  smtp_auth_password: 'your-password'

# Alert routing configuration
route:
  group_by: ['alertname', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    # Critical alerts notify immediately
    - match:
        severity: critical
      receiver: 'critical-alerts'
      group_wait: 0s
      repeat_interval: 5m

    # Warning alerts delay notification
    - match:
        severity: warning
      receiver: 'warning-alerts'
      group_wait: 30s
      repeat_interval: 30m

    # Business alerts special handling
    - match_re:
        business_service: '.*'
      receiver: 'business-alerts'
      group_wait: 15s
      repeat_interval: 15m

# Inhibition rules
inhibit_rules:
  # Suppress other alerts when service is unavailable
  - source_match:
      alertname: JAiRouterDown
    target_match:
      service: jairouter
    equal: ['service']

  # Critical alerts suppress warning alerts
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['service', 'alertname']

# Receiver configuration
receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@jairouter.com'
        subject: 'JAiRouter Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Time: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
          {{ end }}

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@jairouter.com'
        subject: '🚨 Critical Alert: {{ .GroupLabels.alertname }}'
        body: |
          Critical alert triggered!

          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Service: {{ .Labels.service }}
          Time: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
          Runbook: {{ .Annotations.runbook_url }}
          {{ end }}
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts-critical'
        title: '🚨 JAiRouter Critical Alert'
        text: |
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          {{ end }}

  - name: 'warning-alerts'
    email_configs:
      - to: 'team@jairouter.com'
        subject: '⚠️ Warning Alert: {{ .GroupLabels.alertname }}'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts-warning'
        title: '⚠️ JAiRouter Warning Alert'

  - name: 'business-alerts'
    email_configs:
      - to: 'business@jairouter.com'
        subject: '📊 Business Alert: {{ .GroupLabels.alertname }}'
    webhook_configs:
      - url: 'http://your-webhook-endpoint/alerts'
        send_resolved: true

Advanced Routing Configuration¶

# Complex routing example
route:
  group_by: ['alertname', 'service', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    # Different handling for business hours and after hours
    - match:
        severity: critical
      receiver: 'critical-business-hours'
      active_time_intervals:
        - business-hours

    - match:
        severity: critical
      receiver: 'critical-after-hours'
      active_time_intervals:
        - after-hours

    # Specific service alerts
    - match:
        service: jairouter
        alertname: JAiRouterDown
      receiver: 'service-down'
      group_wait: 0s
      repeat_interval: 2m

# Time interval definitions
time_intervals:
  - name: business-hours
    time_intervals:
      - times:
          - start_time: '09:00'
            end_time: '18:00'
        weekdays: ['monday:friday']
        location: 'Asia/Shanghai'

  - name: after-hours
    time_intervals:
      - times:
          - start_time: '18:00'
            end_time: '09:00'
        weekdays: ['monday:friday']
        location: 'Asia/Shanghai'
      - weekdays: ['saturday', 'sunday']
        location: 'Asia/Shanghai'

Notification Channel Configuration¶

Email Notifications¶

receivers:
  - name: 'email-alerts'
    email_configs:
      - to: 'alerts@jairouter.com'
        from: 'noreply@jairouter.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'noreply@jairouter.com'
        auth_password: 'your-password'
        subject: 'JAiRouter Alert: {{ .GroupLabels.alertname }}'
        headers:
          Priority: 'high'
        body: |
          <!DOCTYPE html>
          <html>
          <head>
              <style>
                  .alert { padding: 10px; margin: 10px 0; border-radius: 5px; }
                  .critical { background-color: #ffebee; border-left: 5px solid #f44336; }
                  .warning { background-color: #fff3e0; border-left: 5px solid #ff9800; }
              </style>
          </head>
          <body>
              <h2>JAiRouter Alert Notification</h2>
              {{ range .Alerts }}
              <div class="alert {{ .Labels.severity }}">
                  <h3>{{ .Annotations.summary }}</h3>
                  <p><strong>Description:</strong> {{ .Annotations.description }}</p>
                  <p><strong>Service:</strong> {{ .Labels.service }}</p>
                  <p><strong>Severity:</strong> {{ .Labels.severity }}</p>
                  <p><strong>Start Time:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
                  {{ if .Annotations.runbook_url }}
                  <p><strong>Runbook:</strong> <a href="{{ .Annotations.runbook_url }}">View</a></p>
                  {{ end }}
              </div>
              {{ end }}
          </body>
          </html>

Slack Notifications¶

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#jairouter-alerts'
        username: 'AlertManager'
        icon_emoji: ':warning:'
        title: '{{ if eq .Status "firing" }}🚨{{ else }}✅{{ end }} JAiRouter Alert'
        title_link: 'http://localhost:9093'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Service:* {{ .Labels.service }}
          *Severity:* {{ .Labels.severity }}
          *Time:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
          {{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
          ---
          {{ end }}
        actions:
          - type: button
            text: 'View Grafana'
            url: 'http://localhost:3000'
          - type: button
            text: 'View Prometheus'
            url: 'http://localhost:9090'

DingTalk Notifications¶

receivers:
  - name: 'dingtalk-alerts'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
        send_resolved: true
        http_config:
          proxy_url: 'http://proxy.example.com:8080'
        body: |
          {
            "msgtype": "markdown",
            "markdown": {
              "title": "JAiRouter Alert Notification",
              "text": "## JAiRouter Alert Notification\n\n{{ range .Alerts }}**Alert:** {{ .Annotations.summary }}\n\n**Description:** {{ .Annotations.description }}\n\n**Service:** {{ .Labels.service }}\n\n**Severity:** {{ .Labels.severity }}\n\n**Time:** {{ .StartsAt.Format \"2006-01-02 15:04:05\" }}\n\n---\n\n{{ end }}"
            }
          }

SMS Notifications¶

receivers:
  - name: 'sms-alerts'
    webhook_configs:
      - url: 'http://your-sms-gateway/send'
        http_config:
          basic_auth:
            username: 'your-username'
            password: 'your-password'
        body: |
          {
            "to": ["13800138000", "13900139000"],
            "message": "JAiRouter Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
          }

Alert Silencing and Inhibition¶

Silence Rules¶

# Create silence rules using amtool
amtool silence add alertname="HighMemoryUsage" --duration="2h" --comment="Memory optimization maintenance"

# Silence all alerts for a specific service
amtool silence add service="jairouter" --duration="30m" --comment="Service maintenance"

# Silence alerts for a specific instance
amtool silence add instance="jairouter-01" --duration="1h" --comment="Instance restart"

Inhibition Rule Configuration¶

inhibit_rules:
  # Suppress other related alerts when service is completely unavailable
  - source_match:
      alertname: JAiRouterDown
    target_match_re:
      alertname: '(HighLatency|HighErrorRate|HighMemoryUsage)'
    equal: ['service']

  # Suppress related business alerts when backend service is unavailable
  - source_match:
      alertname: BackendServiceDown
    target_match:
      alertname: HighModelCallFailureRate
    equal: ['service']

  # Critical level alerts suppress warning level alerts
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['service', 'alertname']

Alert Testing¶

Manual Alert Triggering¶

# Stop JAiRouter service to test service unavailable alert
docker stop jairouter

# Simulate high memory usage
curl -X POST http://localhost:8080/actuator/test/memory-stress

# Simulate high error rate
for i in {1..100}; do curl http://localhost:8080/invalid-endpoint; done

Alert Rule Validation¶

# Validate alert rule syntax
promtool check rules monitoring/prometheus/rules/jairouter-alerts.yml

# Test alert rules
promtool query instant http://localhost:9090 'up{job="jairouter"} == 0'

# View current active alerts
curl http://localhost:9090/api/v1/alerts

AlertManager Testing¶

# Check AlertManager configuration
amtool config show

# View current alerts
amtool alert query

# View silence rules
amtool silence query

# Test notifications
amtool alert add alertname="TestAlert" service="jairouter" severity="warning"

Alert Handling Process¶

Alert Response Flow¶

graph TD
    A[Alert Triggered] --> B[Alert Notification]
    B --> C[Oncall Confirmation]
    C --> D{Severity}
    D -->|Critical| E[Immediate Response]
    D -->|Warning| F[Scheduled Response]
    D -->|Info| G[Tracking]

    E --> H[Impact Assessment]
    F --> H
    G --> I[Periodic Review]

    H --> J[Quick Fix]
    J --> K[Root Cause Analysis]
    K --> L[Permanent Fix]
    L --> M[Documentation Update]
    M --> N[Process Improvement]

Alert Handling Checklist¶

Critical Alert Handling¶

[ ] Confirm alert authenticity
[ ] Assess business impact scope
[ ] Notify relevant teams
[ ] Execute emergency response plan
[ ] Document handling process
[ ] Implement temporary fix
[ ] Monitor fix effectiveness
[ ] Conduct root cause analysis
[ ] Implement permanent fix
[ ] Update documentation and processes

Warning Alert Handling¶

[ ] Confirm alert validity
[ ] Assess potential risks
[ ] Schedule handling time
[ ] Implement preventive measures
[ ] Monitor trend changes
[ ] Document handling results

Alert Escalation Mechanism¶

# Alert escalation configuration example
route:
  routes:
    - match:
        severity: critical
      receiver: 'level1-oncall'
      group_wait: 0s
      repeat_interval: 5m
      routes:
        # Escalate to level 2 oncall after 15 minutes
        - match:
            severity: critical
          receiver: 'level2-oncall'
          group_wait: 15m
          repeat_interval: 10m
          routes:
            # Escalate to management after 30 minutes
            - match:
                severity: critical
              receiver: 'management'
              group_wait: 30m
              repeat_interval: 15m

Alert Optimization¶

Reducing Alert Noise¶

1. Set Reasonable Thresholds¶

# Avoid overly sensitive thresholds
- alert: HighLatency
  expr: histogram_quantile(0.95, sum(rate(jairouter_request_duration_seconds_bucket[5m])) by (le)) > 2
  for: 5m  # Increase duration to avoid transient fluctuations

2. Use Alert Grouping¶

route:
  group_by: ['alertname', 'service', 'severity']
  group_wait: 30s
  group_interval: 5m

3. Implement Alert Inhibition¶

inhibit_rules:
  - source_match:
      alertname: JAiRouterDown
    target_match_re:
      alertname: '.*'
    equal: ['service']

Alert Quality Monitoring¶

Alert Metric Collection¶

# Collect alert-related metrics
- record: jairouter:alert_firing_count
  expr: sum(ALERTS{alertstate="firing"})

- record: jairouter:alert_resolution_time
  expr: time() - ALERTS_FOR_STATE{alertstate="firing"}

Alert Effectiveness Analysis¶

Alert accuracy: Real issues / Total alerts
Alert coverage: Detected issues / Actual issues
Average response time: From alert to start of handling
Average resolution time: From alert to issue resolution

Best Practices¶

Alert Rule Design¶

1. Follow SLI/SLO Principles¶

Set alerts based on service level indicators
Focus on user experience metrics
Avoid alerts based on resource metrics

2. Use Layered Alerts¶

Symptom alerts: User-perceivable issues
Cause alerts: Root causes of symptoms
Predictive alerts: Trends that may lead to issues

3. Alert Naming Conventions¶

# Good alert naming
- alert: JAiRouterHighLatency
- alert: JAiRouterBackendDown
- alert: JAiRouterHighErrorRate

# Avoid these names
- alert: Alert1
- alert: Problem
- alert: Issue

Notification Strategy¶

1. Tiered Notifications¶

Critical: Immediate notification, multiple channels
Warning: Delayed notification, single channel
Info: Only recorded, periodic summary

2. Notification Content Optimization¶

Include sufficient context information
Provide runbook links
Use clear descriptive language
Avoid excessive technical jargon

3. Notification Time Management¶

Different strategies for business hours and after hours
Avoid non-emergency notifications late at night
Consider time zone differences

Troubleshooting¶

Common Issues¶

1. Alert Rules Not Triggering¶

Check Steps:

# Validate rule syntax
promtool check rules rules/jairouter-alerts.yml

# Check rule loading status
curl http://localhost:9090/api/v1/rules

# Test query expressions
curl "http://localhost:9090/api/v1/query?query=up{job=\"jairouter\"}"

2. Notifications Not Sent¶

Check Steps:

# Check AlertManager status
curl http://localhost:9093/api/v1/status

# View notification history
curl http://localhost:9093/api/v1/alerts

# Check configuration
amtool config show

3. Alert Storm¶

Handling Methods:

# Create temporary silence
amtool silence add alertname=".*" --duration="1h" --comment="Alert storm handling"

# Check inhibition rules
amtool config show | grep -A 10 inhibit_rules

Next Steps¶

After configuring alerts, it's recommended to:

Important Reminder: Regularly review and optimize alert rules to ensure their effectiveness and accuracy. Avoid alert fatigue and maintain team sensitivity to alerts.