Skip to content

JAiRouter Alert Rules Guide

Document Version: 1.0.0 Last Updated: 2025-08-19 Git Commit: f47f2607 Author: Lincoln

Overview

This document details the Prometheus alert rules configuration for the JAiRouter project, including alert types, trigger conditions, and handling recommendations.

Alert Rule Categories

1. Basic Service Alerts (jairouter.basic)

JAiRouterServiceDown

  • Description: JAiRouter service unavailable
  • Trigger Condition: up{job="jairouter"} == 0
  • Duration: 1 minute
  • Severity: Critical
  • Handling Recommendations:
  • Check JAiRouter service process status
  • Review application startup logs
  • Verify port usage
  • Check system resources availability

JAiRouterHighErrorRate

  • Description: JAiRouter error rate too high
  • Trigger Condition: 4xx/5xx error rate exceeds 10%
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check application error logs
  • Verify backend service status
  • Check network connectivity
  • Analyze error type distribution

JAiRouterCriticalErrorRate

  • Description: JAiRouter critical error rate too high
  • Trigger Condition: 5xx error rate exceeds 5%
  • Duration: 1 minute
  • Severity: Critical
  • Handling Recommendations:
  • Immediately check server status
  • Review application exception logs
  • Check database connections
  • Verify dependency service availability

2. Performance Alerts (jairouter.performance)

JAiRouterHighLatency

  • Description: JAiRouter response time too high
  • Trigger Condition: 95th percentile response time exceeds 2 seconds
  • Duration: 3 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check system resource usage
  • Analyze slow queries and performance bottlenecks
  • Verify backend service response times
  • Check network latency

JAiRouterCriticalLatency

  • Description: JAiRouter response time critically high
  • Trigger Condition: 95th percentile response time exceeds 5 seconds
  • Duration: 1 minute
  • Severity: Critical
  • Handling Recommendations:
  • Immediately check system load
  • Analyze performance bottlenecks
  • Consider temporary rate limiting
  • Check if scaling is needed

JAiRouterLowRequestVolume

  • Description: JAiRouter request volume abnormally low
  • Trigger Condition: Request rate below 0.1 req/s
  • Duration: 5 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check client connection status
  • Verify load balancer configuration
  • Check network routing
  • Confirm if it's normal business low period

JAiRouterSlowQueriesDetected

  • Description: JAiRouter detected slow queries
  • Trigger Condition: More than 5 slow queries in 5 minutes
  • Duration: 1 minute
  • Severity: Warning
  • Handling Recommendations:
  • Check slow query logs
  • Analyze slow query causes
  • Optimize related queries or operations
  • Consider adding indexes or caching

JAiRouterHighSlowQueryRate

  • Description: JAiRouter slow query rate too high
  • Trigger Condition: Slow query rate exceeds 1/second
  • Duration: 2 minutes
  • Severity: Critical
  • Handling Recommendations:
  • Immediately analyze system performance bottlenecks
  • Check database connections and queries
  • Evaluate if resource scaling is needed
  • Consider temporary rate limiting measures

3. Backend Service Alerts (jairouter.backend)

JAiRouterBackendDown

  • Description: JAiRouter backend service unavailable
  • Trigger Condition: jairouter_backend_health == 0
  • Duration: 1 minute
  • Severity: Critical
  • Handling Recommendations:
  • Check backend service status
  • Verify network connectivity
  • Check service configuration
  • Review health check logs

JAiRouterBackendHighLatency

  • Description: JAiRouter backend service responding slowly
  • Trigger Condition: Backend 95th percentile response time exceeds 3 seconds
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check backend service performance
  • Analyze network latency
  • Verify backend resource usage
  • Consider adjusting timeout configuration

JAiRouterBackendHighErrorRate

  • Description: JAiRouter backend service error rate high
  • Trigger Condition: Backend error rate exceeds 15%
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check backend service logs
  • Verify API compatibility
  • Check authentication configuration
  • Analyze error types

4. Infrastructure Alerts (jairouter.infrastructure)

JAiRouterCircuitBreakerOpen

  • Description: JAiRouter circuit breaker opened
  • Trigger Condition: jairouter_circuit_breaker_state == 2
  • Duration: 30 seconds
  • Severity: Warning
  • Handling Recommendations:
  • Check downstream service status
  • Analyze failure rate causes
  • Verify circuit breaker configuration
  • Consider manual recovery

JAiRouterRateLimitTriggered

  • Description: JAiRouter rate limiter frequently triggered
  • Trigger Condition: Rate limit rejection rate exceeds 10 req/s
  • Duration: 1 minute
  • Severity: Warning
  • Handling Recommendations:
  • Analyze request sources
  • Check rate limit configuration
  • Evaluate if threshold adjustment needed
  • Consider capacity increase

JAiRouterLoadBalancerImbalance

  • Description: JAiRouter load balancing uneven
  • Trigger Condition: Instance request volume difference exceeds 50%
  • Duration: 5 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check load balancing strategy
  • Verify instance health status
  • Analyze instance performance differences
  • Consider adjusting weight configuration

5. Resource Alerts (jairouter.resources)

JAiRouterHighMemoryUsage

  • Description: JAiRouter memory usage high
  • Trigger Condition: JVM heap memory usage exceeds 80%
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check memory leaks
  • Analyze GC logs
  • Consider adjusting JVM parameters
  • Evaluate if scaling needed

JAiRouterCriticalMemoryUsage

  • Description: JAiRouter memory usage critically high
  • Trigger Condition: JVM heap memory usage exceeds 90%
  • Duration: 1 minute
  • Severity: Critical
  • Handling Recommendations:
  • Immediately check memory usage
  • Consider restarting service
  • Increase memory configuration
  • Analyze memory leak causes

JAiRouterHighGCRate

  • Description: JAiRouter GC frequency too high
  • Trigger Condition: GC frequency exceeds 0.2/second
  • Duration: 3 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Analyze GC logs
  • Optimize JVM parameters
  • Check memory allocation patterns
  • Consider adjusting heap size

JAiRouterHighThreadCount

  • Description: JAiRouter thread count too high
  • Trigger Condition: Current thread count exceeds 200
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check thread pool configuration
  • Analyze thread stacks
  • Find thread leaks
  • Optimize concurrent handling

6. Business Metric Alerts (jairouter.business)

JAiRouterModelCallFailureRate

  • Description: JAiRouter model call failure rate high
  • Trigger Condition: Model call failure rate exceeds 20%
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check AI model service status
  • Verify API keys and configuration
  • Analyze failure causes
  • Check network connectivity

JAiRouterLargeRequestSize

  • Description: JAiRouter request size abnormal
  • Trigger Condition: 95th percentile request size exceeds 1MB
  • Duration: 3 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Analyze request content
  • Check client behavior
  • Consider adding size limits
  • Optimize data transfer

JAiRouterLargeResponseSize

  • Description: JAiRouter response size abnormal
  • Trigger Condition: 95th percentile response size exceeds 5MB
  • Duration: 3 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check response content
  • Optimize data format
  • Consider pagination handling
  • Check for data leaks

7. Security Alerts (jairouter.security)

JAiRouterSuspiciousIPActivity

  • Description: JAiRouter detected suspicious IP activity
  • Trigger Condition: Single IP request rate exceeds 100 req/s
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Analyze IP access patterns
  • Check if it's attack behavior
  • Consider temporary blocking
  • Strengthen access controls

JAiRouterHighAuthFailureRate

  • Description: JAiRouter authentication failure rate high
  • Trigger Condition: 401 error rate exceeds 5%
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check authentication system status
  • Analyze failure causes
  • Verify key configuration
  • Check for brute force attempts

JAiRouterHighClientErrorRate

  • Description: JAiRouter client error rate high
  • Trigger Condition: 4xx error rate exceeds 20%
  • Duration: 3 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Analyze client requests
  • Check API documentation consistency
  • Verify parameter validation logic
  • Provide better error information

8. Capacity Planning Alerts (jairouter.capacity)

JAiRouterRequestVolumeGrowth

  • Description: JAiRouter request volume significant growth
  • Trigger Condition: Growth exceeds 50% compared to 24 hours ago
  • Duration: 5 minutes
  • Severity: Info
  • Handling Recommendations:
  • Analyze growth causes
  • Evaluate system capacity
  • Consider scaling plans
  • Monitor resource usage

JAiRouterLowDiskSpace

  • Description: JAiRouter server disk space insufficient
  • Trigger Condition: Available disk space below 20%
  • Duration: 5 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Clean temporary files
  • Archive historical logs
  • Check disk usage
  • Consider expansion

JAiRouterHighCPUUsage

  • Description: JAiRouter server CPU usage high
  • Trigger Condition: CPU usage exceeds 80%
  • Duration: 3 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Analyze CPU usage
  • Check process status
  • Optimize performance bottlenecks
  • Consider scaling

9. Dependency Service Alerts (jairouter.dependencies)

JAiRouterDatabaseConnectionIssue

  • Description: JAiRouter database connection pool usage high
  • Trigger Condition: Connection pool usage exceeds 80%
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check database status
  • Analyze connection leaks
  • Optimize connection pool configuration
  • Check slow queries

JAiRouterLowCacheHitRate

  • Description: JAiRouter cache hit rate low
  • Trigger Condition: Cache hit rate below 70%
  • Duration: 5 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Analyze caching strategy
  • Check cache configuration
  • Optimize cache key design
  • Consider cache warming

JAiRouterExternalAPITimeout

  • Description: JAiRouter external API call timeout frequent
  • Trigger Condition: Timeout frequency exceeds 5/second
  • Duration: 2 minutes
  • Severity: Warning
  • Handling Recommendations:
  • Check external service status
  • Analyze network latency
  • Adjust timeout configuration
  • Consider retry strategy

Alert Handling Process

1. Alert Reception

  • Receive alert notifications via email, Slack, DingTalk, etc.
  • View alert details and severity level
  • Confirm alert authenticity and urgency

2. Initial Diagnosis

  • Access Grafana dashboards for detailed metrics
  • Check Prometheus alert page for related alerts
  • Review application and system logs

3. Problem Handling

  • Execute corresponding handling steps based on alert type
  • Record handling process and results
  • Contact relevant teams if necessary

4. Recovery Verification

  • Confirm problem resolved
  • Verify related metrics return to normal
  • Wait for alert auto-resolution

5. Post-analysis

  • Analyze problem root cause
  • Evaluate if alert rule adjustments needed
  • Improve preventive measures and handling procedures

Alert Rule Maintenance

Regular Checks

  • Monthly check of alert rule effectiveness
  • Adjust thresholds based on business changes
  • Clean outdated or invalid alert rules

Threshold Tuning

  • Analyze reasonable thresholds based on historical data
  • Avoid excessive false positives and negatives
  • Consider business characteristics and user experience

Documentation Updates

  • Timely update alert handling documentation
  • Record common issues and solutions
  • Share best practices and lessons learned

Testing and Validation

Syntax Check

# Linux/macOS
./monitoring/prometheus/test-alerts.sh

# Windows
.\monitoring\prometheus\test-alerts.ps1

Complete Validation

# Linux/macOS
./monitoring/prometheus/validate-alerts.sh

# Windows
.\monitoring\prometheus\validate-alerts.ps1

Manual Testing

  • Simulate failure scenarios to trigger alerts
  • Verify notification channels are working
  • Test alert recovery mechanism

Contact Information

For questions or suggestions, contact: - Operations Team: ops-team@example.com - Development Team: dev-team@example.com - JAiRouter Team: jairouter-team@example.com