JAiRouter Alert Rules Guide¶
Document Version: 1.0.0 Last Updated: 2025-08-19 Git Commit: f47f2607 Author: Lincoln
Overview¶
This document details the Prometheus alert rules configuration for the JAiRouter project, including alert types, trigger conditions, and handling recommendations.
Alert Rule Categories¶
1. Basic Service Alerts (jairouter.basic)¶
JAiRouterServiceDown¶
- Description: JAiRouter service unavailable
- Trigger Condition:
up{job="jairouter"} == 0 - Duration: 1 minute
- Severity: Critical
- Handling Recommendations:
- Check JAiRouter service process status
- Review application startup logs
- Verify port usage
- Check system resources availability
JAiRouterHighErrorRate¶
- Description: JAiRouter error rate too high
- Trigger Condition: 4xx/5xx error rate exceeds 10%
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check application error logs
- Verify backend service status
- Check network connectivity
- Analyze error type distribution
JAiRouterCriticalErrorRate¶
- Description: JAiRouter critical error rate too high
- Trigger Condition: 5xx error rate exceeds 5%
- Duration: 1 minute
- Severity: Critical
- Handling Recommendations:
- Immediately check server status
- Review application exception logs
- Check database connections
- Verify dependency service availability
2. Performance Alerts (jairouter.performance)¶
JAiRouterHighLatency¶
- Description: JAiRouter response time too high
- Trigger Condition: 95th percentile response time exceeds 2 seconds
- Duration: 3 minutes
- Severity: Warning
- Handling Recommendations:
- Check system resource usage
- Analyze slow queries and performance bottlenecks
- Verify backend service response times
- Check network latency
JAiRouterCriticalLatency¶
- Description: JAiRouter response time critically high
- Trigger Condition: 95th percentile response time exceeds 5 seconds
- Duration: 1 minute
- Severity: Critical
- Handling Recommendations:
- Immediately check system load
- Analyze performance bottlenecks
- Consider temporary rate limiting
- Check if scaling is needed
JAiRouterLowRequestVolume¶
- Description: JAiRouter request volume abnormally low
- Trigger Condition: Request rate below 0.1 req/s
- Duration: 5 minutes
- Severity: Warning
- Handling Recommendations:
- Check client connection status
- Verify load balancer configuration
- Check network routing
- Confirm if it's normal business low period
JAiRouterSlowQueriesDetected¶
- Description: JAiRouter detected slow queries
- Trigger Condition: More than 5 slow queries in 5 minutes
- Duration: 1 minute
- Severity: Warning
- Handling Recommendations:
- Check slow query logs
- Analyze slow query causes
- Optimize related queries or operations
- Consider adding indexes or caching
JAiRouterHighSlowQueryRate¶
- Description: JAiRouter slow query rate too high
- Trigger Condition: Slow query rate exceeds 1/second
- Duration: 2 minutes
- Severity: Critical
- Handling Recommendations:
- Immediately analyze system performance bottlenecks
- Check database connections and queries
- Evaluate if resource scaling is needed
- Consider temporary rate limiting measures
3. Backend Service Alerts (jairouter.backend)¶
JAiRouterBackendDown¶
- Description: JAiRouter backend service unavailable
- Trigger Condition:
jairouter_backend_health == 0 - Duration: 1 minute
- Severity: Critical
- Handling Recommendations:
- Check backend service status
- Verify network connectivity
- Check service configuration
- Review health check logs
JAiRouterBackendHighLatency¶
- Description: JAiRouter backend service responding slowly
- Trigger Condition: Backend 95th percentile response time exceeds 3 seconds
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check backend service performance
- Analyze network latency
- Verify backend resource usage
- Consider adjusting timeout configuration
JAiRouterBackendHighErrorRate¶
- Description: JAiRouter backend service error rate high
- Trigger Condition: Backend error rate exceeds 15%
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check backend service logs
- Verify API compatibility
- Check authentication configuration
- Analyze error types
4. Infrastructure Alerts (jairouter.infrastructure)¶
JAiRouterCircuitBreakerOpen¶
- Description: JAiRouter circuit breaker opened
- Trigger Condition:
jairouter_circuit_breaker_state == 2 - Duration: 30 seconds
- Severity: Warning
- Handling Recommendations:
- Check downstream service status
- Analyze failure rate causes
- Verify circuit breaker configuration
- Consider manual recovery
JAiRouterRateLimitTriggered¶
- Description: JAiRouter rate limiter frequently triggered
- Trigger Condition: Rate limit rejection rate exceeds 10 req/s
- Duration: 1 minute
- Severity: Warning
- Handling Recommendations:
- Analyze request sources
- Check rate limit configuration
- Evaluate if threshold adjustment needed
- Consider capacity increase
JAiRouterLoadBalancerImbalance¶
- Description: JAiRouter load balancing uneven
- Trigger Condition: Instance request volume difference exceeds 50%
- Duration: 5 minutes
- Severity: Warning
- Handling Recommendations:
- Check load balancing strategy
- Verify instance health status
- Analyze instance performance differences
- Consider adjusting weight configuration
5. Resource Alerts (jairouter.resources)¶
JAiRouterHighMemoryUsage¶
- Description: JAiRouter memory usage high
- Trigger Condition: JVM heap memory usage exceeds 80%
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check memory leaks
- Analyze GC logs
- Consider adjusting JVM parameters
- Evaluate if scaling needed
JAiRouterCriticalMemoryUsage¶
- Description: JAiRouter memory usage critically high
- Trigger Condition: JVM heap memory usage exceeds 90%
- Duration: 1 minute
- Severity: Critical
- Handling Recommendations:
- Immediately check memory usage
- Consider restarting service
- Increase memory configuration
- Analyze memory leak causes
JAiRouterHighGCRate¶
- Description: JAiRouter GC frequency too high
- Trigger Condition: GC frequency exceeds 0.2/second
- Duration: 3 minutes
- Severity: Warning
- Handling Recommendations:
- Analyze GC logs
- Optimize JVM parameters
- Check memory allocation patterns
- Consider adjusting heap size
JAiRouterHighThreadCount¶
- Description: JAiRouter thread count too high
- Trigger Condition: Current thread count exceeds 200
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check thread pool configuration
- Analyze thread stacks
- Find thread leaks
- Optimize concurrent handling
6. Business Metric Alerts (jairouter.business)¶
JAiRouterModelCallFailureRate¶
- Description: JAiRouter model call failure rate high
- Trigger Condition: Model call failure rate exceeds 20%
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check AI model service status
- Verify API keys and configuration
- Analyze failure causes
- Check network connectivity
JAiRouterLargeRequestSize¶
- Description: JAiRouter request size abnormal
- Trigger Condition: 95th percentile request size exceeds 1MB
- Duration: 3 minutes
- Severity: Warning
- Handling Recommendations:
- Analyze request content
- Check client behavior
- Consider adding size limits
- Optimize data transfer
JAiRouterLargeResponseSize¶
- Description: JAiRouter response size abnormal
- Trigger Condition: 95th percentile response size exceeds 5MB
- Duration: 3 minutes
- Severity: Warning
- Handling Recommendations:
- Check response content
- Optimize data format
- Consider pagination handling
- Check for data leaks
7. Security Alerts (jairouter.security)¶
JAiRouterSuspiciousIPActivity¶
- Description: JAiRouter detected suspicious IP activity
- Trigger Condition: Single IP request rate exceeds 100 req/s
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Analyze IP access patterns
- Check if it's attack behavior
- Consider temporary blocking
- Strengthen access controls
JAiRouterHighAuthFailureRate¶
- Description: JAiRouter authentication failure rate high
- Trigger Condition: 401 error rate exceeds 5%
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check authentication system status
- Analyze failure causes
- Verify key configuration
- Check for brute force attempts
JAiRouterHighClientErrorRate¶
- Description: JAiRouter client error rate high
- Trigger Condition: 4xx error rate exceeds 20%
- Duration: 3 minutes
- Severity: Warning
- Handling Recommendations:
- Analyze client requests
- Check API documentation consistency
- Verify parameter validation logic
- Provide better error information
8. Capacity Planning Alerts (jairouter.capacity)¶
JAiRouterRequestVolumeGrowth¶
- Description: JAiRouter request volume significant growth
- Trigger Condition: Growth exceeds 50% compared to 24 hours ago
- Duration: 5 minutes
- Severity: Info
- Handling Recommendations:
- Analyze growth causes
- Evaluate system capacity
- Consider scaling plans
- Monitor resource usage
JAiRouterLowDiskSpace¶
- Description: JAiRouter server disk space insufficient
- Trigger Condition: Available disk space below 20%
- Duration: 5 minutes
- Severity: Warning
- Handling Recommendations:
- Clean temporary files
- Archive historical logs
- Check disk usage
- Consider expansion
JAiRouterHighCPUUsage¶
- Description: JAiRouter server CPU usage high
- Trigger Condition: CPU usage exceeds 80%
- Duration: 3 minutes
- Severity: Warning
- Handling Recommendations:
- Analyze CPU usage
- Check process status
- Optimize performance bottlenecks
- Consider scaling
9. Dependency Service Alerts (jairouter.dependencies)¶
JAiRouterDatabaseConnectionIssue¶
- Description: JAiRouter database connection pool usage high
- Trigger Condition: Connection pool usage exceeds 80%
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check database status
- Analyze connection leaks
- Optimize connection pool configuration
- Check slow queries
JAiRouterLowCacheHitRate¶
- Description: JAiRouter cache hit rate low
- Trigger Condition: Cache hit rate below 70%
- Duration: 5 minutes
- Severity: Warning
- Handling Recommendations:
- Analyze caching strategy
- Check cache configuration
- Optimize cache key design
- Consider cache warming
JAiRouterExternalAPITimeout¶
- Description: JAiRouter external API call timeout frequent
- Trigger Condition: Timeout frequency exceeds 5/second
- Duration: 2 minutes
- Severity: Warning
- Handling Recommendations:
- Check external service status
- Analyze network latency
- Adjust timeout configuration
- Consider retry strategy
Alert Handling Process¶
1. Alert Reception¶
- Receive alert notifications via email, Slack, DingTalk, etc.
- View alert details and severity level
- Confirm alert authenticity and urgency
2. Initial Diagnosis¶
- Access Grafana dashboards for detailed metrics
- Check Prometheus alert page for related alerts
- Review application and system logs
3. Problem Handling¶
- Execute corresponding handling steps based on alert type
- Record handling process and results
- Contact relevant teams if necessary
4. Recovery Verification¶
- Confirm problem resolved
- Verify related metrics return to normal
- Wait for alert auto-resolution
5. Post-analysis¶
- Analyze problem root cause
- Evaluate if alert rule adjustments needed
- Improve preventive measures and handling procedures
Alert Rule Maintenance¶
Regular Checks¶
- Monthly check of alert rule effectiveness
- Adjust thresholds based on business changes
- Clean outdated or invalid alert rules
Threshold Tuning¶
- Analyze reasonable thresholds based on historical data
- Avoid excessive false positives and negatives
- Consider business characteristics and user experience
Documentation Updates¶
- Timely update alert handling documentation
- Record common issues and solutions
- Share best practices and lessons learned
Testing and Validation¶
Syntax Check¶
# Linux/macOS
./monitoring/prometheus/test-alerts.sh
# Windows
.\monitoring\prometheus\test-alerts.ps1
Complete Validation¶
# Linux/macOS
./monitoring/prometheus/validate-alerts.sh
# Windows
.\monitoring\prometheus\validate-alerts.ps1
Manual Testing¶
- Simulate failure scenarios to trigger alerts
- Verify notification channels are working
- Test alert recovery mechanism
Related Links¶
- Prometheus Alert Rules Documentation
- AlertManager Configuration Documentation
- Grafana Dashboards
- Prometheus Web Interface
Contact Information¶
For questions or suggestions, contact: - Operations Team: ops-team@example.com - Development Team: dev-team@example.com - JAiRouter Team: jairouter-team@example.com