Skip to content

Troubleshooting

This document provides diagnosis and solutions for common issues with the JAiRouter distributed tracing feature.

Tracing Data Issues

1. Missing Tracing Data

Symptoms: - No traceId and spanId in logs - No tracing data in monitoring panels - Exporter not receiving tracing information

Diagnosis Steps:

# 1. Check if tracing is enabled
curl http://localhost:8080/actuator/health/tracing

# 2. Check configuration
curl http://localhost:8080/actuator/configprops | jq '.jairouter.tracing'

# 3. Check sampling rate
curl http://localhost:8080/actuator/metrics/jairouter.tracing.sampling.rate

Common Causes and Solutions:

CauseSolution
Tracing not enabledSet jairouter.tracing.enabled=true
Sampling rate too lowTemporarily set sampling.ratio=1.0 for testing
Exporter configuration errorCheck exporter endpoint and authentication configuration
Filter order issueEnsure TracingWebFilter is at the front of the filter chain

2. Partial Data Loss

Symptoms: - Only some requests have tracing data - Child Spans missing - Async operations have no tracing information

Solutions:

# Temporarily increase sampling rate for debugging
jairouter:
  tracing:
    sampling:
      strategy: "ratio"
      ratio: 1.0

    # Enable debug logs
    logging:
      level: DEBUG

Performance Issues

1. Tracing Causing Performance Degradation

Symptoms: - Significantly increased response time - Rising CPU usage - High memory usage

Performance Analysis:

# View tracing-related metrics
curl -s http://localhost:8080/actuator/metrics | grep tracing

# Check GC situation
curl -s http://localhost:8080/actuator/metrics/jvm.gc.pause

# View thread usage
curl -s http://localhost:8080/actuator/metrics/jvm.threads.live

Optimization Measures:

jairouter:
  tracing:
    # Reduce sampling rate
    sampling:
      ratio: 0.1

    # Enable async processing
    async:
      enabled: true
      core-pool-size: 4

    # Optimize batch processing
    exporter:
      batch-size: 512
      export-timeout: 5s

2. Memory Leak

Symptoms: - Continuously growing heap memory - OutOfMemoryError occurs - Frequent GC but memory not released

Troubleshooting Steps:

# 1. Check Span count
curl http://localhost:8080/actuator/metrics/jairouter.tracing.spans.active

# 2. Check memory usage
jmap -histo <pid> | grep Span

# 3. Generate heap dump
jcmd <pid> GC.run_finalization
jcmd <pid> VM.gc

Solutions:

jairouter:
  tracing:
    memory:
      max-spans: 5000              # Limit Span count
      cleanup-interval: 15s        # More frequent cleanup
      span-ttl: 60s               # Shorter TTL

Configuration Issues

1. Configuration Not Taking Effect

Symptoms: - No changes after modifying configuration - Configuration validation fails - Configuration errors at startup

Check Configuration Syntax:

# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('application.yml'))"

# Check configuration binding
curl http://localhost:8080/actuator/configprops | jq '.jairouter.tracing'

Common Configuration Errors:

# ❌ Wrong configuration
jairouter:
  tracing:
    sampling:
      ratio: 1.5                   # Out of range [0.0, 1.0]
    exporter:
      endpoint: "localhost:4317"   # Missing protocol

# ✅ Correct configuration  
jairouter:
  tracing:
    sampling:
      ratio: 1.0
    exporter:
      endpoint: "http://localhost:4317"

2. Dynamic Configuration Update Failure

Diagnostic Methods:

# Check configuration service status
curl http://localhost:8080/actuator/health/config

# View configuration history
curl http://localhost:8080/api/admin/config/history

Exporter Issues

1. Jaeger Connection Failure

Error Message:

Failed to export spans to Jaeger: Connection refused

Resolution Steps:

# 1. Check Jaeger service status
curl http://localhost:14268/api/traces

# 2. Verify network connection
telnet localhost 14268

# 3. Check firewall settings
netstat -an | grep 14268

Configuration Adjustment:

jairouter:
  tracing:
    exporter:
      type: "jaeger"
      jaeger:
        endpoint: "http://jaeger:14268/api/traces"  # Use service name
        timeout: 30s                                # Increase timeout
        retry-enabled: true                         # Enable retry

2. OTLP Export Errors

Common Errors:

ErrorCauseSolution
UNAUTHENTICATEDAuthentication failureCheck API key configuration
RESOURCE_EXHAUSTEDInsufficient quotaReduce sampling rate or contact service provider
DEADLINE_EXCEEDEDTimeoutIncrease export timeout

Context Propagation Issues

1. Context Loss in Reactive Streams

Issue Manifestation: - No traceId in async operations - Child Span creation fails - MDC information missing

Solutions:

// ✅ Correct reactive context propagation
return Mono.just(data)
    .flatMap(this::processAsync)
    .contextWrite(Context.of("tracing", TracingContext.current()));

// ❌ Wrong usage - no context propagation
return Mono.just(data)
    .flatMap(this::processAsync);

2. Context Loss in Thread Pools

Configure Thread Pool Context Propagation:

@Bean
public TaskExecutor tracingTaskExecutor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setCorePoolSize(4);
    executor.setTaskDecorator(new TracingTaskDecorator());
    return executor;
}

Debugging Tips

1. Enable Debug Logs

logging:
  level:
    org.unreal.modelrouter.tracing: DEBUG
    io.opentelemetry: DEBUG

2. Use Debug Endpoints

# View current active Spans
curl http://localhost:8080/actuator/tracing/active-spans

# View tracing statistics
curl http://localhost:8080/actuator/tracing/stats

# Force export all Spans
curl -X POST http://localhost:8080/actuator/tracing/flush

3. Local Testing Tools

# Use curl to test tracing
curl -H "X-Trace-Debug: true" http://localhost:8080/api/v1/chat/completions

# Check tracing information in response headers
curl -I http://localhost:8080/health

Monitoring Alerts

1. Key Metrics Monitoring

# Prometheus alert rules
groups:
  - name: tracing_alerts
    rules:
      - alert: TracingExportFailure
        expr: rate(jairouter_tracing_export_errors_total[5m]) > 0.1
        labels:
          severity: warning

      - alert: TracingMemoryHigh
        expr: jairouter_tracing_memory_used_ratio > 0.8
        labels:
          severity: critical

2. Health Checks

# Set up tracing health check
curl http://localhost:8080/actuator/health/tracing

# Expected response
{
  "status": "UP",
  "details": {
    "exporter": "healthy",
    "sampling": "active",
    "memory": "normal"
  }
}

Common Error Codes

Error CodeDescriptionSolution
TRACING_001Tracing service not initializedCheck configuration and restart service
TRACING_002Sampling strategy configuration errorValidate sampling configuration syntax
TRACING_003Exporter connection failureCheck network and endpoint configuration
TRACING_004Insufficient memoryIncrease memory or adjust configuration
TRACING_005Context propagation failureCheck async operation implementation

Getting Support

If you encounter unresolved issues:

  1. View Logs: Enable DEBUG level logs for detailed information
  2. Check Configuration: Use actuator endpoints to validate configuration
  3. Performance Analysis: Use JVM tools to analyze performance issues
  4. Community Support: Submit issue reports in GitHub Issues

Next Steps