Troubleshooting Guide

Common issues, debugging techniques, and solutions for OSpec projects

troubleshooting debugging solutions

Troubleshooting Guide

Quick solutions to common problems you might encounter while working with OSpec projects.

Quick Diagnostic Commands

Check OSpec Status

# Get overall project status
ospec status --project-id your-project-id

# Check agent health
ospec agents health --all

# Validate OSpec document
ospec validate project.ospec.yaml

# Check dependencies
ospec dependencies check

Debug Agent Issues

# View agent logs
ospec logs --agent-id agent-123 --tail 100

# Check agent capabilities
ospec agents describe agent-123

# Test agent connectivity
ospec agents ping agent-123

# Force agent restart
ospec agents restart agent-123

Common Error Patterns

1. “Agent Timeout” Errors

Symptoms:

Error: Agent timeout after 30 minutes
Status: Task 'implement-auth' incomplete

Quick Fixes:

# Increase timeout in OSpec
timeouts:
  task_execution_minutes: 60
  agent_response_minutes: 10

# Or break down the task
tasks:
  - id: "setup-auth-structure"
    estimated_hours: 1
  - id: "implement-user-model"  
    estimated_hours: 2

2. “Dependency Conflict” Errors

Symptoms:

Error: Cannot resolve dependency conflict
Package A requires version 1.0
Package B requires version 2.0

Quick Fixes:

# Pin specific versions
stack:
  dependencies:
    package_a: "1.5.0"  # Compatible middle version
    package_b: "1.8.0"

# Or use overrides
dependency_overrides:
  conflicting_package: "2.0.0"

3. “Secret Not Found” Errors

Symptoms:

Error: Required secret 'DATABASE_URL' not found
Provider: aws-secrets-manager

Quick Fixes:

# Check if secret exists
aws secretsmanager describe-secret --secret-id DATABASE_URL

# Create missing secret
aws secretsmanager create-secret \
  --name DATABASE_URL \
  --secret-string "postgresql://..."

# Use environment fallback
secrets:
  fallback_providers:
    - "aws-secrets-manager"
    - "environment"
    - "dotenv_file"

4. “Test Failures” in Acceptance

Symptoms:

Test failed: HTTP endpoint /api/health returned 404
Expected: 200 OK

Quick Fixes:

# Check if service is running
curl -v http://localhost:8000/api/health

# Check service logs
ospec logs --service api --tail 50

# Verify port mapping
ospec inspect --service api --ports
# Add debugging to acceptance criteria
acceptance:
  http_endpoints:
    - path: "/api/health"
      status: 200
      timeout_seconds: 30
      debug: true
      retry_attempts: 3

Environment-Specific Issues

Development Environment

Issue: Hot reload not working

# Enable development features
development:
  hot_reload: true
  auto_restart: true
  debug_mode: true
  detailed_logging: true

Issue: Resource constraints

# Reduce resource usage for development
resources:
  memory_gb: 2  # Reduce from 8GB
  cpu_cores: 1  # Reduce from 4
  optimize_for: "speed"  # vs "memory"

Production Environment

Issue: Performance degradation

# Add performance monitoring
monitoring:
  performance_metrics: true
  profiling: true
  resource_tracking: true
  
# Enable caching
caching:
  enabled: true
  levels: ["application", "database", "cdn"]

Issue: Scaling problems

# Configure auto-scaling
scaling:
  auto_scale: true
  metrics: ["cpu", "memory", "requests_per_second"]
  scale_up_threshold: 70
  scale_down_threshold: 30

Network and Connectivity Issues

Connection Timeouts

# Increase timeouts
network:
  connect_timeout_seconds: 30
  read_timeout_seconds: 120
  retry_attempts: 3
  retry_backoff: "exponential"

Firewall Issues

# Test connectivity
telnet your-service.com 443
nc -zv your-service.com 443

# Check security groups (AWS)
aws ec2 describe-security-groups --group-ids sg-123456

# Add required ports
infrastructure:
  firewall_rules:
    - port: 80
      protocol: "TCP"
      source: "0.0.0.0/0"
    - port: 443
      protocol: "TCP" 
      source: "0.0.0.0/0"

Database Issues

Connection Pool Exhaustion

# Increase connection pool
database:
  connection_pool:
    min_connections: 5
    max_connections: 50
    idle_timeout_minutes: 10
    
# Add connection monitoring
monitoring:
  database_connections: true
  slow_query_log: true

Migration Failures

# Check migration status
ospec db migrate --status

# Rollback failed migration
ospec db rollback --steps 1

# Force migration (careful!)
ospec db migrate --force

Security Issues

Certificate Errors

# Check certificate expiry
openssl x509 -in cert.crt -text -noout | grep "Not After"

# Auto-renew certificates
security:
  ssl_certificates:
    auto_renewal: true
    renewal_days_before: 30
    provider: "letsencrypt"

Access Denied Errors

# Check IAM permissions
permissions:
  check_access: true
  audit_enabled: true
  
# Use least privilege principle
iam_policies:
  - name: "S3ReadOnly"
    actions: ["s3:GetObject", "s3:ListBucket"]
    resources: ["arn:aws:s3:::my-bucket/*"]

Performance Issues

High Memory Usage

# Add memory profiling
profiling:
  memory: true
  cpu: true
  duration_minutes: 10
  
# Optimize garbage collection (Node.js)
runtime_options:
  node_options: "--max-old-space-size=4096"
  
# Add memory limits
resources:
  memory_limit_gb: 8
  memory_monitoring: true

Slow Response Times

# Add performance monitoring
monitoring:
  response_times: true
  slow_request_threshold_ms: 1000
  
# Enable caching
caching:
  redis: true
  application_cache: true
  database_cache: true
  
# Database optimization
database:
  connection_pooling: true
  query_optimization: true
  indexing_hints: true

Deployment Issues

Build Failures

# Check build logs
ospec logs --build --tail 100

# Clear cache and rebuild
ospec build --clean --verbose

# Check dependencies
ospec dependencies check --fix

Rollback Procedures

# Rollback to previous version
ospec deploy rollback --version previous

# Rollback to specific version
ospec deploy rollback --version v1.2.3

# Check rollback status
ospec deploy status --show-history

Monitoring and Alerting

Missing Metrics

# Add comprehensive monitoring
monitoring:
  metrics:
    - "request_rate"
    - "error_rate"
    - "response_time"
    - "cpu_usage"
    - "memory_usage"
    - "disk_usage"
  
  custom_metrics:
    - name: "business_conversions"
      type: "counter"
    - name: "user_sessions"
      type: "gauge"

Alert Fatigue

# Configure smart alerting
alerting:
  deduplication: true
  severity_filtering: true
  business_hours_only: false
  
  rules:
    - name: "critical_errors"
      condition: "error_rate > 5%"
      severity: "critical"
      cooldown_minutes: 15
      
    - name: "high_latency"
      condition: "p95_response_time > 2000ms"
      severity: "warning"  
      cooldown_minutes: 30

Debugging Techniques

Enable Debug Mode

# Global debug settings
debug:
  level: "verbose"
  save_intermediate_files: true
  detailed_error_messages: true
  agent_action_logging: true

Interactive Debugging

# Start interactive debug session
ospec debug --interactive

# Step through agent actions
> step
> inspect state
> view context
> continue

Log Analysis

# Search logs for errors
ospec logs search --query "ERROR" --last 24h

# Analyze patterns
ospec logs analyze --pattern "timeout" --service api

# Export logs for analysis
ospec logs export --format json --output debug.log

Recovery Procedures

Data Recovery

# List available backups
ospec backup list --service database

# Restore from backup
ospec backup restore --backup-id backup-123 --confirm

# Verify data integrity
ospec backup verify --backup-id backup-123

Service Recovery

# Restart failed services
ospec services restart --all

# Check service health
ospec services health --detailed

# Rebuild from scratch (last resort)
ospec rebuild --service api --confirm

Prevention Strategies

Health Checks

health_checks:
  enabled: true
  interval_seconds: 30
  timeout_seconds: 5
  
  endpoints:
    - path: "/health"
      expected_status: 200
    - path: "/metrics"
      expected_status: 200

Automated Testing

testing:
  continuous: true
  types: ["unit", "integration", "e2e"]
  
  smoke_tests:
    - "api_health_check"
    - "database_connectivity"
    - "external_service_availability"

Monitoring Setup

monitoring:
  proactive: true
  predictive_alerts: true
  
  sli_slo:  # Service Level Indicators/Objectives
    availability: 99.9
    latency_p95: 500  # ms
    error_rate: 0.1   # %

Getting Help

Community Resources

Support Channels

  • Community Support: Free community help
  • Professional Support: Priority support for enterprises
  • Training Programs: Workshops and certification

Emergency Contacts

Severity Levels

Critical (P0): Complete service outage

  • Response time: 15 minutes
  • Contact: PagerDuty + Phone

High (P1): Major functionality impaired

  • Response time: 1 hour
  • Contact: Slack + Email

Medium (P2): Minor issues

  • Response time: 4 hours
  • Contact: Email

Low (P3): Feature requests

  • Response time: 24 hours
  • Contact: GitHub Issues

Remember: Most issues can be prevented with proper monitoring, testing, and documentation. When in doubt, check the basics first: network connectivity, permissions, and resource availability.