ATOM Documentation

← Back to App

Redis Root Cause Fix - Verification Guide

**Date**: 2026-04-09

**Issue**: 2.2M Redis GET requests in Upstash dashboard

**Fix**: Eliminated 3 root causes with local caching

---

šŸŽÆ What Was Fixed

Root Cause #1: Quota Manager Circular Dependency

**Problem**: quota_manager.check_quota() called cache.get_async() which checked quota again

  • **Impact**: 2 GETs per cache operation (quota + data)
  • **Fix**: Direct Redis client bypasses quota checks
  • **Reduction**: 50% (from 2 GETs to 1 GET)

Root Cause #2: Rate Limiter State Checks

**Problem**: Every integration API call checked Redis for rate limit state

  • **Impact**: 1 GET per API call (Slack, Salesforce, HubSpot, etc.)
  • **Fix**: Local cache with 5-second TTL
  • **Reduction**: ~95% (1 GET per 5 seconds vs 1 GET per call)

Root Cause #3: Circuit Breaker State Checks

**Problem**: Every integration API call checked Redis for circuit state

  • **Impact**: 1 GET per API call
  • **Fix**: Local cache with 10-second TTL
  • **Reduction**: ~90% (1 GET per 10 seconds vs 1 GET per call)

**Total Expected Reduction**: 2.2M → ~110K GET requests/day (95% reduction)

---

šŸ” Investigating the Current 2.2M GETs

The flat line at the top of your Upstash dashboard suggests one of these:

1. Connection Leak or Tight Loop

**Symptoms**: Flat line at maximum, no variation

**Causes**:

  • Infinite retry loop without backoff
  • Missing await in async code
  • Unbounded loop calling Redis

**Check in your code**:

# Search for potential loops
cd backend-saas
grep -r "while.*redis" --include="*.py"
grep -r "for.*in.*range.*:" --include="*.py" | grep -i redis

2. Dashboard Metrics Lag

**Symptoms**: Traffic stopped but dashboard still shows high

**Cause**: Upstash dashboard aggregates over 5-10 minute windows

**Solution**: Wait 10-15 minutes after applying fix

3. Health Check Spam

**Symptoms**: Consistent pattern of requests

**Check**: Look for keys like health:, ping:, status:

# In Upstash console or redis-cli
redis-cli KEYS *health*
redis-cli KEYS *ping*

4. Multi-Region Traffic

**Symptoms**: Dashboard shows more traffic than your CLI sees

**Cause**: Global replication aggregates all regions

**Check**: Compare regional dashboards in Upstash console

---

āœ… Step-by-Step Verification

Step 1: Deploy Root Cause Fixes (DONE)

āœ… **Commit**: d1e303fdb

āœ… **Status**: Pushed to main

āœ… **Files Changed**:

  • backend-saas/core/cache.py - Fixed quota manager
  • backend-saas/core/integration_rate_limiter.py - Added local cache
  • backend-saas/core/integration_circuit_breaker.py - Added local cache

Step 2: Re-enable Redis (CURRENT)

**Option A: Gradual Rollout (RECOMMENDED)**

# Remove suspension flag
fly secrets set SUSPEND_REDIS=false -a atom-saas

# Deploy with root cause fixes
fly deploy -a atom-saas

# Monitor for issues
fly logs -a atom-saas --tail 100 | grep -i "error\|exception\|redis"

**Option B: Test with Small Percentage**

# Keep Redis suspended but test on staging
fly secrets set SUSPEND_REDIS=false -a atom-saas-staging

Step 3: Monitor Upstash Dashboard (Next 1 Hour)

**What to Watch For**:

MetricBefore FixAfter Fix (Expected)Timeframe
**GET requests**2.2M/day~110K/dayImmediate
**Request rate**~1,525/minute~76/minuteWithin 15 min
**Cost**$X/month$X/20 (95% off)This billing cycle

**Dashboard URL**: https://console.upstash.com/

**Key Indicators**:

  • āœ… **GET line drops from flat top to 5% of previous height**
  • āœ… **Request rate decreases from ~1,525/min to ~76/min**
  • āœ… **No new spikes or flat lines**

Step 4: Verify Application Health

**Check health endpoint**:

watch -n 5 'curl -s https://app.atomagentos.com/api/health | jq .'

**Expected output**:

{
  "status": "ok",
  "services": {
    "web": "ok",
    "database": "ok",
    "redis": "ok"  // Should show "ok" (using Redis with local cache)
  }
}

**Check error rates**:

fly logs -a atom-saas --json | grep -i "error\|exception" | wc -l
# Should see minimal errors (same as with SUSPEND_REDIS=true)

Step 5: Monitor for 24 Hours

**Hour 0-1**: Immediate reduction in GET requests

**Hour 1-6**: Stable low request rate

**Hour 6-24**: No new spikes or issues

**If you see issues**:

# Emergency: Re-suspend Redis
fly secrets set SUSPEND_REDIS=true -a atom-saas
fly deploy -a atom-saas

# Check logs
fly logs -a atom-saas --tail 500 | grep -i "redis\|error"

---

šŸ›  Troubleshooting Common Issues

Issue: GET requests still high after 15 minutes

**Diagnose**:

# Check which keys are being accessed most
# (In Upstash console or via redis-cli)
redis-cli --hotkeys

# Or use SCAN to find frequently accessed keys
redis-cli --scan --pattern "rate_limit:*" | head -20
redis-cli --scan --pattern "circuit_breaker:*" | head -20
redis-cli --scan --pattern "quota:redis:*" | head -20

**If rate_limit or circuit_breaker keys appear**:

  • Local caching might not be working
  • Check logs for "local cache" messages
  • Verify local_cache_ttl is being used

Issue: Application errors after re-enabling Redis

**Check 1**: Redis connection working

fly logs -a atom-saas --tail 100 | grep -i "redis.*connect"

**Check 2**: Quota not blocking operations

fly logs -a atom-saas --tail 100 | grep -i "quota.*exceeded"

**Check 3**: Local cache populating

fly logs -a atom-saas --tail 100 | grep -i "local.*cache"

Issue: No reduction in GET requests

**Possible Cause**: Old code still running

# Verify deployment
fly status -a atom-saas
# Check version number increased

# Check machines restarted
fly machines list -a atom-saas
# All should have recent "LAST UPDATED" times

**Possible Cause**: Local cache not being used

# Add debug logging temporarily
# In integration_rate_limiter.py, line 149:
logger.info(f"šŸ”µ Rate limiter: Using local cache for {key}")

---

šŸ“Š Success Criteria

The fix is successful when ALL of these are true:

  • [ ] Upstash dashboard shows ~95% reduction in GET requests
  • [ ] GET line is no longer flat at top
  • [ ] Request rate dropped from ~1,525/min to ~76/min
  • [ ] Application health checks passing
  • [ ] Error rates same as with SUSPEND_REDIS=true
  • [ ] No new spikes or flat lines in dashboard
  • [ ] Costs reduced by ~95%

---

šŸš€ Rollback Plan

If issues occur after re-enabling Redis:

**Immediate Rollback** (2 minutes):

fly secrets set SUSPEND_REDIS=true -a atom-saas
fly deploy -a atom-saas

**Investigation** (15 minutes):

# Check logs for errors
fly logs -a atom-saas --tail 500 > redis_errors.log
grep -i "error\|exception\|traceback" redis_errors.log

# Identify problematic pattern
grep -i "redis" redis_errors.log | head -50

**Fix and Retry** (1 hour):

  1. Identify root cause from logs
  2. Implement fix
  3. Test locally
  4. Deploy to staging
  5. Retry production rollout

---

šŸ“ˆ Long-Term Monitoring

Daily Checks (First Week)

  • [ ] Upstash dashboard GET request count
  • [ ] Application error rate
  • [ ] Response time (should be < 100ms)
  • [ ] Cost trending down

Weekly Review

  • [ ] Compare weekly GET request totals
  • [ ] Calculate cost savings
  • [ ] Check for any anomalies
  • [ ] Update documentation if needed

Monthly Review

  • [ ] Evaluate if local cache TTL needs adjustment
  • [ ] Consider increasing TTL if stability is good
  • [ ] Review integration volume changes
  • [ ] Update rate limits if needed

---

šŸŽÆ Expected Results Summary

BeforeAfterImprovement
**2.2M GETs/day****~110K GETs/day**95% reduction
**1,525 GETs/min****~76 GETs/min**95% reduction
**Quota: 2 GETs/op****Quota: 1 GET/op**50% reduction
**Rate limiter: 1 GET/call****Rate limiter: 1 GET/5s**95% reduction
**Circuit breaker: 1 GET/call****Circuit breaker: 1 GET/10s**90% reduction
**Cost: $X/month****Cost: $X/20/month**95% savings

---

šŸ“ Next Actions

  1. **Read this guide completely** āœ“
  2. **Re-enable Redis** using Step 2 above
  3. **Monitor dashboard** for 1 hour (Step 3)
  4. **Verify app health** (Step 4)
  5. **Monitor for 24 hours** (Step 5)
  6. **Report results** and adjust if needed

---

**Questions?** Check these files:

  • REDIS_GET_SOLUTION.md - Complete technical analysis
  • REDIS_FIX_APPLIED.md - Temporary fix documentation
  • backend-saas/core/cache.py - Quota manager fixes
  • backend-saas/core/integration_rate_limiter.py - Rate limiter caching
  • backend-saas/core/integration_circuit_breaker.py - Circuit breaker caching