Atom AI Labs - AI-Powered Multi-Tenant Platform

Redis Root Cause Fix - Verification Guide

**Date**: 2026-04-09

**Issue**: 2.2M Redis GET requests in Upstash dashboard

**Fix**: Eliminated 3 root causes with local caching

---

🎯 What Was Fixed

Root Cause #1: Quota Manager Circular Dependency

**Problem**: quota_manager.check_quota() called cache.get_async() which checked quota again

**Impact**: 2 GETs per cache operation (quota + data)
**Fix**: Direct Redis client bypasses quota checks
**Reduction**: 50% (from 2 GETs to 1 GET)

Root Cause #2: Rate Limiter State Checks

**Problem**: Every integration API call checked Redis for rate limit state

**Impact**: 1 GET per API call (Slack, Salesforce, HubSpot, etc.)
**Fix**: Local cache with 5-second TTL
**Reduction**: ~95% (1 GET per 5 seconds vs 1 GET per call)

Root Cause #3: Circuit Breaker State Checks

**Problem**: Every integration API call checked Redis for circuit state

**Impact**: 1 GET per API call
**Fix**: Local cache with 10-second TTL
**Reduction**: ~90% (1 GET per 10 seconds vs 1 GET per call)

**Total Expected Reduction**: 2.2M → ~110K GET requests/day (95% reduction)

---

🔍 Investigating the Current 2.2M GETs

The flat line at the top of your Upstash dashboard suggests one of these:

1. Connection Leak or Tight Loop

**Symptoms**: Flat line at maximum, no variation

**Causes**:

Infinite retry loop without backoff
Missing await in async code
Unbounded loop calling Redis

**Check in your code**:

# Search for potential loops
cd backend-saas
grep -r "while.*redis" --include="*.py"
grep -r "for.*in.*range.*:" --include="*.py" | grep -i redis

2. Dashboard Metrics Lag

**Symptoms**: Traffic stopped but dashboard still shows high

**Cause**: Upstash dashboard aggregates over 5-10 minute windows

**Solution**: Wait 10-15 minutes after applying fix

3. Health Check Spam

**Symptoms**: Consistent pattern of requests

**Check**: Look for keys like health:, ping:, status:

# In Upstash console or redis-cli
redis-cli KEYS *health*
redis-cli KEYS *ping*

4. Multi-Region Traffic

**Symptoms**: Dashboard shows more traffic than your CLI sees

**Cause**: Global replication aggregates all regions

**Check**: Compare regional dashboards in Upstash console

---

✅ Step-by-Step Verification

Step 1: Deploy Root Cause Fixes (DONE)

✅ **Commit**: d1e303fdb

✅ **Status**: Pushed to main

✅ **Files Changed**:

backend-saas/core/cache.py - Fixed quota manager
backend-saas/core/integration_rate_limiter.py - Added local cache
backend-saas/core/integration_circuit_breaker.py - Added local cache

Step 2: Re-enable Redis (CURRENT)

**Option A: Gradual Rollout (RECOMMENDED)**

# Remove suspension flag
fly secrets set SUSPEND_REDIS=false -a atom-saas

# Deploy with root cause fixes
fly deploy -a atom-saas

# Monitor for issues
fly logs -a atom-saas --tail 100 | grep -i "error\|exception\|redis"

**Option B: Test with Small Percentage**

# Keep Redis suspended but test on staging
fly secrets set SUSPEND_REDIS=false -a atom-saas-staging

Step 3: Monitor Upstash Dashboard (Next 1 Hour)

**What to Watch For**:

Metric	Before Fix	After Fix (Expected)	Timeframe
GET requests	2.2M/day	~110K/day	Immediate
Request rate	~1,525/minute	~76/minute	Within 15 min
Cost	$X/month	$X/20 (95% off)	This billing cycle

**Dashboard URL**: https://console.upstash.com/

**Key Indicators**:

✅ **GET line drops from flat top to 5% of previous height**
✅ **Request rate decreases from ~1,525/min to ~76/min**
✅ **No new spikes or flat lines**

Step 4: Verify Application Health

**Check health endpoint**:

watch -n 5 'curl -s https://app.atomagentos.com/api/health | jq .'

**Expected output**:

{
  "status": "ok",
  "services": {
    "web": "ok",
    "database": "ok",
    "redis": "ok"  // Should show "ok" (using Redis with local cache)
  }
}

**Check error rates**:

fly logs -a atom-saas --json | grep -i "error\|exception" | wc -l
# Should see minimal errors (same as with SUSPEND_REDIS=true)

Step 5: Monitor for 24 Hours

**Hour 0-1**: Immediate reduction in GET requests

**Hour 1-6**: Stable low request rate

**Hour 6-24**: No new spikes or issues

**If you see issues**:

# Emergency: Re-suspend Redis
fly secrets set SUSPEND_REDIS=true -a atom-saas
fly deploy -a atom-saas

# Check logs
fly logs -a atom-saas --tail 500 | grep -i "redis\|error"

---

🛠 Troubleshooting Common Issues

Issue: GET requests still high after 15 minutes

**Diagnose**:

# Check which keys are being accessed most
# (In Upstash console or via redis-cli)
redis-cli --hotkeys

# Or use SCAN to find frequently accessed keys
redis-cli --scan --pattern "rate_limit:*" | head -20
redis-cli --scan --pattern "circuit_breaker:*" | head -20
redis-cli --scan --pattern "quota:redis:*" | head -20

**If rate_limit or circuit_breaker keys appear**:

Local caching might not be working
Check logs for "local cache" messages
Verify local_cache_ttl is being used

Issue: Application errors after re-enabling Redis

**Check 1**: Redis connection working

fly logs -a atom-saas --tail 100 | grep -i "redis.*connect"

**Check 2**: Quota not blocking operations

fly logs -a atom-saas --tail 100 | grep -i "quota.*exceeded"

**Check 3**: Local cache populating

fly logs -a atom-saas --tail 100 | grep -i "local.*cache"

Issue: No reduction in GET requests

**Possible Cause**: Old code still running

# Verify deployment
fly status -a atom-saas
# Check version number increased

# Check machines restarted
fly machines list -a atom-saas
# All should have recent "LAST UPDATED" times

**Possible Cause**: Local cache not being used

# Add debug logging temporarily
# In integration_rate_limiter.py, line 149:
logger.info(f"🔵 Rate limiter: Using local cache for {key}")

---

📊 Success Criteria

The fix is successful when ALL of these are true:

[ ] Upstash dashboard shows ~95% reduction in GET requests
[ ] GET line is no longer flat at top
[ ] Request rate dropped from ~1,525/min to ~76/min
[ ] Application health checks passing
[ ] Error rates same as with SUSPEND_REDIS=true
[ ] No new spikes or flat lines in dashboard
[ ] Costs reduced by ~95%

---

🚀 Rollback Plan

If issues occur after re-enabling Redis:

**Immediate Rollback** (2 minutes):

fly secrets set SUSPEND_REDIS=true -a atom-saas
fly deploy -a atom-saas

**Investigation** (15 minutes):

# Check logs for errors
fly logs -a atom-saas --tail 500 > redis_errors.log
grep -i "error\|exception\|traceback" redis_errors.log

# Identify problematic pattern
grep -i "redis" redis_errors.log | head -50

**Fix and Retry** (1 hour):

Identify root cause from logs
Implement fix
Test locally
Deploy to staging
Retry production rollout

---

📈 Long-Term Monitoring

Daily Checks (First Week)

[ ] Upstash dashboard GET request count
[ ] Application error rate
[ ] Response time (should be < 100ms)
[ ] Cost trending down

Weekly Review

[ ] Compare weekly GET request totals
[ ] Calculate cost savings
[ ] Check for any anomalies
[ ] Update documentation if needed

Monthly Review

[ ] Evaluate if local cache TTL needs adjustment
[ ] Consider increasing TTL if stability is good
[ ] Review integration volume changes
[ ] Update rate limits if needed

---

🎯 Expected Results Summary

Before	After	Improvement
2.2M GETs/day	~110K GETs/day	95% reduction
1,525 GETs/min	~76 GETs/min	95% reduction
Quota: 2 GETs/op	Quota: 1 GET/op	50% reduction
Rate limiter: 1 GET/call	Rate limiter: 1 GET/5s	95% reduction
Circuit breaker: 1 GET/call	Circuit breaker: 1 GET/10s	90% reduction
Cost: $X/month	Cost: $X/20/month	95% savings

---

📝 Next Actions

**Read this guide completely** ✓
**Re-enable Redis** using Step 2 above
**Monitor dashboard** for 1 hour (Step 3)
**Verify app health** (Step 4)
**Monitor for 24 hours** (Step 5)
**Report results** and adjust if needed

---

**Questions?** Check these files:

REDIS_GET_SOLUTION.md - Complete technical analysis
REDIS_FIX_APPLIED.md - Temporary fix documentation
backend-saas/core/cache.py - Quota manager fixes
backend-saas/core/integration_rate_limiter.py - Rate limiter caching
backend-saas/core/integration_circuit_breaker.py - Circuit breaker caching

Before	After	Improvement
2.2M GETs/day	~110K GETs/day	95% reduction
1,525 GETs/min	~76 GETs/min	95% reduction
Quota: 2 GETs/op	Quota: 1 GET/op	50% reduction
Rate limiter: 1 GET/call	Rate limiter: 1 GET/5s	95% reduction
Circuit breaker: 1 GET/call	Circuit breaker: 1 GET/10s	90% reduction
Cost: $X/month	Cost: $X/20/month	95% savings