Redis Root Cause Fix - Verification Guide
**Date**: 2026-04-09
**Issue**: 2.2M Redis GET requests in Upstash dashboard
**Fix**: Eliminated 3 root causes with local caching
---
šÆ What Was Fixed
Root Cause #1: Quota Manager Circular Dependency
**Problem**: quota_manager.check_quota() called cache.get_async() which checked quota again
- **Impact**: 2 GETs per cache operation (quota + data)
- **Fix**: Direct Redis client bypasses quota checks
- **Reduction**: 50% (from 2 GETs to 1 GET)
Root Cause #2: Rate Limiter State Checks
**Problem**: Every integration API call checked Redis for rate limit state
- **Impact**: 1 GET per API call (Slack, Salesforce, HubSpot, etc.)
- **Fix**: Local cache with 5-second TTL
- **Reduction**: ~95% (1 GET per 5 seconds vs 1 GET per call)
Root Cause #3: Circuit Breaker State Checks
**Problem**: Every integration API call checked Redis for circuit state
- **Impact**: 1 GET per API call
- **Fix**: Local cache with 10-second TTL
- **Reduction**: ~90% (1 GET per 10 seconds vs 1 GET per call)
**Total Expected Reduction**: 2.2M ā ~110K GET requests/day (95% reduction)
---
š Investigating the Current 2.2M GETs
The flat line at the top of your Upstash dashboard suggests one of these:
1. Connection Leak or Tight Loop
**Symptoms**: Flat line at maximum, no variation
**Causes**:
- Infinite retry loop without backoff
- Missing
awaitin async code - Unbounded loop calling Redis
**Check in your code**:
# Search for potential loops
cd backend-saas
grep -r "while.*redis" --include="*.py"
grep -r "for.*in.*range.*:" --include="*.py" | grep -i redis2. Dashboard Metrics Lag
**Symptoms**: Traffic stopped but dashboard still shows high
**Cause**: Upstash dashboard aggregates over 5-10 minute windows
**Solution**: Wait 10-15 minutes after applying fix
3. Health Check Spam
**Symptoms**: Consistent pattern of requests
**Check**: Look for keys like health:, ping:, status:
# In Upstash console or redis-cli
redis-cli KEYS *health*
redis-cli KEYS *ping*4. Multi-Region Traffic
**Symptoms**: Dashboard shows more traffic than your CLI sees
**Cause**: Global replication aggregates all regions
**Check**: Compare regional dashboards in Upstash console
---
ā Step-by-Step Verification
Step 1: Deploy Root Cause Fixes (DONE)
ā
**Commit**: d1e303fdb
ā **Status**: Pushed to main
ā **Files Changed**:
backend-saas/core/cache.py- Fixed quota managerbackend-saas/core/integration_rate_limiter.py- Added local cachebackend-saas/core/integration_circuit_breaker.py- Added local cache
Step 2: Re-enable Redis (CURRENT)
**Option A: Gradual Rollout (RECOMMENDED)**
# Remove suspension flag
fly secrets set SUSPEND_REDIS=false -a atom-saas
# Deploy with root cause fixes
fly deploy -a atom-saas
# Monitor for issues
fly logs -a atom-saas --tail 100 | grep -i "error\|exception\|redis"**Option B: Test with Small Percentage**
# Keep Redis suspended but test on staging
fly secrets set SUSPEND_REDIS=false -a atom-saas-stagingStep 3: Monitor Upstash Dashboard (Next 1 Hour)
**What to Watch For**:
| Metric | Before Fix | After Fix (Expected) | Timeframe |
|---|---|---|---|
| **GET requests** | 2.2M/day | ~110K/day | Immediate |
| **Request rate** | ~1,525/minute | ~76/minute | Within 15 min |
| **Cost** | $X/month | $X/20 (95% off) | This billing cycle |
**Dashboard URL**: https://console.upstash.com/
**Key Indicators**:
- ā **GET line drops from flat top to 5% of previous height**
- ā **Request rate decreases from ~1,525/min to ~76/min**
- ā **No new spikes or flat lines**
Step 4: Verify Application Health
**Check health endpoint**:
watch -n 5 'curl -s https://app.atomagentos.com/api/health | jq .'**Expected output**:
{
"status": "ok",
"services": {
"web": "ok",
"database": "ok",
"redis": "ok" // Should show "ok" (using Redis with local cache)
}
}**Check error rates**:
fly logs -a atom-saas --json | grep -i "error\|exception" | wc -l
# Should see minimal errors (same as with SUSPEND_REDIS=true)Step 5: Monitor for 24 Hours
**Hour 0-1**: Immediate reduction in GET requests
**Hour 1-6**: Stable low request rate
**Hour 6-24**: No new spikes or issues
**If you see issues**:
# Emergency: Re-suspend Redis
fly secrets set SUSPEND_REDIS=true -a atom-saas
fly deploy -a atom-saas
# Check logs
fly logs -a atom-saas --tail 500 | grep -i "redis\|error"---
š Troubleshooting Common Issues
Issue: GET requests still high after 15 minutes
**Diagnose**:
# Check which keys are being accessed most
# (In Upstash console or via redis-cli)
redis-cli --hotkeys
# Or use SCAN to find frequently accessed keys
redis-cli --scan --pattern "rate_limit:*" | head -20
redis-cli --scan --pattern "circuit_breaker:*" | head -20
redis-cli --scan --pattern "quota:redis:*" | head -20**If rate_limit or circuit_breaker keys appear**:
- Local caching might not be working
- Check logs for "local cache" messages
- Verify
local_cache_ttlis being used
Issue: Application errors after re-enabling Redis
**Check 1**: Redis connection working
fly logs -a atom-saas --tail 100 | grep -i "redis.*connect"**Check 2**: Quota not blocking operations
fly logs -a atom-saas --tail 100 | grep -i "quota.*exceeded"**Check 3**: Local cache populating
fly logs -a atom-saas --tail 100 | grep -i "local.*cache"Issue: No reduction in GET requests
**Possible Cause**: Old code still running
# Verify deployment
fly status -a atom-saas
# Check version number increased
# Check machines restarted
fly machines list -a atom-saas
# All should have recent "LAST UPDATED" times**Possible Cause**: Local cache not being used
# Add debug logging temporarily
# In integration_rate_limiter.py, line 149:
logger.info(f"šµ Rate limiter: Using local cache for {key}")---
š Success Criteria
The fix is successful when ALL of these are true:
- [ ] Upstash dashboard shows ~95% reduction in GET requests
- [ ] GET line is no longer flat at top
- [ ] Request rate dropped from ~1,525/min to ~76/min
- [ ] Application health checks passing
- [ ] Error rates same as with SUSPEND_REDIS=true
- [ ] No new spikes or flat lines in dashboard
- [ ] Costs reduced by ~95%
---
š Rollback Plan
If issues occur after re-enabling Redis:
**Immediate Rollback** (2 minutes):
fly secrets set SUSPEND_REDIS=true -a atom-saas
fly deploy -a atom-saas**Investigation** (15 minutes):
# Check logs for errors
fly logs -a atom-saas --tail 500 > redis_errors.log
grep -i "error\|exception\|traceback" redis_errors.log
# Identify problematic pattern
grep -i "redis" redis_errors.log | head -50**Fix and Retry** (1 hour):
- Identify root cause from logs
- Implement fix
- Test locally
- Deploy to staging
- Retry production rollout
---
š Long-Term Monitoring
Daily Checks (First Week)
- [ ] Upstash dashboard GET request count
- [ ] Application error rate
- [ ] Response time (should be < 100ms)
- [ ] Cost trending down
Weekly Review
- [ ] Compare weekly GET request totals
- [ ] Calculate cost savings
- [ ] Check for any anomalies
- [ ] Update documentation if needed
Monthly Review
- [ ] Evaluate if local cache TTL needs adjustment
- [ ] Consider increasing TTL if stability is good
- [ ] Review integration volume changes
- [ ] Update rate limits if needed
---
šÆ Expected Results Summary
| Before | After | Improvement |
|---|---|---|
| **2.2M GETs/day** | **~110K GETs/day** | 95% reduction |
| **1,525 GETs/min** | **~76 GETs/min** | 95% reduction |
| **Quota: 2 GETs/op** | **Quota: 1 GET/op** | 50% reduction |
| **Rate limiter: 1 GET/call** | **Rate limiter: 1 GET/5s** | 95% reduction |
| **Circuit breaker: 1 GET/call** | **Circuit breaker: 1 GET/10s** | 90% reduction |
| **Cost: $X/month** | **Cost: $X/20/month** | 95% savings |
---
š Next Actions
- **Read this guide completely** ā
- **Re-enable Redis** using Step 2 above
- **Monitor dashboard** for 1 hour (Step 3)
- **Verify app health** (Step 4)
- **Monitor for 24 hours** (Step 5)
- **Report results** and adjust if needed
---
**Questions?** Check these files:
REDIS_GET_SOLUTION.md- Complete technical analysisREDIS_FIX_APPLIED.md- Temporary fix documentationbackend-saas/core/cache.py- Quota manager fixesbackend-saas/core/integration_rate_limiter.py- Rate limiter cachingbackend-saas/core/integration_circuit_breaker.py- Circuit breaker caching