Fix 2.2M Redis GET Requests in Upstash
Problem Statement
Upstash is showing 2.2M GET requests that aren't decreasing. These requests may not be visible via redis-cli but are still consuming quota.
Root Causes Identified
1. Redis Quota Manager (PRIMARY SUSPECT)
**Every cache operation = 2 Redis GETs**
# core/cache.py:488-517
async def get_async(self, key: str, tenant_id: str | None = None):
# GET #1: Check quota
await self.quota_manager.check_quota(tenant_id, plan_type)
# GET #2: Get actual data
val = self.client.get(namespaced_key)**Impact**: 1M cache operations = 2M Redis GETs
**Evidence**:
- Quota check at line 504:
await self.cache.get_async(quota_key) - Data fetch at line 512:
self.client.get(namespaced_key)
2. Tenant Discovery Service
**Every webhook = 1 Redis GET**
# core/tenant_discovery.py:44
cached_tenant_id = await self.cache.get_async(f"discovery:{connector_id}:{external_id}")3. Rate Limiter + Circuit Breaker
**Every integration API call = 2 Redis GETs**
# Rate limiter
state = await self.redis.get(f"rate_limit:{tenant_id}:{connector_id}")
# Circuit breaker
state = await self.redis.get(f"circuit_breaker:{tenant_id}:{connector_id}")---
Immediate Solutions
Solution 1: Enable Redis Suspension (QUICKEST)
# Set on Fly.io
fly secrets set SUSPEND_REDIS=true -a atom-saas
# Verify
fly secrets list -a atom-saas | grep SUSPEND**This will**:
- ✅ Stop all distributed Redis operations
- ✅ Use local memory cache only
- ✅ Reduce Upstash costs to $0
**Trade-off**:
- ❌ No distributed cache coordination
- ❌ Each machine has its own cache
Solution 2: Reduce Quota Check Frequency
**Problem**: Quota checked on EVERY cache operation (line 504)
**Fix**: Increase quota result cache from 10s to 60s
# core/cache.py:90
# OLD: self._quota_result_cache[quota_key] = (is_allowed, time.time() + 10.0)
# NEW:
self._quota_result_cache[quota_key] = (is_allowed, time.time() + 60.0) # 60 seconds**Impact**: Reduces quota GETs by 83%
Solution 3: Eliminate Double GET in Quota Check
**Problem**: Quota check itself calls cache.get_async() which checks quota again (infinite loop potential)
**Fix**: Use direct Redis for quota checks
# core/cache.py:504
# OLD: current = await self.cache.get_async(quota_key)
# NEW:
if self.client:
current = self.client.get(quota_key) # Direct Redis, bypass quota check
else:
current = NoneSolution 4: Pre-warm Tenant Discovery Cache
**Problem**: Every webhook hits Redis to resolve tenant_id
**Fix**: Cache pre-population after OAuth
# After successful OAuth callback
external_id = integration.external_id
cache_key = f"discovery:{connector_id}:{external_id}"
await cache.set_async(cache_key, tenant_id, ttl=3600) # 1 hourSolution 5: Local Cache for Rate Limiter/Circuit Breaker
**Problem**: Every integration API call hits Redis for state
**Fix**: Use in-memory state with periodic sync
# core/integration_rate_limiter.py
class IntegrationRateLimiter:
def __init__(self, redis):
self.redis = redis
self._local_state = {} # Add local cache
self._last_sync = time.time()
async def check_rate_limit(self, tenant_id: str, connector_id: str):
key = f"{tenant_id}:{connector_id}"
# Check local cache first (5-second TTL)
if key in self._local_state:
if time.time() - self._local_state[key]['time'] < 5:
return self._local_state[key]['result']
# Fall back to Redis
result = await self._check_redis(tenant_id, connector_id)
self._local_state[key] = {'result': result, 'time': time.time()}
return result---
Diagnostic Steps
Step 1: Verify Current Redis Usage
# Check if SUSPEND_REDIS is set
fly secrets list -a atom-saas | grep -i redis
# Expected output:
# SUSPEND_REDIS=trueStep 2: Enable Redis Metrics
Add to .env:
TRACK_REDIS_METRICS=trueDeploy:
fly deploy -a atom-saasCheck logs:
fly logs -a atom-saas --json | grep "Redis Metrics"Step 3: Monitor Upstash Dashboard
Watch for:
- ✅ GET requests decreasing
- ✅ Quota usage stabilizing
- ✅ No spike after deployment
---
Long-Term Architecture Fix
Option 1: Remove Distributed Redis Entirely
**When**: Multi-machine coordination not critical
**How**:
- Set
SUSPEND_REDIS=truepermanently - Use local memory cache only
- Accept cache inconsistency across machines
**Pros**:
- ✅ Zero Upstash costs
- ✅ Simpler architecture
- ✅ Faster (no network latency)
**Cons**:
- ❌ No distributed coordination
- ❌ Each machine has own cache
- ❌ Cache warmth varies by machine
Option 2: Hybrid Approach
**When**: Need distributed for some features only
**How**:
- Use local cache for hot data (quota, rate limits, circuit state)
- Use Redis for cold data only (tenant discovery, session store)
- Implement cache warming strategy
**Pros**:
- ✅ 90% cost reduction
- ✅ Keep critical distributed features
- ✅ Best of both worlds
**Cons**:
- ❌ More complex
- ❌ Two cache layers to manage
Option 3: Redis Alternatives
**DragonflyDB** (Self-hosted on Fly.io):
- ✅ Same protocol as Redis
- ✅ Higher performance
- ✅ No per-request pricing
- ❌ Need to manage server
**KeyDB** (Multi-threaded Redis):
- ✅ Drop-in Redis replacement
- ✅ Better CPU utilization
- ❌ Still need to host
---
Recommended Action Plan
Phase 1: Stop the Bleeding (TODAY)
fly secrets set SUSPEND_REDIS=true -a atom-saasPhase 2: Optimize Quota Checks (THIS WEEK)
- Increase quota cache TTL to 60s
- Eliminate double GET in quota check
- Deploy and monitor
Phase 3: Add Local Caching (THIS MONTH)
- Add local cache for rate limiter
- Add local cache for circuit breaker
- Pre-warm tenant discovery cache
Phase 4: Evaluate Architecture (NEXT QUARTER)
- Decide: Keep Redis vs. Remove entirely vs. Self-host
- Implement chosen solution
- Monitor costs
---
Success Metrics
**Before** (Current):
- 2.2M GET requests/day
- Unknown Upstash costs
- Potential budget suspension risk
**After** (Target):
- < 100K GET requests/day (95% reduction)
- $0 Upstash costs (if suspended)
- No budget suspension risk
- Cache hit ratio > 80%
---
Rollback Plan
If issues arise after changes:
# Re-enable Redis
fly secrets set SUSPEND_REDIS=false -a atom-saas
# Redeploy previous version
fly deploy --image=atom-saas:deployment-<PREVIOUS_IMAGE_ID>Monitor for 24 hours before closing issue.