Atom AI Labs - AI-Powered Multi-Tenant Platform

Fix 2.2M Redis GET Requests in Upstash

Problem Statement

Upstash is showing 2.2M GET requests that aren't decreasing. These requests may not be visible via redis-cli but are still consuming quota.

Root Causes Identified

1. Redis Quota Manager (PRIMARY SUSPECT)

**Every cache operation = 2 Redis GETs**

# core/cache.py:488-517
async def get_async(self, key: str, tenant_id: str | None = None):
    # GET #1: Check quota
    await self.quota_manager.check_quota(tenant_id, plan_type)

    # GET #2: Get actual data
    val = self.client.get(namespaced_key)

**Impact**: 1M cache operations = 2M Redis GETs

**Evidence**:

Quota check at line 504: await self.cache.get_async(quota_key)
Data fetch at line 512: self.client.get(namespaced_key)

2. Tenant Discovery Service

**Every webhook = 1 Redis GET**

# core/tenant_discovery.py:44
cached_tenant_id = await self.cache.get_async(f"discovery:{connector_id}:{external_id}")

3. Rate Limiter + Circuit Breaker

**Every integration API call = 2 Redis GETs**

# Rate limiter
state = await self.redis.get(f"rate_limit:{tenant_id}:{connector_id}")

# Circuit breaker
state = await self.redis.get(f"circuit_breaker:{tenant_id}:{connector_id}")

---

Immediate Solutions

Solution 1: Enable Redis Suspension (QUICKEST)

# Set on Fly.io
fly secrets set SUSPEND_REDIS=true -a atom-saas

# Verify
fly secrets list -a atom-saas | grep SUSPEND

**This will**:

✅ Stop all distributed Redis operations
✅ Use local memory cache only
✅ Reduce Upstash costs to $0

**Trade-off**:

❌ No distributed cache coordination
❌ Each machine has its own cache

Solution 2: Reduce Quota Check Frequency

**Problem**: Quota checked on EVERY cache operation (line 504)

**Fix**: Increase quota result cache from 10s to 60s

# core/cache.py:90
# OLD: self._quota_result_cache[quota_key] = (is_allowed, time.time() + 10.0)
# NEW:
self._quota_result_cache[quota_key] = (is_allowed, time.time() + 60.0)  # 60 seconds

**Impact**: Reduces quota GETs by 83%

Solution 3: Eliminate Double GET in Quota Check

**Problem**: Quota check itself calls cache.get_async() which checks quota again (infinite loop potential)

**Fix**: Use direct Redis for quota checks

# core/cache.py:504
# OLD: current = await self.cache.get_async(quota_key)
# NEW:
if self.client:
    current = self.client.get(quota_key)  # Direct Redis, bypass quota check
else:
    current = None

Solution 4: Pre-warm Tenant Discovery Cache

**Problem**: Every webhook hits Redis to resolve tenant_id

**Fix**: Cache pre-population after OAuth

# After successful OAuth callback
external_id = integration.external_id
cache_key = f"discovery:{connector_id}:{external_id}"
await cache.set_async(cache_key, tenant_id, ttl=3600)  # 1 hour

Solution 5: Local Cache for Rate Limiter/Circuit Breaker

**Problem**: Every integration API call hits Redis for state

**Fix**: Use in-memory state with periodic sync

# core/integration_rate_limiter.py
class IntegrationRateLimiter:
    def __init__(self, redis):
        self.redis = redis
        self._local_state = {}  # Add local cache
        self._last_sync = time.time()

    async def check_rate_limit(self, tenant_id: str, connector_id: str):
        key = f"{tenant_id}:{connector_id}"

        # Check local cache first (5-second TTL)
        if key in self._local_state:
            if time.time() - self._local_state[key]['time'] < 5:
                return self._local_state[key]['result']

        # Fall back to Redis
        result = await self._check_redis(tenant_id, connector_id)
        self._local_state[key] = {'result': result, 'time': time.time()}
        return result

---

Diagnostic Steps

Step 1: Verify Current Redis Usage

# Check if SUSPEND_REDIS is set
fly secrets list -a atom-saas | grep -i redis

# Expected output:
# SUSPEND_REDIS=true

Step 2: Enable Redis Metrics

Add to .env:

TRACK_REDIS_METRICS=true

Deploy:

fly deploy -a atom-saas

Check logs:

fly logs -a atom-saas --json | grep "Redis Metrics"

Step 3: Monitor Upstash Dashboard

Watch for:

✅ GET requests decreasing
✅ Quota usage stabilizing
✅ No spike after deployment

---

Long-Term Architecture Fix

Option 1: Remove Distributed Redis Entirely

**When**: Multi-machine coordination not critical

**How**:

Set SUSPEND_REDIS=true permanently
Use local memory cache only
Accept cache inconsistency across machines

**Pros**:

✅ Zero Upstash costs
✅ Simpler architecture
✅ Faster (no network latency)

**Cons**:

❌ No distributed coordination
❌ Each machine has own cache
❌ Cache warmth varies by machine

Option 2: Hybrid Approach

**When**: Need distributed for some features only

**How**:

Use local cache for hot data (quota, rate limits, circuit state)
Use Redis for cold data only (tenant discovery, session store)
Implement cache warming strategy

**Pros**:

✅ 90% cost reduction
✅ Keep critical distributed features
✅ Best of both worlds

**Cons**:

❌ More complex
❌ Two cache layers to manage

Option 3: Redis Alternatives

**DragonflyDB** (Self-hosted on Fly.io):

✅ Same protocol as Redis
✅ Higher performance
✅ No per-request pricing
❌ Need to manage server

**KeyDB** (Multi-threaded Redis):

✅ Drop-in Redis replacement
✅ Better CPU utilization
❌ Still need to host

---

Recommended Action Plan

Phase 1: Stop the Bleeding (TODAY)

fly secrets set SUSPEND_REDIS=true -a atom-saas

Phase 2: Optimize Quota Checks (THIS WEEK)

Increase quota cache TTL to 60s
Eliminate double GET in quota check
Deploy and monitor

Phase 3: Add Local Caching (THIS MONTH)

Add local cache for rate limiter
Add local cache for circuit breaker
Pre-warm tenant discovery cache

Phase 4: Evaluate Architecture (NEXT QUARTER)

Decide: Keep Redis vs. Remove entirely vs. Self-host
Implement chosen solution
Monitor costs

---

Success Metrics

**Before** (Current):

2.2M GET requests/day
Unknown Upstash costs
Potential budget suspension risk

**After** (Target):

< 100K GET requests/day (95% reduction)
$0 Upstash costs (if suspended)
No budget suspension risk
Cache hit ratio > 80%

---

Rollback Plan

If issues arise after changes:

# Re-enable Redis
fly secrets set SUSPEND_REDIS=false -a atom-saas

# Redeploy previous version
fly deploy --image=atom-saas:deployment-<PREVIOUS_IMAGE_ID>

Monitor for 24 hours before closing issue.