ATOM Documentation

← Back to App

Fix 2.2M Redis GET Requests in Upstash

Problem Statement

Upstash is showing 2.2M GET requests that aren't decreasing. These requests may not be visible via redis-cli but are still consuming quota.

Root Causes Identified

1. Redis Quota Manager (PRIMARY SUSPECT)

**Every cache operation = 2 Redis GETs**

# core/cache.py:488-517
async def get_async(self, key: str, tenant_id: str | None = None):
    # GET #1: Check quota
    await self.quota_manager.check_quota(tenant_id, plan_type)

    # GET #2: Get actual data
    val = self.client.get(namespaced_key)

**Impact**: 1M cache operations = 2M Redis GETs

**Evidence**:

  • Quota check at line 504: await self.cache.get_async(quota_key)
  • Data fetch at line 512: self.client.get(namespaced_key)

2. Tenant Discovery Service

**Every webhook = 1 Redis GET**

# core/tenant_discovery.py:44
cached_tenant_id = await self.cache.get_async(f"discovery:{connector_id}:{external_id}")

3. Rate Limiter + Circuit Breaker

**Every integration API call = 2 Redis GETs**

# Rate limiter
state = await self.redis.get(f"rate_limit:{tenant_id}:{connector_id}")

# Circuit breaker
state = await self.redis.get(f"circuit_breaker:{tenant_id}:{connector_id}")

---

Immediate Solutions

Solution 1: Enable Redis Suspension (QUICKEST)

# Set on Fly.io
fly secrets set SUSPEND_REDIS=true -a atom-saas

# Verify
fly secrets list -a atom-saas | grep SUSPEND

**This will**:

  • ✅ Stop all distributed Redis operations
  • ✅ Use local memory cache only
  • ✅ Reduce Upstash costs to $0

**Trade-off**:

  • ❌ No distributed cache coordination
  • ❌ Each machine has its own cache

Solution 2: Reduce Quota Check Frequency

**Problem**: Quota checked on EVERY cache operation (line 504)

**Fix**: Increase quota result cache from 10s to 60s

# core/cache.py:90
# OLD: self._quota_result_cache[quota_key] = (is_allowed, time.time() + 10.0)
# NEW:
self._quota_result_cache[quota_key] = (is_allowed, time.time() + 60.0)  # 60 seconds

**Impact**: Reduces quota GETs by 83%

Solution 3: Eliminate Double GET in Quota Check

**Problem**: Quota check itself calls cache.get_async() which checks quota again (infinite loop potential)

**Fix**: Use direct Redis for quota checks

# core/cache.py:504
# OLD: current = await self.cache.get_async(quota_key)
# NEW:
if self.client:
    current = self.client.get(quota_key)  # Direct Redis, bypass quota check
else:
    current = None

Solution 4: Pre-warm Tenant Discovery Cache

**Problem**: Every webhook hits Redis to resolve tenant_id

**Fix**: Cache pre-population after OAuth

# After successful OAuth callback
external_id = integration.external_id
cache_key = f"discovery:{connector_id}:{external_id}"
await cache.set_async(cache_key, tenant_id, ttl=3600)  # 1 hour

Solution 5: Local Cache for Rate Limiter/Circuit Breaker

**Problem**: Every integration API call hits Redis for state

**Fix**: Use in-memory state with periodic sync

# core/integration_rate_limiter.py
class IntegrationRateLimiter:
    def __init__(self, redis):
        self.redis = redis
        self._local_state = {}  # Add local cache
        self._last_sync = time.time()

    async def check_rate_limit(self, tenant_id: str, connector_id: str):
        key = f"{tenant_id}:{connector_id}"

        # Check local cache first (5-second TTL)
        if key in self._local_state:
            if time.time() - self._local_state[key]['time'] < 5:
                return self._local_state[key]['result']

        # Fall back to Redis
        result = await self._check_redis(tenant_id, connector_id)
        self._local_state[key] = {'result': result, 'time': time.time()}
        return result

---

Diagnostic Steps

Step 1: Verify Current Redis Usage

# Check if SUSPEND_REDIS is set
fly secrets list -a atom-saas | grep -i redis

# Expected output:
# SUSPEND_REDIS=true

Step 2: Enable Redis Metrics

Add to .env:

TRACK_REDIS_METRICS=true

Deploy:

fly deploy -a atom-saas

Check logs:

fly logs -a atom-saas --json | grep "Redis Metrics"

Step 3: Monitor Upstash Dashboard

Watch for:

  • ✅ GET requests decreasing
  • ✅ Quota usage stabilizing
  • ✅ No spike after deployment

---

Long-Term Architecture Fix

Option 1: Remove Distributed Redis Entirely

**When**: Multi-machine coordination not critical

**How**:

  1. Set SUSPEND_REDIS=true permanently
  2. Use local memory cache only
  3. Accept cache inconsistency across machines

**Pros**:

  • ✅ Zero Upstash costs
  • ✅ Simpler architecture
  • ✅ Faster (no network latency)

**Cons**:

  • ❌ No distributed coordination
  • ❌ Each machine has own cache
  • ❌ Cache warmth varies by machine

Option 2: Hybrid Approach

**When**: Need distributed for some features only

**How**:

  1. Use local cache for hot data (quota, rate limits, circuit state)
  2. Use Redis for cold data only (tenant discovery, session store)
  3. Implement cache warming strategy

**Pros**:

  • ✅ 90% cost reduction
  • ✅ Keep critical distributed features
  • ✅ Best of both worlds

**Cons**:

  • ❌ More complex
  • ❌ Two cache layers to manage

Option 3: Redis Alternatives

**DragonflyDB** (Self-hosted on Fly.io):

  • ✅ Same protocol as Redis
  • ✅ Higher performance
  • ✅ No per-request pricing
  • ❌ Need to manage server

**KeyDB** (Multi-threaded Redis):

  • ✅ Drop-in Redis replacement
  • ✅ Better CPU utilization
  • ❌ Still need to host

---

Phase 1: Stop the Bleeding (TODAY)

fly secrets set SUSPEND_REDIS=true -a atom-saas

Phase 2: Optimize Quota Checks (THIS WEEK)

  • Increase quota cache TTL to 60s
  • Eliminate double GET in quota check
  • Deploy and monitor

Phase 3: Add Local Caching (THIS MONTH)

  • Add local cache for rate limiter
  • Add local cache for circuit breaker
  • Pre-warm tenant discovery cache

Phase 4: Evaluate Architecture (NEXT QUARTER)

  • Decide: Keep Redis vs. Remove entirely vs. Self-host
  • Implement chosen solution
  • Monitor costs

---

Success Metrics

**Before** (Current):

  • 2.2M GET requests/day
  • Unknown Upstash costs
  • Potential budget suspension risk

**After** (Target):

  • < 100K GET requests/day (95% reduction)
  • $0 Upstash costs (if suspended)
  • No budget suspension risk
  • Cache hit ratio > 80%

---

Rollback Plan

If issues arise after changes:

# Re-enable Redis
fly secrets set SUSPEND_REDIS=false -a atom-saas

# Redeploy previous version
fly deploy --image=atom-saas:deployment-<PREVIOUS_IMAGE_ID>

Monitor for 24 hours before closing issue.