Atom AI Labs - AI-Powered Multi-Tenant Platform

Redis Cache Fix - Deployment Summary

✅ Deployment Complete

**Deployed**: 2026-04-08

**Commit**: 55a4c46b7

**Status**: ✅ SUCCESS

**URL**: https://atom-saas.fly.dev/

---

What Was Fixed

Problem

**Excessive Redis Reads**: 13,480,415 reads/day (423:1 read:write ratio)
**Root Cause**: 812 API routes calling getTenantFromRequest() with NO caching
**Impact**: 2-5 PostgreSQL queries per API request, massive database load

Solution

**Created Redis Client Service** (src/lib/redis/redis-client.ts)

Upstash Redis wrapper with automatic fallback
In-memory cache fallback when Redis unavailable
TTL-based cache invalidation

**Updated Tenant Extraction** (src/lib/tenant/tenant-extractor.ts)

Cache-first lookup strategy
Multiple cache key patterns:
tenant:id:{id} - Lookup by tenant ID
tenant:subdomain:{subdomain} - Lookup by subdomain
tenant:domain:{domain} - Lookup by custom domain
Cache TTL: 1 hour (configurable)

**Added Cache Management**

invalidateTenantCache() - Manual cache invalidation
Automatic expiration via TTL
Graceful degradation to local cache

---

Expected Results

Performance Improvements

**Database Queries**: 80-90% reduction
**Redis Reads**: 70-80% reduction (from 13.5M to ~2-3M/day)
**API Response Time**: 50-100ms faster per request
**Cache Hit Rate**: Expected >85% after warmup

Cost Savings

**Upstash Commands**: 10M fewer operations/day
**Database Load**: Significantly reduced
**Infrastructure**: Better resource utilization

---

Verification Steps

1. Check Application Health

# Test API endpoint
curl -I https://atom-saas.fly.dev/api/health

# Check logs for errors
fly logs -a atom-saas --tail 50

2. Monitor Cache Performance

# SSH into the app
fly ssh console -a atom-saas

# Check cache stats (add this to your monitoring)
python3 << 'EOF'
import sys
sys.path.append('/app')
from lib.redis.redis_client import getCacheStats
print(getCacheStats())
EOF

3. Track Redis Metrics

**Upstash Dashboard**:

Navigate to your Upstash dashboard
Monitor "Commands/sec" metric
Expected: 70-80% reduction after 24-48 hours

**Key Metrics to Watch**:

keyspace_hits - Should increase significantly
keyspace_misses - Should decrease significantly
Total commands/day - Should drop from 13.5M to ~2-3M

4. Database Load Monitoring

-- Check query reduction in PostgreSQL
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
WHERE query LIKE '%tenants%'
ORDER BY calls DESC
LIMIT 10;

5. Cache Hit Rate Calculation

# After 24 hours, check hit rate
# Formula: hits / (hits + misses)

# Expected:
# Day 1: 60-70% (warming up)
# Day 2: 80-85% (stable)
# Day 3+: 85-90% (optimal)

---

Cache Invalidation

When to Invalidate Cache

Call invalidateTenantCache(tenantId) when:

Tenant plan changes
Tenant settings updated
Tenant subdomain/custom domain changes
Tenant ownership transferred

Example Usage

import { invalidateTenantCache } from '@/lib/tenant/tenant-extractor'

// After updating tenant
await updateTenant(tenantId, { plan_type: 'enterprise' })
await invalidateTenantCache(tenantId)

---

Troubleshooting

High Cache Miss Rate

**Symptoms**: Cache hit rate < 60%

**Causes**:

Cache too small (increase LOCAL_CACHE_SIZE)
TTL too short (increase CACHE_TTL.TENANT)
Redis connection issues

**Solution**:

# Check Redis connection
fly ssh console -a atom-saas -C "python3 -c \"
import os
from lib.redis.redis_client import getRedisClient
client = getRedisClient()
print(f'Redis connected: {client is not None}')
\""

Cache Not Working

**Symptoms**: Still seeing 13M+ reads/day

**Check**:

Verify environment variables:

Check application logs for Redis errors:

Verify build included new code:

Memory Issues

**Symptoms**: OOM errors, high memory usage

**Solution**:

Reduce LOCAL_CACHE_SIZE env var (default: 1000)
Reduce cache TTLs
Monitor memory usage: fly stats -a atom-saas

---

Configuration

Environment Variables

# Redis Configuration (already set)
UPSTASH_REDIS_REST_URL=https://xxx.upstash.io
UPSTASH_REDIS_REST_TOKEN=xxx

# Cache Configuration (optional)
LOCAL_CACHE_SIZE=1000              # Max local cache entries
LOCAL_CACHE_TTL=60                 # Default local TTL (seconds)
REDIS_CIRCUIT_THRESHOLD=3          # Failures before circuit opens
REDIS_CIRCUIT_TIMEOUT=60           # Seconds before retry

Cache TTL Values

CACHE_TTL = {
  TENANT: 3600,      // 1 hour - tenant data
  SESSION: 1800,     // 30 minutes - session data
  RATE_LIMIT: 60,    // 1 minute - rate limits
  SHORT: 300,        // 5 minutes - frequently changing data
  MEDIUM: 1800,      // 30 minutes - moderate change frequency
  LONG: 3600,        // 1 hour - rarely changing data
}

---

Rollback Plan (If Needed)

If Issues Occur

# Revert to previous commit
git revert HEAD
git push origin main

# Redeploy
fly deploy -a atom-saas --strategy immediate

Previous Commit

**Before**: 2c029bd9e (no Redis caching)
**After**: 55a4c46b7 (with Redis caching)

---

Next Steps

Immediate (Day 1)

✅ Deploy complete
⏳ Monitor application logs for errors
⏳ Verify basic functionality (login, API calls)

Short-term (Week 1)

⏳ Track Redis metrics in Upstash dashboard
⏳ Measure cache hit rate
⏳ Monitor database query reduction
⏳ Check API response times

Long-term (Month 1)

⏳ Optimize cache TTLs based on hit rates
⏳ Add monitoring/alerting for cache failures
⏳ Document cache patterns for other services
⏳ Consider adding Redis caching to other frequently accessed data

---

Documentation

src/lib/redis/redis-client.ts - Redis client implementation
src/lib/tenant/tenant-extractor.ts - Cached tenant extraction
REDIS_READS_DIAGNOSIS.md - Original diagnosis
UPSTASH_QSTASH_CLEANUP_GUIDE.md - Cleanup procedures

Support

**Upstash Docs**: https://upstash.com/docs
**@upstash/redis**: https://github.com/upstash/upstash-redis
**Fly.io Secrets**: https://fly.io/docs/reference/secrets/

---

**Generated**: 2026-04-08

**Status**: ✅ Deployed and Active

**Next Review**: 2026-04-15 (7 days)