ATOM Documentation

← Back to App

Redis Cache Fix - Deployment Summary

✅ Deployment Complete

**Deployed**: 2026-04-08

**Commit**: 55a4c46b7

**Status**: ✅ SUCCESS

**URL**: https://atom-saas.fly.dev/

---

What Was Fixed

Problem

  • **Excessive Redis Reads**: 13,480,415 reads/day (423:1 read:write ratio)
  • **Root Cause**: 812 API routes calling getTenantFromRequest() with NO caching
  • **Impact**: 2-5 PostgreSQL queries per API request, massive database load

Solution

  1. **Created Redis Client Service** (src/lib/redis/redis-client.ts)
  • Upstash Redis wrapper with automatic fallback
  • In-memory cache fallback when Redis unavailable
  • TTL-based cache invalidation
  1. **Updated Tenant Extraction** (src/lib/tenant/tenant-extractor.ts)
  • Cache-first lookup strategy
  • Multiple cache key patterns:
  • tenant:id:{id} - Lookup by tenant ID
  • tenant:subdomain:{subdomain} - Lookup by subdomain
  • tenant:domain:{domain} - Lookup by custom domain
  • Cache TTL: 1 hour (configurable)
  1. **Added Cache Management**
  • invalidateTenantCache() - Manual cache invalidation
  • Automatic expiration via TTL
  • Graceful degradation to local cache

---

Expected Results

Performance Improvements

  • **Database Queries**: 80-90% reduction
  • **Redis Reads**: 70-80% reduction (from 13.5M to ~2-3M/day)
  • **API Response Time**: 50-100ms faster per request
  • **Cache Hit Rate**: Expected >85% after warmup

Cost Savings

  • **Upstash Commands**: 10M fewer operations/day
  • **Database Load**: Significantly reduced
  • **Infrastructure**: Better resource utilization

---

Verification Steps

1. Check Application Health

# Test API endpoint
curl -I https://atom-saas.fly.dev/api/health

# Check logs for errors
fly logs -a atom-saas --tail 50

2. Monitor Cache Performance

# SSH into the app
fly ssh console -a atom-saas

# Check cache stats (add this to your monitoring)
python3 << 'EOF'
import sys
sys.path.append('/app')
from lib.redis.redis_client import getCacheStats
print(getCacheStats())
EOF

3. Track Redis Metrics

**Upstash Dashboard**:

  • Navigate to your Upstash dashboard
  • Monitor "Commands/sec" metric
  • Expected: 70-80% reduction after 24-48 hours

**Key Metrics to Watch**:

  • keyspace_hits - Should increase significantly
  • keyspace_misses - Should decrease significantly
  • Total commands/day - Should drop from 13.5M to ~2-3M

4. Database Load Monitoring

-- Check query reduction in PostgreSQL
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
WHERE query LIKE '%tenants%'
ORDER BY calls DESC
LIMIT 10;

5. Cache Hit Rate Calculation

# After 24 hours, check hit rate
# Formula: hits / (hits + misses)

# Expected:
# Day 1: 60-70% (warming up)
# Day 2: 80-85% (stable)
# Day 3+: 85-90% (optimal)

---

Cache Invalidation

When to Invalidate Cache

Call invalidateTenantCache(tenantId) when:

  1. Tenant plan changes
  2. Tenant settings updated
  3. Tenant subdomain/custom domain changes
  4. Tenant ownership transferred

Example Usage

import { invalidateTenantCache } from '@/lib/tenant/tenant-extractor'

// After updating tenant
await updateTenant(tenantId, { plan_type: 'enterprise' })
await invalidateTenantCache(tenantId)

---

Troubleshooting

High Cache Miss Rate

**Symptoms**: Cache hit rate < 60%

**Causes**:

  • Cache too small (increase LOCAL_CACHE_SIZE)
  • TTL too short (increase CACHE_TTL.TENANT)
  • Redis connection issues

**Solution**:

# Check Redis connection
fly ssh console -a atom-saas -C "python3 -c \"
import os
from lib.redis.redis_client import getRedisClient
client = getRedisClient()
print(f'Redis connected: {client is not None}')
\""

Cache Not Working

**Symptoms**: Still seeing 13M+ reads/day

**Check**:

  1. Verify environment variables:
  1. Check application logs for Redis errors:
  1. Verify build included new code:

Memory Issues

**Symptoms**: OOM errors, high memory usage

**Solution**:

  • Reduce LOCAL_CACHE_SIZE env var (default: 1000)
  • Reduce cache TTLs
  • Monitor memory usage: fly stats -a atom-saas

---

Configuration

Environment Variables

# Redis Configuration (already set)
UPSTASH_REDIS_REST_URL=https://xxx.upstash.io
UPSTASH_REDIS_REST_TOKEN=xxx

# Cache Configuration (optional)
LOCAL_CACHE_SIZE=1000              # Max local cache entries
LOCAL_CACHE_TTL=60                 # Default local TTL (seconds)
REDIS_CIRCUIT_THRESHOLD=3          # Failures before circuit opens
REDIS_CIRCUIT_TIMEOUT=60           # Seconds before retry

Cache TTL Values

CACHE_TTL = {
  TENANT: 3600,      // 1 hour - tenant data
  SESSION: 1800,     // 30 minutes - session data
  RATE_LIMIT: 60,    // 1 minute - rate limits
  SHORT: 300,        // 5 minutes - frequently changing data
  MEDIUM: 1800,      // 30 minutes - moderate change frequency
  LONG: 3600,        // 1 hour - rarely changing data
}

---

Rollback Plan (If Needed)

If Issues Occur

# Revert to previous commit
git revert HEAD
git push origin main

# Redeploy
fly deploy -a atom-saas --strategy immediate

Previous Commit

  • **Before**: 2c029bd9e (no Redis caching)
  • **After**: 55a4c46b7 (with Redis caching)

---

Next Steps

Immediate (Day 1)

  • ✅ Deploy complete
  • ⏳ Monitor application logs for errors
  • ⏳ Verify basic functionality (login, API calls)

Short-term (Week 1)

  • ⏳ Track Redis metrics in Upstash dashboard
  • ⏳ Measure cache hit rate
  • ⏳ Monitor database query reduction
  • ⏳ Check API response times

Long-term (Month 1)

  • ⏳ Optimize cache TTLs based on hit rates
  • ⏳ Add monitoring/alerting for cache failures
  • ⏳ Document cache patterns for other services
  • ⏳ Consider adding Redis caching to other frequently accessed data

---

Documentation

  • src/lib/redis/redis-client.ts - Redis client implementation
  • src/lib/tenant/tenant-extractor.ts - Cached tenant extraction
  • REDIS_READS_DIAGNOSIS.md - Original diagnosis
  • UPSTASH_QSTASH_CLEANUP_GUIDE.md - Cleanup procedures

Support

  • **Upstash Docs**: https://upstash.com/docs
  • **@upstash/redis**: https://github.com/upstash/upstash-redis
  • **Fly.io Secrets**: https://fly.io/docs/reference/secrets/

---

**Generated**: 2026-04-08

**Status**: ✅ Deployed and Active

**Next Review**: 2026-04-15 (7 days)