Infrastructure Constraints & Scaling Guide
This document records the production infrastructure constraints for the ATUM SaaS platform, specifically following the stability event on April 9, 2026.
🗄️ Database: NeonDB
Current Configuration
- **Plan:** Launch (Standard)
- **Primary Limit:** 20 connections (Free Tier) escalated to Launch Tier constraints (~10,000 pooled connections).
Scaling Formula
To ensure stability across multiple Fly.io machines, we use the following formula for connection pooling:
Total Connections = Num Machines * (DB_POOL_SIZE + DATABASE_MAX_OVERFLOW)**Safe Tuning (Launch Plan):**
DB_POOL_SIZE = 15DATABASE_MAX_OVERFLOW = 5- **Result:** 20 connections per machine. With 10 machines, we only use 200 connections, well within the Launch plan's headroom.
⚠️ Upgrade Path
If you scale beyond 20 concurrent machines or increase pool depth significantly:
- **Monitor PgBouncer Stats**: Check Neon dashboard for "Waiting for connection".
- **Upgrade Plan**: Scale to "Scale" plan if direct connection limits (for migrations/long-lived tasks) become a bottleneck.
---
🏎️ Caching: Upstash Redis
Incident Report (April 8-9, 2026)
- **Symptom:** 12M+ Redis reads in 24 hours.
- **Root Cause:**
- **SQLAlchemy Crash Loop:** A
MapperConfigurationErroron theskill_versionsrelationship caused the backend to crash and restart repeatedly. - **Initialization Reads:** Every startup triggered a full initialization of background services, each performing Redis pings and metadata lookups.
- **Missing Caching:** Tenant lookups were not cached in Redis, falling back to direct DB queries for every request.
Stabilization Measures
- **Kill Switch**:
SUSPEND_REDIS=trueenvironment variable to bypass all Redis traffic during outages. (Falls back to local memory cache for performance). - **Mapping Fix**: Resolved the
SkillVersion.tenantrelationship conflict and duplicate class definition inbackend-saas/core/models.py. - **Startup Resilience**: Wrapped
lifespaninitialization inbackend-saas/main_api_app.pywith granular error handlers forInterfaceErrorandOperationalError. - **Tenant Caching**: Implemented Redis/Memory caching for tenant lookups.
---
🕰️ Scheduler: QStash
Cleanup Procedures
Frequent backend restarts can result in many "Dead Letter" or retry messages in QStash.
- **Utility:**
backend-saas/scripts/clean_all_upstash_and_qstash.py - **Action:** Run this script after any major deployment failure to clear the queue.
---
---
🛑 Recent Incident: Brennan Account Loss & SQLite Fallback (April 9, 2026)
Symptom
- **Account "Deletion"**: User
rish@brennan.cawas unable to log in despite having a valid record in the production database. - **Data Mismatch**: Logins resulted in empty workspaces and no tenant data.
Root Causes
- **Silent SQLite Fallback**: The app's
DATABASE_URLdefaulted to a localatom_dev.db(SQLite) because theENVIRONMENT=productionflag was missing from the Fly.io configuration. - **Hardcoded Maintenance Scripts**: Maintenance scripts (
direct_cleanup.py,purge_test_data.py) contained hardcoded references tobrennanas "test data", creating a high risk of permanent data loss during automated cleanups. - **Architecture Drift**: The app was split into
fly.api.tomlandfly.worker.toml, deviating from the unified "single machine" architecture documented inCLAUDE.md.
Remediation & Guardrails
- **Lockdown**: Added
ENVIRONMENT=productionto Fly.io configs. - **Hardening**: Refactored
core/database.pyto **ABORT** if SQLite is used in a production environment. - **Unification**: Re-unified the API and Worker into a single
fly.tomlto ensure consistent environment configuration. - **Institutional Memory**: Updated
CLAUDE.mdto prohibit hardcoded cleanup targets for future AI agents.
**Last Updated:** April 9, 2026 (12:15 PM)
**Status:** Unified Deployment Lock (Production-Only)