Atom AI Labs - AI-Powered Multi-Tenant Platform

Infrastructure Constraints & Scaling Guide

This document records the production infrastructure constraints for the ATUM SaaS platform, specifically following the stability event on April 9, 2026.

🗄️ Database: NeonDB

Current Configuration

**Plan:** Launch (Standard)
**Primary Limit:** 20 connections (Free Tier) escalated to Launch Tier constraints (~10,000 pooled connections).

Scaling Formula

To ensure stability across multiple Fly.io machines, we use the following formula for connection pooling:

Total Connections = Num Machines * (DB_POOL_SIZE + DATABASE_MAX_OVERFLOW)

**Safe Tuning (Launch Plan):**

DB_POOL_SIZE = 15
DATABASE_MAX_OVERFLOW = 5
**Result:** 20 connections per machine. With 10 machines, we only use 200 connections, well within the Launch plan's headroom.

⚠️ Upgrade Path

If you scale beyond 20 concurrent machines or increase pool depth significantly:

**Monitor PgBouncer Stats**: Check Neon dashboard for "Waiting for connection".
**Upgrade Plan**: Scale to "Scale" plan if direct connection limits (for migrations/long-lived tasks) become a bottleneck.

---

🏎️ Caching: Upstash Redis

Incident Report (April 8-9, 2026)

**Symptom:** 12M+ Redis reads in 24 hours.
**Root Cause:**

**SQLAlchemy Crash Loop:** A MapperConfigurationError on the skill_versions relationship caused the backend to crash and restart repeatedly.
**Initialization Reads:** Every startup triggered a full initialization of background services, each performing Redis pings and metadata lookups.
**Missing Caching:** Tenant lookups were not cached in Redis, falling back to direct DB queries for every request.

Stabilization Measures

**Kill Switch**: SUSPEND_REDIS=true environment variable to bypass all Redis traffic during outages. (Falls back to local memory cache for performance).
**Mapping Fix**: Resolved the SkillVersion.tenant relationship conflict and duplicate class definition in backend-saas/core/models.py.
**Startup Resilience**: Wrapped lifespan initialization in backend-saas/main_api_app.py with granular error handlers for InterfaceError and OperationalError.
**Tenant Caching**: Implemented Redis/Memory caching for tenant lookups.

---

🕰️ Scheduler: QStash

Cleanup Procedures

Frequent backend restarts can result in many "Dead Letter" or retry messages in QStash.

**Utility:** backend-saas/scripts/clean_all_upstash_and_qstash.py
**Action:** Run this script after any major deployment failure to clear the queue.

---

🛑 Recent Incident: Brennan Account Loss & SQLite Fallback (April 9, 2026)

Symptom

**Account "Deletion"**: User rish@brennan.ca was unable to log in despite having a valid record in the production database.
**Data Mismatch**: Logins resulted in empty workspaces and no tenant data.

Root Causes

**Silent SQLite Fallback**: The app's DATABASE_URL defaulted to a local atom_dev.db (SQLite) because the ENVIRONMENT=production flag was missing from the Fly.io configuration.
**Hardcoded Maintenance Scripts**: Maintenance scripts (direct_cleanup.py, purge_test_data.py) contained hardcoded references to brennan as "test data", creating a high risk of permanent data loss during automated cleanups.
**Architecture Drift**: The app was split into fly.api.toml and fly.worker.toml, deviating from the unified "single machine" architecture documented in CLAUDE.md.

Remediation & Guardrails

**Lockdown**: Added ENVIRONMENT=production to Fly.io configs.
**Hardening**: Refactored core/database.py to **ABORT** if SQLite is used in a production environment.
**Unification**: Re-unified the API and Worker into a single fly.toml to ensure consistent environment configuration.
**Institutional Memory**: Updated CLAUDE.md to prohibit hardcoded cleanup targets for future AI agents.

**Last Updated:** April 9, 2026 (12:15 PM)

**Status:** Unified Deployment Lock (Production-Only)