ATOM Documentation

← Back to App

Infrastructure Constraints & Scaling Guide

This document records the production infrastructure constraints for the ATUM SaaS platform, specifically following the stability event on April 9, 2026.

🗄️ Database: NeonDB

Current Configuration

  • **Plan:** Launch (Standard)
  • **Primary Limit:** 20 connections (Free Tier) escalated to Launch Tier constraints (~10,000 pooled connections).

Scaling Formula

To ensure stability across multiple Fly.io machines, we use the following formula for connection pooling:

Total Connections = Num Machines * (DB_POOL_SIZE + DATABASE_MAX_OVERFLOW)

**Safe Tuning (Launch Plan):**

  • DB_POOL_SIZE = 15
  • DATABASE_MAX_OVERFLOW = 5
  • **Result:** 20 connections per machine. With 10 machines, we only use 200 connections, well within the Launch plan's headroom.

⚠️ Upgrade Path

If you scale beyond 20 concurrent machines or increase pool depth significantly:

  1. **Monitor PgBouncer Stats**: Check Neon dashboard for "Waiting for connection".
  2. **Upgrade Plan**: Scale to "Scale" plan if direct connection limits (for migrations/long-lived tasks) become a bottleneck.

---

🏎️ Caching: Upstash Redis

Incident Report (April 8-9, 2026)

  • **Symptom:** 12M+ Redis reads in 24 hours.
  • **Root Cause:**
  1. **SQLAlchemy Crash Loop:** A MapperConfigurationError on the skill_versions relationship caused the backend to crash and restart repeatedly.
  2. **Initialization Reads:** Every startup triggered a full initialization of background services, each performing Redis pings and metadata lookups.
  3. **Missing Caching:** Tenant lookups were not cached in Redis, falling back to direct DB queries for every request.

Stabilization Measures

  1. **Kill Switch**: SUSPEND_REDIS=true environment variable to bypass all Redis traffic during outages. (Falls back to local memory cache for performance).
  2. **Mapping Fix**: Resolved the SkillVersion.tenant relationship conflict and duplicate class definition in backend-saas/core/models.py.
  3. **Startup Resilience**: Wrapped lifespan initialization in backend-saas/main_api_app.py with granular error handlers for InterfaceError and OperationalError.
  4. **Tenant Caching**: Implemented Redis/Memory caching for tenant lookups.

---

🕰️ Scheduler: QStash

Cleanup Procedures

Frequent backend restarts can result in many "Dead Letter" or retry messages in QStash.

  • **Utility:** backend-saas/scripts/clean_all_upstash_and_qstash.py
  • **Action:** Run this script after any major deployment failure to clear the queue.

---

---

🛑 Recent Incident: Brennan Account Loss & SQLite Fallback (April 9, 2026)

Symptom

  • **Account "Deletion"**: User rish@brennan.ca was unable to log in despite having a valid record in the production database.
  • **Data Mismatch**: Logins resulted in empty workspaces and no tenant data.

Root Causes

  1. **Silent SQLite Fallback**: The app's DATABASE_URL defaulted to a local atom_dev.db (SQLite) because the ENVIRONMENT=production flag was missing from the Fly.io configuration.
  2. **Hardcoded Maintenance Scripts**: Maintenance scripts (direct_cleanup.py, purge_test_data.py) contained hardcoded references to brennan as "test data", creating a high risk of permanent data loss during automated cleanups.
  3. **Architecture Drift**: The app was split into fly.api.toml and fly.worker.toml, deviating from the unified "single machine" architecture documented in CLAUDE.md.

Remediation & Guardrails

  1. **Lockdown**: Added ENVIRONMENT=production to Fly.io configs.
  2. **Hardening**: Refactored core/database.py to **ABORT** if SQLite is used in a production environment.
  3. **Unification**: Re-unified the API and Worker into a single fly.toml to ensure consistent environment configuration.
  4. **Institutional Memory**: Updated CLAUDE.md to prohibit hardcoded cleanup targets for future AI agents.

**Last Updated:** April 9, 2026 (12:15 PM)

**Status:** Unified Deployment Lock (Production-Only)