ATOM Documentation

← Back to App

Deployment Playbook: Atom SaaS on Fly.io

**Last Updated:** 2026-05-18

**Purpose:** Prevent recurring deployment failures and ensure zero-downtime deployments.

---

🔴 Critical Issues We Keep Repeating

Issue 1: `docker-entrypoint.sh: No such file or directory`

**Root Cause:**

  1. Fly.io can't build Next.js frontend (2GB RAM insufficient) - it copies .next/standalone
  2. If .next folder is stale/missing, Docker build fails silently
  3. Build cache contains stale layers without entrypoint script

**Symptoms:**

ERROR: failed to spawn command: [/app/docker-entrypoint.sh web]: No such file or directory
Virtual machine exited abruptly

**Fix:** ALWAYS build frontend locally first, then deploy with --no-cache.

# From project root (NOT backend-saas):
rm -rf .next && npm run build
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) --no-cache

Issue 2: Machines in "stopped" or "replacing" state after deployment

**Root Cause:** immediate strategy kills old processes but new processes fail to start.

**Fix:** Restart machines explicitly and wait for health checks.

fly machines restart <machine-id> -a atom-saas --skip-health-checks
# Wait 30-60 seconds, then verify
fly machines list -a atom-saas

Issue 3: Code changed but processes still run old code

**Root Cause:** Files on disk updated but running process memory has old code.

**Verification:** Always check the health endpoint's deployed_sha against git HEAD.

git rev-parse --short HEAD  # Local HEAD
curl -s https://app.atomagentos.com/api/health | jq -r '.deployed_sha'

---

✅ Standard Deployment Procedure

1. Pre-Flight Checks

# Check current commit
git rev-parse --short HEAD

# Verify no uncommitted changes (unless intentional)
git status

# Check all tests pass
npm run test:e2e  # Or relevant test suite

2. Build Frontend (REQUIRED - Fly.io can't build it)

# From PROJECT ROOT (not backend-saas)
# Fly.io copies .next/standalone - it doesn't have RAM to build Next.js
rm -rf .next && npm run build

3. Deploy with Correct Strategy

# From project root, use immediate strategy with VERSION_SHA
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD)

# If entrypoint/Dockerfile changed, add --no-cache
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) --no-cache

4. Wait for Deployment to Complete

# Watch the deployment
fly status -a atom-saas

# Machines should transition: replacing → started
# Wait until all show "started" with 1/1 checks

5. Verify Machines Actually Started

# List machines - ALL should be started with 1/1 checks
fly machines list -a atom-saas

# If any show stopped/replacing, restart them:
fly machines restart <machine-id> -a atom-saas

6. Verify Deployment in Running Code

# Method 1: Health endpoint (most reliable)
curl -s https://app.atomagentos.com/api/health | jq '.'
# Check uptime is low (<300s = recent restart)
# Check deployed_sha matches local (may show "unknown" if VERSION.txt not baked)

# Method 2: Check logs for startup
fly logs -a atom-saas | grep "STARTUP_TASKS"
# Should see recent startup messages

# Method 3: Check specific function exists in running code
fly ssh console -a atom-saas --command 'python3 -c "from core.webhook_delivery_service import WebhookDeliveryService; print(\"OK\")"'

7. Verify Key Functionality

# Health check
curl -s https://app.atomagentos.com/api/health

# Database connectivity
# (Check logs for "Database query success")

# Webhook endpoint reachable
curl -X POST https://app.atomagentos.com/api/test/webhook

---

🚨 When Things Go Wrong

Symptom: "machine exited abruptly"

**Diagnosis:**

fly logs -a atom-saas | grep "ERROR"

**Common Causes:**

  1. Missing entrypoint → Deploy with --no-cache
  2. Import error in startup → Check logs for ModuleNotFoundError
  3. Database connection failure → Check DATABASE_URL env var

Symptom: Machines stuck in "replacing"

**Fix:**

# Force restart all machines
fly machines restart --all -a atom-saas

# If that fails, scale down then up
fly scale count 0 -a atom-saas
fly scale count 3 -a atom-saas

Symptom: 404 errors on webhook renewal

**This is expected behavior** - subscription was deleted by Microsoft.

The new auto-recreate feature should handle this. Monitor:

fly logs -a atom-saas | grep -i "recreate\|renew"

Symptom: Machines reach max restart count

**Fix:**

# This usually means the app crashes on startup
# Check the crash logs:
fly logs -a atom-saas | head -200

# Common fixes:
# 1. Build frontend + deploy with --no-cache
rm -rf .next && npm run build
fly deploy -a atom-saas --strategy immediate --no-cache
# 2. Check for Python syntax errors in recent changes
# 3. Check for missing dependencies in requirements.txt

---

📋 Deployment Checklist

Use this checklist for EVERY deployment:

  • [ ] Current commit is intended for production
  • [ ] Tests pass locally
  • [ ] No uncommitted breaking changes
  • [ ] **Build frontend locally: rm -rf .next && npm run build**
  • [ ] Deploy with VERSION_SHA: fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD)
  • [ ] If Dockerfile/entrypoint/frontend changed: add --no-cache
  • [ ] Wait for all machines to show "started" (1/1 checks)
  • [ ] If machines stopped: explicitly restart them
  • [ ] Verify health endpoint responds
  • [ ] Verify uptime < 5 minutes (indicates recent restart)
  • [ ] Check logs for startup success message
  • [ ] Verify critical endpoints work (health, webhooks)

---

🔧 Useful Commands

# STANDARD DEPLOYMENT (from project root)
rm -rf .next && npm run build && \
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD)

# DEPLOY with --no-cache (if Dockerfile/entrypoint changed)
rm -rf .next && npm run build && \
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) --no-cache

# Quick health check
fly status -a atom-saas && fly machines list -a atom-saas

# Stream logs in real-time
fly logs -a atom-saas --tail

# SSH into a running machine
fly ssh console -a atom-saas

# Check specific machine logs
fly logs -a atom-saas -m <machine-id>

# Force restart all machines
fly machines restart --all -a atom-saas

# Scale to zero then back up (hard reset)
fly scale count 0 -a atom-saas && sleep 5 && fly scale count 3 -a atom-saas

# Check what code is actually running
fly ssh console -a atom-saas --command "head -5 /app/backend-saas/core/<some_file>.py"

---

  • CLAUDE.md - Main project documentation
  • INFRASTRUCTURE_CONSTRAINTS.md - Production limits
  • .planning/WEBHOOK_RELIABILITY_PLAN.md - Webhook healing system
  • docs/incidents/ - Post-mortem analyses of past incidents

---

**Rule of Thumb:** If a deployment fails the same way twice, document the fix HERE.