Atom AI Labs - AI-Powered Multi-Tenant Platform

Deployment Playbook: Atom SaaS on Fly.io

**Last Updated:** 2026-05-18

**Purpose:** Prevent recurring deployment failures and ensure zero-downtime deployments.

---

🔴 Critical Issues We Keep Repeating

Issue 1: `docker-entrypoint.sh: No such file or directory`

**Root Cause:**

Fly.io can't build Next.js frontend (2GB RAM insufficient) - it copies .next/standalone
If .next folder is stale/missing, Docker build fails silently
Build cache contains stale layers without entrypoint script

**Symptoms:**

ERROR: failed to spawn command: [/app/docker-entrypoint.sh web]: No such file or directory
Virtual machine exited abruptly

**Fix:** ALWAYS build frontend locally first, then deploy with --no-cache.

# From project root (NOT backend-saas):
rm -rf .next && npm run build
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) --no-cache

Issue 2: Machines in "stopped" or "replacing" state after deployment

**Root Cause:** immediate strategy kills old processes but new processes fail to start.

**Fix:** Restart machines explicitly and wait for health checks.

fly machines restart <machine-id> -a atom-saas --skip-health-checks
# Wait 30-60 seconds, then verify
fly machines list -a atom-saas

Issue 3: Code changed but processes still run old code

**Root Cause:** Files on disk updated but running process memory has old code.

**Verification:** Always check the health endpoint's deployed_sha against git HEAD.

git rev-parse --short HEAD  # Local HEAD
curl -s https://app.atomagentos.com/api/health | jq -r '.deployed_sha'

---

✅ Standard Deployment Procedure

1. Pre-Flight Checks

# Check current commit
git rev-parse --short HEAD

# Verify no uncommitted changes (unless intentional)
git status

# Check all tests pass
npm run test:e2e  # Or relevant test suite

2. Build Frontend (REQUIRED - Fly.io can't build it)

# From PROJECT ROOT (not backend-saas)
# Fly.io copies .next/standalone - it doesn't have RAM to build Next.js
rm -rf .next && npm run build

3. Deploy with Correct Strategy

# From project root, use immediate strategy with VERSION_SHA
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD)

# If entrypoint/Dockerfile changed, add --no-cache
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) --no-cache

4. Wait for Deployment to Complete

# Watch the deployment
fly status -a atom-saas

# Machines should transition: replacing → started
# Wait until all show "started" with 1/1 checks

5. Verify Machines Actually Started

# List machines - ALL should be started with 1/1 checks
fly machines list -a atom-saas

# If any show stopped/replacing, restart them:
fly machines restart <machine-id> -a atom-saas

6. Verify Deployment in Running Code

# Method 1: Health endpoint (most reliable)
curl -s https://app.atomagentos.com/api/health | jq '.'
# Check uptime is low (<300s = recent restart)
# Check deployed_sha matches local (may show "unknown" if VERSION.txt not baked)

# Method 2: Check logs for startup
fly logs -a atom-saas | grep "STARTUP_TASKS"
# Should see recent startup messages

# Method 3: Check specific function exists in running code
fly ssh console -a atom-saas --command 'python3 -c "from core.webhook_delivery_service import WebhookDeliveryService; print(\"OK\")"'

7. Verify Key Functionality

# Health check
curl -s https://app.atomagentos.com/api/health

# Database connectivity
# (Check logs for "Database query success")

# Webhook endpoint reachable
curl -X POST https://app.atomagentos.com/api/test/webhook

---

🚨 When Things Go Wrong

Symptom: "machine exited abruptly"

**Diagnosis:**

fly logs -a atom-saas | grep "ERROR"

**Common Causes:**

Missing entrypoint → Deploy with --no-cache
Import error in startup → Check logs for ModuleNotFoundError
Database connection failure → Check DATABASE_URL env var

Symptom: Machines stuck in "replacing"

**Fix:**

# Force restart all machines
fly machines restart --all -a atom-saas

# If that fails, scale down then up
fly scale count 0 -a atom-saas
fly scale count 3 -a atom-saas

Symptom: 404 errors on webhook renewal

**This is expected behavior** - subscription was deleted by Microsoft.

The new auto-recreate feature should handle this. Monitor:

fly logs -a atom-saas | grep -i "recreate\|renew"

Symptom: Machines reach max restart count