Deployment Playbook: Atom SaaS on Fly.io
**Last Updated:** 2026-05-18
**Purpose:** Prevent recurring deployment failures and ensure zero-downtime deployments.
---
🔴 Critical Issues We Keep Repeating
Issue 1: `docker-entrypoint.sh: No such file or directory`
**Root Cause:**
- Fly.io can't build Next.js frontend (2GB RAM insufficient) - it copies
.next/standalone - If
.nextfolder is stale/missing, Docker build fails silently - Build cache contains stale layers without entrypoint script
**Symptoms:**
ERROR: failed to spawn command: [/app/docker-entrypoint.sh web]: No such file or directory
Virtual machine exited abruptly**Fix:** ALWAYS build frontend locally first, then deploy with --no-cache.
# From project root (NOT backend-saas):
rm -rf .next && npm run build
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) --no-cacheIssue 2: Machines in "stopped" or "replacing" state after deployment
**Root Cause:** immediate strategy kills old processes but new processes fail to start.
**Fix:** Restart machines explicitly and wait for health checks.
fly machines restart <machine-id> -a atom-saas --skip-health-checks
# Wait 30-60 seconds, then verify
fly machines list -a atom-saasIssue 3: Code changed but processes still run old code
**Root Cause:** Files on disk updated but running process memory has old code.
**Verification:** Always check the health endpoint's deployed_sha against git HEAD.
git rev-parse --short HEAD # Local HEAD
curl -s https://app.atomagentos.com/api/health | jq -r '.deployed_sha'---
✅ Standard Deployment Procedure
1. Pre-Flight Checks
# Check current commit
git rev-parse --short HEAD
# Verify no uncommitted changes (unless intentional)
git status
# Check all tests pass
npm run test:e2e # Or relevant test suite2. Build Frontend (REQUIRED - Fly.io can't build it)
# From PROJECT ROOT (not backend-saas)
# Fly.io copies .next/standalone - it doesn't have RAM to build Next.js
rm -rf .next && npm run build3. Deploy with Correct Strategy
# From project root, use immediate strategy with VERSION_SHA
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD)
# If entrypoint/Dockerfile changed, add --no-cache
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) --no-cache4. Wait for Deployment to Complete
# Watch the deployment
fly status -a atom-saas
# Machines should transition: replacing → started
# Wait until all show "started" with 1/1 checks5. Verify Machines Actually Started
# List machines - ALL should be started with 1/1 checks
fly machines list -a atom-saas
# If any show stopped/replacing, restart them:
fly machines restart <machine-id> -a atom-saas6. Verify Deployment in Running Code
# Method 1: Health endpoint (most reliable)
curl -s https://app.atomagentos.com/api/health | jq '.'
# Check uptime is low (<300s = recent restart)
# Check deployed_sha matches local (may show "unknown" if VERSION.txt not baked)
# Method 2: Check logs for startup
fly logs -a atom-saas | grep "STARTUP_TASKS"
# Should see recent startup messages
# Method 3: Check specific function exists in running code
fly ssh console -a atom-saas --command 'python3 -c "from core.webhook_delivery_service import WebhookDeliveryService; print(\"OK\")"'7. Verify Key Functionality
# Health check
curl -s https://app.atomagentos.com/api/health
# Database connectivity
# (Check logs for "Database query success")
# Webhook endpoint reachable
curl -X POST https://app.atomagentos.com/api/test/webhook---
🚨 When Things Go Wrong
Symptom: "machine exited abruptly"
**Diagnosis:**
fly logs -a atom-saas | grep "ERROR"**Common Causes:**
- Missing entrypoint → Deploy with
--no-cache - Import error in startup → Check logs for
ModuleNotFoundError - Database connection failure → Check DATABASE_URL env var
Symptom: Machines stuck in "replacing"
**Fix:**
# Force restart all machines
fly machines restart --all -a atom-saas
# If that fails, scale down then up
fly scale count 0 -a atom-saas
fly scale count 3 -a atom-saasSymptom: 404 errors on webhook renewal
**This is expected behavior** - subscription was deleted by Microsoft.
The new auto-recreate feature should handle this. Monitor:
fly logs -a atom-saas | grep -i "recreate\|renew"Symptom: Machines reach max restart count
**Fix:**
# This usually means the app crashes on startup
# Check the crash logs:
fly logs -a atom-saas | head -200
# Common fixes:
# 1. Build frontend + deploy with --no-cache
rm -rf .next && npm run build
fly deploy -a atom-saas --strategy immediate --no-cache
# 2. Check for Python syntax errors in recent changes
# 3. Check for missing dependencies in requirements.txt---
📋 Deployment Checklist
Use this checklist for EVERY deployment:
- [ ] Current commit is intended for production
- [ ] Tests pass locally
- [ ] No uncommitted breaking changes
- [ ] **Build frontend locally:
rm -rf .next && npm run build** - [ ] Deploy with VERSION_SHA:
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) - [ ] If Dockerfile/entrypoint/frontend changed: add
--no-cache - [ ] Wait for all machines to show "started" (1/1 checks)
- [ ] If machines stopped: explicitly restart them
- [ ] Verify health endpoint responds
- [ ] Verify uptime < 5 minutes (indicates recent restart)
- [ ] Check logs for startup success message
- [ ] Verify critical endpoints work (health, webhooks)
---
🔧 Useful Commands
# STANDARD DEPLOYMENT (from project root)
rm -rf .next && npm run build && \
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD)
# DEPLOY with --no-cache (if Dockerfile/entrypoint changed)
rm -rf .next && npm run build && \
fly deploy -a atom-saas --strategy immediate --build-arg VERSION_SHA=$(git rev-parse --short HEAD) --no-cache
# Quick health check
fly status -a atom-saas && fly machines list -a atom-saas
# Stream logs in real-time
fly logs -a atom-saas --tail
# SSH into a running machine
fly ssh console -a atom-saas
# Check specific machine logs
fly logs -a atom-saas -m <machine-id>
# Force restart all machines
fly machines restart --all -a atom-saas
# Scale to zero then back up (hard reset)
fly scale count 0 -a atom-saas && sleep 5 && fly scale count 3 -a atom-saas
# Check what code is actually running
fly ssh console -a atom-saas --command "head -5 /app/backend-saas/core/<some_file>.py"---
📚 Related Documentation
CLAUDE.md- Main project documentationINFRASTRUCTURE_CONSTRAINTS.md- Production limits.planning/WEBHOOK_RELIABILITY_PLAN.md- Webhook healing systemdocs/incidents/- Post-mortem analyses of past incidents
---
**Rule of Thumb:** If a deployment fails the same way twice, document the fix HERE.