ATOM Documentation

← Back to App

Incident Report: Zombie Reaper Attack

**Date:** 2026-05-04

**Severity:** CRITICAL

**Incident Type:** Production Data Interference by Test Environment

**Status:** RESOLVED

Executive Summary

A test application (atom-saas-test) was configured to point to the **production database** and was running an outdated version of the maintenance reaper code. This caused active production jobs to be incorrectly cancelled after only 10 minutes of heartbeat staleness (instead of the 60-minute threshold in production).

Timeline

Time (UTC)Event
10:13Job f648f968 started processing
14:25Job sent last heartbeat
14:3510 minutes elapsed (test app threshold)
14:36**atom-saas-test** reaper cancelled job with old error message
14:39v2068 deployed to production (heartbeat fix)
14:40User discovered job was cancelled
~15:00User identified atom-saas-test as the culprit
15:05**atom-saas-test stopped** - incident resolved

Root Cause Analysis

The Problem

**Test App Configuration:**

  • **App Name:** atom-saas-test
  • **Database:** Shared production database URL (ep-little-poetry-ad98vm8v-pooler...)
  • **Reaper Code:** Old version (pre-leader-election, pre-structured-logging)
  • **Reaper Threshold:** 10 minutes (hard-coded)
  • **Redis Lock:** ❌ None (no leader election)

**Production Configuration:**

  • **App Name:** atom-saas
  • **Reaper Threshold:** 60 minutes
  • **Redis Lock:** ✅ Implemented (single leader elected)
  • **Error Format:** Structured with machine_id suffix

Why It Wasn't Detected Earlier

  1. **Test app was deployed** 10 minutes before the incident, meaning it was actively running
  2. **No separation of concerns** - test and production shared the same database
  3. **Old code format** - the error message "Abandoned (server restart or timeout)" didn't have the new Reaper: {machine_id} suffix, which would have immediately identified the culprit
  4. **Silent interference** - the test app's reaper ran every 5 minutes without logging to production logs

Impact

**Direct Impact:**

  • 1 backfill job (f648f968) cancelled prematurely after processing 100 entities
  • 11 minutes of work lost (job had 4+ hours of runtime but only 11 min since last heartbeat)

**Potential Impact (if not detected):**

  • ALL production backfill jobs at risk of premature cancellation
  • Continuous interference with production operations
  • Data corruption from repeated job interruptions
  • User trust erosion

Resolution

Immediate Actions Taken

  1. ✅ **Stopped atom-saas-test** (fly scale count 0 -a atom-saas-test)
  2. ✅ **Verified app is suspended** (no running machines)
  3. ✅ **Confirmed production app is only active instance**

Required Follow-up Actions

  1. **URGENT:** Reconfigure atom-saas-test to use a separate test database
  1. **Audit other apps** for similar misconfigurations:
  1. **Add environment guards** in code:
  1. **Restore job f648f968:**
  • Data is safe in R2 (checkpoint at 100 entities)
  • Can be resumed manually from the UI or via API

Lessons Learned

Process Issues

  1. **No environment verification** - Test apps should be blocked from accessing production resources
  2. **No database access auditing** - No alerts when multiple apps connect to production DB
  3. **Stale test deployments** - Old code versions running indefinitely
  4. **No cross-app monitoring** - Each app operates in isolation

Technical Debt

  1. **Hard-coded thresholds** - Should be environment variables
  2. **No app identification in error messages** - Made debugging difficult
  3. **Shared infrastructure without guards** - Test/production shared same database

Prevention Measures

  1. **Environment-based guards:**
  1. **Database access logging:**
  • Log all new database connections with app name
  • Alert on unexpected app names
  1. **Secrets scanning:**
  • Prevent test apps from having production secrets
  • Use different secret keys for test/production
  1. **App naming conventions:**
  • Use -test suffix for non-production
  • Block apps with -test from production resources

Verification

Post-Incident Verification

  1. ✅ **atom-saas-test stopped** - No running machines
  2. ✅ **atom-saas production running** - Only active instance
  3. ✅ **No other test apps** pointing to production (verified)
  4. ⏳ **Database reconfiguration** - PENDING for atom-saas-test

Ongoing Monitoring

  1. Monitor atom-saas logs for any reaper activity
  2. Verify new backfill jobs complete successfully
  3. Watch for any unexpected "cancelled" job statuses

Credits

**Incident Detected By:** User (forensic analysis of log timestamps and error message format)

**Key Insight:** The 10-minute threshold (test) vs 60-minute threshold (production) mismatch, combined with old error message format, revealed the second reaper.

---

**Status:** ✅ RESOLVED - Rogue reaper eliminated

**Next Review:** After atom-saas-test database reconfiguration