Incident Report: Zombie Reaper Attack
**Date:** 2026-05-04
**Severity:** CRITICAL
**Incident Type:** Production Data Interference by Test Environment
**Status:** RESOLVED
Executive Summary
A test application (atom-saas-test) was configured to point to the **production database** and was running an outdated version of the maintenance reaper code. This caused active production jobs to be incorrectly cancelled after only 10 minutes of heartbeat staleness (instead of the 60-minute threshold in production).
Timeline
| Time (UTC) | Event |
|---|---|
| 10:13 | Job f648f968 started processing |
| 14:25 | Job sent last heartbeat |
| 14:35 | 10 minutes elapsed (test app threshold) |
| 14:36 | **atom-saas-test** reaper cancelled job with old error message |
| 14:39 | v2068 deployed to production (heartbeat fix) |
| 14:40 | User discovered job was cancelled |
| ~15:00 | User identified atom-saas-test as the culprit |
| 15:05 | **atom-saas-test stopped** - incident resolved |
Root Cause Analysis
The Problem
**Test App Configuration:**
- **App Name:**
atom-saas-test - **Database:** Shared production database URL (
ep-little-poetry-ad98vm8v-pooler...) - **Reaper Code:** Old version (pre-leader-election, pre-structured-logging)
- **Reaper Threshold:** 10 minutes (hard-coded)
- **Redis Lock:** ❌ None (no leader election)
**Production Configuration:**
- **App Name:**
atom-saas - **Reaper Threshold:** 60 minutes
- **Redis Lock:** ✅ Implemented (single leader elected)
- **Error Format:** Structured with
machine_idsuffix
Why It Wasn't Detected Earlier
- **Test app was deployed** 10 minutes before the incident, meaning it was actively running
- **No separation of concerns** - test and production shared the same database
- **Old code format** - the error message "Abandoned (server restart or timeout)" didn't have the new
Reaper: {machine_id}suffix, which would have immediately identified the culprit - **Silent interference** - the test app's reaper ran every 5 minutes without logging to production logs
Impact
**Direct Impact:**
- 1 backfill job (
f648f968) cancelled prematurely after processing 100 entities - 11 minutes of work lost (job had 4+ hours of runtime but only 11 min since last heartbeat)
**Potential Impact (if not detected):**
- ALL production backfill jobs at risk of premature cancellation
- Continuous interference with production operations
- Data corruption from repeated job interruptions
- User trust erosion
Resolution
Immediate Actions Taken
- ✅ **Stopped atom-saas-test** (
fly scale count 0 -a atom-saas-test) - ✅ **Verified app is suspended** (no running machines)
- ✅ **Confirmed production app is only active instance**
Required Follow-up Actions
- **URGENT:** Reconfigure
atom-saas-testto use a separate test database
- **Audit other apps** for similar misconfigurations:
- **Add environment guards** in code:
- **Restore job
f648f968:**
- Data is safe in R2 (checkpoint at 100 entities)
- Can be resumed manually from the UI or via API
Lessons Learned
Process Issues
- **No environment verification** - Test apps should be blocked from accessing production resources
- **No database access auditing** - No alerts when multiple apps connect to production DB
- **Stale test deployments** - Old code versions running indefinitely
- **No cross-app monitoring** - Each app operates in isolation
Technical Debt
- **Hard-coded thresholds** - Should be environment variables
- **No app identification in error messages** - Made debugging difficult
- **Shared infrastructure without guards** - Test/production shared same database
Prevention Measures
- **Environment-based guards:**
- **Database access logging:**
- Log all new database connections with app name
- Alert on unexpected app names
- **Secrets scanning:**
- Prevent test apps from having production secrets
- Use different secret keys for test/production
- **App naming conventions:**
- Use
-testsuffix for non-production - Block apps with
-testfrom production resources
Verification
Post-Incident Verification
- ✅ **atom-saas-test stopped** - No running machines
- ✅ **atom-saas production running** - Only active instance
- ✅ **No other test apps** pointing to production (verified)
- ⏳ **Database reconfiguration** - PENDING for atom-saas-test
Ongoing Monitoring
- Monitor
atom-saaslogs for any reaper activity - Verify new backfill jobs complete successfully
- Watch for any unexpected "cancelled" job statuses
Credits
**Incident Detected By:** User (forensic analysis of log timestamps and error message format)
**Key Insight:** The 10-minute threshold (test) vs 60-minute threshold (production) mismatch, combined with old error message format, revealed the second reaper.
---
**Status:** ✅ RESOLVED - Rogue reaper eliminated
**Next Review:** After atom-saas-test database reconfiguration