Atom AI Labs - AI-Powered Multi-Tenant Platform

Incident Report: Zombie Reaper Attack

**Date:** 2026-05-04

**Severity:** CRITICAL

**Incident Type:** Production Data Interference by Test Environment

**Status:** RESOLVED

Executive Summary

A test application (atom-saas-test) was configured to point to the **production database** and was running an outdated version of the maintenance reaper code. This caused active production jobs to be incorrectly cancelled after only 10 minutes of heartbeat staleness (instead of the 60-minute threshold in production).

Timeline

Time (UTC)	Event
10:13	Job `f648f968` started processing
14:25	Job sent last heartbeat
14:35	10 minutes elapsed (test app threshold)
14:36	atom-saas-test reaper cancelled job with old error message
14:39	v2068 deployed to production (heartbeat fix)
14:40	User discovered job was cancelled
~15:00	User identified atom-saas-test as the culprit
15:05	atom-saas-test stopped - incident resolved

Root Cause Analysis

The Problem

**Test App Configuration:**

**App Name:** atom-saas-test
**Database:** Shared production database URL (ep-little-poetry-ad98vm8v-pooler...)
**Reaper Code:** Old version (pre-leader-election, pre-structured-logging)
**Reaper Threshold:** 10 minutes (hard-coded)
**Redis Lock:** ❌ None (no leader election)

**Production Configuration:**

**App Name:** atom-saas
**Reaper Threshold:** 60 minutes
**Redis Lock:** ✅ Implemented (single leader elected)
**Error Format:** Structured with machine_id suffix

Why It Wasn't Detected Earlier

**Test app was deployed** 10 minutes before the incident, meaning it was actively running
**No separation of concerns** - test and production shared the same database
**Old code format** - the error message "Abandoned (server restart or timeout)" didn't have the new Reaper: {machine_id} suffix, which would have immediately identified the culprit
**Silent interference** - the test app's reaper ran every 5 minutes without logging to production logs

Impact

**Direct Impact:**

1 backfill job (f648f968) cancelled prematurely after processing 100 entities
11 minutes of work lost (job had 4+ hours of runtime but only 11 min since last heartbeat)

**Potential Impact (if not detected):**

ALL production backfill jobs at risk of premature cancellation
Continuous interference with production operations
Data corruption from repeated job interruptions
User trust erosion

Resolution

Immediate Actions Taken

✅ **Stopped atom-saas-test** (fly scale count 0 -a atom-saas-test)
✅ **Verified app is suspended** (no running machines)
✅ **Confirmed production app is only active instance**

Required Follow-up Actions

**URGENT:** Reconfigure atom-saas-test to use a separate test database

**Audit other apps** for similar misconfigurations:

**Add environment guards** in code:

**Restore job f648f968:**

Data is safe in R2 (checkpoint at 100 entities)
Can be resumed manually from the UI or via API

Lessons Learned

Process Issues

**No environment verification** - Test apps should be blocked from accessing production resources
**No database access auditing** - No alerts when multiple apps connect to production DB
**Stale test deployments** - Old code versions running indefinitely
**No cross-app monitoring** - Each app operates in isolation

Technical Debt

**Hard-coded thresholds** - Should be environment variables
**No app identification in error messages** - Made debugging difficult
**Shared infrastructure without guards** - Test/production shared same database

Prevention Measures

**Environment-based guards:**

**Database access logging:**

Log all new database connections with app name
Alert on unexpected app names

**Secrets scanning:**

Prevent test apps from having production secrets
Use different secret keys for test/production

**App naming conventions:**

Use -test suffix for non-production
Block apps with -test from production resources

Verification

Post-Incident Verification

✅ **atom-saas-test stopped** - No running machines
✅ **atom-saas production running** - Only active instance
✅ **No other test apps** pointing to production (verified)
⏳ **Database reconfiguration** - PENDING for atom-saas-test

Ongoing Monitoring

Monitor atom-saas logs for any reaper activity
Verify new backfill jobs complete successfully
Watch for any unexpected "cancelled" job statuses

Credits

**Incident Detected By:** User (forensic analysis of log timestamps and error message format)

**Key Insight:** The 10-minute threshold (test) vs 60-minute threshold (production) mismatch, combined with old error message format, revealed the second reaper.

---

**Status:** ✅ RESOLVED - Rogue reaper eliminated

**Next Review:** After atom-saas-test database reconfiguration