Atom AI Labs - AI-Powered Multi-Tenant Platform

# Historical Sync Backfill Debugging Guide

**Date:** May 3, 2026

**Issue:** Historical sync jobs failing with premature cancellations and variable shadowing errors

## Problem Summary

Historical sync backfill jobs were failing with two critical issues:

1. **Variable shadowing bug**: 'tuple' object has no attribute 'get'

2. **Premature cancellations**: Jobs cancelled after 10-20 minutes instead of completing

## Root Causes

### 1. Variable Shadowing Bug (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:910

**Problem:**

# OLD BUGGY CODE:
for result in llm_results:  # ❌ Shadows outer 'result' variable
    doc_id, llm_entities, llm_relationships = result
    # ... process ...

# Later code tries to use outer 'result':
has_more = result.get("has_more", False)  # ❌ 'result' is now a tuple!

**Impact:**

- Job 3cfdf70d-d1f2-4b1e-b007-19741cddf4fc failed

- Error: AttributeError: 'tuple' object has no attribute 'get'

- All backfill jobs failing after first chunk

**Fix:**

# NEW FIXED CODE:
for llm_result in llm_results:  # ✅ Uses different variable name
    doc_id, llm_entities, llm_relationships = llm_result
    # ... process ...

# Outer 'result' still available:
has_more = result.get("has_more", False)  # ✅ Works correctly

**Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py

### 2. Silent Heartbeat Failures (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:960, 1168

**Problem:**

# OLD CODE:
try:
    heartbeat_db = Session(bind=self.db.bind)
    _job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
    heartbeat_db.commit()
except Exception:
    pass  # ❌ Swallows all errors silently!

**Impact:**

- Heartbeat updates failing silently

- Reaper seeing stale heartbeats

- Jobs marked as "abandoned" even though worker is alive

**Fix:**

# NEW CODE:
try:
    heartbeat_db = Session(bind=self.db.bind)
    _job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
    heartbeat_db.commit()
except Exception as heartbeat_err:
    logger.error(
        f"Failed to update heartbeat: {heartbeat_err}",
        tenant_id=self.tenant_id,
        job_id=job_id,
        action="heartbeat_update_failed",
    )  # ✅ Logs errors for debugging

### 3. LLM Infinite Hangs (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:893

**Problem:**

# OLD CODE:
llm_results = await asyncio.gather(*llm_tasks, return_exceptions=True)
# ❌ No timeout - waits forever if task hangs

**Impact:**

- Jobs stuck indefinitely when LLM API hangs

- No way to recover from hung LLM calls

- Worker machines blocked forever

**Fix:**

# NEW CODE:
try:
    llm_results = await asyncio.wait_for(
        asyncio.gather(*llm_tasks, return_exceptions=True),
        timeout=300.0  # ✅ 5 minutes max for LLM extraction
    )
except asyncio.TimeoutError:
    logger.error(f"LLM extraction timed out after 5 minutes for chunk {chunk_count + 1}")
    llm_results = []  # Continue with next chunk

### 4. Wrong Reaper Threshold (FIXED ✅)

**Location:** backend-saas/core/startup_tasks.py:74

**Problem:**

- Old threshold: 10-30 minutes

- Large backfill chunks legitimately take 30-45 minutes

- Jobs cancelled before they can complete

**Fix:**

# NEW CODE:
threshold = datetime.now(timezone.utc) - timedelta(minutes=60)  # ✅ 60 minutes

### 5. Poor Error Logging (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:1400

**Problem:**

# OLD CODE:
job.last_error = str(e)  # ❌ Loses stack trace

**Fix:**

# NEW CODE:
import traceback
job.last_error = traceback.format_exc()  # ✅ Full stack trace

### 6. Enhanced Reaper Logging (FIXED ✅)

**Location:** backend-saas/core/startup_tasks.py:101-121

**Problem:**

- Old reaper cancellation message: "Abandoned (server restart or timeout)"

- No diagnostic information to debug why job was cancelled

**Fix:**

# NEW CODE:
logger.warning(
    f"Reaper cancelling orphaned HistoricalSyncJob",
    extra={
        "job_id": str(job.id),
        "tenant_id": str(job.tenant_id),
        "integration_id": str(job.integration_id),
        "reaper_machine_id": machine_id,
        "job_age_minutes": job_age.total_seconds() / 60,
        "heartbeat_age_minutes": heartbeat_age.total_seconds() / 60 if heartbeat_age else None,
        "job_created": job.created_at.isoformat() if job.created_at else None,
        "last_heartbeat": job.last_heartbeat.isoformat() if job.last_heartbeat else None,
        "heartbeat_is_null": job.last_heartbeat is None,
        "records_processed": job.records_processed,
        "started_at": job.started_at.isoformat() if job.started_at else None,
        "diagnosis": "DEAD WORKER: Job was running but heartbeat is stale. ..."
    }
)

### 7. Deployment Verification Gap (CRITICAL LESSON - May 3, 2026)

**🔴 CRITICAL DISCOVERY:**

**The Deployment Paradox:**

- ✅ Files on disk show VERSION 2039 code (verified by grep/SSH)

- ❌ Running Python process has OLD code in memory

- ❌ Verification scripts check FILES, not RUNNING PROCESSES

- Result: "Verification passes" but running code is stale

**Real Incident Timeline:**

1. Deployed VERSION 2039 with enhanced reaper logging ✅

2. Machine config updated to VERSION 2039 ✅

3. Verification script checked files via SSH - all passed ✅

4. Backfill job ran and got cancelled with OLD message format ❌

5. Last error: "Abandoned (server restart or timeout)" (old format) ❌

6. Root cause: Machine "UPDATED AT" changed but process didn't restart ❌

**Why This Happens:**

- Deployment builds new image and updates machine config

- Files on disk are replaced with new code

- BUT running Python process (uvicorn) continues with old code in memory

- Machine "UPDATED AT" timestamp updates even without process restart

- Verification scripts use SSH to grep files, which shows new code ✅

- But running process memory still has old code loaded ❌

**Two Solutions:**

**Option 1: Immediate Deployment Strategy (RECOMMENDED)**

# Use immediate strategy - kills and recreates machines
fly deploy -a atom-saas --strategy immediate

# Advantages:
# ✅ Old processes killed (not just stopped)
# ✅ New machines start fresh with new code in memory
# ✅ Guaranteed consistency
# ✅ No human error (forgetting to restart)
# ⚠️ Brief downtime (~30-60s) but worth it for consistency

**Option 2: Manual Restart After Deployment**

# 1. Deploy
fly deploy -a atom-saas

# 2. Check machine timestamps
fly machines list -a atom-saas

# 3. CRITICAL: If UPDATED AT equals deployment time, process didn't restart
fly machines restart --all -a atom-saas

# 4. Wait for health
sleep 30
fly status -a atom-saas --all

**Verification Script Limitation:**

The verify_deployment.py script has a **CRITICAL LIMITATION**:

- ✅ Checks files on disk via SSH

- ❌ Does NOT check running process memory

- ❌ Passing verification ≠ running code is up-to-date

**✅ Definitive Verification (Health Endpoint - May 3, 2026):**

**NEW SOLUTION:** Check health endpoint for deployed SHA:

# 1. Get current HEAD SHA locally
git rev-parse --short HEAD

# 2. Check deployed SHA from health endpoint
curl https://app.atomagentos.com/api/health | jq .deployed_sha

# 3. Verify SHA matches
# - Match found: New code is running ✅
# - SHA missing or different: Machine needs restart ❌
fly machines restart --all -a atom-saas

**Why This Works:**

- VERSION.txt baked into Docker image at build time

- Health endpoint reads VERSION.txt at runtime

- No git dependency needed in container

- Instant verification without SSH or log diving

**Implementation:**

1. **Dockerfile** (lines 10-12, 61-63):

```dockerfile

ARG VERSION_SHA=dev

ENV VERSION_SHA=${VERSION_SHA}

ARG VERSION_SHA=unknown

RUN echo $VERSION_SHA > /app/backend-saas/VERSION.txt

```

2. **Health Endpoint** (src/app/api/health/route.ts):

```typescript

try {

const versionPath = '/app/backend-saas/VERSION.txt'

if (existsSync(versionPath)) {

health.deployed_sha = await readFile(versionPath, 'utf-8').trim()

}

} catch (error) {

// deployed_sha will be undefined if file doesn't exist

}

```

**Deployment Usage:**

fly deploy -a atom-saas --build-arg VERSION_SHA=$(git rev-parse --short HEAD)

**Log-Based Verification (Secondary Method):**

If health endpoint unavailable, check Fly logs:

## Fixes Applied

All fixes deployed in VERSION 2031:

1. ✅ Variable shadowing fix (line 910)

2. ✅ Enhanced error logging with traceback.format_exc() (line 1400)

3. ✅ Fresh sessions for heartbeat updates (lines 960, 1168)

4. ✅ 5-minute LLM timeout (line 906)

5. ✅ 60-minute reaper threshold (line 74)

6. ✅ Enhanced reaper logging (line 101)

7. ✅ Deployment verification script

## Test Results

**Job 3ededa84-d8fc-4543-803c-a2930f7a5a88:**

- Runtime: 1 hour 10 minutes

- Records processed: 800

- Entities extracted: 800

- Relationships extracted: 800

- Result: Cancelled by OLD reaper (machine hadn't restarted)

**Key Success Indicators:**

- ✅ No variable shadowing errors for 800 records

- ✅ Heartbeat updates working (6 seconds old at one point)

- ✅ No premature cancellation at 10-20 minutes

- ✅ Job ran for 60+ minutes without issue

**Remaining Issue:**

- Job was cancelled because app machine hadn't restarted to pick up new reaper code

- Old reaper cancelled with old threshold (likely 30 minutes based on heartbeat age)

- Solution: Always restart machines after deployment

## Prevention Checklist

For future deployments:

1. **Code Changes:**

- [ ] Always use TDD for bug fixes

- [ ] Test with real backfill jobs

- [ ] Add regression tests

2. **Deployment Process:**

- [ ] Deploy with --strategy immediate

- [ ] Check UPDATED AT timestamps

- [ ] Manually restart if >5 minutes old

- [ ] Run verification script

- [ ] Start test backfill job

3. **Monitoring:**

- [ ] Check job error messages for diagnostic info

- [ ] Verify machine ID in reaper logs

- [ ] Confirm heartbeat ages make sense

- [ ] Look for "DEAD WORKER" diagnosis

## Related Files

- **Bug Fix Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py

- **Verification Script:** backend-saas/scripts/verify_deployment.py

- **Worker Service:** backend-saas/core/historical_sync_service.py

- **Reaper:** backend-saas/core/startup_tasks.py

## Commands Reference

# Check machine status
fly status -a atom-saas --all
fly machines list -a atom-saas

# Restart machines
fly machine restart <machine-id> -a atom-saas
fly machines restart --all -a atom-saas

# Verify deployment
python3 backend-saas/scripts/verify_deployment.py

# Check job progress
fly ssh console -a atom-saas -c <machine-id> --command \
  "python3 -c \"
  import sys; sys.path.insert(0, '/app/backend-saas')
  from core.database import SessionLocal
  from core.models import HistoricalSyncJob
  db = SessionLocal()
  job = db.query(HistoricalSyncJob).filter(
      HistoricalSyncJob.id == '<job-id>'
  ).first()
  if job:
    print(f'Status: {job.status}')
    print(f'Records: {job.records_processed}')
  db.close()
  \""

# Check logs
fly logs -a atom-saas -n 100 | grep -E "DEPLOYED|Reaper|cancelled"

## Lessons Learned

1. **Variable Naming Matters**: Never shadow outer scope variables in loops

2. **Silent Errors Are Dangerous**: Always log exceptions, never use except: pass

3. **Timeouts Are Critical**: Always add timeouts to potentially blocking operations

4. **Thresholds Must Match Reality**: Set timeouts based on actual operation times

5. **🔴 DEPLOYMENT PARADOX (May 3, 2026)**:

- Files on disk ≠ Running process memory

- Verification scripts check FILES, not PROCESSES

- "Verified" deployment can still run old code

- **Solution**: Use fly deploy --strategy immediate to guarantee consistency

6. **Verify Running Code**: Don't just check files - verify the running process

7. **Diagnostic Logging Is Essential**: Can't debug what you can't see

## Future Improvements

1. **Health Check Endpoint**: Add /api/health/version to return running code version

2. **Automated Restart**: Make deployment script automatically restart machines

3. **Monitoring Dashboards**: Grafana dashboard for job progress and heartbeat ages

4. **Alerts**: PagerDuty alerts when jobs are cancelled

5. **Automatic Recovery**: Worker should detect stale reaper and restart job