Atom AI Labs - AI-Powered Multi-Tenant Platform

Historical Sync Backfill Debugging Guide

**Date:** May 3, 2026 (updated May 2026)

**Issue:** Historical sync jobs failing with premature cancellations, variable shadowing errors, and DB pool exhaustion

Problem Summary

Historical sync backfill jobs were failing with three critical issues:

**Variable shadowing bug**: 'tuple' object has no attribute 'get'
**Premature cancellations**: Jobs cancelled after 10-20 minutes instead of completing
**DB connection pool exhaustion (May 2026)**: SSH timeouts, workers unable to serve queries — LLMService() instantiated inside parallel _extract_one tasks, each creating a BYOKHandler that queries TenantSetting, saturating the Neon pool (15+20). Fixed by creating LLMService once per chunk and sharing via _llm_extract_with_handler().

Root Causes

1. Variable Shadowing Bug (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:910

**Problem:**

# OLD BUGGY CODE:
for result in llm_results:  # ❌ Shadows outer 'result' variable
    doc_id, llm_entities, llm_relationships = result
    # ... process ...

# Later code tries to use outer 'result':
has_more = result.get("has_more", False)  # ❌ 'result' is now a tuple!

**Impact:**

Job 3cfdf70d-d1f2-4b1e-b007-19741cddf4fc failed
Error: AttributeError: 'tuple' object has no attribute 'get'
All backfill jobs failing after first chunk

**Fix:**

# NEW FIXED CODE:
for llm_result in llm_results:  # ✅ Uses different variable name
    doc_id, llm_entities, llm_relationships = llm_result
    # ... process ...

# Outer 'result' still available:
has_more = result.get("has_more", False)  # ✅ Works correctly

**Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py

2. Silent Heartbeat Failures (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:960, 1168

**Problem:**

# OLD CODE:
try:
    heartbeat_db = Session(bind=self.db.bind)
    _job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
    heartbeat_db.commit()
except Exception:
    pass  # ❌ Swallows all errors silently!

**Impact:**

Heartbeat updates failing silently
Reaper seeing stale heartbeats
Jobs marked as "abandoned" even though worker is alive

**Fix:**

# NEW CODE:
try:
    heartbeat_db = Session(bind=self.db.bind)
    _job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
    heartbeat_db.commit()
except Exception as heartbeat_err:
    logger.error(
        f"Failed to update heartbeat: {heartbeat_err}",
        tenant_id=self.tenant_id,
        job_id=job_id,
        action="heartbeat_update_failed",
    )  # ✅ Logs errors for debugging

3. LLM Infinite Hangs (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:893

**Problem:**

# OLD CODE:
llm_results = await asyncio.gather(*llm_tasks, return_exceptions=True)
# ❌ No timeout - waits forever if task hangs

**Impact:**

Jobs stuck indefinitely when LLM API hangs
No way to recover from hung LLM calls
Worker machines blocked forever

**Fix:**

# NEW CODE:
try:
    llm_results = await asyncio.wait_for(
        asyncio.gather(*llm_tasks, return_exceptions=True),
        timeout=300.0  # ✅ 5 minutes max for LLM extraction
    )
except asyncio.TimeoutError:
    logger.error(f"LLM extraction timed out after 5 minutes for chunk {chunk_count + 1}")
    llm_results = []  # Continue with next chunk

4. Wrong Reaper Threshold (FIXED ✅)

**Location:** backend-saas/core/startup_tasks.py:74

**Problem:**

Old threshold: 10-30 minutes
Large backfill chunks legitimately take 30-45 minutes
Jobs cancelled before they can complete

**Fix:**

# NEW CODE:
threshold = datetime.now(timezone.utc) - timedelta(minutes=60)  # ✅ 60 minutes

5. Poor Error Logging (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:1400

**Problem:**

# OLD CODE:
job.last_error = str(e)  # ❌ Loses stack trace

**Fix:**

# NEW CODE:
import traceback
job.last_error = traceback.format_exc()  # ✅ Full stack trace

6. Enhanced Reaper Logging (FIXED ✅)

**Location:** backend-saas/core/startup_tasks.py:101-121

**Problem:**

Old reaper cancellation message: "Abandoned (server restart or timeout)"
No diagnostic information to debug why job was cancelled

**Fix:**

# NEW CODE:
logger.warning(
    f"Reaper cancelling orphaned HistoricalSyncJob",
    extra={
        "job_id": str(job.id),
        "tenant_id": str(job.tenant_id),
        "integration_id": str(job.integration_id),
        "reaper_machine_id": machine_id,
        "job_age_minutes": job_age.total_seconds() / 60,
        "heartbeat_age_minutes": heartbeat_age.total_seconds() / 60 if heartbeat_age else None,
        "job_created": job.created_at.isoformat() if job.created_at else None,
        "last_heartbeat": job.last_heartbeat.isoformat() if job.last_heartbeat else None,
        "heartbeat_is_null": job.last_heartbeat is None,
        "records_processed": job.records_processed,
        "started_at": job.started_at.isoformat() if job.started_at else None,
        "diagnosis": "DEAD WORKER: Job was running but heartbeat is stale. ..."
    }
)

7. Deployment Verification Gap (CRITICAL LESSON - May 3, 2026)

**🔴 CRITICAL DISCOVERY:**

**The Deployment Paradox:**

✅ Files on disk show VERSION 2039 code (verified by grep/SSH)
❌ Running Python process has OLD code in memory
❌ Verification scripts check FILES, not RUNNING PROCESSES
Result: "Verification passes" but running code is stale

**Real Incident Timeline:**

Deployed VERSION 2039 with enhanced reaper logging ✅
Machine config updated to VERSION 2039 ✅
Verification script checked files via SSH - all passed ✅
Backfill job ran and got cancelled with OLD message format ❌
Last error: "Abandoned (server restart or timeout)" (old format) ❌
Root cause: Machine "UPDATED AT" changed but process didn't restart ❌

**Why This Happens:**

Deployment builds new image and updates machine config
Files on disk are replaced with new code
BUT running Python process (uvicorn) continues with old code in memory
Machine "UPDATED AT" timestamp updates even without process restart
Verification scripts use SSH to grep files, which shows new code ✅
But running process memory still has old code loaded ❌

**Two Solutions:**

**Option 1: Immediate Deployment Strategy (RECOMMENDED)**

# Use immediate strategy - kills and recreates machines
fly deploy -a atom-saas --strategy immediate

# Advantages:
# ✅ Old processes killed (not just stopped)
# ✅ New machines start fresh with new code in memory
# ✅ Guaranteed consistency
# ✅ No human error (forgetting to restart)
# ⚠️ Brief downtime (~30-60s) but worth it for consistency

**Option 2: Manual Restart After Deployment**

# 1. Deploy
fly deploy -a atom-saas

# 2. Check machine timestamps
fly machines list -a atom-saas

# 3. CRITICAL: If UPDATED AT equals deployment time, process didn't restart
fly machines restart --all -a atom-saas

# 4. Wait for health
sleep 30
fly status -a atom-saas --all

**Verification Script Limitation:**

The verify_deployment.py script has a **CRITICAL LIMITATION**:

✅ Checks files on disk via SSH
❌ Does NOT check running process memory
❌ Passing verification ≠ running code is up-to-date

**✅ Definitive Verification (Health Endpoint - May 3, 2026):**

**NEW SOLUTION:** Check health endpoint for deployed SHA:

# 1. Get current HEAD SHA locally
git rev-parse --short HEAD

# 2. Check deployed SHA from health endpoint
curl https://app.atomagentos.com/api/health | jq .deployed_sha

# 3. Verify SHA matches
# - Match found: New code is running ✅
# - SHA missing or different: Machine needs restart ❌
fly machines restart --all -a atom-saas

**Why This Works:**

VERSION.txt baked into Docker image at build time
Health endpoint reads VERSION.txt at runtime
No git dependency needed in container
Instant verification without SSH or log diving

**Implementation:**

**Dockerfile** (lines 10-12, 61-63):

ARG VERSION_SHA=unknown

RUN echo $VERSION_SHA > /app/backend-saas/VERSION.txt

```

**Health Endpoint** (src/app/api/health/route.ts):

**Deployment Usage:**

fly deploy -a atom-saas --build-arg VERSION_SHA=$(git rev-parse --short HEAD)

**Log-Based Verification (Secondary Method):**

If health endpoint unavailable, check Fly logs:

Fixes Applied

All fixes deployed in VERSION 2031:

✅ Variable shadowing fix (line 910)
✅ Enhanced error logging with traceback.format_exc() (line 1400)
✅ Fresh sessions for heartbeat updates (lines 960, 1168)
✅ 5-minute LLM timeout (line 906)
✅ 60-minute reaper threshold (line 74)
✅ Enhanced reaper logging (line 101)
✅ Deployment verification script
✅ DB pool exhaustion: LLMService created once per chunk via _llm_extract_with_handler() (May 2026)
✅ Integration-aware extraction: system_instruction_override + extra_metadata params added to _llm_extract_with_handler() (May 2026)

Test Results

**Job 3ededa84-d8fc-4543-803c-a2930f7a5a88:**

Runtime: 1 hour 10 minutes
Records processed: 800
Entities extracted: 800
Relationships extracted: 800
Result: Cancelled by OLD reaper (machine hadn't restarted)

**Key Success Indicators:**

✅ No variable shadowing errors for 800 records
✅ Heartbeat updates working (6 seconds old at one point)
✅ No premature cancellation at 10-20 minutes
✅ Job ran for 60+ minutes without issue

**Remaining Issue:**

Job was cancelled because app machine hadn't restarted to pick up new reaper code
Old reaper cancelled with old threshold (likely 30 minutes based on heartbeat age)
Solution: Always restart machines after deployment

Prevention Checklist

For future deployments:

**Code Changes:**

[ ] Always use TDD for bug fixes
[ ] Test with real backfill jobs
[ ] Add regression tests

**Deployment Process:**

[ ] Deploy with --strategy immediate
[ ] Check UPDATED AT timestamps
[ ] Manually restart if >5 minutes old
[ ] Run verification script
[ ] Start test backfill job

**Monitoring:**

[ ] Check job error messages for diagnostic info
[ ] Verify machine ID in reaper logs
[ ] Confirm heartbeat ages make sense
[ ] Look for "DEAD WORKER" diagnosis

**Bug Fix Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py
**Verification Script:** backend-saas/scripts/verify_deployment.py
**Worker Service:** backend-saas/core/historical_sync_service.py (see _llm_extract_with_handler() for shared LLM extraction)
**Reaper:** backend-saas/core/startup_tasks.py

Commands Reference

# Check machine status
fly status -a atom-saas --all
fly machines list -a atom-saas

# Restart machines
fly machine restart <machine-id> -a atom-saas
fly machines restart --all -a atom-saas

# Verify deployment
python3 backend-saas/scripts/verify_deployment.py

# Check job progress
fly ssh console -a atom-saas -c <machine-id> --command \
  "python3 -c \"
  import sys; sys.path.insert(0, '/app/backend-saas')
  from core.database import SessionLocal
  from core.models import HistoricalSyncJob
  db = SessionLocal()
  job = db.query(HistoricalSyncJob).filter(
      HistoricalSyncJob.id == '<job-id>'
  ).first()
  if job:
    print(f'Status: {job.status}')
    print(f'Records: {job.records_processed}')
  db.close()
  \""

# Check logs
fly logs -a atom-saas -n 100 | grep -E "DEPLOYED|Reaper|cancelled"

Lessons Learned

**Variable Naming Matters**: Never shadow outer scope variables in loops
**Silent Errors Are Dangerous**: Always log exceptions, never use except: pass
**Timeouts Are Critical**: Always add timeouts to potentially blocking operations
**Thresholds Must Match Reality**: Set timeouts based on actual operation times
**🔴 DEPLOYMENT PARADOX (May 3, 2026)**:

Files on disk ≠ Running process memory
Verification scripts check FILES, not PROCESSES
"Verified" deployment can still run old code
**Solution**: Use fly deploy --strategy immediate to guarantee consistency

**Verify Running Code**: Don't just check files - verify the running process
**Diagnostic Logging Is Essential**: Can't debug what you can't see

Future Improvements

**Health Check Endpoint**: Add /api/health/version to return running code version
**Automated Restart**: Make deployment script automatically restart machines
**Monitoring Dashboards**: Grafana dashboard for job progress and heartbeat ages
**Alerts**: PagerDuty alerts when jobs are cancelled
**Automatic Recovery**: Worker should detect stale reaper and restart job

Historical Sync Backfill Debugging Guide

Problem Summary

Root Causes

1. Variable Shadowing Bug (FIXED ✅)

2. Silent Heartbeat Failures (FIXED ✅)

3. LLM Infinite Hangs (FIXED ✅)

4. Wrong Reaper Threshold (FIXED ✅)

5. Poor Error Logging (FIXED ✅)

6. Enhanced Reaper Logging (FIXED ✅)

7. Deployment Verification Gap (CRITICAL LESSON - May 3, 2026)

Fixes Applied

Test Results

Prevention Checklist

Related Files

Commands Reference

Lessons Learned

Future Improvements