ATOM Documentation

← Back to App

Historical Sync Backfill Debugging Guide

**Date:** May 3, 2026 (updated May 2026)

**Issue:** Historical sync jobs failing with premature cancellations, variable shadowing errors, and DB pool exhaustion

Problem Summary

Historical sync backfill jobs were failing with three critical issues:

  1. **Variable shadowing bug**: 'tuple' object has no attribute 'get'
  2. **Premature cancellations**: Jobs cancelled after 10-20 minutes instead of completing
  3. **DB connection pool exhaustion (May 2026)**: SSH timeouts, workers unable to serve queries — LLMService() instantiated inside parallel _extract_one tasks, each creating a BYOKHandler that queries TenantSetting, saturating the Neon pool (15+20). Fixed by creating LLMService once per chunk and sharing via _llm_extract_with_handler().

Root Causes

1. Variable Shadowing Bug (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:910

**Problem:**

# OLD BUGGY CODE:
for result in llm_results:  # ❌ Shadows outer 'result' variable
    doc_id, llm_entities, llm_relationships = result
    # ... process ...

# Later code tries to use outer 'result':
has_more = result.get("has_more", False)  # ❌ 'result' is now a tuple!

**Impact:**

  • Job 3cfdf70d-d1f2-4b1e-b007-19741cddf4fc failed
  • Error: AttributeError: 'tuple' object has no attribute 'get'
  • All backfill jobs failing after first chunk

**Fix:**

# NEW FIXED CODE:
for llm_result in llm_results:  # ✅ Uses different variable name
    doc_id, llm_entities, llm_relationships = llm_result
    # ... process ...

# Outer 'result' still available:
has_more = result.get("has_more", False)  # ✅ Works correctly

**Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py

2. Silent Heartbeat Failures (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:960, 1168

**Problem:**

# OLD CODE:
try:
    heartbeat_db = Session(bind=self.db.bind)
    _job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
    heartbeat_db.commit()
except Exception:
    pass  # ❌ Swallows all errors silently!

**Impact:**

  • Heartbeat updates failing silently
  • Reaper seeing stale heartbeats
  • Jobs marked as "abandoned" even though worker is alive

**Fix:**

# NEW CODE:
try:
    heartbeat_db = Session(bind=self.db.bind)
    _job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
    heartbeat_db.commit()
except Exception as heartbeat_err:
    logger.error(
        f"Failed to update heartbeat: {heartbeat_err}",
        tenant_id=self.tenant_id,
        job_id=job_id,
        action="heartbeat_update_failed",
    )  # ✅ Logs errors for debugging

3. LLM Infinite Hangs (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:893

**Problem:**

# OLD CODE:
llm_results = await asyncio.gather(*llm_tasks, return_exceptions=True)
# ❌ No timeout - waits forever if task hangs

**Impact:**

  • Jobs stuck indefinitely when LLM API hangs
  • No way to recover from hung LLM calls
  • Worker machines blocked forever

**Fix:**

# NEW CODE:
try:
    llm_results = await asyncio.wait_for(
        asyncio.gather(*llm_tasks, return_exceptions=True),
        timeout=300.0  # ✅ 5 minutes max for LLM extraction
    )
except asyncio.TimeoutError:
    logger.error(f"LLM extraction timed out after 5 minutes for chunk {chunk_count + 1}")
    llm_results = []  # Continue with next chunk

4. Wrong Reaper Threshold (FIXED ✅)

**Location:** backend-saas/core/startup_tasks.py:74

**Problem:**

  • Old threshold: 10-30 minutes
  • Large backfill chunks legitimately take 30-45 minutes
  • Jobs cancelled before they can complete

**Fix:**

# NEW CODE:
threshold = datetime.now(timezone.utc) - timedelta(minutes=60)  # ✅ 60 minutes

5. Poor Error Logging (FIXED ✅)

**Location:** backend-saas/core/historical_sync_service.py:1400

**Problem:**

# OLD CODE:
job.last_error = str(e)  # ❌ Loses stack trace

**Fix:**

# NEW CODE:
import traceback
job.last_error = traceback.format_exc()  # ✅ Full stack trace

6. Enhanced Reaper Logging (FIXED ✅)

**Location:** backend-saas/core/startup_tasks.py:101-121

**Problem:**

  • Old reaper cancellation message: "Abandoned (server restart or timeout)"
  • No diagnostic information to debug why job was cancelled

**Fix:**

# NEW CODE:
logger.warning(
    f"Reaper cancelling orphaned HistoricalSyncJob",
    extra={
        "job_id": str(job.id),
        "tenant_id": str(job.tenant_id),
        "integration_id": str(job.integration_id),
        "reaper_machine_id": machine_id,
        "job_age_minutes": job_age.total_seconds() / 60,
        "heartbeat_age_minutes": heartbeat_age.total_seconds() / 60 if heartbeat_age else None,
        "job_created": job.created_at.isoformat() if job.created_at else None,
        "last_heartbeat": job.last_heartbeat.isoformat() if job.last_heartbeat else None,
        "heartbeat_is_null": job.last_heartbeat is None,
        "records_processed": job.records_processed,
        "started_at": job.started_at.isoformat() if job.started_at else None,
        "diagnosis": "DEAD WORKER: Job was running but heartbeat is stale. ..."
    }
)

7. Deployment Verification Gap (CRITICAL LESSON - May 3, 2026)

**🔴 CRITICAL DISCOVERY:**

**The Deployment Paradox:**

  • ✅ Files on disk show VERSION 2039 code (verified by grep/SSH)
  • ❌ Running Python process has OLD code in memory
  • ❌ Verification scripts check FILES, not RUNNING PROCESSES
  • Result: "Verification passes" but running code is stale

**Real Incident Timeline:**

  1. Deployed VERSION 2039 with enhanced reaper logging ✅
  2. Machine config updated to VERSION 2039 ✅
  3. Verification script checked files via SSH - all passed ✅
  4. Backfill job ran and got cancelled with OLD message format ❌
  5. Last error: "Abandoned (server restart or timeout)" (old format) ❌
  6. Root cause: Machine "UPDATED AT" changed but process didn't restart ❌

**Why This Happens:**

  • Deployment builds new image and updates machine config
  • Files on disk are replaced with new code
  • BUT running Python process (uvicorn) continues with old code in memory
  • Machine "UPDATED AT" timestamp updates even without process restart
  • Verification scripts use SSH to grep files, which shows new code ✅
  • But running process memory still has old code loaded ❌

**Two Solutions:**

**Option 1: Immediate Deployment Strategy (RECOMMENDED)**

# Use immediate strategy - kills and recreates machines
fly deploy -a atom-saas --strategy immediate

# Advantages:
# ✅ Old processes killed (not just stopped)
# ✅ New machines start fresh with new code in memory
# ✅ Guaranteed consistency
# ✅ No human error (forgetting to restart)
# ⚠️ Brief downtime (~30-60s) but worth it for consistency

**Option 2: Manual Restart After Deployment**

# 1. Deploy
fly deploy -a atom-saas

# 2. Check machine timestamps
fly machines list -a atom-saas

# 3. CRITICAL: If UPDATED AT equals deployment time, process didn't restart
fly machines restart --all -a atom-saas

# 4. Wait for health
sleep 30
fly status -a atom-saas --all

**Verification Script Limitation:**

The verify_deployment.py script has a **CRITICAL LIMITATION**:

  • ✅ Checks files on disk via SSH
  • ❌ Does NOT check running process memory
  • ❌ Passing verification ≠ running code is up-to-date

**✅ Definitive Verification (Health Endpoint - May 3, 2026):**

**NEW SOLUTION:** Check health endpoint for deployed SHA:

# 1. Get current HEAD SHA locally
git rev-parse --short HEAD

# 2. Check deployed SHA from health endpoint
curl https://app.atomagentos.com/api/health | jq .deployed_sha

# 3. Verify SHA matches
# - Match found: New code is running ✅
# - SHA missing or different: Machine needs restart ❌
fly machines restart --all -a atom-saas

**Why This Works:**

  • VERSION.txt baked into Docker image at build time
  • Health endpoint reads VERSION.txt at runtime
  • No git dependency needed in container
  • Instant verification without SSH or log diving

**Implementation:**

  1. **Dockerfile** (lines 10-12, 61-63):

ARG VERSION_SHA=unknown

RUN echo $VERSION_SHA > /app/backend-saas/VERSION.txt

```

  1. **Health Endpoint** (src/app/api/health/route.ts):

**Deployment Usage:**

fly deploy -a atom-saas --build-arg VERSION_SHA=$(git rev-parse --short HEAD)

**Log-Based Verification (Secondary Method):**

If health endpoint unavailable, check Fly logs:

Fixes Applied

All fixes deployed in VERSION 2031:

  1. ✅ Variable shadowing fix (line 910)
  2. ✅ Enhanced error logging with traceback.format_exc() (line 1400)
  3. ✅ Fresh sessions for heartbeat updates (lines 960, 1168)
  4. ✅ 5-minute LLM timeout (line 906)
  5. ✅ 60-minute reaper threshold (line 74)
  6. ✅ Enhanced reaper logging (line 101)
  7. ✅ Deployment verification script
  8. ✅ DB pool exhaustion: LLMService created once per chunk via _llm_extract_with_handler() (May 2026)
  9. ✅ Integration-aware extraction: system_instruction_override + extra_metadata params added to _llm_extract_with_handler() (May 2026)

Test Results

**Job 3ededa84-d8fc-4543-803c-a2930f7a5a88:**

  • Runtime: 1 hour 10 minutes
  • Records processed: 800
  • Entities extracted: 800
  • Relationships extracted: 800
  • Result: Cancelled by OLD reaper (machine hadn't restarted)

**Key Success Indicators:**

  • ✅ No variable shadowing errors for 800 records
  • ✅ Heartbeat updates working (6 seconds old at one point)
  • ✅ No premature cancellation at 10-20 minutes
  • ✅ Job ran for 60+ minutes without issue

**Remaining Issue:**

  • Job was cancelled because app machine hadn't restarted to pick up new reaper code
  • Old reaper cancelled with old threshold (likely 30 minutes based on heartbeat age)
  • Solution: Always restart machines after deployment

Prevention Checklist

For future deployments:

  1. **Code Changes:**
  • [ ] Always use TDD for bug fixes
  • [ ] Test with real backfill jobs
  • [ ] Add regression tests
  1. **Deployment Process:**
  • [ ] Deploy with --strategy immediate
  • [ ] Check UPDATED AT timestamps
  • [ ] Manually restart if >5 minutes old
  • [ ] Run verification script
  • [ ] Start test backfill job
  1. **Monitoring:**
  • [ ] Check job error messages for diagnostic info
  • [ ] Verify machine ID in reaper logs
  • [ ] Confirm heartbeat ages make sense
  • [ ] Look for "DEAD WORKER" diagnosis
  • **Bug Fix Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py
  • **Verification Script:** backend-saas/scripts/verify_deployment.py
  • **Worker Service:** backend-saas/core/historical_sync_service.py (see _llm_extract_with_handler() for shared LLM extraction)
  • **Reaper:** backend-saas/core/startup_tasks.py

Commands Reference

# Check machine status
fly status -a atom-saas --all
fly machines list -a atom-saas

# Restart machines
fly machine restart <machine-id> -a atom-saas
fly machines restart --all -a atom-saas

# Verify deployment
python3 backend-saas/scripts/verify_deployment.py

# Check job progress
fly ssh console -a atom-saas -c <machine-id> --command \
  "python3 -c \"
  import sys; sys.path.insert(0, '/app/backend-saas')
  from core.database import SessionLocal
  from core.models import HistoricalSyncJob
  db = SessionLocal()
  job = db.query(HistoricalSyncJob).filter(
      HistoricalSyncJob.id == '<job-id>'
  ).first()
  if job:
    print(f'Status: {job.status}')
    print(f'Records: {job.records_processed}')
  db.close()
  \""

# Check logs
fly logs -a atom-saas -n 100 | grep -E "DEPLOYED|Reaper|cancelled"

Lessons Learned

  1. **Variable Naming Matters**: Never shadow outer scope variables in loops
  2. **Silent Errors Are Dangerous**: Always log exceptions, never use except: pass
  3. **Timeouts Are Critical**: Always add timeouts to potentially blocking operations
  4. **Thresholds Must Match Reality**: Set timeouts based on actual operation times
  5. **🔴 DEPLOYMENT PARADOX (May 3, 2026)**:
  • Files on disk ≠ Running process memory
  • Verification scripts check FILES, not PROCESSES
  • "Verified" deployment can still run old code
  • **Solution**: Use fly deploy --strategy immediate to guarantee consistency
  1. **Verify Running Code**: Don't just check files - verify the running process
  2. **Diagnostic Logging Is Essential**: Can't debug what you can't see

Future Improvements

  1. **Health Check Endpoint**: Add /api/health/version to return running code version
  2. **Automated Restart**: Make deployment script automatically restart machines
  3. **Monitoring Dashboards**: Grafana dashboard for job progress and heartbeat ages
  4. **Alerts**: PagerDuty alerts when jobs are cancelled
  5. **Automatic Recovery**: Worker should detect stale reaper and restart job