Historical Sync Backfill Debugging Guide
**Date:** May 3, 2026 (updated May 2026)
**Issue:** Historical sync jobs failing with premature cancellations, variable shadowing errors, and DB pool exhaustion
Problem Summary
Historical sync backfill jobs were failing with three critical issues:
- **Variable shadowing bug**:
'tuple' object has no attribute 'get' - **Premature cancellations**: Jobs cancelled after 10-20 minutes instead of completing
- **DB connection pool exhaustion (May 2026)**: SSH timeouts, workers unable to serve queries —
LLMService()instantiated inside parallel_extract_onetasks, each creating aBYOKHandlerthat queriesTenantSetting, saturating the Neon pool (15+20). Fixed by creatingLLMServiceonce per chunk and sharing via_llm_extract_with_handler().
Root Causes
1. Variable Shadowing Bug (FIXED ✅)
**Location:** backend-saas/core/historical_sync_service.py:910
**Problem:**
# OLD BUGGY CODE:
for result in llm_results: # ❌ Shadows outer 'result' variable
doc_id, llm_entities, llm_relationships = result
# ... process ...
# Later code tries to use outer 'result':
has_more = result.get("has_more", False) # ❌ 'result' is now a tuple!**Impact:**
- Job
3cfdf70d-d1f2-4b1e-b007-19741cddf4fcfailed - Error:
AttributeError: 'tuple' object has no attribute 'get' - All backfill jobs failing after first chunk
**Fix:**
# NEW FIXED CODE:
for llm_result in llm_results: # ✅ Uses different variable name
doc_id, llm_entities, llm_relationships = llm_result
# ... process ...
# Outer 'result' still available:
has_more = result.get("has_more", False) # ✅ Works correctly**Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py
2. Silent Heartbeat Failures (FIXED ✅)
**Location:** backend-saas/core/historical_sync_service.py:960, 1168
**Problem:**
# OLD CODE:
try:
heartbeat_db = Session(bind=self.db.bind)
_job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
heartbeat_db.commit()
except Exception:
pass # ❌ Swallows all errors silently!**Impact:**
- Heartbeat updates failing silently
- Reaper seeing stale heartbeats
- Jobs marked as "abandoned" even though worker is alive
**Fix:**
# NEW CODE:
try:
heartbeat_db = Session(bind=self.db.bind)
_job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
heartbeat_db.commit()
except Exception as heartbeat_err:
logger.error(
f"Failed to update heartbeat: {heartbeat_err}",
tenant_id=self.tenant_id,
job_id=job_id,
action="heartbeat_update_failed",
) # ✅ Logs errors for debugging3. LLM Infinite Hangs (FIXED ✅)
**Location:** backend-saas/core/historical_sync_service.py:893
**Problem:**
# OLD CODE:
llm_results = await asyncio.gather(*llm_tasks, return_exceptions=True)
# ❌ No timeout - waits forever if task hangs**Impact:**
- Jobs stuck indefinitely when LLM API hangs
- No way to recover from hung LLM calls
- Worker machines blocked forever
**Fix:**
# NEW CODE:
try:
llm_results = await asyncio.wait_for(
asyncio.gather(*llm_tasks, return_exceptions=True),
timeout=300.0 # ✅ 5 minutes max for LLM extraction
)
except asyncio.TimeoutError:
logger.error(f"LLM extraction timed out after 5 minutes for chunk {chunk_count + 1}")
llm_results = [] # Continue with next chunk4. Wrong Reaper Threshold (FIXED ✅)
**Location:** backend-saas/core/startup_tasks.py:74
**Problem:**
- Old threshold: 10-30 minutes
- Large backfill chunks legitimately take 30-45 minutes
- Jobs cancelled before they can complete
**Fix:**
# NEW CODE:
threshold = datetime.now(timezone.utc) - timedelta(minutes=60) # ✅ 60 minutes5. Poor Error Logging (FIXED ✅)
**Location:** backend-saas/core/historical_sync_service.py:1400
**Problem:**
# OLD CODE:
job.last_error = str(e) # ❌ Loses stack trace**Fix:**
# NEW CODE:
import traceback
job.last_error = traceback.format_exc() # ✅ Full stack trace6. Enhanced Reaper Logging (FIXED ✅)
**Location:** backend-saas/core/startup_tasks.py:101-121
**Problem:**
- Old reaper cancellation message:
"Abandoned (server restart or timeout)" - No diagnostic information to debug why job was cancelled
**Fix:**
# NEW CODE:
logger.warning(
f"Reaper cancelling orphaned HistoricalSyncJob",
extra={
"job_id": str(job.id),
"tenant_id": str(job.tenant_id),
"integration_id": str(job.integration_id),
"reaper_machine_id": machine_id,
"job_age_minutes": job_age.total_seconds() / 60,
"heartbeat_age_minutes": heartbeat_age.total_seconds() / 60 if heartbeat_age else None,
"job_created": job.created_at.isoformat() if job.created_at else None,
"last_heartbeat": job.last_heartbeat.isoformat() if job.last_heartbeat else None,
"heartbeat_is_null": job.last_heartbeat is None,
"records_processed": job.records_processed,
"started_at": job.started_at.isoformat() if job.started_at else None,
"diagnosis": "DEAD WORKER: Job was running but heartbeat is stale. ..."
}
)7. Deployment Verification Gap (CRITICAL LESSON - May 3, 2026)
**🔴 CRITICAL DISCOVERY:**
**The Deployment Paradox:**
- ✅ Files on disk show VERSION 2039 code (verified by grep/SSH)
- ❌ Running Python process has OLD code in memory
- ❌ Verification scripts check FILES, not RUNNING PROCESSES
- Result: "Verification passes" but running code is stale
**Real Incident Timeline:**
- Deployed VERSION 2039 with enhanced reaper logging ✅
- Machine config updated to VERSION 2039 ✅
- Verification script checked files via SSH - all passed ✅
- Backfill job ran and got cancelled with OLD message format ❌
- Last error: "Abandoned (server restart or timeout)" (old format) ❌
- Root cause: Machine "UPDATED AT" changed but process didn't restart ❌
**Why This Happens:**
- Deployment builds new image and updates machine config
- Files on disk are replaced with new code
- BUT running Python process (uvicorn) continues with old code in memory
- Machine "UPDATED AT" timestamp updates even without process restart
- Verification scripts use SSH to
grepfiles, which shows new code ✅ - But running process memory still has old code loaded ❌
**Two Solutions:**
**Option 1: Immediate Deployment Strategy (RECOMMENDED)**
# Use immediate strategy - kills and recreates machines
fly deploy -a atom-saas --strategy immediate
# Advantages:
# ✅ Old processes killed (not just stopped)
# ✅ New machines start fresh with new code in memory
# ✅ Guaranteed consistency
# ✅ No human error (forgetting to restart)
# ⚠️ Brief downtime (~30-60s) but worth it for consistency**Option 2: Manual Restart After Deployment**
# 1. Deploy
fly deploy -a atom-saas
# 2. Check machine timestamps
fly machines list -a atom-saas
# 3. CRITICAL: If UPDATED AT equals deployment time, process didn't restart
fly machines restart --all -a atom-saas
# 4. Wait for health
sleep 30
fly status -a atom-saas --all**Verification Script Limitation:**
The verify_deployment.py script has a **CRITICAL LIMITATION**:
- ✅ Checks files on disk via SSH
- ❌ Does NOT check running process memory
- ❌ Passing verification ≠ running code is up-to-date
**✅ Definitive Verification (Health Endpoint - May 3, 2026):**
**NEW SOLUTION:** Check health endpoint for deployed SHA:
# 1. Get current HEAD SHA locally
git rev-parse --short HEAD
# 2. Check deployed SHA from health endpoint
curl https://app.atomagentos.com/api/health | jq .deployed_sha
# 3. Verify SHA matches
# - Match found: New code is running ✅
# - SHA missing or different: Machine needs restart ❌
fly machines restart --all -a atom-saas**Why This Works:**
- VERSION.txt baked into Docker image at build time
- Health endpoint reads VERSION.txt at runtime
- No git dependency needed in container
- Instant verification without SSH or log diving
**Implementation:**
- **Dockerfile** (lines 10-12, 61-63):
ARG VERSION_SHA=unknown
RUN echo $VERSION_SHA > /app/backend-saas/VERSION.txt
```
- **Health Endpoint** (
src/app/api/health/route.ts):
**Deployment Usage:**
fly deploy -a atom-saas --build-arg VERSION_SHA=$(git rev-parse --short HEAD)**Log-Based Verification (Secondary Method):**
If health endpoint unavailable, check Fly logs:
Fixes Applied
All fixes deployed in VERSION 2031:
- ✅ Variable shadowing fix (line 910)
- ✅ Enhanced error logging with traceback.format_exc() (line 1400)
- ✅ Fresh sessions for heartbeat updates (lines 960, 1168)
- ✅ 5-minute LLM timeout (line 906)
- ✅ 60-minute reaper threshold (line 74)
- ✅ Enhanced reaper logging (line 101)
- ✅ Deployment verification script
- ✅ DB pool exhaustion: LLMService created once per chunk via
_llm_extract_with_handler()(May 2026) - ✅ Integration-aware extraction:
system_instruction_override+extra_metadataparams added to_llm_extract_with_handler()(May 2026)
Test Results
**Job 3ededa84-d8fc-4543-803c-a2930f7a5a88:**
- Runtime: 1 hour 10 minutes
- Records processed: 800
- Entities extracted: 800
- Relationships extracted: 800
- Result: Cancelled by OLD reaper (machine hadn't restarted)
**Key Success Indicators:**
- ✅ No variable shadowing errors for 800 records
- ✅ Heartbeat updates working (6 seconds old at one point)
- ✅ No premature cancellation at 10-20 minutes
- ✅ Job ran for 60+ minutes without issue
**Remaining Issue:**
- Job was cancelled because app machine hadn't restarted to pick up new reaper code
- Old reaper cancelled with old threshold (likely 30 minutes based on heartbeat age)
- Solution: Always restart machines after deployment
Prevention Checklist
For future deployments:
- **Code Changes:**
- [ ] Always use TDD for bug fixes
- [ ] Test with real backfill jobs
- [ ] Add regression tests
- **Deployment Process:**
- [ ] Deploy with
--strategy immediate - [ ] Check UPDATED AT timestamps
- [ ] Manually restart if >5 minutes old
- [ ] Run verification script
- [ ] Start test backfill job
- **Monitoring:**
- [ ] Check job error messages for diagnostic info
- [ ] Verify machine ID in reaper logs
- [ ] Confirm heartbeat ages make sense
- [ ] Look for "DEAD WORKER" diagnosis
Related Files
- **Bug Fix Test:**
backend-saas/tests/test_variable_shadowing_fix_tdd.py - **Verification Script:**
backend-saas/scripts/verify_deployment.py - **Worker Service:**
backend-saas/core/historical_sync_service.py(see_llm_extract_with_handler()for shared LLM extraction) - **Reaper:**
backend-saas/core/startup_tasks.py
Commands Reference
# Check machine status
fly status -a atom-saas --all
fly machines list -a atom-saas
# Restart machines
fly machine restart <machine-id> -a atom-saas
fly machines restart --all -a atom-saas
# Verify deployment
python3 backend-saas/scripts/verify_deployment.py
# Check job progress
fly ssh console -a atom-saas -c <machine-id> --command \
"python3 -c \"
import sys; sys.path.insert(0, '/app/backend-saas')
from core.database import SessionLocal
from core.models import HistoricalSyncJob
db = SessionLocal()
job = db.query(HistoricalSyncJob).filter(
HistoricalSyncJob.id == '<job-id>'
).first()
if job:
print(f'Status: {job.status}')
print(f'Records: {job.records_processed}')
db.close()
\""
# Check logs
fly logs -a atom-saas -n 100 | grep -E "DEPLOYED|Reaper|cancelled"Lessons Learned
- **Variable Naming Matters**: Never shadow outer scope variables in loops
- **Silent Errors Are Dangerous**: Always log exceptions, never use
except: pass - **Timeouts Are Critical**: Always add timeouts to potentially blocking operations
- **Thresholds Must Match Reality**: Set timeouts based on actual operation times
- **🔴 DEPLOYMENT PARADOX (May 3, 2026)**:
- Files on disk ≠ Running process memory
- Verification scripts check FILES, not PROCESSES
- "Verified" deployment can still run old code
- **Solution**: Use
fly deploy --strategy immediateto guarantee consistency
- **Verify Running Code**: Don't just check files - verify the running process
- **Diagnostic Logging Is Essential**: Can't debug what you can't see
Future Improvements
- **Health Check Endpoint**: Add
/api/health/versionto return running code version - **Automated Restart**: Make deployment script automatically restart machines
- **Monitoring Dashboards**: Grafana dashboard for job progress and heartbeat ages
- **Alerts**: PagerDuty alerts when jobs are cancelled
- **Automatic Recovery**: Worker should detect stale reaper and restart job