# Historical Sync Backfill Debugging Guide
**Date:** May 3, 2026
**Issue:** Historical sync jobs failing with premature cancellations and variable shadowing errors
## Problem Summary
Historical sync backfill jobs were failing with two critical issues:
1. **Variable shadowing bug**: 'tuple' object has no attribute 'get'
2. **Premature cancellations**: Jobs cancelled after 10-20 minutes instead of completing
## Root Causes
### 1. Variable Shadowing Bug (FIXED ✅)
**Location:** backend-saas/core/historical_sync_service.py:910
**Problem:**
# OLD BUGGY CODE:
for result in llm_results: # ❌ Shadows outer 'result' variable
doc_id, llm_entities, llm_relationships = result
# ... process ...
# Later code tries to use outer 'result':
has_more = result.get("has_more", False) # ❌ 'result' is now a tuple!
**Impact:**
- Job 3cfdf70d-d1f2-4b1e-b007-19741cddf4fc failed
- Error: AttributeError: 'tuple' object has no attribute 'get'
- All backfill jobs failing after first chunk
**Fix:**
# NEW FIXED CODE:
for llm_result in llm_results: # ✅ Uses different variable name
doc_id, llm_entities, llm_relationships = llm_result
# ... process ...
# Outer 'result' still available:
has_more = result.get("has_more", False) # ✅ Works correctly
**Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py
### 2. Silent Heartbeat Failures (FIXED ✅)
**Location:** backend-saas/core/historical_sync_service.py:960, 1168
**Problem:**
# OLD CODE:
try:
heartbeat_db = Session(bind=self.db.bind)
_job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
heartbeat_db.commit()
except Exception:
pass # ❌ Swallows all errors silently!
**Impact:**
- Heartbeat updates failing silently
- Reaper seeing stale heartbeats
- Jobs marked as "abandoned" even though worker is alive
**Fix:**
# NEW CODE:
try:
heartbeat_db = Session(bind=self.db.bind)
_job_for_heartbeat.last_heartbeat = datetime.now(timezone.utc)
heartbeat_db.commit()
except Exception as heartbeat_err:
logger.error(
f"Failed to update heartbeat: {heartbeat_err}",
tenant_id=self.tenant_id,
job_id=job_id,
action="heartbeat_update_failed",
) # ✅ Logs errors for debugging
### 3. LLM Infinite Hangs (FIXED ✅)
**Location:** backend-saas/core/historical_sync_service.py:893
**Problem:**
# OLD CODE:
llm_results = await asyncio.gather(*llm_tasks, return_exceptions=True)
# ❌ No timeout - waits forever if task hangs
**Impact:**
- Jobs stuck indefinitely when LLM API hangs
- No way to recover from hung LLM calls
- Worker machines blocked forever
**Fix:**
# NEW CODE:
try:
llm_results = await asyncio.wait_for(
asyncio.gather(*llm_tasks, return_exceptions=True),
timeout=300.0 # ✅ 5 minutes max for LLM extraction
)
except asyncio.TimeoutError:
logger.error(f"LLM extraction timed out after 5 minutes for chunk {chunk_count + 1}")
llm_results = [] # Continue with next chunk
### 4. Wrong Reaper Threshold (FIXED ✅)
**Location:** backend-saas/core/startup_tasks.py:74
**Problem:**
- Old threshold: 10-30 minutes
- Large backfill chunks legitimately take 30-45 minutes
- Jobs cancelled before they can complete
**Fix:**
# NEW CODE:
threshold = datetime.now(timezone.utc) - timedelta(minutes=60) # ✅ 60 minutes
### 5. Poor Error Logging (FIXED ✅)
**Location:** backend-saas/core/historical_sync_service.py:1400
**Problem:**
# OLD CODE:
job.last_error = str(e) # ❌ Loses stack trace
**Fix:**
# NEW CODE:
import traceback
job.last_error = traceback.format_exc() # ✅ Full stack trace
### 6. Enhanced Reaper Logging (FIXED ✅)
**Location:** backend-saas/core/startup_tasks.py:101-121
**Problem:**
- Old reaper cancellation message: "Abandoned (server restart or timeout)"
- No diagnostic information to debug why job was cancelled
**Fix:**
# NEW CODE:
logger.warning(
f"Reaper cancelling orphaned HistoricalSyncJob",
extra={
"job_id": str(job.id),
"tenant_id": str(job.tenant_id),
"integration_id": str(job.integration_id),
"reaper_machine_id": machine_id,
"job_age_minutes": job_age.total_seconds() / 60,
"heartbeat_age_minutes": heartbeat_age.total_seconds() / 60 if heartbeat_age else None,
"job_created": job.created_at.isoformat() if job.created_at else None,
"last_heartbeat": job.last_heartbeat.isoformat() if job.last_heartbeat else None,
"heartbeat_is_null": job.last_heartbeat is None,
"records_processed": job.records_processed,
"started_at": job.started_at.isoformat() if job.started_at else None,
"diagnosis": "DEAD WORKER: Job was running but heartbeat is stale. ..."
}
)
### 7. Deployment Verification Gap (CRITICAL LESSON - May 3, 2026)
**🔴 CRITICAL DISCOVERY:**
**The Deployment Paradox:**
- ✅ Files on disk show VERSION 2039 code (verified by grep/SSH)
- ❌ Running Python process has OLD code in memory
- ❌ Verification scripts check FILES, not RUNNING PROCESSES
- Result: "Verification passes" but running code is stale
**Real Incident Timeline:**
1. Deployed VERSION 2039 with enhanced reaper logging ✅
2. Machine config updated to VERSION 2039 ✅
3. Verification script checked files via SSH - all passed ✅
4. Backfill job ran and got cancelled with OLD message format ❌
5. Last error: "Abandoned (server restart or timeout)" (old format) ❌
6. Root cause: Machine "UPDATED AT" changed but process didn't restart ❌
**Why This Happens:**
- Deployment builds new image and updates machine config
- Files on disk are replaced with new code
- BUT running Python process (uvicorn) continues with old code in memory
- Machine "UPDATED AT" timestamp updates even without process restart
- Verification scripts use SSH to grep files, which shows new code ✅
- But running process memory still has old code loaded ❌
**Two Solutions:**
**Option 1: Immediate Deployment Strategy (RECOMMENDED)**
# Use immediate strategy - kills and recreates machines
fly deploy -a atom-saas --strategy immediate
# Advantages:
# ✅ Old processes killed (not just stopped)
# ✅ New machines start fresh with new code in memory
# ✅ Guaranteed consistency
# ✅ No human error (forgetting to restart)
# ⚠️ Brief downtime (~30-60s) but worth it for consistency
**Option 2: Manual Restart After Deployment**
# 1. Deploy
fly deploy -a atom-saas
# 2. Check machine timestamps
fly machines list -a atom-saas
# 3. CRITICAL: If UPDATED AT equals deployment time, process didn't restart
fly machines restart --all -a atom-saas
# 4. Wait for health
sleep 30
fly status -a atom-saas --all
**Verification Script Limitation:**
The verify_deployment.py script has a **CRITICAL LIMITATION**:
- ✅ Checks files on disk via SSH
- ❌ Does NOT check running process memory
- ❌ Passing verification ≠ running code is up-to-date
**✅ Definitive Verification (Health Endpoint - May 3, 2026):**
**NEW SOLUTION:** Check health endpoint for deployed SHA:
# 1. Get current HEAD SHA locally
git rev-parse --short HEAD
# 2. Check deployed SHA from health endpoint
curl https://app.atomagentos.com/api/health | jq .deployed_sha
# 3. Verify SHA matches
# - Match found: New code is running ✅
# - SHA missing or different: Machine needs restart ❌
fly machines restart --all -a atom-saas
**Why This Works:**
- VERSION.txt baked into Docker image at build time
- Health endpoint reads VERSION.txt at runtime
- No git dependency needed in container
- Instant verification without SSH or log diving
**Implementation:**
1. **Dockerfile** (lines 10-12, 61-63):
```dockerfile
ARG VERSION_SHA=dev
ENV VERSION_SHA=${VERSION_SHA}
ARG VERSION_SHA=unknown
RUN echo $VERSION_SHA > /app/backend-saas/VERSION.txt
```
2. **Health Endpoint** (src/app/api/health/route.ts):
```typescript
try {
const versionPath = '/app/backend-saas/VERSION.txt'
if (existsSync(versionPath)) {
health.deployed_sha = await readFile(versionPath, 'utf-8').trim()
}
} catch (error) {
// deployed_sha will be undefined if file doesn't exist
}
```
**Deployment Usage:**
fly deploy -a atom-saas --build-arg VERSION_SHA=$(git rev-parse --short HEAD)
**Log-Based Verification (Secondary Method):**
If health endpoint unavailable, check Fly logs:
## Fixes Applied
All fixes deployed in VERSION 2031:
1. ✅ Variable shadowing fix (line 910)
2. ✅ Enhanced error logging with traceback.format_exc() (line 1400)
3. ✅ Fresh sessions for heartbeat updates (lines 960, 1168)
4. ✅ 5-minute LLM timeout (line 906)
5. ✅ 60-minute reaper threshold (line 74)
6. ✅ Enhanced reaper logging (line 101)
7. ✅ Deployment verification script
## Test Results
**Job 3ededa84-d8fc-4543-803c-a2930f7a5a88:**
- Runtime: 1 hour 10 minutes
- Records processed: 800
- Entities extracted: 800
- Relationships extracted: 800
- Result: Cancelled by OLD reaper (machine hadn't restarted)
**Key Success Indicators:**
- ✅ No variable shadowing errors for 800 records
- ✅ Heartbeat updates working (6 seconds old at one point)
- ✅ No premature cancellation at 10-20 minutes
- ✅ Job ran for 60+ minutes without issue
**Remaining Issue:**
- Job was cancelled because app machine hadn't restarted to pick up new reaper code
- Old reaper cancelled with old threshold (likely 30 minutes based on heartbeat age)
- Solution: Always restart machines after deployment
## Prevention Checklist
For future deployments:
1. **Code Changes:**
- [ ] Always use TDD for bug fixes
- [ ] Test with real backfill jobs
- [ ] Add regression tests
2. **Deployment Process:**
- [ ] Deploy with --strategy immediate
- [ ] Check UPDATED AT timestamps
- [ ] Manually restart if >5 minutes old
- [ ] Run verification script
- [ ] Start test backfill job
3. **Monitoring:**
- [ ] Check job error messages for diagnostic info
- [ ] Verify machine ID in reaper logs
- [ ] Confirm heartbeat ages make sense
- [ ] Look for "DEAD WORKER" diagnosis
## Related Files
- **Bug Fix Test:** backend-saas/tests/test_variable_shadowing_fix_tdd.py
- **Verification Script:** backend-saas/scripts/verify_deployment.py
- **Worker Service:** backend-saas/core/historical_sync_service.py
- **Reaper:** backend-saas/core/startup_tasks.py
## Commands Reference
# Check machine status
fly status -a atom-saas --all
fly machines list -a atom-saas
# Restart machines
fly machine restart <machine-id> -a atom-saas
fly machines restart --all -a atom-saas
# Verify deployment
python3 backend-saas/scripts/verify_deployment.py
# Check job progress
fly ssh console -a atom-saas -c <machine-id> --command \
"python3 -c \"
import sys; sys.path.insert(0, '/app/backend-saas')
from core.database import SessionLocal
from core.models import HistoricalSyncJob
db = SessionLocal()
job = db.query(HistoricalSyncJob).filter(
HistoricalSyncJob.id == '<job-id>'
).first()
if job:
print(f'Status: {job.status}')
print(f'Records: {job.records_processed}')
db.close()
\""
# Check logs
fly logs -a atom-saas -n 100 | grep -E "DEPLOYED|Reaper|cancelled"
## Lessons Learned
1. **Variable Naming Matters**: Never shadow outer scope variables in loops
2. **Silent Errors Are Dangerous**: Always log exceptions, never use except: pass
3. **Timeouts Are Critical**: Always add timeouts to potentially blocking operations
4. **Thresholds Must Match Reality**: Set timeouts based on actual operation times
5. **🔴 DEPLOYMENT PARADOX (May 3, 2026)**:
- Files on disk ≠ Running process memory
- Verification scripts check FILES, not PROCESSES
- "Verified" deployment can still run old code
- **Solution**: Use fly deploy --strategy immediate to guarantee consistency
6. **Verify Running Code**: Don't just check files - verify the running process
7. **Diagnostic Logging Is Essential**: Can't debug what you can't see
## Future Improvements
1. **Health Check Endpoint**: Add /api/health/version to return running code version
2. **Automated Restart**: Make deployment script automatically restart machines
3. **Monitoring Dashboards**: Grafana dashboard for job progress and heartbeat ages
4. **Alerts**: PagerDuty alerts when jobs are cancelled
5. **Automatic Recovery**: Worker should detect stale reaper and restart job