Historical Sync Backfill — Complete Fix Summary
Problem
Outlook memory ingestion backfill jobs were stuck in paused/failed/cancelled states with 0 entities and 0 neural links. Workers were being killed mid-processing, and semantic extraction was failing with 401 authentication errors.
Root Causes (9 bugs found and fixed)
| # | Bug | Symptom | Fix | Deployed |
|---|---|---|---|---|
| 1 | **aiohttp no timeout** — HTTP calls to Microsoft hung for 5 minutes | Jobs stuck running with 0/0 | 30s ClientTimeout on all Microsoft API calls | 2026-05-01 |
| 2 | **Microsoft Graph 504** — $top=1000 with 90-day range timed out Microsoft's servers | Graph API request failed | Lowered to $top=100, added 504 retry with 2s backoff | 2026-05-01 |
| 3 | **Worker self-shutdown** — worker had logic to self-stop after 5 idle minutes | Jobs pending forever, never dequeued | Removed self-shutdown logic entirely; Fly's auto_start_machines handles lifecycle | 2026-05-01 |
| 4a | **No initial heartbeat** — worker didn't send heartbeat before first API call | Reaper marked jobs as stale immediately | Send initial heartbeat before entering fetch loop | 2026-05-01 |
| 4b | **Slow chunk processing** — LanceDB + GraphRAG takes 15-30min per 100-record chunk | Reaper killed jobs as "abandoned" mid-processing | Increased reaper threshold 15→30min | 2026-05-01 |
| 5 | **SQLAlchemy session detachment** — sub-services expired job object from session | "Instance is not persistent within this Session" on chunk commit | Re-query job from DB instead of refresh() | 2026-05-01 |
| 6 | **BYOK key mismatch** — queried openai_api_key (lowercase) but stored as OPENAI_API_KEY | LLM extraction used mock-key, got 401 errors | Case-sensitive match in historical_sync_service.py + BYOKHandler fix | 2026-05-01 |
| 7 | **Email body truncated** — only 500-char preview sent to LLM | Semantic extraction missed entities in email body | Full body (up to 10KB) instead of preview | 2026-05-01 |
| 8 | **Missing asyncio import** — 504 retry logic used asyncio.sleep() | NameError: name 'asyncio' is not defined' | Added import asyncio to outlook_service.py | 2026-05-01 |
| 9 | **Worker process syntax** — fly.toml used wrong command format | Worker failed to start: "No such file or directory" | Shell wrapper: worker = "sh -c '/app/docker-entrypoint.sh worker'" | 2026-05-01 |
Known Issues (Not Yet Fixed)
| # | Bug | Symptom | Status |
|---|---|---|---|
| 10 | **Transaction errors in GraphRAG** — database operation fails without rollback | InFailedSqlTransaction: current transaction is aborted blocks all subsequent operations | **Needs fix** - Add proper error handling with rollback |
Additional Improvements
Performance
- **Chunk size**: 1000→100 records to avoid Microsoft Graph 504 errors
- **Progress calculation**: Uses
has_moreflag instead of per-pagetotal_count(fixed 150% display bug) - **Email body**: Full content (up to 10KB) instead of 500-char
bodyPreview
Architecture
- **Dual ingestion paths**:
- **Semantic memory** (LanceDB): Vector embeddings for all tenants, no API key required
- **Structured memory** (GraphRAG): Entity extraction for tenants with BYO key
- **Worker VM**: 2GB memory (1-core) for GraphRAG ingestion headroom
- **Autoscaling**: Multi-worker (1-3 machines) via Fly Machines API
Deployment Timeline
- **2026-05-01 00:00-01:00 UTC** - Initial fixes (timeout, 504 retry, worker shutdown)
- **2026-05-01 01:00-02:00 UTC** - Session fixes (re-query, asyncio, BYOK key)
- **2026-05-01 02:00-03:00 UTC** - Email body + worker process fixes
- **2026-05-01 03:00-04:00 UTC** - Reaper timeout + dual-memory architecture
Architecture
Email record (100 per chunk)
│
├─→ _extract_structured_entities() → DiscoveredEntity → Postgres
│ └─→ Rule-based: from, to, subject, content fields
│
├─→ graphrag.ingest_document() → GraphNode/GraphEdge → Postgres
│ └─→ LLM-based: people, orgs, topics from email body
│ └─→ Requires: Tenant BYOK key (OPENAI_API_KEY)
│
└─→ lancedb.add_document() × 100 → LanceDB
└─→ Full text + subject → Vector embeddings
└─→ Requires: None (runs for all tenants)Deployment Commands
Deploy Latest Changes
git pull origin main
fly deploy -a atom-saas --strategy immediateVerify Deployment
# Check machines started
fly status -a atom-saas
# Check worker logs
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(ROLE|Starting|Loaded service)"Verification
Check Recent Jobs
SELECT
id,
status,
records_processed,
entities_extracted,
relationships_extracted,
created_at,
completed_at
FROM historical_sync_jobs
WHERE created_at > NOW() - INTERVAL '1 hour'
ORDER BY created_at DESC
LIMIT 5;Monitor Worker Activity
# Real-time worker logs
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(dequeue|Fetched|Persisting|entities|relationships|LanceDB)"
# Check for errors
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(error|Error|ERROR|failed|Failed)"Verify Semantic Extraction
# Check LanceDB documents stored
# (Requires access to production database)
# Check GraphRAG entities created
SELECT COUNT(*) FROM graph_nodes
WHERE tenant_id = '31c06fc4-db22-4740-83ea-48ac14f25810'
AND created_at > NOW() - INTERVAL '1 hour';Expected Performance
Per 100-record chunk:
- **Fetch from Outlook**: ~2-5 seconds
- **LanceDB embeddings**: ~30-60 seconds (synchronous, one-at-a-time)
- **GraphRAG entity extraction**: ~5-15 minutes (if BYOK key available)
- **Total time**: ~8-10 minutes (LanceDB only) or ~15-20 minutes (both paths)
**Note**: LanceDB calls are currently synchronous (not batched with asyncio.gather()). This is a future optimization opportunity.
Current Status
✅ Working
- Worker stays alive and processes jobs continuously
- Fetches 100 emails per chunk without 504 errors
- Full email body (10KB) available for processing
- No premature reaper kills (30-minute threshold)
- LanceDB embeddings created for all tenants
- BYOK tenants get GraphRAG entity extraction
⏳ Pending
- **Transaction error fix** needed for GraphRAG path
- **LanceDB batching** optimization (future)
- Production backfill completion metrics
🔧 Known Limitations
- GraphRAG path blocked by transaction errors (Bug #10)
- Single-core worker limits concurrent processing
- Synchronous LanceDB calls add ~30-60 seconds per chunk
Next Steps
- **Fix transaction errors** in GraphRAG ingestion path
- **Verify production backfill** completes successfully
- **Optimize LanceDB** with batched concurrent calls
- **Monitor reaper** to ensure 30-minute threshold is sufficient