Atom AI Labs - AI-Powered Multi-Tenant Platform

Historical Sync Backfill — Complete Fix Summary

Problem

Outlook memory ingestion backfill jobs were stuck in paused/failed/cancelled states with 0 entities and 0 neural links. Workers were being killed mid-processing, and semantic extraction was failing with 401 authentication errors.

Root Causes (9 bugs found and fixed)

#	Bug	Symptom	Fix	Deployed
1	aiohttp no timeout — HTTP calls to Microsoft hung for 5 minutes	Jobs stuck `running` with 0/0	30s `ClientTimeout` on all Microsoft API calls	2026-05-01
2	Microsoft Graph 504 — `$top=1000` with 90-day range timed out Microsoft's servers	`Graph API request failed`	Lowered to `$top=100`, added 504 retry with 2s backoff	2026-05-01
3	Worker self-shutdown — worker had logic to self-stop after 5 idle minutes	Jobs `pending` forever, never dequeued	Removed self-shutdown logic entirely; Fly's `auto_start_machines` handles lifecycle	2026-05-01
4a	No initial heartbeat — worker didn't send heartbeat before first API call	Reaper marked jobs as stale immediately	Send initial heartbeat before entering fetch loop	2026-05-01
4b	Slow chunk processing — LanceDB + GraphRAG takes 15-30min per 100-record chunk	Reaper killed jobs as "abandoned" mid-processing	Increased reaper threshold 15→30min	2026-05-01
5	SQLAlchemy session detachment — sub-services expired job object from session	`"Instance is not persistent within this Session"` on chunk commit	Re-query job from DB instead of `refresh()`	2026-05-01
6	BYOK key mismatch — queried `openai_api_key` (lowercase) but stored as `OPENAI_API_KEY`	LLM extraction used `mock-key`, got 401 errors	Case-sensitive match in `historical_sync_service.py` + `BYOKHandler` fix	2026-05-01
7	Email body truncated — only 500-char preview sent to LLM	Semantic extraction missed entities in email body	Full body (up to 10KB) instead of preview	2026-05-01
8	Missing asyncio import — 504 retry logic used `asyncio.sleep()`	`NameError: name 'asyncio' is not defined'`	Added `import asyncio` to `outlook_service.py`	2026-05-01
9	Worker process syntax — fly.toml used wrong command format	Worker failed to start: "No such file or directory"	Shell wrapper: `worker = "sh -c '/app/docker-entrypoint.sh worker'"`	2026-05-01

Known Issues (Not Yet Fixed)

#	Bug	Symptom	Status
10	Transaction errors in GraphRAG — database operation fails without rollback	`InFailedSqlTransaction: current transaction is aborted` blocks all subsequent operations	Needs fix - Add proper error handling with rollback

Additional Improvements

Performance

**Chunk size**: 1000→100 records to avoid Microsoft Graph 504 errors
**Progress calculation**: Uses has_more flag instead of per-page total_count (fixed 150% display bug)
**Email body**: Full content (up to 10KB) instead of 500-char bodyPreview

Architecture

**Dual ingestion paths**:
**Semantic memory** (LanceDB): Vector embeddings for all tenants, no API key required
**Structured memory** (GraphRAG): Entity extraction for tenants with BYO key
**Worker VM**: 2GB memory (1-core) for GraphRAG ingestion headroom
**Autoscaling**: Multi-worker (1-3 machines) via Fly Machines API

Deployment Timeline

**2026-05-01 00:00-01:00 UTC** - Initial fixes (timeout, 504 retry, worker shutdown)
**2026-05-01 01:00-02:00 UTC** - Session fixes (re-query, asyncio, BYOK key)
**2026-05-01 02:00-03:00 UTC** - Email body + worker process fixes
**2026-05-01 03:00-04:00 UTC** - Reaper timeout + dual-memory architecture

Architecture

Email record (100 per chunk)
    │
    ├─→ _extract_structured_entities() → DiscoveredEntity → Postgres
    │   └─→ Rule-based: from, to, subject, content fields
    │
    ├─→ graphrag.ingest_document() → GraphNode/GraphEdge → Postgres
    │   └─→ LLM-based: people, orgs, topics from email body
    │   └─→ Requires: Tenant BYOK key (OPENAI_API_KEY)
    │
    └─→ lancedb.add_document() × 100 → LanceDB
        └─→ Full text + subject → Vector embeddings
        └─→ Requires: None (runs for all tenants)

Deployment Commands

Deploy Latest Changes

git pull origin main
fly deploy -a atom-saas --strategy immediate

Verify Deployment

# Check machines started
fly status -a atom-saas

# Check worker logs
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(ROLE|Starting|Loaded service)"

Verification

Check Recent Jobs

SELECT
    id,
    status,
    records_processed,
    entities_extracted,
    relationships_extracted,
    created_at,
    completed_at
FROM historical_sync_jobs
WHERE created_at > NOW() - INTERVAL '1 hour'
ORDER BY created_at DESC
LIMIT 5;

Monitor Worker Activity

# Real-time worker logs
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(dequeue|Fetched|Persisting|entities|relationships|LanceDB)"

# Check for errors
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(error|Error|ERROR|failed|Failed)"

Verify Semantic Extraction

# Check LanceDB documents stored
# (Requires access to production database)

# Check GraphRAG entities created
SELECT COUNT(*) FROM graph_nodes
WHERE tenant_id = '31c06fc4-db22-4740-83ea-48ac14f25810'
  AND created_at > NOW() - INTERVAL '1 hour';

Expected Performance

Per 100-record chunk:

**Fetch from Outlook**: ~2-5 seconds
**LanceDB embeddings**: ~30-60 seconds (synchronous, one-at-a-time)
**GraphRAG entity extraction**: ~5-15 minutes (if BYOK key available)
**Total time**: ~8-10 minutes (LanceDB only) or ~15-20 minutes (both paths)

**Note**: LanceDB calls are currently synchronous (not batched with asyncio.gather()). This is a future optimization opportunity.

Current Status

✅ Working

Worker stays alive and processes jobs continuously
Fetches 100 emails per chunk without 504 errors
Full email body (10KB) available for processing
No premature reaper kills (30-minute threshold)
LanceDB embeddings created for all tenants
BYOK tenants get GraphRAG entity extraction

⏳ Pending

**Transaction error fix** needed for GraphRAG path
**LanceDB batching** optimization (future)
Production backfill completion metrics

🔧 Known Limitations

GraphRAG path blocked by transaction errors (Bug #10)
Single-core worker limits concurrent processing
Synchronous LanceDB calls add ~30-60 seconds per chunk

Next Steps

**Fix transaction errors** in GraphRAG ingestion path
**Verify production backfill** completes successfully
**Optimize LanceDB** with batched concurrent calls
**Monitor reaper** to ensure 30-minute threshold is sufficient