Atom AI Labs - AI-Powered Multi-Tenant Platform

Plan: Automated Entity Linking Agent + Attachment Ingestion

**Created:** May 7, 2026

**Context:** Raw Discoveries tab shows extracted entities but linking is manual (prompt()).

No attachment ingestion from Outlook emails (only body text is extracted).

---

Plan 1: LLM Judge Agent for Automated Entity Linking

Overview

Create a tenant-scoped agent (entity_linking_agent) that evaluates DiscoveredEntity

rows and decides: link, reject, or flag for human review. Uses the existing agent

framework (AgentRegistry, governance, graduation, feedback loops).

Architecture

Trigger (after sync chunk / scheduled)
  ↓
EntityLinkingAgent.execute()
  ↓
For each pending DiscoveredEntity:
  1. Fetch entity + source record + existing graph context
  2. LLM judge evaluates: is this a real entity? does it match an existing node?
  3. Decision: link / reject / flag
  4. Record decision with confidence + reasoning
  ↓
AgentLearning records feedback for continuous improvement

Files to Create/Modify

New Files

**backend-saas/core/agents/entity_linking_agent.py**

EntityLinkingAgent class extending agent framework
evaluate_entity(entity) → { decision, confidence, reasoning, target_node_id }
execute_batch(tenant_id, workspace_id) → processes all pending entities
Uses LLM via LLMService with task_type="linking"
Records decisions in DiscoveredEntity.extraction_metadata.linking_judgment

**backend-saas/core/agents/entity_linking_judge.py**

EntityLinkingJudge — the LLM evaluation logic
Prompt template with:
Entity name, type, properties
Source record preview (email body, etc.)
Existing graph nodes of similar name/type
Instructions: determine if entity is real, if it matches existing node, confidence
Returns structured JSON: { decision, target_node_id, confidence, reasoning }

**backend-saas/core/agents/__init__.py**

Package init for agents directory

**backend-saas/tests/test_entity_linking_agent_tdd.py**

TDD tests covering:
Agent correctly identifies duplicate entities
Agent links to existing graph node when name matches
Agent creates new graph node for novel entities
Agent rejects low-confidence / spam entities
Agent respects tenant isolation
Undo functionality: agent can revert a bad link

Modified Files

**backend-saas/core/models.py**

Add entity_linking_agent as default agent in AgentRegistry (like other system agents)
Add linking_judgment JSON field to DiscoveredEntity (or use extraction_metadata)

**backend-saas/core/ingestion_pipeline.py**

After _run_schema_discovery(), call entity_linking_agent.execute_batch()
Only if auto-linking is enabled (feature flag / tenant setting)

**backend-saas/core/historical_sync_service.py**

Same: after schema discovery in _process_sync_job, trigger linking agent

**backend-saas/api/routes/multi_entity_extraction_routes.py**

Add endpoint: POST /api/v1/entities/discovered/auto-link
Add endpoint: POST /api/v1/entities/discovered/{entity_id}/undo-link

**src/components/knowledge-graph/DiscoveredEntitiesPanel.tsx**

Add "Auto-Link All" button (triggers agent batch)
Add "Undo" button on linked entities
Show linking judgment (reasoning) in expanded details

Agent Training & Learning

The agent uses the existing agent framework:

**AgentRegistry**: registered as entity_linking_agent, category="System", maturity="supervised"
**AgentLearning**: tracks feedback counts, adjusts LLM temperature over time
**AgentFeedback**: users can 👍/👎 linking decisions, feeding RLHF
**Graduation**: starts at "supervised" (flags uncertain cases), can graduate to "autonomous" after X successful links with no undos

Feature Flags

Flag	Default	Description
`AUTO_LINK_ENABLED`	`false`	Global kill switch
`AUTO_LINK_MIN_CONFIDENCE`	`0.7`	Minimum confidence to auto-link
`AUTO_LINK_FLAG_THRESHOLD`	`0.5`	Below this, flag for human review

TDD Verification

TestAutoLinking:
  test_agent_links_high_confidence_entity           # confidence 0.9 → linked
  test_agent_flags_medium_confidence_entity         # confidence 0.6 → flagged
  test_agent_rejects_low_confidence_entity          # confidence 0.2 → rejected
  test_agent_matches_existing_graph_node            # "Acme Corp" matches existing node
  test_agent_creates_new_node_for_novel_entity      # No match → creates node
  test_agent_undo_reverts_link                      # Undo restores pending status
  test_agent_tenant_isolation                       # Tenant A can't see Tenant B entities
  test_agent_learning_from_feedback                 # 👍 increases future auto-link rate
  test_agent_graduation_path                        # supervised → autonomous after N successes

---

Plan 2: Outlook Email Attachment Ingestion

Overview

Download and parse attachments from Outlook emails during historical sync.

Reuses the existing Docling document parsing pipeline used by OneDrive/WorkDrive.

Architecture

Outlook sync fetches email
  ↓
Check hasAttachments flag
  ↓
If true: GET /me/messages/{id}/attachments (metadata)
  ↓
For each attachment:
  GET /me/messages/{id}/attachments/{attachmentId}/$value (binary)
  ↓
Pass through DoclingProcessor (same as OneDrive/WorkDrive)
  ↓
Append parsed text to email body for LLM extraction
  ↓
Create DiscoveredEntity with source_record_type="outlook_attachment"

Files to Create/Modify

New Files

**backend-saas/tests/test_outlook_attachment_ingestion_tdd.py**

TDD tests covering:
Attachment metadata fetched correctly
Binary content downloaded
PDF parsed via Docling
DOCX parsed via Docling
Text appended to email body for extraction
Attachment entities created with proper type
Large attachments skipped (respect MAX size)
Rate limiting for attachment API calls

Modified Files

**backend-saas/integrations/outlook_service.py**

Add download_attachment(message_id, attachment_id) method
Calls GET /me/messages/{message_id}/attachments/{attachment_id}/$value
Returns bytes
Reuses existing self.client (httpx) and access token

**backend-saas/core/integrations/adapters/outlook_integration_v2.py**

Add download_file(file_id) method (standardized contract for _prepare_record_text_async)
file_id format: "{message_id}:{attachment_id}"
Delegates to Outlook service's download_attachment

**backend-saas/core/ingestion_pipeline.py**

In _prepare_record_text_async, add handling for integration_id == "outlook"
Check if record has hasAttachments == true
Fetch and parse each attachment
Append parsed text: record["body"] += "\n\n[Attachment: {name}]\n{parsed_text}"

**backend-saas/core/historical_sync_service.py**

No changes needed — _prepare_record_text_async is already called per record

Reuse from OneDrive

The OneDrive adapter (core/integrations/adapters/onedrive.py) already implements:

download_file(file_id) — downloads binary from MS Graph
Docling integration via _prepare_record_text_async

The Outlook service (integrations/outlook_service.py) already has:

self.client (httpx.AsyncClient) with auth headers
MS Graph API base URL
Access token management

The attachment flow is identical: MS Graph API → binary → Docling → text.

Only the endpoint differs: /drive/items/{id}/content vs /messages/{id}/attachments/{aid}/$value.

Rate Limiting & Limits

Setting	Default	Description
`OUTLOOK_MAX_ATTACHMENT_SIZE_MB`	10	Skip attachments larger than this
`OUTLOOK_ATTACHMENT_RATE_LIMIT`	10/min	MS Graph throttling
`OUTLOOK_ATTACHMENT_TYPES`	pdf,docx,xlsx,pptx,txt,csv	Supported formats (via Docling)

TDD Verification

TestOutlookAttachmentIngestion:
  test_fetches_attachment_metadata           # GET /attachments returns list
  test_downloads_attachment_content          # GET /$value returns bytes
  test_parses_pdf_attachment                 # Docling extracts PDF text
  test_parses_docx_attachment                # Docling extracts DOCX text
  test_appends_text_to_email_body            # Body + attachment text combined
  test_skips_large_attachments               # >10MB skipped with warning
  test_handles_missing_attachment            # 404 from Graph API
  test_creates_attachment_entities            # DiscoveredEntity with type="Document"
  test_rate_limits_attachment_requests        # Respects 10/min throttle
  test_tenant_isolation                      # Cross-tenant attachment access blocked

---

Implementation Order

Phase 1: Attachment Ingestion (1-2 sessions)

Add download_attachment to outlook_service.py
Add download_file to outlook_integration_v2.py
Modify _prepare_record_text_async for outlook integration
TDD verification
Deploy + test with Brennan's Outlook backfill

Phase 2: Automated Linking Agent (2-3 sessions)

Create EntityLinkingJudge with LLM prompt
Create EntityLinkingAgent extending agent framework
Wire into ingestion pipeline (after schema discovery)
Add API endpoints (auto-link, undo)
Add UI components (Auto-Link All button, Undo, judgment display)
TDD verification
Register as default agent for all tenants
Deploy + monitor linking decisions

Phase 3: Integration & Polish (1 session)

End-to-end TDD: sync → extract → link → view in Knowledge Graph
Agent graduation: supervised → autonomous based on feedback
Documentation update

---

Dependencies

Dependency	Status	Notes
Docling processor	✅ Deployed	`get_docling_processor()` singleton
Agent framework	✅ Deployed	AgentRegistry, governance, graduation
EntityLinkingService	✅ Exists	`link_entities_to_graph` already written
SchemaDiscoveryService	✅ Exists	Creates EntityTypeDefinition drafts
MS Graph API access	✅ Working	Token refresh handled by outlook_service
Outlook attachment endpoints	⚠️ Needs scope	May need `Mail.ReadWrite` or `Files.Read`