Plan: Automated Entity Linking Agent + Attachment Ingestion
**Created:** May 7, 2026
**Context:** Raw Discoveries tab shows extracted entities but linking is manual (prompt()).
No attachment ingestion from Outlook emails (only body text is extracted).
---
Plan 1: LLM Judge Agent for Automated Entity Linking
Overview
Create a tenant-scoped agent (entity_linking_agent) that evaluates DiscoveredEntity
rows and decides: link, reject, or flag for human review. Uses the existing agent
framework (AgentRegistry, governance, graduation, feedback loops).
Architecture
Trigger (after sync chunk / scheduled)
↓
EntityLinkingAgent.execute()
↓
For each pending DiscoveredEntity:
1. Fetch entity + source record + existing graph context
2. LLM judge evaluates: is this a real entity? does it match an existing node?
3. Decision: link / reject / flag
4. Record decision with confidence + reasoning
↓
AgentLearning records feedback for continuous improvementFiles to Create/Modify
New Files
**backend-saas/core/agents/entity_linking_agent.py**
EntityLinkingAgentclass extending agent frameworkevaluate_entity(entity)→{ decision, confidence, reasoning, target_node_id }execute_batch(tenant_id, workspace_id)→ processes all pending entities- Uses LLM via
LLMServicewith task_type="linking" - Records decisions in
DiscoveredEntity.extraction_metadata.linking_judgment
**backend-saas/core/agents/entity_linking_judge.py**
EntityLinkingJudge— the LLM evaluation logic- Prompt template with:
- Entity name, type, properties
- Source record preview (email body, etc.)
- Existing graph nodes of similar name/type
- Instructions: determine if entity is real, if it matches existing node, confidence
- Returns structured JSON:
{ decision, target_node_id, confidence, reasoning }
**backend-saas/core/agents/__init__.py**
- Package init for agents directory
**backend-saas/tests/test_entity_linking_agent_tdd.py**
- TDD tests covering:
- Agent correctly identifies duplicate entities
- Agent links to existing graph node when name matches
- Agent creates new graph node for novel entities
- Agent rejects low-confidence / spam entities
- Agent respects tenant isolation
- Undo functionality: agent can revert a bad link
Modified Files
**backend-saas/core/models.py**
- Add
entity_linking_agentas default agent in AgentRegistry (like other system agents) - Add
linking_judgmentJSON field toDiscoveredEntity(or useextraction_metadata)
**backend-saas/core/ingestion_pipeline.py**
- After
_run_schema_discovery(), callentity_linking_agent.execute_batch() - Only if auto-linking is enabled (feature flag / tenant setting)
**backend-saas/core/historical_sync_service.py**
- Same: after schema discovery in
_process_sync_job, trigger linking agent
**backend-saas/api/routes/multi_entity_extraction_routes.py**
- Add endpoint:
POST /api/v1/entities/discovered/auto-link - Add endpoint:
POST /api/v1/entities/discovered/{entity_id}/undo-link
**src/components/knowledge-graph/DiscoveredEntitiesPanel.tsx**
- Add "Auto-Link All" button (triggers agent batch)
- Add "Undo" button on linked entities
- Show linking judgment (reasoning) in expanded details
Agent Training & Learning
The agent uses the existing agent framework:
- **AgentRegistry**: registered as
entity_linking_agent, category="System", maturity="supervised" - **AgentLearning**: tracks feedback counts, adjusts LLM temperature over time
- **AgentFeedback**: users can 👍/👎 linking decisions, feeding RLHF
- **Graduation**: starts at "supervised" (flags uncertain cases), can graduate to "autonomous" after X successful links with no undos
Feature Flags
| Flag | Default | Description |
|---|---|---|
AUTO_LINK_ENABLED | false | Global kill switch |
AUTO_LINK_MIN_CONFIDENCE | 0.7 | Minimum confidence to auto-link |
AUTO_LINK_FLAG_THRESHOLD | 0.5 | Below this, flag for human review |
TDD Verification
TestAutoLinking:
test_agent_links_high_confidence_entity # confidence 0.9 → linked
test_agent_flags_medium_confidence_entity # confidence 0.6 → flagged
test_agent_rejects_low_confidence_entity # confidence 0.2 → rejected
test_agent_matches_existing_graph_node # "Acme Corp" matches existing node
test_agent_creates_new_node_for_novel_entity # No match → creates node
test_agent_undo_reverts_link # Undo restores pending status
test_agent_tenant_isolation # Tenant A can't see Tenant B entities
test_agent_learning_from_feedback # 👍 increases future auto-link rate
test_agent_graduation_path # supervised → autonomous after N successes---
Plan 2: Outlook Email Attachment Ingestion
Overview
Download and parse attachments from Outlook emails during historical sync.
Reuses the existing Docling document parsing pipeline used by OneDrive/WorkDrive.
Architecture
Outlook sync fetches email
↓
Check hasAttachments flag
↓
If true: GET /me/messages/{id}/attachments (metadata)
↓
For each attachment:
GET /me/messages/{id}/attachments/{attachmentId}/$value (binary)
↓
Pass through DoclingProcessor (same as OneDrive/WorkDrive)
↓
Append parsed text to email body for LLM extraction
↓
Create DiscoveredEntity with source_record_type="outlook_attachment"Files to Create/Modify
New Files
**backend-saas/tests/test_outlook_attachment_ingestion_tdd.py**
- TDD tests covering:
- Attachment metadata fetched correctly
- Binary content downloaded
- PDF parsed via Docling
- DOCX parsed via Docling
- Text appended to email body for extraction
- Attachment entities created with proper type
- Large attachments skipped (respect MAX size)
- Rate limiting for attachment API calls
Modified Files
**backend-saas/integrations/outlook_service.py**
- Add
download_attachment(message_id, attachment_id)method - Calls
GET /me/messages/{message_id}/attachments/{attachment_id}/$value - Returns bytes
- Reuses existing
self.client(httpx) and access token
**backend-saas/core/integrations/adapters/outlook_integration_v2.py**
- Add
download_file(file_id)method (standardized contract for_prepare_record_text_async) file_idformat:"{message_id}:{attachment_id}"- Delegates to Outlook service's download_attachment
**backend-saas/core/ingestion_pipeline.py**
- In
_prepare_record_text_async, add handling forintegration_id == "outlook" - Check if record has
hasAttachments == true - Fetch and parse each attachment
- Append parsed text:
record["body"] += "\n\n[Attachment: {name}]\n{parsed_text}"
**backend-saas/core/historical_sync_service.py**
- No changes needed —
_prepare_record_text_asyncis already called per record
Reuse from OneDrive
The OneDrive adapter (core/integrations/adapters/onedrive.py) already implements:
download_file(file_id)— downloads binary from MS Graph- Docling integration via
_prepare_record_text_async
The Outlook service (integrations/outlook_service.py) already has:
self.client(httpx.AsyncClient) with auth headers- MS Graph API base URL
- Access token management
The attachment flow is identical: MS Graph API → binary → Docling → text.
Only the endpoint differs: /drive/items/{id}/content vs /messages/{id}/attachments/{aid}/$value.
Rate Limiting & Limits
| Setting | Default | Description |
|---|---|---|
OUTLOOK_MAX_ATTACHMENT_SIZE_MB | 10 | Skip attachments larger than this |
OUTLOOK_ATTACHMENT_RATE_LIMIT | 10/min | MS Graph throttling |
OUTLOOK_ATTACHMENT_TYPES | pdf,docx,xlsx,pptx,txt,csv | Supported formats (via Docling) |
TDD Verification
TestOutlookAttachmentIngestion:
test_fetches_attachment_metadata # GET /attachments returns list
test_downloads_attachment_content # GET /$value returns bytes
test_parses_pdf_attachment # Docling extracts PDF text
test_parses_docx_attachment # Docling extracts DOCX text
test_appends_text_to_email_body # Body + attachment text combined
test_skips_large_attachments # >10MB skipped with warning
test_handles_missing_attachment # 404 from Graph API
test_creates_attachment_entities # DiscoveredEntity with type="Document"
test_rate_limits_attachment_requests # Respects 10/min throttle
test_tenant_isolation # Cross-tenant attachment access blocked---
Implementation Order
Phase 1: Attachment Ingestion (1-2 sessions)
- Add
download_attachmenttooutlook_service.py - Add
download_filetooutlook_integration_v2.py - Modify
_prepare_record_text_asyncfor outlook integration - TDD verification
- Deploy + test with Brennan's Outlook backfill
Phase 2: Automated Linking Agent (2-3 sessions)
- Create
EntityLinkingJudgewith LLM prompt - Create
EntityLinkingAgentextending agent framework - Wire into ingestion pipeline (after schema discovery)
- Add API endpoints (auto-link, undo)
- Add UI components (Auto-Link All button, Undo, judgment display)
- TDD verification
- Register as default agent for all tenants
- Deploy + monitor linking decisions
Phase 3: Integration & Polish (1 session)
- End-to-end TDD: sync → extract → link → view in Knowledge Graph
- Agent graduation: supervised → autonomous based on feedback
- Documentation update
---
Dependencies
| Dependency | Status | Notes |
|---|---|---|
| Docling processor | ✅ Deployed | get_docling_processor() singleton |
| Agent framework | ✅ Deployed | AgentRegistry, governance, graduation |
| EntityLinkingService | ✅ Exists | link_entities_to_graph already written |
| SchemaDiscoveryService | ✅ Exists | Creates EntityTypeDefinition drafts |
| MS Graph API access | ✅ Working | Token refresh handled by outlook_service |
| Outlook attachment endpoints | ⚠️ Needs scope | May need Mail.ReadWrite or Files.Read |