docs(design): v1.1 add tagging, Postgres+pgvector, hybrid RAG, compliance reasoning addendum

vinod0m · vinod0m · commit e55fa9e8d7f4 · 2025-10-08T00:46:14.000+02:00
diff --git a/doc/design/deepagent_document_tools_integration.md b/doc/design/deepagent_document_tools_integration.md
@@ -1,9 +1,15 @@
 # DeepAgent + DocumentAgent Integration Design
 
-**Version:** 1.0  
-**Date:** October 7, 2025  
+**Version:** 1.1  
+**Date:** October 8, 2025  
 **Author:** AI Architecture Team  
-**Status:** Design Proposal
+**Status:** Design Proposal (Enhanced Addendum Included)
+
+### Version History
+| Version | Date | Summary |
+|---------|------|---------|
+| 1.0 | 2025-10-07 | Initial integration design (tool architecture, QA, performance, security) |
+| 1.1 | 2025-10-08 | Added advanced tagging + confirmation workflow, domain-specific pipelines, Postgres + pgvector persistence API, hybrid RAG architecture, compliance & cross‑document reasoning use cases, >99% accuracy requirements pipeline, retrieval evaluation strategy |
 
 ---
 
@@ -1723,4 +1729,322 @@ ERROR_MESSAGES = {
 
 ---
 
-**End of Design Document**
+## 14. Enhanced Capabilities Addendum (v1.1)
+
+This addendum incorporates the newly requested advanced capabilities:
+
+### 14.1 Requirement Mapping (User Requests → Design Elements)
+| # | User Requirement | Design Element(s) Added |
+|---|------------------|-------------------------|
+| 1 | Context-aware tagging + user confirmation | Section 14.2 Tagging & Confirmation Workflow |
+| 2 | Tag-driven domain pipelines | Section 14.3 Domain-Specific Processing Matrix |
+| 3 | >99% accuracy for requirements | Section 14.4 High-Accuracy Requirements Pipeline |
+| 4 | Persist structured requirements in Postgres (external repo) | Section 14.5 Persistence & External API Contracts |
+| 4b | Store requirements as embeddings in pgvector | Section 14.6 Embedding & Vector Index Strategy |
+| 5 | Embed other doc types (standards/howto/templates) | Section 14.6 (document_type expansion) |
+| 6 | Hybrid RAG across all doc types | Section 14.7 Hybrid Retrieval Architecture |
+| 7 | Compliance check (requirements vs standards/templates) | Section 14.8 Use Case Flow 1 |
+| 8 | Q&A over standards + related templates/howtos | Section 14.8 Use Case Flow 2 |
+| 9 | Standards inter-relationship exploration | Section 14.8 Use Case Flow 3 |
+
+### 14.2 Tagging & Confirmation Workflow
+Objective: Automatically classify each uploaded document into one (or multiple) semantic types: `requirements_spec`, `standard`, `howto`, `template`, `policy`, `guideline`, `unknown`.
+
+Workflow Steps:
+1. **Initial Rapid Heuristic Pass** (deterministic):
+     - File name / path regex (e.g. `(spec|srs|requirements)` → requirements_spec; `(iso|iec|ieee|nist)` → standard).
+     - Heading density & patterns (e.g. high ratio of imperative "shall" → requirements_spec; presence of numbered normative clauses like "3.2.1" with normative modal verbs → standard).
+     - Keyword priors with TF-IDF or BM25 quick scan.
+2. **LLM Tagging Pass** (contextual): Provide top N headings + first 2 pages + any strong heuristic signals. Return JSON:
+     ```json
+     {"primary_tag": "requirements_spec", "alt_tags": ["template"], "confidence": 0.91, "rationale": "Contains 'shall' density 4.2%, structured numbered sections"}
+     ```
+3. **Conflict Resolver:** If heuristic primary ≠ LLM primary and both confidences < threshold (e.g. 0.8) → ask user.
+4. **User Confirmation Loop:** DeepAgent presents a summary:
+     > Detected: requirements_spec (91% confidence). Alternate: template (42%). Confirm? (Yes / choose correct tag / multi-select)
+5. **Correction Handling:** If user overrides, store override in `document_tag_overrides` table (Postgres) with pattern signature (hash of top headings) to auto-apply next time (active learning).
+6. **Multi-Tag Support:** Some documents may legitimately be both `standard` and `template` (rare). We allow up to 2 tags, one primary. Pipelines prioritize primary.
+7. **Persistence of Tag Decision:** Store final tag(s) + confidence + rationale for audit.
+
+Data Model (Tagging Metadata):
+```sql
+CREATE TABLE document_tags (
+    document_id UUID PRIMARY KEY,
+    file_name TEXT NOT NULL,
+    primary_tag TEXT NOT NULL,
+    secondary_tag TEXT NULL,
+    heuristic_confidence REAL,
+    llm_confidence REAL,
+    final_confidence REAL,
+    rationale TEXT,
+    created_at TIMESTAMPTZ DEFAULT now()
+);
+```
+
+### 14.3 Domain-Specific Processing Matrix
+| Tag | Extraction Pipeline | Specialized Prompts | Extra Validation | Output Artifacts |
+|-----|---------------------|---------------------|------------------|------------------|
+| requirements_spec | High-accuracy multi-pass | Requirements schema, atomicity rules | Duplicate ID check, modal verb density, coverage vs TOC | Structured requirements JSON, embeddings |
+| standard | Clause segmentation, normative language parser | Normative vs informative discrimination | Clause numbering integrity, cross-reference validation | Clause graph, embeddings |
+| howto | Procedure step parser, imperative detection | Step normalization & tool references | Ordered step continuity, missing prerequisite detection | Steps list, embeddings |
+| template | Placeholder field extraction | Variable slot detection | Placeholder coverage ratio, duplicate placeholder detection | Template slots, embeddings |
+| policy/guideline | Policy statement extraction | Risk/compliance phrasing patterns | Policy classification consistency | Policy items, embeddings |
+
+### 14.4 High-Accuracy Requirements Pipeline (>99%)
+Stages (multi-pass):
+1. **Ingestion & Normalization** (Docling → Markdown → canonical whitespace, remove page artifacts).
+2. **Section Structuring Pass** (existing DocumentAgent) with chunk overlap for context.
+3. **Requirements Extraction Pass A (Baseline)** – Strict JSON schema.
+4. **Requirements Extraction Pass B (Refinement)** – Feed ambiguous or low-confidence items with clarifying meta-prompt; unify style.
+5. **Deduplication & Canonicalization:** Hash normalized body; unify IDs; if conflicting IDs with different bodies → create variant list + resolution heuristic (prefer longer, more specific, or user-confirmed).
+6. **Atomicity Validator:** Split compound statements containing multiple modal verbs (`shall`,`must`,`should`) separated by conjunctions.
+7. **Category Classifier (functional vs non-functional + subcategories):** Lightweight model + LLM tie-break.
+8. **Confidence Scoring Ensemble:** Combine: (a) extraction model self-score, (b) heuristic quality metrics (length, modality strength, ambiguity penalty), (c) duplication penalty.
+9. **Human-in-the-Loop Optional Gate:** For items < threshold (e.g. 0.85), present batch diff to user.
+10. **Persistence & Embedding:** After acceptance, store in Postgres, generate embeddings, index in pgvector.
+
+Error Handling & Correction:
+- Retry with reduced chunk size on context errors.
+- Fallback minimal parser if LLM output invalid JSON after N retries (skeleton insertion + mark `needs_review=true`).
+
+### 14.5 Persistence & External API Contracts
+External Postgres (other repo) is exposed via REST (or gRPC). We define a client in this repo with resilient calls + exponential backoff.
+
+#### Core Tables (Proposed)
+```sql
+CREATE EXTENSION IF NOT EXISTS "vector"; -- pgvector
+
+CREATE TABLE documents (
+    document_id UUID PRIMARY KEY,
+    file_name TEXT NOT NULL,
+    primary_tag TEXT NOT NULL,
+    secondary_tag TEXT,
+    source_path TEXT,
+    version TEXT,
+    checksum TEXT,
+    size_bytes BIGINT,
+    processed_at TIMESTAMPTZ DEFAULT now()
+);
+
+CREATE TABLE requirements (
+    requirement_id UUID PRIMARY KEY,
+    document_id UUID REFERENCES documents(document_id) ON DELETE CASCADE,
+    external_req_id TEXT,              -- original numbering if present
+    body TEXT NOT NULL,
+    category TEXT,                     -- functional / non-functional
+    subcategory TEXT,                  -- performance / security etc.
+    confidence REAL,
+    needs_review BOOLEAN DEFAULT FALSE,
+    metadata JSONB,
+    created_at TIMESTAMPTZ DEFAULT now()
+);
+
+CREATE TABLE knowledge_clauses (
+    clause_id UUID PRIMARY KEY,
+    document_id UUID REFERENCES documents(document_id) ON DELETE CASCADE,
+    tag TEXT,                          -- standard / howto / template
+    clause_number TEXT,
+    title TEXT,
+    content TEXT,
+    metadata JSONB,
+    created_at TIMESTAMPTZ DEFAULT now()
+);
+
+-- Unified embedding store
+CREATE TABLE embeddings (
+    embedding_id UUID PRIMARY KEY,
+    parent_type TEXT NOT NULL CHECK (parent_type IN ('requirement','clause','template_slot')),
+    parent_id UUID NOT NULL,
+    document_id UUID NOT NULL REFERENCES documents(document_id) ON DELETE CASCADE,
+    vector vector(1536) NOT NULL,      -- dimension depends on model (e.g. text-embedding-3-large)
+    tag TEXT,                          -- reuse primary_tag or refined semantic tag
+    chunk_index INT,
+    text_excerpt TEXT,
+    metadata JSONB,
+    created_at TIMESTAMPTZ DEFAULT now()
+);
+
+CREATE INDEX ON embeddings USING ivfflat (vector vector_cosine_ops) WITH (lists=100);
+CREATE INDEX embeddings_tag_idx ON embeddings(tag);
+CREATE INDEX requirements_doc_idx ON requirements(document_id);
+CREATE INDEX knowledge_doc_idx ON knowledge_clauses(document_id);
+```
+
+#### External REST API (Contract)
+| Endpoint | Method | Purpose | Request | Response |
+|----------|--------|---------|---------|----------|
+| `/documents` | POST | Register processed doc | file metadata + tags | `{document_id}` |
+| `/requirements/batch` | POST | Bulk insert requirements | list of requirement objects | counts + failed IDs |
+| `/clauses/batch` | POST | Bulk insert standard/howto/template clauses | objects | counts |
+| `/embeddings/batch` | POST | Bulk insert vectors | dimension + vectors | success/fail |
+| `/retrieval/hybrid` | POST | Hybrid search (query + filters) | query JSON | ranked results |
+| `/compliance/check` | POST | Requirements vs standard sections | requirement IDs + standard ref | compliance summary |
+| `/standards/graph` | GET | Return standards relationship graph | query params | node/edge JSON |
+
+Request JSON examples and detailed schemas would be placed in `doc/api/` (future work).
+
+#### Client Pseudocode
+```python
+class ExternalKnowledgeStoreClient:
+        def __init__(self, base_url: str, api_key: str | None = None, timeout=30): ...
+
+        def register_document(self, meta: dict) -> str: ...
+        def upsert_requirements(self, reqs: list[dict]) -> dict: ...
+        def upsert_clauses(self, clauses: list[dict]) -> dict: ...
+        def upsert_embeddings(self, embeddings: list[dict]) -> dict: ...
+        def hybrid_search(self, query: str, k: int = 15, filters: dict | None = None) -> list[dict]: ...
+        def compliance_check(self, requirement_ids: list[str], standard_ref: str) -> dict: ...
+```
+
+### 14.6 Embedding & Vector Index Strategy
+Embedding Model Options:
+- Default: OpenAI `text-embedding-3-large` (1536 dims) OR local Qwen/Instructor variant if privacy constraints.
+- Domain adaptation: Fine-tune or use contrastive re-ranking for standards.
+
+Chunking Strategy:
+| Doc Type | Unit | Avg Tokens | Overlap | Notes |
+|----------|------|------------|---------|-------|
+| requirements_spec | Individual requirement | 30–120 | 0 | Each requirement atomic -> direct embedding |
+| standard | Clause / subclause | 80–250 | 25 tokens | Preserve normative boundaries |
+| howto | Step group (5–7 steps) | 60–150 | 20 tokens | Provide local context |
+| template | Placeholder + surrounding context | 40–90 | 15 | Capture variable semantics |
+| policy/guideline | Policy statement | 50–160 | 15 | Keep actionable text intact |
+
+Embedding Ingestion Pipeline:
+1. Normalize text (unicode NFC, preserve casing, strip page numbers).
+2. Generate vector.
+3. Compute lexical signature (top 12 stemmed tokens) for hybrid BM25 fusion.
+4. Persist to Postgres `embeddings` table.
+
+### 14.7 Hybrid Retrieval Architecture
+Hybrid = Vector Similarity + Lexical (BM25) + Metadata Filters + (Optional) Reranker.
+
+Retrieval Steps:
+1. **Lexical Candidate Generation:** Use Postgres full text search or an external BM25 (pg_trgm / tsvector) index.
+2. **Vector Similarity Search:** ivfflat (cosine) top K.
+3. **Score Fusion:** Reciprocal Rank Fusion (RRF) or Weighted Sum:
+     `final = w_vec * norm(vector_score) + w_lex * norm(bm25_score) + w_meta * meta_boost`
+4. **Optional Cross-Encoder Re-rank:** For top 50 using a local mini LM (e.g. `bge-reranker-base`).
+5. **Diversity Filter:** Remove near-duplicate (cosine > 0.95) keeping highest rank.
+6. **Return:** Structured results with provenance: `{parent_type, parent_id, document_id, score, snippet}`.
+
+Representative Hybrid Query (Illustrative):
+```sql
+WITH vec AS (
+    SELECT parent_id, 1 - (vector <=> embedding_query(:q_vec)) AS vscore
+    FROM embeddings
+    WHERE tag = ANY(:tags)
+    ORDER BY embedding_query(:q_vec) <=> vector
+    LIMIT 100
+),
+lex AS (
+    SELECT parent_id, ts_rank_cd(tsv, plainto_tsquery(:q_text)) AS lscore
+    FROM lexical_index
+    WHERE tsv @@ plainto_tsquery(:q_text)
+    LIMIT 100
+)
+SELECT coalesce(vec.parent_id, lex.parent_id) AS parent_id,
+             coalesce(vscore,0) AS vscore,
+             coalesce(lscore,0) AS lscore,
+             (0.6 * vscore + 0.4 * lscore) AS final_score
+FROM vec FULL OUTER JOIN lex USING (parent_id)
+ORDER BY final_score DESC
+LIMIT 25;
+```
+
+### 14.8 Advanced Use Case Flows
+
+#### 1. Compliance / Conformance Checking (Requirements vs Standard)
+Flow:
+1. User: *"Do our login requirements comply with ISO-27001 section 9.2?"*
+2. Agent: Retrieve requirements tagged `authentication` + standard clauses referencing access control.
+3. Alignment Heuristic:
+     - Semantic similarity (embedding cos > threshold)
+     - Keyword obligation coverage (presence of MUST/SHALL vs passive wording)
+     - Gap detection (standard clause concepts missing in requirement set)
+4. Output categories:
+     - `fully_covered`, `partially_covered`, `missing`, `over_specified`.
+5. Summarize gaps + propose draft requirements (LLM generative assist flagged as `suggested_draft`).
+
+#### 2. Standards Q&A with Related Templates & HowTos
+Flow:
+1. User question → Hybrid retrieval across `standard` + `template` + `howto`.
+2. Group results by type; build answer plan:
+     - Normative definition excerpts
+     - Concrete procedural template placeholders
+     - Practical steps from howto.
+3. LLM synthesizes final answer citing sources (document_id + clause_number / step number).
+
+#### 3. Standards Relationship Exploration
+Data Prep:
+- Build a *standards graph* (background job): nodes = clauses; edges: semantic similarity > 0.88 OR explicit cross-reference anchor.
+- Store edges in `standards_graph_edges` (source_clause_id, target_clause_id, edge_type, weight).
+Interactive Flow:
+1. User: *"How does ISO-27001 relate to NIST 800-53 on incident response?"*
+2. Retrieve subgraph filtered by tags & topic embeddings (incident response cluster labels).
+3. Summarize: overlapping concepts, unique requirements, divergence notes.
+
+### 14.9 Orchestration Pseudocode (High-Level)
+```python
+def process_document(file_path: str, session_id: str):
+        raw_meta = gather_basic_metadata(file_path)
+        heuristic_tag, h_conf = heuristic_classifier(file_path)
+        llm_tag, llm_conf, rationale = llm_classifier(file_path)
+        final_tag, final_conf = resolve_tag(heuristic_tag, h_conf, llm_tag, llm_conf)
+        if needs_user_confirmation(final_conf):
+                prompt_user_for_tag_confirmation(session_id, candidates=[heuristic_tag, llm_tag])
+                final_tag = await_user_choice(session_id)
+        doc_id = external_client.register_document({...})
+        pipeline = select_pipeline(final_tag)
+        structured = pipeline.run(file_path)
+        if final_tag == 'requirements_spec':
+                refined = high_accuracy_refinement(structured)
+                external_client.upsert_requirements(refined.requirements)
+                embed_and_store(refined.requirements, doc_id)
+        else:
+                clauses = normalize_non_requirements(structured)
+                external_client.upsert_clauses(clauses)
+                embed_and_store(clauses, doc_id)
+        return summary(structured)
+```
+
+### 14.10 Evaluation & QA Extensions
+New Metrics:
+| Aspect | Metric | Target |
+|--------|--------|--------|
+| Tagging | Primary tag accuracy (manual validation) | ≥95% |
+| Tagging | Confirmation intervention rate | <25% (improves with learning) |
+| Requirements | Extraction accuracy (precision/recall) | ≥99% / ≥98% |
+| Embeddings | Retrieval nDCG@10 (bench queries) | ≥0.82 |
+| Hybrid Search | Latency (p95) | <700ms (warm) |
+| Compliance | Gap detection F1 | ≥0.9 |
+
+Automated Evaluation Harness:
+- Golden dataset of annotated documents (requirements, standards) with expected outputs.
+- Periodic CI job runs extraction + retrieval benchmarks; publishes `TEST_EXECUTION_REPORT.md` deltas.
+
+### 14.11 Risks & Mitigations (Addendum)
+| Risk | Impact | Mitigation |
+|------|--------|------------|
+| External DB downtime | Lost persistence / user blockage | Local queue + retry DLQ; show degraded-mode notice |
+| Tag misclassification | Wrong pipeline reduces accuracy | Confirmation loop + override memory + continuous learning |
+| Vector drift (model change) | Retrieval inconsistency | Versioned embeddings (store `embedding_model_version`) + background re-indexer |
+| Hybrid query latency spike | Poor UX | Adaptive K reduction + caching top lexical candidates |
+| Over-generation in compliance suggestions | False confidence | Flag AI-suggested items; require explicit user accept |
+
+### 14.12 Implementation Phasing Extension
+Add to original phases:
+- **Phase 5 (Week 9-10)**: Tagging confirmation loop + external API client + requirements persistence.
+- **Phase 6 (Week 11-12)**: pgvector embeddings + hybrid retrieval MVP.
+- **Phase 7 (Week 13-14)**: Compliance engine + standards graph builder.
+- **Phase 8 (Week 15)**: Evaluation harness automation + performance tuning.
+
+### 14.13 Summary of Addendum
+The enhanced design introduces a *closed-loop knowledge lifecycle*:
+`Document Ingestion → Context-Aware Tagging (+User Confirmation) → Domain Pipeline → High-Accuracy Structuring → Persistent Knowledge Graph (Postgres + pgvector) → Hybrid Retrieval → Cross-Document Reasoning (Compliance, Relationships, Q&A)`.
+
+This augments the original architecture without breaking existing abstractions: new functionality slots into **pre-tool (tagging)**, **mid-pipeline (high-accuracy refinement)**, and **post-processing (persistence + retrieval)** stages.
+
+---
+
+**End of Design Document (v1.1 with Addendum)**