Skip to content

Commit e55fa9e

Browse files
committed
docs(design): v1.1 add tagging, Postgres+pgvector, hybrid RAG, compliance reasoning addendum
1 parent f2c7fc6 commit e55fa9e

File tree

1 file changed

+328
-4
lines changed

1 file changed

+328
-4
lines changed

doc/design/deepagent_document_tools_integration.md

Lines changed: 328 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,15 @@
11
# DeepAgent + DocumentAgent Integration Design
22

3-
**Version:** 1.0
4-
**Date:** October 7, 2025
3+
**Version:** 1.1
4+
**Date:** October 8, 2025
55
**Author:** AI Architecture Team
6-
**Status:** Design Proposal
6+
**Status:** Design Proposal (Enhanced Addendum Included)
7+
8+
### Version History
9+
| Version | Date | Summary |
10+
|---------|------|---------|
11+
| 1.0 | 2025-10-07 | Initial integration design (tool architecture, QA, performance, security) |
12+
| 1.1 | 2025-10-08 | Added advanced tagging + confirmation workflow, domain-specific pipelines, Postgres + pgvector persistence API, hybrid RAG architecture, compliance & cross‑document reasoning use cases, >99% accuracy requirements pipeline, retrieval evaluation strategy |
713

814
---
915

@@ -1723,4 +1729,322 @@ ERROR_MESSAGES = {
17231729

17241730
---
17251731

1726-
**End of Design Document**
1732+
## 14. Enhanced Capabilities Addendum (v1.1)
1733+
1734+
This addendum incorporates the newly requested advanced capabilities:
1735+
1736+
### 14.1 Requirement Mapping (User Requests → Design Elements)
1737+
| # | User Requirement | Design Element(s) Added |
1738+
|---|------------------|-------------------------|
1739+
| 1 | Context-aware tagging + user confirmation | Section 14.2 Tagging & Confirmation Workflow |
1740+
| 2 | Tag-driven domain pipelines | Section 14.3 Domain-Specific Processing Matrix |
1741+
| 3 | >99% accuracy for requirements | Section 14.4 High-Accuracy Requirements Pipeline |
1742+
| 4 | Persist structured requirements in Postgres (external repo) | Section 14.5 Persistence & External API Contracts |
1743+
| 4b | Store requirements as embeddings in pgvector | Section 14.6 Embedding & Vector Index Strategy |
1744+
| 5 | Embed other doc types (standards/howto/templates) | Section 14.6 (document_type expansion) |
1745+
| 6 | Hybrid RAG across all doc types | Section 14.7 Hybrid Retrieval Architecture |
1746+
| 7 | Compliance check (requirements vs standards/templates) | Section 14.8 Use Case Flow 1 |
1747+
| 8 | Q&A over standards + related templates/howtos | Section 14.8 Use Case Flow 2 |
1748+
| 9 | Standards inter-relationship exploration | Section 14.8 Use Case Flow 3 |
1749+
1750+
### 14.2 Tagging & Confirmation Workflow
1751+
Objective: Automatically classify each uploaded document into one (or multiple) semantic types: `requirements_spec`, `standard`, `howto`, `template`, `policy`, `guideline`, `unknown`.
1752+
1753+
Workflow Steps:
1754+
1. **Initial Rapid Heuristic Pass** (deterministic):
1755+
- File name / path regex (e.g. `(spec|srs|requirements)` → requirements_spec; `(iso|iec|ieee|nist)` → standard).
1756+
- Heading density & patterns (e.g. high ratio of imperative "shall" → requirements_spec; presence of numbered normative clauses like "3.2.1" with normative modal verbs → standard).
1757+
- Keyword priors with TF-IDF or BM25 quick scan.
1758+
2. **LLM Tagging Pass** (contextual): Provide top N headings + first 2 pages + any strong heuristic signals. Return JSON:
1759+
```json
1760+
{"primary_tag": "requirements_spec", "alt_tags": ["template"], "confidence": 0.91, "rationale": "Contains 'shall' density 4.2%, structured numbered sections"}
1761+
```
1762+
3. **Conflict Resolver:** If heuristic primary ≠ LLM primary and both confidences < threshold (e.g. 0.8) → ask user.
1763+
4. **User Confirmation Loop:** DeepAgent presents a summary:
1764+
> Detected: requirements_spec (91% confidence). Alternate: template (42%). Confirm? (Yes / choose correct tag / multi-select)
1765+
5. **Correction Handling:** If user overrides, store override in `document_tag_overrides` table (Postgres) with pattern signature (hash of top headings) to auto-apply next time (active learning).
1766+
6. **Multi-Tag Support:** Some documents may legitimately be both `standard` and `template` (rare). We allow up to 2 tags, one primary. Pipelines prioritize primary.
1767+
7. **Persistence of Tag Decision:** Store final tag(s) + confidence + rationale for audit.
1768+
1769+
Data Model (Tagging Metadata):
1770+
```sql
1771+
CREATE TABLE document_tags (
1772+
document_id UUID PRIMARY KEY,
1773+
file_name TEXT NOT NULL,
1774+
primary_tag TEXT NOT NULL,
1775+
secondary_tag TEXT NULL,
1776+
heuristic_confidence REAL,
1777+
llm_confidence REAL,
1778+
final_confidence REAL,
1779+
rationale TEXT,
1780+
created_at TIMESTAMPTZ DEFAULT now()
1781+
);
1782+
```
1783+
1784+
### 14.3 Domain-Specific Processing Matrix
1785+
| Tag | Extraction Pipeline | Specialized Prompts | Extra Validation | Output Artifacts |
1786+
|-----|---------------------|---------------------|------------------|------------------|
1787+
| requirements_spec | High-accuracy multi-pass | Requirements schema, atomicity rules | Duplicate ID check, modal verb density, coverage vs TOC | Structured requirements JSON, embeddings |
1788+
| standard | Clause segmentation, normative language parser | Normative vs informative discrimination | Clause numbering integrity, cross-reference validation | Clause graph, embeddings |
1789+
| howto | Procedure step parser, imperative detection | Step normalization & tool references | Ordered step continuity, missing prerequisite detection | Steps list, embeddings |
1790+
| template | Placeholder field extraction | Variable slot detection | Placeholder coverage ratio, duplicate placeholder detection | Template slots, embeddings |
1791+
| policy/guideline | Policy statement extraction | Risk/compliance phrasing patterns | Policy classification consistency | Policy items, embeddings |
1792+
1793+
### 14.4 High-Accuracy Requirements Pipeline (>99%)
1794+
Stages (multi-pass):
1795+
1. **Ingestion & Normalization** (Docling → Markdown → canonical whitespace, remove page artifacts).
1796+
2. **Section Structuring Pass** (existing DocumentAgent) with chunk overlap for context.
1797+
3. **Requirements Extraction Pass A (Baseline)** – Strict JSON schema.
1798+
4. **Requirements Extraction Pass B (Refinement)** – Feed ambiguous or low-confidence items with clarifying meta-prompt; unify style.
1799+
5. **Deduplication & Canonicalization:** Hash normalized body; unify IDs; if conflicting IDs with different bodies → create variant list + resolution heuristic (prefer longer, more specific, or user-confirmed).
1800+
6. **Atomicity Validator:** Split compound statements containing multiple modal verbs (`shall`,`must`,`should`) separated by conjunctions.
1801+
7. **Category Classifier (functional vs non-functional + subcategories):** Lightweight model + LLM tie-break.
1802+
8. **Confidence Scoring Ensemble:** Combine: (a) extraction model self-score, (b) heuristic quality metrics (length, modality strength, ambiguity penalty), (c) duplication penalty.
1803+
9. **Human-in-the-Loop Optional Gate:** For items < threshold (e.g. 0.85), present batch diff to user.
1804+
10. **Persistence & Embedding:** After acceptance, store in Postgres, generate embeddings, index in pgvector.
1805+
1806+
Error Handling & Correction:
1807+
- Retry with reduced chunk size on context errors.
1808+
- Fallback minimal parser if LLM output invalid JSON after N retries (skeleton insertion + mark `needs_review=true`).
1809+
1810+
### 14.5 Persistence & External API Contracts
1811+
External Postgres (other repo) is exposed via REST (or gRPC). We define a client in this repo with resilient calls + exponential backoff.
1812+
1813+
#### Core Tables (Proposed)
1814+
```sql
1815+
CREATE EXTENSION IF NOT EXISTS "vector"; -- pgvector
1816+
1817+
CREATE TABLE documents (
1818+
document_id UUID PRIMARY KEY,
1819+
file_name TEXT NOT NULL,
1820+
primary_tag TEXT NOT NULL,
1821+
secondary_tag TEXT,
1822+
source_path TEXT,
1823+
version TEXT,
1824+
checksum TEXT,
1825+
size_bytes BIGINT,
1826+
processed_at TIMESTAMPTZ DEFAULT now()
1827+
);
1828+
1829+
CREATE TABLE requirements (
1830+
requirement_id UUID PRIMARY KEY,
1831+
document_id UUID REFERENCES documents(document_id) ON DELETE CASCADE,
1832+
external_req_id TEXT, -- original numbering if present
1833+
body TEXT NOT NULL,
1834+
category TEXT, -- functional / non-functional
1835+
subcategory TEXT, -- performance / security etc.
1836+
confidence REAL,
1837+
needs_review BOOLEAN DEFAULT FALSE,
1838+
metadata JSONB,
1839+
created_at TIMESTAMPTZ DEFAULT now()
1840+
);
1841+
1842+
CREATE TABLE knowledge_clauses (
1843+
clause_id UUID PRIMARY KEY,
1844+
document_id UUID REFERENCES documents(document_id) ON DELETE CASCADE,
1845+
tag TEXT, -- standard / howto / template
1846+
clause_number TEXT,
1847+
title TEXT,
1848+
content TEXT,
1849+
metadata JSONB,
1850+
created_at TIMESTAMPTZ DEFAULT now()
1851+
);
1852+
1853+
-- Unified embedding store
1854+
CREATE TABLE embeddings (
1855+
embedding_id UUID PRIMARY KEY,
1856+
parent_type TEXT NOT NULL CHECK (parent_type IN ('requirement','clause','template_slot')),
1857+
parent_id UUID NOT NULL,
1858+
document_id UUID NOT NULL REFERENCES documents(document_id) ON DELETE CASCADE,
1859+
vector vector(1536) NOT NULL, -- dimension depends on model (e.g. text-embedding-3-large)
1860+
tag TEXT, -- reuse primary_tag or refined semantic tag
1861+
chunk_index INT,
1862+
text_excerpt TEXT,
1863+
metadata JSONB,
1864+
created_at TIMESTAMPTZ DEFAULT now()
1865+
);
1866+
1867+
CREATE INDEX ON embeddings USING ivfflat (vector vector_cosine_ops) WITH (lists=100);
1868+
CREATE INDEX embeddings_tag_idx ON embeddings(tag);
1869+
CREATE INDEX requirements_doc_idx ON requirements(document_id);
1870+
CREATE INDEX knowledge_doc_idx ON knowledge_clauses(document_id);
1871+
```
1872+
1873+
#### External REST API (Contract)
1874+
| Endpoint | Method | Purpose | Request | Response |
1875+
|----------|--------|---------|---------|----------|
1876+
| `/documents` | POST | Register processed doc | file metadata + tags | `{document_id}` |
1877+
| `/requirements/batch` | POST | Bulk insert requirements | list of requirement objects | counts + failed IDs |
1878+
| `/clauses/batch` | POST | Bulk insert standard/howto/template clauses | objects | counts |
1879+
| `/embeddings/batch` | POST | Bulk insert vectors | dimension + vectors | success/fail |
1880+
| `/retrieval/hybrid` | POST | Hybrid search (query + filters) | query JSON | ranked results |
1881+
| `/compliance/check` | POST | Requirements vs standard sections | requirement IDs + standard ref | compliance summary |
1882+
| `/standards/graph` | GET | Return standards relationship graph | query params | node/edge JSON |
1883+
1884+
Request JSON examples and detailed schemas would be placed in `doc/api/` (future work).
1885+
1886+
#### Client Pseudocode
1887+
```python
1888+
class ExternalKnowledgeStoreClient:
1889+
def __init__(self, base_url: str, api_key: str | None = None, timeout=30): ...
1890+
1891+
def register_document(self, meta: dict) -> str: ...
1892+
def upsert_requirements(self, reqs: list[dict]) -> dict: ...
1893+
def upsert_clauses(self, clauses: list[dict]) -> dict: ...
1894+
def upsert_embeddings(self, embeddings: list[dict]) -> dict: ...
1895+
def hybrid_search(self, query: str, k: int = 15, filters: dict | None = None) -> list[dict]: ...
1896+
def compliance_check(self, requirement_ids: list[str], standard_ref: str) -> dict: ...
1897+
```
1898+
1899+
### 14.6 Embedding & Vector Index Strategy
1900+
Embedding Model Options:
1901+
- Default: OpenAI `text-embedding-3-large` (1536 dims) OR local Qwen/Instructor variant if privacy constraints.
1902+
- Domain adaptation: Fine-tune or use contrastive re-ranking for standards.
1903+
1904+
Chunking Strategy:
1905+
| Doc Type | Unit | Avg Tokens | Overlap | Notes |
1906+
|----------|------|------------|---------|-------|
1907+
| requirements_spec | Individual requirement | 30–120 | 0 | Each requirement atomic -> direct embedding |
1908+
| standard | Clause / subclause | 80–250 | 25 tokens | Preserve normative boundaries |
1909+
| howto | Step group (5–7 steps) | 60–150 | 20 tokens | Provide local context |
1910+
| template | Placeholder + surrounding context | 40–90 | 15 | Capture variable semantics |
1911+
| policy/guideline | Policy statement | 50–160 | 15 | Keep actionable text intact |
1912+
1913+
Embedding Ingestion Pipeline:
1914+
1. Normalize text (unicode NFC, preserve casing, strip page numbers).
1915+
2. Generate vector.
1916+
3. Compute lexical signature (top 12 stemmed tokens) for hybrid BM25 fusion.
1917+
4. Persist to Postgres `embeddings` table.
1918+
1919+
### 14.7 Hybrid Retrieval Architecture
1920+
Hybrid = Vector Similarity + Lexical (BM25) + Metadata Filters + (Optional) Reranker.
1921+
1922+
Retrieval Steps:
1923+
1. **Lexical Candidate Generation:** Use Postgres full text search or an external BM25 (pg_trgm / tsvector) index.
1924+
2. **Vector Similarity Search:** ivfflat (cosine) top K.
1925+
3. **Score Fusion:** Reciprocal Rank Fusion (RRF) or Weighted Sum:
1926+
`final = w_vec * norm(vector_score) + w_lex * norm(bm25_score) + w_meta * meta_boost`
1927+
4. **Optional Cross-Encoder Re-rank:** For top 50 using a local mini LM (e.g. `bge-reranker-base`).
1928+
5. **Diversity Filter:** Remove near-duplicate (cosine > 0.95) keeping highest rank.
1929+
6. **Return:** Structured results with provenance: `{parent_type, parent_id, document_id, score, snippet}`.
1930+
1931+
Representative Hybrid Query (Illustrative):
1932+
```sql
1933+
WITH vec AS (
1934+
SELECT parent_id, 1 - (vector <=> embedding_query(:q_vec)) AS vscore
1935+
FROM embeddings
1936+
WHERE tag = ANY(:tags)
1937+
ORDER BY embedding_query(:q_vec) <=> vector
1938+
LIMIT 100
1939+
),
1940+
lex AS (
1941+
SELECT parent_id, ts_rank_cd(tsv, plainto_tsquery(:q_text)) AS lscore
1942+
FROM lexical_index
1943+
WHERE tsv @@ plainto_tsquery(:q_text)
1944+
LIMIT 100
1945+
)
1946+
SELECT coalesce(vec.parent_id, lex.parent_id) AS parent_id,
1947+
coalesce(vscore,0) AS vscore,
1948+
coalesce(lscore,0) AS lscore,
1949+
(0.6 * vscore + 0.4 * lscore) AS final_score
1950+
FROM vec FULL OUTER JOIN lex USING (parent_id)
1951+
ORDER BY final_score DESC
1952+
LIMIT 25;
1953+
```
1954+
1955+
### 14.8 Advanced Use Case Flows
1956+
1957+
#### 1. Compliance / Conformance Checking (Requirements vs Standard)
1958+
Flow:
1959+
1. User: *"Do our login requirements comply with ISO-27001 section 9.2?"*
1960+
2. Agent: Retrieve requirements tagged `authentication` + standard clauses referencing access control.
1961+
3. Alignment Heuristic:
1962+
- Semantic similarity (embedding cos > threshold)
1963+
- Keyword obligation coverage (presence of MUST/SHALL vs passive wording)
1964+
- Gap detection (standard clause concepts missing in requirement set)
1965+
4. Output categories:
1966+
- `fully_covered`, `partially_covered`, `missing`, `over_specified`.
1967+
5. Summarize gaps + propose draft requirements (LLM generative assist flagged as `suggested_draft`).
1968+
1969+
#### 2. Standards Q&A with Related Templates & HowTos
1970+
Flow:
1971+
1. User question → Hybrid retrieval across `standard` + `template` + `howto`.
1972+
2. Group results by type; build answer plan:
1973+
- Normative definition excerpts
1974+
- Concrete procedural template placeholders
1975+
- Practical steps from howto.
1976+
3. LLM synthesizes final answer citing sources (document_id + clause_number / step number).
1977+
1978+
#### 3. Standards Relationship Exploration
1979+
Data Prep:
1980+
- Build a *standards graph* (background job): nodes = clauses; edges: semantic similarity > 0.88 OR explicit cross-reference anchor.
1981+
- Store edges in `standards_graph_edges` (source_clause_id, target_clause_id, edge_type, weight).
1982+
Interactive Flow:
1983+
1. User: *"How does ISO-27001 relate to NIST 800-53 on incident response?"*
1984+
2. Retrieve subgraph filtered by tags & topic embeddings (incident response cluster labels).
1985+
3. Summarize: overlapping concepts, unique requirements, divergence notes.
1986+
1987+
### 14.9 Orchestration Pseudocode (High-Level)
1988+
```python
1989+
def process_document(file_path: str, session_id: str):
1990+
raw_meta = gather_basic_metadata(file_path)
1991+
heuristic_tag, h_conf = heuristic_classifier(file_path)
1992+
llm_tag, llm_conf, rationale = llm_classifier(file_path)
1993+
final_tag, final_conf = resolve_tag(heuristic_tag, h_conf, llm_tag, llm_conf)
1994+
if needs_user_confirmation(final_conf):
1995+
prompt_user_for_tag_confirmation(session_id, candidates=[heuristic_tag, llm_tag])
1996+
final_tag = await_user_choice(session_id)
1997+
doc_id = external_client.register_document({...})
1998+
pipeline = select_pipeline(final_tag)
1999+
structured = pipeline.run(file_path)
2000+
if final_tag == 'requirements_spec':
2001+
refined = high_accuracy_refinement(structured)
2002+
external_client.upsert_requirements(refined.requirements)
2003+
embed_and_store(refined.requirements, doc_id)
2004+
else:
2005+
clauses = normalize_non_requirements(structured)
2006+
external_client.upsert_clauses(clauses)
2007+
embed_and_store(clauses, doc_id)
2008+
return summary(structured)
2009+
```
2010+
2011+
### 14.10 Evaluation & QA Extensions
2012+
New Metrics:
2013+
| Aspect | Metric | Target |
2014+
|--------|--------|--------|
2015+
| Tagging | Primary tag accuracy (manual validation) | ≥95% |
2016+
| Tagging | Confirmation intervention rate | <25% (improves with learning) |
2017+
| Requirements | Extraction accuracy (precision/recall) | ≥99% / ≥98% |
2018+
| Embeddings | Retrieval nDCG@10 (bench queries) | ≥0.82 |
2019+
| Hybrid Search | Latency (p95) | <700ms (warm) |
2020+
| Compliance | Gap detection F1 | ≥0.9 |
2021+
2022+
Automated Evaluation Harness:
2023+
- Golden dataset of annotated documents (requirements, standards) with expected outputs.
2024+
- Periodic CI job runs extraction + retrieval benchmarks; publishes `TEST_EXECUTION_REPORT.md` deltas.
2025+
2026+
### 14.11 Risks & Mitigations (Addendum)
2027+
| Risk | Impact | Mitigation |
2028+
|------|--------|------------|
2029+
| External DB downtime | Lost persistence / user blockage | Local queue + retry DLQ; show degraded-mode notice |
2030+
| Tag misclassification | Wrong pipeline reduces accuracy | Confirmation loop + override memory + continuous learning |
2031+
| Vector drift (model change) | Retrieval inconsistency | Versioned embeddings (store `embedding_model_version`) + background re-indexer |
2032+
| Hybrid query latency spike | Poor UX | Adaptive K reduction + caching top lexical candidates |
2033+
| Over-generation in compliance suggestions | False confidence | Flag AI-suggested items; require explicit user accept |
2034+
2035+
### 14.12 Implementation Phasing Extension
2036+
Add to original phases:
2037+
- **Phase 5 (Week 9-10)**: Tagging confirmation loop + external API client + requirements persistence.
2038+
- **Phase 6 (Week 11-12)**: pgvector embeddings + hybrid retrieval MVP.
2039+
- **Phase 7 (Week 13-14)**: Compliance engine + standards graph builder.
2040+
- **Phase 8 (Week 15)**: Evaluation harness automation + performance tuning.
2041+
2042+
### 14.13 Summary of Addendum
2043+
The enhanced design introduces a *closed-loop knowledge lifecycle*:
2044+
`Document Ingestion → Context-Aware Tagging (+User Confirmation) → Domain Pipeline → High-Accuracy Structuring → Persistent Knowledge Graph (Postgres + pgvector) → Hybrid Retrieval → Cross-Document Reasoning (Compliance, Relationships, Q&A)`.
2045+
2046+
This augments the original architecture without breaking existing abstractions: new functionality slots into **pre-tool (tagging)**, **mid-pipeline (high-accuracy refinement)**, and **post-processing (persistence + retrieval)** stages.
2047+
2048+
---
2049+
2050+
**End of Design Document (v1.1 with Addendum)**

0 commit comments

Comments
 (0)