Files

Multica PM Agent 343a5eebe3

Sync to Gitee / sync (push) Has been cancelled

Details

docs(rag): add MemoryBear RAG implementation docs v1.0

Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>

2026-05-09 10:51:48 +08:00

13 KiB

Raw Blame History

RAG 环节边界定义

目标：明确每个 RAG 阶段的输入 / 输出 / 上下游接口（数据结构层面），避免 Sprint-2 各文档之间留白或重叠。

总览图

[Data Sources] ──→ [Loader] ──→ [Parser] ──→ [Chunking] ──→ [Embedding] ──→ [VDB]
                                                                      ↑
                                                                      │ (async)
                                                              [GraphRAG]

[User Query] ──→ [Query Understanding] ──→ [Retrieval] ──→ [Reranking] ──→ [Prompt] ──→ [LLM] ──→ [Post-Process] ──→ [Answer]
                                              ↑
                                              │ (GRAPH mode)
                                        [KG Search]

1. Loader（数据加载层）

维度	定义
上游	外部系统：飞书 API、语雀 API、Web URL、用户上传接口
输入	飞书：folder_token, app_id, app_secret；语雀：user_id, token；Web：entry_url, max_pages；上传：multipart/form-data
输出	原始文件内容：`CrawledDocument` (dataclass) 或本地文件路径 (.docx/.pdf/.md/.html/.xlsx)
输出数据结构	`CrawledDocument(url, title, content, content_length, crawl_timestamp, metadata)`；本地文件：`str` (path)
下游	Parser：接收文件路径或 bytes，调用对应 format-specific parser
边界约定	Loader 不做任何格式解析（不提取正文、不做 OCR）。仅负责：鉴权 → 获取/下载 → 存盘。格式识别由 Parser 层的 `naive.chunk()` 根据文件扩展名决定。

2. Parser（文档解析层）

维度	定义
上游	Loader：接收文件路径 `str` 或二进制 `bytes`
输入	`(filename: str, binary: bytes \| None, from_page, to_page, callback, vision_model)`
输出	`sections: List[Tuple[str, str]]` — (text_content, layout_tag)；`tables: List[Tuple[Tuple[Optional[Image.Image], Union[str, List[str]]], List[Tuple]]]`
输出数据结构	元组列表，其中 tag 表示布局类型（"Title"/"Text"/"Table"/...），text 可能含位置标签 `@@page\tx0\tx1\ttop\tbottom##`
下游	Chunking：接收 `sections` + `tables`，执行合并与分块
边界约定	Parser 负责格式-specific 的纯提取，不负责语义分块。PDF Parser 特殊：需输出 OCR 结果 + 布局信息 + 表格 HTML。Parser 之间互不调用——由 `naive.chunk()` 统一 dispatch。

3. Chunking（文本分块层）

维度	定义
上游	Parser：`sections` + `tables`
输入	`sections: List[Tuple[str, str]]`, `tables`, `chunk_token_num: int`, `delimiter: str`, `parser_config: dict`
输出	`res: List[Dict]` — 分块后的文档字典列表
输出数据结构（关键字段）	`content_with_weight: str`（原始文本）, `content_ltks: str`（粗粒度分词）, `content_sm_ltks: str`（细粒度分词）, `image: PIL.Image`（可选）, `page_num_int: int`, `position_int: List[int]`, `top_int: int`, `doc_type_kwd: str`
下游	Embedding：接收 `res`，提取 `content_with_weight` 进行向量化；GraphRAG：接收 `res` 中的文本进行实体关系抽取
边界约定	Chunking 不调用 Embedding，也不直接写入 VDB。它只负责将长文本切分成符合 token 预算的 chunks，并填充分词/位置元数据。多模态（图片/音频）的分块结果也统一为此数据结构。

4. Embedding（向量化层）

维度	定义
上游	Chunking：接收 chunk dicts 的 `content_with_weight`
输入	`texts: List[str]`（batch，默认 ≤16 条）
输出	`(np.array, total_tokens)` — `np.array` shape `(batch_size, vector_dimension)`
输出数据结构	NumPy ndarray，float32；向量维度由模型决定（如 OpenAI text-embedding-3: 1536d）
下游	VDB：接收 `(chunk_text, vector, metadata)` 组装成 `DocumentChunk` 后入库
边界约定	Embedding 层无状态，不管理模型生命周期。Provider 通过工厂模式实例化（`Base._FACTORY_NAME` 匹配）。输入文本超长时自动截断（OpenAI 截到 8000 tokens，QWen 截到 2048）。支持 `encode_queries()` 单条 query 编码。

5. VDB（向量数据库层）

维度	定义
上游	Embedding：接收 `(text, vector, metadata)`；Chunking：接收 chunk dicts 中的 metadata
输入	`DocumentChunk(page_content: str, vector: List[float], metadata: dict)`；或检索时：`query: str, top_k: int, indices: str, score_threshold: float`
输出（入库）	ack / error；输出（检索）：`List[DocumentChunk]`
存储 Schema	`page_content: text(ik_max_word)`, `metadata: object(doc_id, document_id, knowledge_id, sort_id, status)`, `vector: dense_vector(cosine, dynamic_dims)`
下游	Retrieval：通过 `search_by_vector` / `search_by_full_text` / `search` (hybrid) 获取结果
边界约定	VDB 同时承担文档存储（全文索引）和向量存储（密集向量索引）双重职责。ES 是唯一的后端（无 Milvus/Pinecone 等）。GraphRAG 的实体/关系/社区报告也以相同 chunk 格式存储于此。

6. GraphRAG（知识图谱层）

维度	定义
上游	Chunking：接收 chunk dicts 的 `content_with_weight`；Celery：异步触发 `build_graphrag_for_document`
输入（索引）	`(document_id, chunk_text)` tuples；`chat_model: Base`, `embedding_model: OpenAIEmbed`, `vector_service: ElasticSearchVector`
输入（检索）	`question: str, workspace_ids: List[str], kb_ids: List[str], emb_mdl, llm`
输出（索引）	`nx.Graph`（全局图）存储到 ES；`entity` chunks + `relation` chunks + `community_report` chunks（General only）
输出（检索）	`Dict` with `page_content` = "Entities CSV + Relations CSV + Community Reports"，`metadata` 含引用信息
下游	VDB：索引/存储实体、关系、社区报告 chunks；Retrieval：`KGSearch.retrieval()` 返回的 chunk 被 `insert(0, ...)` 插入标准检索结果
边界约定	GraphRAG 是独立异步流程，不与标准 RAG 索引同步。Light 和 General 共享相同的存储格式但 General 多出 community_report。GraphRAG 不替代 VDB，而是在 VDB 之上增加图语义层。检索时 KG 结果优先级最高（insert at position 0）。

7. Retrieval（检索层）

维度	定义
上游	VDB：通过 `search_by_vector` / `search_by_full_text` 获取候选；GraphRAG：`KGSearch.retrieval()` 获取图语义结果；Workflow Node：`KnowledgeRetrievalNode.execute()` 发起调用
输入	`query: str, config: Dict(knowledge_bases[], merge_strategy, reranker_id, reranker_top_k, use_graph)`
输出	`List[DocumentChunk]` — 按相关性降序排列的文档块
输出数据结构	`DocumentChunk(page_content: str, metadata: dict)`，其中 metadata 含 `score`, `doc_id`, `document_id`, `knowledge_id`, `highlight`
下游	Reranking：接收候选列表，可选执行重排序；Prompt：接收 chunks 组装上下文
边界约定	Retrieval 层支持 4 种模式：PARTICIPLE（全文）、SEMANTIC（向量）、HYBRID（混合）、GRAPH（图增强）。多 KB 时逐 KB 检索后合并。HYBRID 的默认权重为 BM25 0.05 + Vector 0.95。检索失败（空结果）时自动降级（min_match 0.1 + similarity 0.17 重试）。

8. Reranking（重排序层）

维度	定义
上游	Retrieval：接收候选 `List[DocumentChunk]`
输入	`query: str, docs: List[DocumentChunk], top_k: int`；或 `reranker_id: UUID`
输出	`List[DocumentChunk]` — 重排序后的文档块（长度 ≤ top_k）
输出数据结构	同 Retrieval 输出，metadata 中更新 `score` 为重排序后的分数
下游	Prompt：接收重排序后的 chunks 组装上下文
边界约定	Reranking 是可选层。未配置 reranker_id 时，HYBRID 结果按 metadata.score 降序截断。配置了 reranker_id 时，调用外部 Rerank API（Jina / DashScope / Xinference）。Rerank 失败时降级到原始结果（不阻断流程）。

9. Prompt（Prompt 组装层）

维度	定义
上游	Reranking：接收排序后的 chunks；Workflow：接收用户 query
输入	`chunks: List[DocumentChunk], query: str, system_prompt: str`（可选）
输出	`system: str, history: List[Dict]` — LLM 可调用的消息格式
输出数据结构	`system: str`（含检索上下文 + 系统指令），`history: [{"role": "user", "content": query}]`
下游	LLM：`Base.chat(system, history, gen_conf)`
边界约定	Prompt 层不调用 LLM，只负责文本组装。组装逻辑包括：citation_prompt（引用标注格式）、keyword_extraction（用于缓存 key）、content_tagging（内容分类）。Prompt 模板以 `.md` 文件形式存储在 `prompts/` 目录，通过 `template.py` 动态加载。

10. LLM（大模型生成层）

维度	定义
上游	Prompt：接收 `system` + `history`
输入	`system: str, history: List[Dict], gen_conf: dict(temperature, top_p, max_tokens)`
输出	`(answer: str, tokens: int)` 或流式 `Generator[str \| int]`
输出数据结构	字符串（生成的回答文本）；流式模式下逐 token 返回
下游	Post-Process：`insert_citations()` 插入引用标注
边界约定	LLM 层无上下文记忆（stateless），每次调用携带完整 history。支持 10+ Provider，通过 `_FACTORY_NAME` 工厂模式匹配。流式输出通过 `chat_streamly()` 实现，返回 Generator。错误处理：API 异常时抛出，由上层（Workflow / Celery）捕获。

11. Post-Process（后处理层）

维度	定义
上游	LLM：接收生成的 `answer`；Retrieval：接收原始 `chunks` + `chunk_v`（向量）
输入	`answer: str, chunks: List[DocumentChunk], chunk_v: List[np.array], embd_mdl, tkweight, vtweight`
输出	`(answer_with_citations: str, cited_ids: Set[str])`
输出数据结构	字符串（含 `[1]`, `[2]` 等引用标记），`Set[str]`（被引用的 chunk id 集合）
下游	User：最终展示；Cache：写入 Redis 缓存
边界约定	Post-Process 只做引用标注插入（`insert_citations()`），不做内容修改。引用定位算法基于 `pagerank * similarity` 评分。代码块（`...`）内不插入引用。缓存键由 `(model_name, prompt_text)` 组合生成，TTL 由 Redis 配置决定。

跨层数据流总表

阶段	输入数据类型	输出数据类型	关键数据结构 / 文件
Loader	URL / Token / File	`CrawledDocument` / `str` (path)	`crawler/models.py`, `integrations/*/models.py`
Parser	`str` (path) / `bytes`	`List[Tuple[str, str]]` + tables	`deepdoc/parser/*.py`
Chunking	sections + tables	`List[Dict]`	`nlp/__init__.py`, `app/naive.py`
Embedding	`List[str]`	`(np.array, int)`	`llm/embedding_model.py`
VDB	`DocumentChunk`	ack / `List[DocumentChunk]`	`vdb/field.py`, `models/chunk.py`
GraphRAG	chunk texts	`nx.Graph` + chunks	`graphrag/search.py`, `graphrag/general/index.py`
Retrieval	`query + config`	`List[DocumentChunk]`	`nlp/search.py`
Reranking	`query + docs`	`List[DocumentChunk]`	`models/rerank.py`
Prompt	`chunks + query`	`system + history`	`prompts/generator.py`
LLM	`system + history`	`str + int`	`llm/chat_model.py`
Post-Process	`answer + chunks`	`str + Set[str]`	`nlp/search.py:489`

留白与重叠风险点

风险区域	说明	建议归属
Parser ↔ Chunking 边界	Parser 输出的 `sections` 格式（含 tag 和位置信息）被 Chunking 的 `naive_merge` 直接消费。若 Parser 改了 tag 格式，Chunking 会受影响。	统一在 Parser 文档中定义 `sections` 数据契约，Chunking 文档只引用该契约。
Embedding ↔ VDB 边界	Embedding 输出维度必须与 VDB mapping 中 `dense_vector` 的 dims 一致。动态维度由首次 encode 决定。	Embedding 文档声明维度获取方式，VDB 文档只引用。
GraphRAG ↔ VDB 边界	GraphRAG 的实体/关系/社区报告以 `DocumentChunk` 格式存入 VDB，与标准 chunk 共用同一 ES index。	VDB 文档定义通用存储格式，GraphRAG 文档只说明使用了该格式。
Retrieval ↔ Reranking 边界	Retrieval 的 HYBRID 模式在 Node 层已做 dedup，但 `knowledge_retrieval()` 函数也有独立 rerank 调用。	Reranking 文档说明两种调用路径（Node 层 vs 函数层）的区别。
Prompt ↔ LLM 边界	Prompt 组装的 `history` 格式必须与各 Provider 的 API 格式兼容。	Prompt 文档声明输出格式规范，LLM 文档说明各 Provider 的适配。

13 KiB Raw Blame History Unescape Escape

RAG 环节边界定义

总览图

1. Loader（数据加载层）

2. Parser（文档解析层）

3. Chunking（文本分块层）

4. Embedding（向量化层）

5. VDB（向量数据库层）

6. GraphRAG（知识图谱层）

7. Retrieval（检索层）

8. Reranking（重排序层）

9. Prompt（Prompt 组装层）

10. LLM（大模型生成层）

11. Post-Process（后处理层）

跨层数据流总表

留白与重叠风险点

13 KiB

Raw Blame History