Files
MemoryBear/docs/rag/overview/source-inventory.md
Multica PM Agent 343a5eebe3
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
docs(rag): add MemoryBear RAG implementation docs v1.0
Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 10:51:48 +08:00

249 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# [S1-T3] MemoryBear RAG 源码盘点与模块依赖关系图谱 — 交付物
## 一、模块清单
> 统计口径:`api/app/core/rag/` 全部子目录 + `api/app/core/workflow/nodes/knowledge` + `api/app/core/rag_utils/` 共 **~24,900+ LOC** Python 代码。
| 子模块路径 | 主要职责 | 入口文件 / 关键类 / 关键函数 | 对外接口(被谁调用 / 调用谁) | 第三方依赖 | 文件数 / 行数 |
|---|---|---|---|---|---|
| `rag/app` | 文档解析与分块 orchestrator按 doc_type 路由到不同解析策略naive / book / paper / qa / audio / picture / manual / laws / mail / one | `naive.py:508 chunk()``naive.py:257 naive.__call__()``naive.py:27 by_deepdoc()``naive.py:45 by_mineru()``naive.py:65 by_textln()` | 被 `tasks.py` 调用Celery ingestion调用 `deepdoc/parser` + `deepdoc/vision` + `rag/nlp` + `rag/llm/cv_model` + `rag/llm/sequence2txt_model` | `python-docx`, `openpyxl`, `pdfplumber`, `markdown`, `Pillow` | 12 / 2,923 |
| `rag/common` | RAG 共享常量、异常、装饰器、工具函数(文件/浮点/日志/字符串/Token 计数) | `constants.py`(常量定义)、`token_utils.py`encoder`settings.py:13 init_settings()`(单例初始化) | 被 `rag/utils/es_conn.py``rag/graphrag/utils.py``rag/nlp/search.py` 等广泛 import | `tiktoken`tokenizer | 12 / 602 |
| `rag/crawler` | Web 页面抓取与内容提取 | `web_crawler.py``content_extractor.py``http_fetcher.py` | 被 `tasks.py` 调用;由 knowledge sync 触发 | `requests` | 9 / 1,237 |
| `rag/deepdoc/parser` | 11 种格式文档解析PDF/Word/Excel/HTML/MD/JSON/TXT/PPT | `pdf_parser.py:34 RAGPdfParser.__call__:1124``docx_parser.py:9 RAGDocxParser`、mineru_parser.py:41 MinerUParser` | 被 `rag/app/naive.py` import 并调用 | `pdfplumber`, `pypdf`, `python-docx`, `openpyxl`, `beautifulsoup4`, `markdown`, `pandas` | 12 / 3,228 |
| `rag/deepdoc/vision` | 文档视觉分析:布局识别 + OCR + 表格结构识别 | `ocr.py:522 OCR.__call__:694`、`layout_recognizer.py:17 LayoutRecognizer`、`table_structure_recognizer.py:15 TableStructureRecognizer` | 被 `pdf_parser.py` 调用进行版面/表格/图像识别 | `onnxruntime`, `huggingface_hub`, `Pillow`, `opencv-python`, `numpy` | 10 / 3,657 |
| `rag/graphrag`(顶层) | GraphRAG 共享工具、实体消歧、查询分析提示、知识图谱搜索 | `search.py:19 KGSearch(Dealer)`、`entity_resolution.py:31 EntityResolution`、`utils.py`graph merge/persist/LLM cache | 被 `tasks.py`、workflow knowledge node、prompts/generator.py 调用 | `networkx`, `pandas`, `trio`, `redis`, `xxhash`, `json_repair` | 6 / 1,452 |
| `rag/graphrag/general` | 通用/完整版 GraphRAG 流水线:子图抽取 → 合并 → 实体消歧 → Leiden 社区 → 社区报告 | `index.py:36 run_graphrag()`、`index.py:122 run_graphrag_for_kb()`、`graph_extractor.py:34 GraphExtractor`、`community_reports_extractor.py:37` | 被 `tasks.py` 的 Celery task 调用;调用 `ElasticSearchVector` 写图数据 | `networkx`, `graspologic`, `tiktoken`, `trio` | 11 / 1,857 |
| `rag/graphrag/light` | 轻量版 GraphRAGLightRAG 风格):简化实体/关系抽取,无社区报告 | `light/graph_extractor.py:31 GraphExtractor` | 被 `general/index.py` 根据 `parser_config.graphrag.method` 条件切换调用 | `networkx`, `trio` | 3 / 462 |
| `rag/integrations/feishu` | 飞书文档同步客户端 | `client.py: FeishuAPIClient` | 被 `knowledge_controller.py` + `tasks.py` 调用 | `requests` | 6 / 737 |
| `rag/integrations/yuque` | 语雀文档同步客户端 | `client.py: YuqueAPIClient` | 被 `knowledge_controller.py` + `tasks.py` 调用 | `requests` | 6 / 844 |
| `rag/llm` | LLM 多模型统一 facadeChat / Embedding / CV / Seq2txt | `chat_model.py:52 Base`、`embedding_model.py:14 Base`、`cv_model.py:19 Base`、`sequence2txt_model.py:15 Base` | 被 `rag/app`、`rag/nlp/search`、`rag/graphrag`、`rag/vdb`、`workflow/nodes/knowledge` 等调用 | `openai`, `dashscope`, `azure-openai`, `ollama`, `zhipuai`, `requests` | 5 / 1,676 |
| `rag/models` | Chunk 数据模型 | `chunk.py:17 DocumentChunk`、`chunk.py:5 ChildDocumentChunk` | 被 `rag/vdb`、`rag/app`、`workflow/nodes/knowledge`、`tasks.py` 引用 | `pydantic` | 2 / 72 |
| `rag/nlp` | NLP 工具箱中文分词、BM25/hybrid 搜索调度、同义词扩展、术语权重、Query 重写 | `search.py:349 Dealer`(含 `retrieval:674`、`search:387`、`rerank:606`)、`rag_tokenizer.py:15 RagTokenizer`、`query.py:10 FulltextQueryer` | 被 `rag/app/naive.py`、`rag/graphrag`、`rag/prompts/generator.py`、`rag/common/settings.py` 调用 | `datrie`, `hanziconv`, `nltk`, `pandas`, `numpy` | 7 / 2,962 |
| `rag/prompts` | Prompt 模板加载与 LLM prompt 工厂 | `template.py:9 load_prompt()`、`generator.py`citation/keyword/question/toc/reflect 等 20+ 函数) | 被 `tasks.py`、`rag/nlp/search.py`、`rag/graphrag` 调用;依赖 `.md` prompt 文件 | `jinja2`, `json_repair` | 3 / 769 + 31 md 文件 |
| `rag/utils` | ES 连接、Redis 连接、LibreOffice 转换、文件工具 | `es_conn.py: ESConnection`、`redis_conn.py`、`libre_office.py`、`file_utils.py`、`doc_store_conn.py` | 被 `rag/vdb`、`rag/common/settings.py`、`rag/app/naive.py`、`rag/nlp/search.py` 调用 | `elasticsearch`, `redis` | 6 / 1,578 |
| `rag/vdb` | 向量数据库抽象 + Elasticsearch 实现 | `elasticsearch/elasticsearch_vector.py:29 ElasticSearchVector`、`elasticsearch/elasticsearch_vector.py:666 ElasticSearchVectorFactory`、`vector_base.py:9 BaseVector` | 被 `tasks.py`、`knowledge_controller.py`、`chunk_controller.py`、`workflow/nodes/knowledge` 调用 | `elasticsearch`, `langchain-core` | 3 / 83 + 2 / 753 |
| `rag/res` | 静态资源NER 词表、同义词表、映射表 | `ner.json`、`synonym.json`、`mapping.json` | 被 `rag/nlp/term_weight.py`、`rag/nlp/synonym.py` 加载 | — | 3 JSON |
| `workflow/nodes/knowledge` | Workflow 知识检索节点:多知识库检索 + 重排序 + GraphRAG 增强 | `node.py:29 KnowledgeRetrievalNode`、`node.py:303 execute()`、`node.py:195 knowledge_retrieval()` | 被 `workflow/nodes/node_factory.py`、`workflow/nodes/__init__.py` 注册;调用 `rag/vdb`、`rag/llm`、`rag/models` | `langchain-core` | 3 / 455 |
| `rag_utils`(⚠️ 与 `rag/utils` 不同) | Chunk 内容 LLM 分析:摘要生成、标签提取、洞察分析、人物画像 | `chunk_summary.py:68 generate_chunk_summary()`、`chunk_tags.py:56 extract_chunk_tags()`、`chunk_insight.py:137 generate_chunk_insight()` | 被 `services/memory_dashboard_service.py` 调用;依赖 `app.core.memory.*` LLM 工厂 | `pydantic` | 4 / 588 |
---
## 二、依赖关系图谱Mermaid
```mermaid
graph TB
subgraph "上层调用者"
A1[tasks.py<br/>Celery Workers]
A2[controllers/<br/>REST API]
A3[workflow/nodes/<br/>知识检索节点]
A4[services/memory_<br/>dashboard_service.py]
end
subgraph "RAG Core"
B1[rag/app<br/>解析与分块]
B2[rag/deepdoc/parser<br/>格式解析]
B3[rag/deepdoc/vision<br/>版面/OCR]
B4[rag/crawler<br/>网页抓取]
B5[rag/integrations<br/>飞书/语雀]
B6[rag/nlp<br/>分词/搜索调度]
B7[rag/llm<br/>多模型Facade]
B8[rag/vdb<br/>ES向量存储]
B9[rag/graphrag<br/>知识图谱]
B10[rag/prompts<br/>Prompt工厂]
B11[rag/models<br/>Chunk模型]
B12[rag/common<br/>常量/工具]
B13[rag/utils<br/>ES/Redis连接]
end
subgraph "旁路模块"
C1[rag_utils<br/>Chunk LLM分析]
end
A1 --> B1
A1 --> B4
A1 --> B5
A1 --> B8
A1 --> B9
A1 --> B10
A2 --> B1
A2 --> B5
A2 --> B8
A2 --> B9
A3 --> B8
A3 --> B7
A3 --> B11
A4 --> C1
B1 --> B2
B1 --> B3
B1 --> B6
B1 --> B7
B2 --> B3
B2 --> B6
B3 --> B12
B4 --> B13
B5 --> B13
B6 --> B7
B6 --> B13
B6 --> B10
B8 --> B7
B8 --> B11
B8 --> B13
B9 --> B6
B9 --> B7
B9 --> B10
B9 --> B13
B10 --> B7
B10 --> B9
C1 --> B7
B12 --> B13
B13 --> B8
```
---
## 三、入口链路梳理
### 3.1 文档入库链路Indexing Pipeline
```
REST POST /document 或 /knowledge/{id}/sync
↓ 触发
Celery task @tasks.py:212 parse_document(file_path, document_id)
↓ 调用
rag/app/naive.py:508 chunk(filename, binary, ...)
↓ 路由 by file extension
├─ PDF → by_deepdoc() → deepdoc/parser/pdf_parser.py:34 RAGPdfParser.__call__:1124
├─ PDF alt → by_mineru() → deepdoc/parser/mineru_parser.py:41 MinerUParser.parse_pdf()
├─ DOCX → RAGDocxParser.__call__() @ docx_parser.py:9
├─ XLSX → RAGExcelParser.__call__() @ excel_parser.py:16
├─ HTML → RAGHtmlParser.__call__() @ html_parser.py:22
├─ MD → RAGMarkdownParser.__call__() @ markdown_parser.py:6
├─ JSON → RAGJsonParser.__call__() @ json_parser.py:7
└─ TXT → RAGTxtParser.__call__() @ txt_parser.py:7
rag/app/naive.py:257 naive.__call__() — 提取 sections + tables
rag/nlp/__init__.py — tokenize / naive_merge / hierarchical_merge
rag/vdb/elasticsearch/elasticsearch_vector.py:55 add_chunks()
↓ 调用
rag/vdb/elasticsearch/elasticsearch_vector.py:65 create()
↓ 调用
embedding_model.py: encode() → LLM API → ES bulk index
```
### 3.2 在线检索链路Query Pipeline
```
REST POST /retrieval
Workflow Node: workflow/nodes/knowledge/node.py:303 execute()
workflow/nodes/knowledge/node.py:195 knowledge_retrieval()
↓ 根据 retrieve_type 分支
├─ PARTICIPLE → ElasticSearchVector.search_by_full_text() @ elasticsearch_vector.py:468
├─ SEMANTIC → ElasticSearchVector.search_by_vector() @ elasticsearch_vector.py:374
├─ HYBRID → 并行 vector + full_text → dedupe → rerank @ node.py:236-271
└─ Graph → HYBRID 结果 + kg_retriever.retrieval()
↓ 调用
rag/common/settings.py:10 kg_retriever (单例)
↓ 调用
rag/graphrag/search.py:19 KGSearch.retrieval()
```
### 3.3 GraphRAG 构建链路
```
REST POST /knowledge/{knowledge_id}/knowledge_graph
Celery task @tasks.py:472 build_graphrag_for_kb(kb_id)
Celery task @tasks.py:557 build_graphrag_for_document(document_id, knowledge_id)
rag/graphrag/general/index.py:36 run_graphrag(row, language, with_resolution, with_community, ...)
rag/graphrag/general/index.py:122 run_graphrag_for_kb(kb_id, ...)
↓ 流水线
1. init_graphrag() → 创建 ES 索引
2. GraphExtractor.extract() → 逐 chunk 抽取实体/关系
├─ general/graph_extractor.py:34 GraphExtractor (Microsoft GraphRAG 风格)
└─ light/graph_extractor.py:31 GraphExtractor (LightRAG 风格,条件切换)
3. graph_merge() → 合并子图
4. EntityResolution.resolve() → 实体消歧
5. leiden.run() → 社区发现
6. CommunityReportsExtractor.extract() → 社区摘要
7. set_graph() → 写回 ES
```
### 3.4 Workflow Knowledge 节点链路
```
workflow/nodes/knowledge/node.py:29 KnowledgeRetrievalNode
node.py:54 _extract_input() — 渲染 query 模板,读取 knowledge_bases 配置
node.py:303 execute()
node.py:335 get_knowledge_by_id() — 校验知识库存在性
node.py:195 knowledge_retrieval()
↓ 分支处理
├─ FOLDER 类型 → 递归遍历子知识库
├─ PARTICIPLE → vector_service.search_by_full_text()
├─ SEMANTIC → vector_service.search_by_vector()
├─ HYBRID → vector + full_text 并行 → dedupe → rerank
└─ Graph → HYBRID + kg_retriever.retrieval() 增强
node.py:108 rerank() — 调用 RedBearRerank 模型
node.py:362 返回 {"chunks": [...], "citations": [...]}
```
---
## 四、Gap 报告(代码 vs S1-T2 架构预期)
### 4.1 "架构里列了但代码里没有 / 命名/范围不一致"
| # | 差异项 | S1-T2 架构预期 | 代码实际 | 影响与建议 |
|---|---|---|---|---|
| 1 | **缺少 Milvus/Weaviate/Qdrant 支持** | VDB 环节预期讨论"向量数据库选型",暗示可能多库 | 仅 `rag/vdb/elasticsearch/` 有实现,`BaseVector` 无其他子类 | 架构文档中 VDB 章节需要明确限定为 Elasticsearch 8.x或规划扩展接口 |
| 2 | **`rag_utils` vs `rag/utils` 命名冲突** | 预期目录:`api/app/core/rag/{deepdoc,crawler,integrations,llm,vdb,graphrag,prompts,app}` | 实际存在 `rag/utils`(文件工具/ES 连接)**和** `rag_utils/`Chunk LLM 分析)两个独立目录,仅下划线差异 | 极易混淆,建议将 `rag_utils/` 重命名为 `rag/chunk_analytics/` 或合并到 `rag/app/` 下游 |
| 3 | **`nlp/search.py` 中的 `Dealer` 是遗留/旁路模块** | 架构中 `rag/nlp` 预期为"分词/NLP 工具" | `rag/nlp/search.py:349 Dealer` 实际是一个完整的 BM25/hybrid 搜索调度器,与 `rag/vdb` 的 ES 向量搜索并行存在两套检索体系 | 两套检索代码并存(`nlp/search.py` 主要被 GraphRAG 使用,`vdb/elasticsearch` 被 Workflow 使用)。架构文档应明确标注 `nlp/search` 是 GraphRAG 专用旧通道 |
| 4 | **缺少独立的 Reranking 模块** | S1-T2 预期有独立的 Reranking 环节 | 重排序逻辑散布在多处:`workflow/nodes/knowledge/node.py:108 rerank()`、`rag/vdb/elasticsearch/elasticsearch_vector.py:560 rerank()`、以及 `rag/nlp/search.py:606 rerank()` | 建议 Sprint-2 文档将 Reranking 单独成章汇总这三处实现并标注差异Workflow 节点用 RedBearRerankVDB 层也有独立 rerankNLP 层有 model-based rerank |
| 5 | **Prompt 目录含大量 .md 模板但无统一版本管理** | Prompt 工程是独立环节 | `rag/prompts/` 有 31 个 `.md` 模板文件 + `template.py`(加载器)+ `generator.py`(工厂函数),但模板修改无版本控制/审计机制 | 建议文档中标注 prompt 管理现状:文件驱动、运行时加载、无 A/B 或版本回滚机制 |
| 6 | **Deepdoc vision 模型加载路径硬编码** | 架构预期模型管理可配置 | `deepdoc/vision/` 各 recognizer 硬编码从 `huggingface_hub.snapshot_download(repo_id="InfiniFlow/deepdoc")` 下载到 `res/deepdoc/`,仅 `HF_ENDPOINT` 环境变量可配 | 建议文档中明确标注模型路径约束,为后续模型热更新/私有化部署做铺垫 |
| 7 | **GraphRAG light 是条件分支而非独立模块** | S1-T2 预期 GraphRAG 有 light 和 general 两个独立目录 | `light/` 仅含 `graph_extractor.py` + `graph_prompt.py`2 个逻辑文件),其余全部复用 `general/` 的 `Extractor` 基类、`utils.py`、`index.py` | Sprint-2 文档应将 light 标记为"general 的条件子模式",避免读者误以为两套完整流水线 |
### 4.2 "代码里有但架构没列"
| # | 差异项 | 代码位置 | 说明 |
|---|---|---|---|
| 1 | **rag/app 按 doc_type 路由的 11 种解析策略** | `rag/app/{naive,book,paper,qa,audio,picture,manual,laws,mail,one,textin_parser}.py` | S1-T2 架构只提到 "Loader / Parser",未提及 MemoryBear 特有的 doc_type 路由体系book/paper/qa/audio 等) |
| 2 | **MinerU 第三方解析器集成** | `rag/deepdoc/parser/mineru_parser.py` | 架构中 Parser 环节未提及 MinerU第三方 PDF 解析服务)作为 PDF 解析的替代方案 |
| 3 | **TextIn 第三方解析器集成** | `rag/app/textin_parser.py` | 同上,未提及 TextIn API 作为另一 PDF 解析备选 |
| 4 | **rag_utilsChunk LLM 分析)** | `api/app/core/rag_utils/` | 架构中无此模块定位,它实际做 chunk 摘要/标签/洞察,与 Memory 系统耦合 |
| 5 | **Toc目录智能提取链路** | `rag/prompts/generator.py:408-717` | 大量 LLM-driven TOC 检测/提取/索引/关联代码,架构大纲中未单列 "TOC 处理" 环节 |
| 6 | **Crawler网页抓取** | `rag/crawler/` | 架构中 Loader 环节可能包含爬虫,但代码量 1,200+ LOC 值得单独标注 |
| 7 | **res/ 静态资源NER、同义词表** | `rag/res/{ner.json,synonym.json,mapping.json}` | 架构中未提及术语权重/同义词扩展的资源文件体系 |
---
## 五、关键数据速查
| 指标 | 数值 |
|---|---|
| `api/app/core/rag/` 总 Python LOC | ~24,895 |
| `api/app/core/rag/` 子模块数 | 15不含 res/ |
| `.md` Prompt 模板数 | 31 |
| Parser 实现数 | 11 种(含 PDF 3 种策略deepdoc/mineru/textin |
| LLM Provider 实现数 | Chat 9 种 + Embed 10 种 + CV 7 种 + Seq2txt 6 种 = **32 个 provider 类** |
| Workflow Knowledge 检索类型 | PARTICIPLE / SEMANTIC / HYBRID / Graph4 种) |
| GraphRAG 模式 | generalMicrosoft GraphRAG/ lightLightRAG 风格) |
| VDB 实现 | Elasticsearch 8.x唯一 |
---
以上交付物已同步写入本地文件 `WS-14-deliverable.md`,可作为 Sprint-2 文档化的底图直接复用。