Files
MemoryBear/docs/rag/overview/source-inventory.md
Multica PM Agent 343a5eebe3
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
docs(rag): add MemoryBear RAG implementation docs v1.0
Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 10:51:48 +08:00

17 KiB
Raw Blame History

[S1-T3] MemoryBear RAG 源码盘点与模块依赖关系图谱 — 交付物

一、模块清单

统计口径:api/app/core/rag/ 全部子目录 + api/app/core/workflow/nodes/knowledge + api/app/core/rag_utils/~24,900+ LOC Python 代码。

子模块路径 主要职责 入口文件 / 关键类 / 关键函数 对外接口(被谁调用 / 调用谁) 第三方依赖 文件数 / 行数
rag/app 文档解析与分块 orchestrator按 doc_type 路由到不同解析策略naive / book / paper / qa / audio / picture / manual / laws / mail / one naive.py:508 chunk()naive.py:257 naive.__call__()naive.py:27 by_deepdoc()naive.py:45 by_mineru()naive.py:65 by_textln() tasks.py 调用Celery ingestion调用 deepdoc/parser + deepdoc/vision + rag/nlp + rag/llm/cv_model + rag/llm/sequence2txt_model python-docx, openpyxl, pdfplumber, markdown, Pillow 12 / 2,923
rag/common RAG 共享常量、异常、装饰器、工具函数(文件/浮点/日志/字符串/Token 计数) constants.py(常量定义)、token_utils.pyencodersettings.py:13 init_settings()(单例初始化) rag/utils/es_conn.pyrag/graphrag/utils.pyrag/nlp/search.py 等广泛 import tiktokentokenizer 12 / 602
rag/crawler Web 页面抓取与内容提取 web_crawler.pycontent_extractor.pyhttp_fetcher.py tasks.py 调用;由 knowledge sync 触发 requests 9 / 1,237
rag/deepdoc/parser 11 种格式文档解析PDF/Word/Excel/HTML/MD/JSON/TXT/PPT pdf_parser.py:34 RAGPdfParser.__call__:1124docx_parser.py:9 RAGDocxParser、mineru_parser.py:41 MinerUParser` rag/app/naive.py import 并调用 pdfplumber, pypdf, python-docx, openpyxl, beautifulsoup4, markdown, pandas 12 / 3,228
rag/deepdoc/vision 文档视觉分析:布局识别 + OCR + 表格结构识别 ocr.py:522 OCR.__call__:694layout_recognizer.py:17 LayoutRecognizertable_structure_recognizer.py:15 TableStructureRecognizer pdf_parser.py 调用进行版面/表格/图像识别 onnxruntime, huggingface_hub, Pillow, opencv-python, numpy 10 / 3,657
rag/graphrag(顶层) GraphRAG 共享工具、实体消歧、查询分析提示、知识图谱搜索 search.py:19 KGSearch(Dealer)entity_resolution.py:31 EntityResolutionutils.pygraph merge/persist/LLM cache tasks.py、workflow knowledge node、prompts/generator.py 调用 networkx, pandas, trio, redis, xxhash, json_repair 6 / 1,452
rag/graphrag/general 通用/完整版 GraphRAG 流水线:子图抽取 → 合并 → 实体消歧 → Leiden 社区 → 社区报告 index.py:36 run_graphrag()index.py:122 run_graphrag_for_kb()graph_extractor.py:34 GraphExtractorcommunity_reports_extractor.py:37 tasks.py 的 Celery task 调用;调用 ElasticSearchVector 写图数据 networkx, graspologic, tiktoken, trio 11 / 1,857
rag/graphrag/light 轻量版 GraphRAGLightRAG 风格):简化实体/关系抽取,无社区报告 light/graph_extractor.py:31 GraphExtractor general/index.py 根据 parser_config.graphrag.method 条件切换调用 networkx, trio 3 / 462
rag/integrations/feishu 飞书文档同步客户端 client.py: FeishuAPIClient knowledge_controller.py + tasks.py 调用 requests 6 / 737
rag/integrations/yuque 语雀文档同步客户端 client.py: YuqueAPIClient knowledge_controller.py + tasks.py 调用 requests 6 / 844
rag/llm LLM 多模型统一 facadeChat / Embedding / CV / Seq2txt chat_model.py:52 Baseembedding_model.py:14 Basecv_model.py:19 Basesequence2txt_model.py:15 Base rag/apprag/nlp/searchrag/graphragrag/vdbworkflow/nodes/knowledge 等调用 openai, dashscope, azure-openai, ollama, zhipuai, requests 5 / 1,676
rag/models Chunk 数据模型 chunk.py:17 DocumentChunkchunk.py:5 ChildDocumentChunk rag/vdbrag/appworkflow/nodes/knowledgetasks.py 引用 pydantic 2 / 72
rag/nlp NLP 工具箱中文分词、BM25/hybrid 搜索调度、同义词扩展、术语权重、Query 重写 search.py:349 Dealer(含 retrieval:674search:387rerank:606)、rag_tokenizer.py:15 RagTokenizerquery.py:10 FulltextQueryer rag/app/naive.pyrag/graphragrag/prompts/generator.pyrag/common/settings.py 调用 datrie, hanziconv, nltk, pandas, numpy 7 / 2,962
rag/prompts Prompt 模板加载与 LLM prompt 工厂 template.py:9 load_prompt()generator.pycitation/keyword/question/toc/reflect 等 20+ 函数) tasks.pyrag/nlp/search.pyrag/graphrag 调用;依赖 .md prompt 文件 jinja2, json_repair 3 / 769 + 31 md 文件
rag/utils ES 连接、Redis 连接、LibreOffice 转换、文件工具 es_conn.py: ESConnectionredis_conn.pylibre_office.pyfile_utils.pydoc_store_conn.py rag/vdbrag/common/settings.pyrag/app/naive.pyrag/nlp/search.py 调用 elasticsearch, redis 6 / 1,578
rag/vdb 向量数据库抽象 + Elasticsearch 实现 elasticsearch/elasticsearch_vector.py:29 ElasticSearchVectorelasticsearch/elasticsearch_vector.py:666 ElasticSearchVectorFactoryvector_base.py:9 BaseVector tasks.pyknowledge_controller.pychunk_controller.pyworkflow/nodes/knowledge 调用 elasticsearch, langchain-core 3 / 83 + 2 / 753
rag/res 静态资源NER 词表、同义词表、映射表 ner.jsonsynonym.jsonmapping.json rag/nlp/term_weight.pyrag/nlp/synonym.py 加载 3 JSON
workflow/nodes/knowledge Workflow 知识检索节点:多知识库检索 + 重排序 + GraphRAG 增强 node.py:29 KnowledgeRetrievalNodenode.py:303 execute()node.py:195 knowledge_retrieval() workflow/nodes/node_factory.pyworkflow/nodes/__init__.py 注册;调用 rag/vdbrag/llmrag/models langchain-core 3 / 455
rag_utils⚠️rag/utils 不同) Chunk 内容 LLM 分析:摘要生成、标签提取、洞察分析、人物画像 chunk_summary.py:68 generate_chunk_summary()chunk_tags.py:56 extract_chunk_tags()chunk_insight.py:137 generate_chunk_insight() services/memory_dashboard_service.py 调用;依赖 app.core.memory.* LLM 工厂 pydantic 4 / 588

二、依赖关系图谱Mermaid

graph TB
    subgraph "上层调用者"
        A1[tasks.py<br/>Celery Workers]
        A2[controllers/<br/>REST API]
        A3[workflow/nodes/<br/>知识检索节点]
        A4[services/memory_<br/>dashboard_service.py]
    end

    subgraph "RAG Core"
        B1[rag/app<br/>解析与分块]
        B2[rag/deepdoc/parser<br/>格式解析]
        B3[rag/deepdoc/vision<br/>版面/OCR]
        B4[rag/crawler<br/>网页抓取]
        B5[rag/integrations<br/>飞书/语雀]
        B6[rag/nlp<br/>分词/搜索调度]
        B7[rag/llm<br/>多模型Facade]
        B8[rag/vdb<br/>ES向量存储]
        B9[rag/graphrag<br/>知识图谱]
        B10[rag/prompts<br/>Prompt工厂]
        B11[rag/models<br/>Chunk模型]
        B12[rag/common<br/>常量/工具]
        B13[rag/utils<br/>ES/Redis连接]
    end

    subgraph "旁路模块"
        C1[rag_utils<br/>Chunk LLM分析]
    end

    A1 --> B1
    A1 --> B4
    A1 --> B5
    A1 --> B8
    A1 --> B9
    A1 --> B10
    A2 --> B1
    A2 --> B5
    A2 --> B8
    A2 --> B9
    A3 --> B8
    A3 --> B7
    A3 --> B11
    A4 --> C1

    B1 --> B2
    B1 --> B3
    B1 --> B6
    B1 --> B7
    B2 --> B3
    B2 --> B6
    B3 --> B12
    B4 --> B13
    B5 --> B13
    B6 --> B7
    B6 --> B13
    B6 --> B10
    B8 --> B7
    B8 --> B11
    B8 --> B13
    B9 --> B6
    B9 --> B7
    B9 --> B10
    B9 --> B13
    B10 --> B7
    B10 --> B9

    C1 --> B7
    B12 --> B13
    B13 --> B8

三、入口链路梳理

3.1 文档入库链路Indexing Pipeline

REST POST /document 或 /knowledge/{id}/sync
    ↓ 触发
Celery task @tasks.py:212 parse_document(file_path, document_id)
    ↓ 调用
rag/app/naive.py:508 chunk(filename, binary, ...)
    ↓ 路由 by file extension
    ├─ PDF → by_deepdoc() → deepdoc/parser/pdf_parser.py:34 RAGPdfParser.__call__:1124
    ├─ PDF alt → by_mineru() → deepdoc/parser/mineru_parser.py:41 MinerUParser.parse_pdf()
    ├─ DOCX → RAGDocxParser.__call__() @ docx_parser.py:9
    ├─ XLSX → RAGExcelParser.__call__() @ excel_parser.py:16
    ├─ HTML → RAGHtmlParser.__call__() @ html_parser.py:22
    ├─ MD  → RAGMarkdownParser.__call__() @ markdown_parser.py:6
    ├─ JSON → RAGJsonParser.__call__() @ json_parser.py:7
    └─ TXT → RAGTxtParser.__call__() @ txt_parser.py:7
    ↓
rag/app/naive.py:257 naive.__call__() — 提取 sections + tables
    ↓
rag/nlp/__init__.py — tokenize / naive_merge / hierarchical_merge
    ↓
rag/vdb/elasticsearch/elasticsearch_vector.py:55 add_chunks()
    ↓ 调用
rag/vdb/elasticsearch/elasticsearch_vector.py:65 create()
    ↓ 调用
embedding_model.py: encode() → LLM API → ES bulk index

3.2 在线检索链路Query Pipeline

REST POST /retrieval
    或
Workflow Node: workflow/nodes/knowledge/node.py:303 execute()
    ↓
workflow/nodes/knowledge/node.py:195 knowledge_retrieval()
    ↓ 根据 retrieve_type 分支
    ├─ PARTICIPLE → ElasticSearchVector.search_by_full_text() @ elasticsearch_vector.py:468
    ├─ SEMANTIC  → ElasticSearchVector.search_by_vector() @ elasticsearch_vector.py:374
    ├─ HYBRID    → 并行 vector + full_text → dedupe → rerank @ node.py:236-271
    └─ Graph     → HYBRID 结果 + kg_retriever.retrieval()
                        ↓ 调用
                        rag/common/settings.py:10 kg_retriever (单例)
                        ↓ 调用
                        rag/graphrag/search.py:19 KGSearch.retrieval()

3.3 GraphRAG 构建链路

REST POST /knowledge/{knowledge_id}/knowledge_graph
    或
Celery task @tasks.py:472 build_graphrag_for_kb(kb_id)
    ↓
Celery task @tasks.py:557 build_graphrag_for_document(document_id, knowledge_id)
    ↓
rag/graphrag/general/index.py:36 run_graphrag(row, language, with_resolution, with_community, ...)
    ↓
rag/graphrag/general/index.py:122 run_graphrag_for_kb(kb_id, ...)
    ↓ 流水线
    1. init_graphrag() → 创建 ES 索引
    2. GraphExtractor.extract() → 逐 chunk 抽取实体/关系
       ├─ general/graph_extractor.py:34 GraphExtractor (Microsoft GraphRAG 风格)
       └─ light/graph_extractor.py:31 GraphExtractor (LightRAG 风格,条件切换)
    3. graph_merge() → 合并子图
    4. EntityResolution.resolve() → 实体消歧
    5. leiden.run() → 社区发现
    6. CommunityReportsExtractor.extract() → 社区摘要
    7. set_graph() → 写回 ES

3.4 Workflow Knowledge 节点链路

workflow/nodes/knowledge/node.py:29 KnowledgeRetrievalNode
    ↓
node.py:54 _extract_input() — 渲染 query 模板,读取 knowledge_bases 配置
    ↓
node.py:303 execute()
    ↓
node.py:335 get_knowledge_by_id() — 校验知识库存在性
    ↓
node.py:195 knowledge_retrieval()
    ↓ 分支处理
    ├─ FOLDER 类型 → 递归遍历子知识库
    ├─ PARTICIPLE → vector_service.search_by_full_text()
    ├─ SEMANTIC  → vector_service.search_by_vector()
    ├─ HYBRID    → vector + full_text 并行 → dedupe → rerank
    └─ Graph     → HYBRID + kg_retriever.retrieval() 增强
    ↓
node.py:108 rerank() — 调用 RedBearRerank 模型
    ↓
node.py:362 返回 {"chunks": [...], "citations": [...]}

四、Gap 报告(代码 vs S1-T2 架构预期)

4.1 "架构里列了但代码里没有 / 命名/范围不一致"

# 差异项 S1-T2 架构预期 代码实际 影响与建议
1 缺少 Milvus/Weaviate/Qdrant 支持 VDB 环节预期讨论"向量数据库选型",暗示可能多库 rag/vdb/elasticsearch/ 有实现,BaseVector 无其他子类 架构文档中 VDB 章节需要明确限定为 Elasticsearch 8.x或规划扩展接口
2 rag_utils vs rag/utils 命名冲突 预期目录:api/app/core/rag/{deepdoc,crawler,integrations,llm,vdb,graphrag,prompts,app} 实际存在 rag/utils(文件工具/ES 连接) rag_utils/Chunk LLM 分析)两个独立目录,仅下划线差异 极易混淆,建议将 rag_utils/ 重命名为 rag/chunk_analytics/ 或合并到 rag/app/ 下游
3 nlp/search.py 中的 Dealer 是遗留/旁路模块 架构中 rag/nlp 预期为"分词/NLP 工具" rag/nlp/search.py:349 Dealer 实际是一个完整的 BM25/hybrid 搜索调度器,与 rag/vdb 的 ES 向量搜索并行存在两套检索体系 两套检索代码并存(nlp/search.py 主要被 GraphRAG 使用,vdb/elasticsearch 被 Workflow 使用)。架构文档应明确标注 nlp/search 是 GraphRAG 专用旧通道
4 缺少独立的 Reranking 模块 S1-T2 预期有独立的 Reranking 环节 重排序逻辑散布在多处:workflow/nodes/knowledge/node.py:108 rerank()rag/vdb/elasticsearch/elasticsearch_vector.py:560 rerank()、以及 rag/nlp/search.py:606 rerank() 建议 Sprint-2 文档将 Reranking 单独成章汇总这三处实现并标注差异Workflow 节点用 RedBearRerankVDB 层也有独立 rerankNLP 层有 model-based rerank
5 Prompt 目录含大量 .md 模板但无统一版本管理 Prompt 工程是独立环节 rag/prompts/ 有 31 个 .md 模板文件 + template.py(加载器)+ generator.py(工厂函数),但模板修改无版本控制/审计机制 建议文档中标注 prompt 管理现状:文件驱动、运行时加载、无 A/B 或版本回滚机制
6 Deepdoc vision 模型加载路径硬编码 架构预期模型管理可配置 deepdoc/vision/ 各 recognizer 硬编码从 huggingface_hub.snapshot_download(repo_id="InfiniFlow/deepdoc") 下载到 res/deepdoc/,仅 HF_ENDPOINT 环境变量可配 建议文档中明确标注模型路径约束,为后续模型热更新/私有化部署做铺垫
7 GraphRAG light 是条件分支而非独立模块 S1-T2 预期 GraphRAG 有 light 和 general 两个独立目录 light/ 仅含 graph_extractor.py + graph_prompt.py2 个逻辑文件),其余全部复用 general/Extractor 基类、utils.pyindex.py Sprint-2 文档应将 light 标记为"general 的条件子模式",避免读者误以为两套完整流水线

4.2 "代码里有但架构没列"

# 差异项 代码位置 说明
1 rag/app 按 doc_type 路由的 11 种解析策略 rag/app/{naive,book,paper,qa,audio,picture,manual,laws,mail,one,textin_parser}.py S1-T2 架构只提到 "Loader / Parser",未提及 MemoryBear 特有的 doc_type 路由体系book/paper/qa/audio 等)
2 MinerU 第三方解析器集成 rag/deepdoc/parser/mineru_parser.py 架构中 Parser 环节未提及 MinerU第三方 PDF 解析服务)作为 PDF 解析的替代方案
3 TextIn 第三方解析器集成 rag/app/textin_parser.py 同上,未提及 TextIn API 作为另一 PDF 解析备选
4 rag_utilsChunk LLM 分析) api/app/core/rag_utils/ 架构中无此模块定位,它实际做 chunk 摘要/标签/洞察,与 Memory 系统耦合
5 Toc目录智能提取链路 rag/prompts/generator.py:408-717 大量 LLM-driven TOC 检测/提取/索引/关联代码,架构大纲中未单列 "TOC 处理" 环节
6 Crawler网页抓取 rag/crawler/ 架构中 Loader 环节可能包含爬虫,但代码量 1,200+ LOC 值得单独标注
7 res/ 静态资源NER、同义词表 rag/res/{ner.json,synonym.json,mapping.json} 架构中未提及术语权重/同义词扩展的资源文件体系

五、关键数据速查

指标 数值
api/app/core/rag/ 总 Python LOC ~24,895
api/app/core/rag/ 子模块数 15不含 res/
.md Prompt 模板数 31
Parser 实现数 11 种(含 PDF 3 种策略deepdoc/mineru/textin
LLM Provider 实现数 Chat 9 种 + Embed 10 种 + CV 7 种 + Seq2txt 6 种 = 32 个 provider 类
Workflow Knowledge 检索类型 PARTICIPLE / SEMANTIC / HYBRID / Graph4 种)
GraphRAG 模式 generalMicrosoft GraphRAG/ lightLightRAG 风格)
VDB 实现 Elasticsearch 8.x唯一

以上交付物已同步写入本地文件 WS-14-deliverable.md,可作为 Sprint-2 文档化的底图直接复用。