Files
MemoryBear/docs/rag/_indexes/file-index.md
Multica PM Agent 343a5eebe3
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
docs(rag): add MemoryBear RAG implementation docs v1.0
Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 10:51:48 +08:00

12 KiB
Raw Blame History

MemoryBear RAG · 源码反查索引File Index

从源码模块反查到对应的文档章节。开发者修改某个文件时,可在此查到所有引用该文件的文档,提前评估改动的"知识涟漪"。

1. 总览:代码目录 → 文档映射

代码目录 主要责任 主导文档 次要引用
api/app/core/rag/app/ 多格式解析 orchestrator pipeline/01-loader-parser-chunking.md overview/source-inventory.md
api/app/core/rag/common/ 常量、token、settings pipeline/01-loader-parser-chunking.md, evolution/architecture-refactor-suggestions.md §0.2 #4 / #2 overview/source-inventory.md
api/app/core/rag/crawler/ Web 爬虫 pipeline/01-loader-parser-chunking.md §4.1
api/app/core/rag/deepdoc/parser/ 11 类格式解析 pipeline/01-loader-parser-chunking.md §5 overview/source-inventory.md
api/app/core/rag/deepdoc/vision/ OCR + 版面 + TSR pipeline/01-loader-parser-chunking.md §5.6 evolution/architecture-refactor-suggestions.md §0.2 #2HF_ENDPOINT
api/app/core/rag/graphrag/ GraphRAG 共享工具 + 图搜索 pipeline/04-graphrag.md(待交付) overview/source-inventory.md §3.3
api/app/core/rag/graphrag/general/ Microsoft GraphRAG 风格流水线 pipeline/04-graphrag.md §general待交付 overview/04-graphrag-indexing.mmd
api/app/core/rag/graphrag/light/ LightRAG 风格抽取器 pipeline/04-graphrag.md §light待交付 同上
api/app/core/rag/integrations/feishu/ 飞书 SDK pipeline/01-loader-parser-chunking.md §4
api/app/core/rag/integrations/yuque/ 语雀 SDK 同上
api/app/core/rag/llm/ LLM 多模型 facade pipeline/05-reranking-prompt-llm.md §3 evolution/architecture-refactor-suggestions.md #1, #5
api/app/core/rag/models/ Chunk 数据模型 pipeline/01-loader-parser-chunking.md §3 overview/source-inventory.md
api/app/core/rag/nlp/ 中文分词、Hybrid 搜索调度 pipeline/03-vdb-and-retrieval.md §6, pipeline/05-reranking-prompt-llm.md §1.2 evolution/architecture-refactor-suggestions.md #3
api/app/core/rag/prompts/ Prompt 模板与工厂 pipeline/05-reranking-prompt-llm.md §2
api/app/core/rag/utils/ ES/Redis 连接、LibreOffice pipeline/03-vdb-and-retrieval.md, pipeline/01-loader-parser-chunking.md §4.2
api/app/core/rag/vdb/elasticsearch/ ES 向量+全文 pipeline/03-vdb-and-retrieval.md 全文 pipeline/02-embedding.md §5.4
api/app/core/rag/res/ NER / 同义词 / mapping pipeline/03-vdb-and-retrieval.md §3
api/app/core/models/ 统一封装层Embedding / Rerank / LLM pipeline/02-embedding.md §1.2, pipeline/05-reranking-prompt-llm.md §1.2 evolution/architecture-refactor-suggestions.md #1
api/app/core/agent/ LangChainAgent pipeline/05-reranking-prompt-llm.md §3.4
api/app/core/workflow/nodes/knowledge/ Workflow Knowledge 节点 pipeline/05-reranking-prompt-llm.md §3.4, pipeline/03-vdb-and-retrieval.md evolution/architecture-refactor-suggestions.md #3
api/app/core/rag_utils/(注意与 rag/utils 不同) Chunk LLM 分析(与 Memory 系统耦合) overview/source-inventory.md §rag_utils evolution/future-extensions-roadmap.md D4
api/app/core/memory/ 对话内存系统Ebbinghaus / ACT-R / Neo4j / langgraph evolution/future-extensions-roadmap.md D4未来扩展引用
api/app/services/ 业务服务层 pipeline/05-reranking-prompt-llm.md §3.5
api/app/tasks.py Celery 任务入口 overview/source-inventory.md §3, pipeline/01-loader-parser-chunking.md §3.1 evolution/future-extensions-roadmap.md D3

2. 关键文件 → 文档章节(细粒度)

api/app/core/rag/app/naive.py

行号 函数 / 关键代码 引用文档
:27 by_deepdoc() DeepDoc 解析路径 pipeline/01-loader-parser-chunking.md §5.1
:45 by_mineru() MinerU 第三方解析 同上 §5.2
:65 by_textln() TextIn 第三方解析 同上 §5.3
:257 naive.__call__() 主解析入口 同上 §3
:508-738 chunk() 11 路 if/elif 分发,按扩展名挑 parser 同上 §3, evolution/architecture-refactor-suggestions.md §0.2 #5 / #5 改造建议

api/app/core/rag/llm/embedding_model.py

行号 类 / 函数 引用文档
:14-38 Base Embedding 抽象基类(旧) pipeline/02-embedding.md §5.1
:50-65 OpenAIEmbed.encode() OpenAI 兼容 Embedding 实现 同上 §5.2, evolution/architecture-refactor-suggestions.md #1 / #4 / #9
:138-143 QWenEmbed DashScope Embedding含显式 5 次重试) pipeline/02-embedding.md §3.2

api/app/core/models/embedding.py

行号 类 / 函数 引用文档
:9-23 RedBearEmbeddings.__init__ LangChain 统一封装初始化 pipeline/02-embedding.md §1.2 / §5.3
:65-78 embed_documents() 文档侧 Embedding含火山多模态分支 同上 §2.1

api/app/core/rag/vdb/elasticsearch/elasticsearch_vector.py

行号 类 / 函数 引用文档
:29 ElasticSearchVector ES 向量主实现 pipeline/03-vdb-and-retrieval.md §1
:55-63 add_chunks() 向量入库 同上 §4, pipeline/02-embedding.md §2.1, evolution/architecture-refactor-suggestions.md #4
:374-380 search_by_vector() 向量检索 pipeline/03-vdb-and-retrieval.md §6, pipeline/02-embedding.md §2.2
:468 search_by_full_text() BM25 检索 pipeline/03-vdb-and-retrieval.md §5
:560-607 rerank() ES 层 rerank pipeline/05-reranking-prompt-llm.md §1.2 D, evolution/architecture-refactor-suggestions.md #3
:653-658 dense_vector mapping dense_vector 维度动态决定 pipeline/02-embedding.md §3.4, pipeline/03-vdb-and-retrieval.md §3
:666 ElasticSearchVectorFactory 工厂类 overview/source-inventory.md, pipeline/03-vdb-and-retrieval.md §1
:685-707 ES 配置环境变量 6 个 ES 相关 env vars evolution/architecture-refactor-suggestions.md §0.2 #2

api/app/core/rag/nlp/search.py

行号 类 / 函数 引用文档
:36-147 knowledge_retrieval() 知识检索入口(旧通道) pipeline/05-reranking-prompt-llm.md §1.2
:284-343 rerank() 模块级 rerank 同上
:349 Dealer BM25/Hybrid 调度器 pipeline/03-vdb-and-retrieval.md §6, overview/source-inventory.md §一
:365-373 get_vector() 调用旧 Embedding 接口的 encode_queries pipeline/02-embedding.md §2.4
:387 search() 主 search pipeline/03-vdb-and-retrieval.md §6
:439 FusionExpr("weighted_sum") 0.05/0.95 硬编码权重 pipeline/03-vdb-and-retrieval.md §6, evolution/future-extensions-roadmap.md D2
:489-577 insert_citations() 引用回填embedding 相似度匹配) pipeline/05-reranking-prompt-llm.md §4.1
:579-604 _rank_feature_scores() tag TF-IDF + PageRank pipeline/05-reranking-prompt-llm.md §1.2 A
:606-643 Dealer.rerank() 内置混合 rerank融合分数 同上, evolution/architecture-refactor-suggestions.md #3
:645-666 rerank_by_model() 外部模型 rerank pipeline/05-reranking-prompt-llm.md §1.2 B
:674-768 retrieval() 检索主流程 同上 §1.3

api/app/core/workflow/nodes/knowledge/node.py

行号 类 / 函数 引用文档
:12 import OpenAIEmbed 硬编码导入旧 Embedding 类 evolution/architecture-refactor-suggestions.md #1
:14 import ElasticSearchVectorFactory 绕过 BaseVector 抽象 同上 §0.2 #1 / #2 改造建议
:29 KnowledgeRetrievalNode Workflow 节点主类 pipeline/05-reranking-prompt-llm.md §3.4
:54 _extract_input() 渲染 query 模板 同上
:108-155 KnowledgeRetrievalNode.rerank() 节点级 rerank 同上 §1.2 C, evolution/architecture-refactor-suggestions.md #3
:157-193 get_reranker_model() 每次调用都查 DB evolution/architecture-refactor-suggestions.md §0.2 #4
:195-263 knowledge_retrieval() 检索分支PARTICIPLE / SEMANTIC / HYBRID / Graph pipeline/05-reranking-prompt-llm.md §3.4, pipeline/03-vdb-and-retrieval.md
:236-271 HYBRID 分支 vector + full_text 并行 → dedup → rerank 同上
:284 rerank() 模块级函数 三轨 rerank 之一 evolution/architecture-refactor-suggestions.md #3
:303-378 execute() 节点执行入口 pipeline/05-reranking-prompt-llm.md §3.4
:327 print(reranked_docs) ⚠️ 调试残留 evolution/architecture-refactor-suggestions.md #3 / #10hot-fix 候选)

api/app/core/rag/graphrag/

行号 类 / 函数 引用文档
general/index.py:36 run_graphrag() GraphRAG 主入口doc 级) pipeline/04-graphrag.md §general待交付
general/index.py:122 run_graphrag_for_kb() KB 级 同上
general/graph_extractor.py:34 GraphExtractor Microsoft 风格抽取 同上
general/community_reports_extractor.py:37 社区报告 同上
light/graph_extractor.py:31 GraphExtractor LightRAG 风格抽取 同上 §light
entity_resolution.py:31 EntityResolution 实体消歧 同上
search.py:19 KGSearch 图检索 同上
utils.py:41 chat_limiter Trio 限流 pipeline/02-embedding.md §3.1, evolution/architecture-refactor-suggestions.md #9
utils.py:115-134 get/set_embed_cache Redis Embedding 缓存 pipeline/02-embedding.md §3.3, evolution/architecture-refactor-suggestions.md #4
utils.py:301-327 graph_node_to_chunk() 实体节点 → 向量 → ES pipeline/02-embedding.md §2.3

api/app/core/rag/llm/chat_model.py

行号 类 / 函数 引用文档
:52 Base LLM 抽象基类 pipeline/05-reranking-prompt-llm.md §3.1
:54-58 LLM_TIMEOUT_SECONDS / LLM_MAX_RETRIES 超时与重试 同上 §3.3, evolution/architecture-refactor-suggestions.md §0.2 #2
:122-150 _chat() 非流式 LLM 调用 pipeline/05-reranking-prompt-llm.md §3.2
:152-185 _chat_streamly() 流式 LLM 调用 同上
:251-303 chat_with_tools() 工具调用 同上 §3.4

api/app/core/rag/prompts/

文件 功能 引用文档
template.py:9 load_prompt() 启动时加载 .md 模板 pipeline/05-reranking-prompt-llm.md §2.1
generator.py 20+ Prompt 工厂函数citation/keyword/... 同上
*.md31 个模板) Prompt 内容 overview/source-inventory.md

api/app/core/rag/common/settings.py

行号 关键代码 引用文档
:9-10 retriever / kg_retriever 进程级单例 evolution/architecture-refactor-suggestions.md §0.2 #4
:13 init_settings() 模块导入时副作用 同上, pipeline/03-vdb-and-retrieval.md
:24 触发位置 evolution/architecture-refactor-suggestions.md #8

api/app/services/draft_run_service.py

行号 关键代码 引用文档
:195-263 create_knowledge_retrieval_tool() 知识检索工具 pipeline/05-reranking-prompt-llm.md §3.5
:227-255 chunk 拼接 \n\n 分隔 chunks 同上 §2.3
:474-490 _filter_citations() 引用过滤 + 下载链接 同上 §4.2

3. 当前已识别的"代码残留与修复任务"

# 文件:行 问题 修复建议 关联
1 workflow/nodes/knowledge/node.py:327 print(reranked_docs) 调试残留 立即提 hot-fix PR 删除 S3-T1 #10 + S3-T1 §3.1
2 chat_model.py 各 provider 子类 base_url 与认证 header 硬编码 引入 Plugin Registry S3-T1 #5
3 naive.py:508-738 chunk() 11 路 if/elif 硬编码 Parser Protocol S3-T1 #5
4 elasticsearch_vector.py:55-63 add_chunks 同步循环,无并发 改 trio 协程 + 共享 chat_limiter S3-T1 #9
5 nlp/search.py:439 weighted_sum 0.05/0.95 硬编码 改为 ctx.fusion_weights 注入 S3-T2 D2
6 rag_utils/ vs rag/utils/ 命名冲突 重命名为 rag/chunk_analytics/ 或合并 S1-T3 §4.1

File Index · v1.0-RC1 · 2026-05-08