Files
MemoryBear/docs/rag/_indexes/file-index.md
Multica PM Agent 343a5eebe3
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
docs(rag): add MemoryBear RAG implementation docs v1.0
Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 10:51:48 +08:00

167 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MemoryBear RAG · 源码反查索引File Index
> 从源码模块反查到对应的文档章节。开发者修改某个文件时,可在此查到所有引用该文件的文档,提前评估改动的"知识涟漪"。
## 1. 总览:代码目录 → 文档映射
| 代码目录 | 主要责任 | 主导文档 | 次要引用 |
|---|---|---|---|
| `api/app/core/rag/app/` | 多格式解析 orchestrator | `pipeline/01-loader-parser-chunking.md` | `overview/source-inventory.md` |
| `api/app/core/rag/common/` | 常量、token、settings | `pipeline/01-loader-parser-chunking.md`, `evolution/architecture-refactor-suggestions.md` §0.2 #4 / #2 | `overview/source-inventory.md` |
| `api/app/core/rag/crawler/` | Web 爬虫 | `pipeline/01-loader-parser-chunking.md` §4.1 | — |
| `api/app/core/rag/deepdoc/parser/` | 11 类格式解析 | `pipeline/01-loader-parser-chunking.md` §5 | `overview/source-inventory.md` |
| `api/app/core/rag/deepdoc/vision/` | OCR + 版面 + TSR | `pipeline/01-loader-parser-chunking.md` §5.6 | `evolution/architecture-refactor-suggestions.md` §0.2 #2HF_ENDPOINT |
| `api/app/core/rag/graphrag/` | GraphRAG 共享工具 + 图搜索 | `pipeline/04-graphrag.md`(待交付) | `overview/source-inventory.md` §3.3 |
| `api/app/core/rag/graphrag/general/` | Microsoft GraphRAG 风格流水线 | `pipeline/04-graphrag.md` §general待交付 | `overview/04-graphrag-indexing.mmd` |
| `api/app/core/rag/graphrag/light/` | LightRAG 风格抽取器 | `pipeline/04-graphrag.md` §light待交付 | 同上 |
| `api/app/core/rag/integrations/feishu/` | 飞书 SDK | `pipeline/01-loader-parser-chunking.md` §4 | — |
| `api/app/core/rag/integrations/yuque/` | 语雀 SDK | 同上 | — |
| `api/app/core/rag/llm/` | LLM 多模型 facade | `pipeline/05-reranking-prompt-llm.md` §3 | `evolution/architecture-refactor-suggestions.md` #1, #5 |
| `api/app/core/rag/models/` | Chunk 数据模型 | `pipeline/01-loader-parser-chunking.md` §3 | `overview/source-inventory.md` |
| `api/app/core/rag/nlp/` | 中文分词、Hybrid 搜索调度 | `pipeline/03-vdb-and-retrieval.md` §6, `pipeline/05-reranking-prompt-llm.md` §1.2 | `evolution/architecture-refactor-suggestions.md` #3 |
| `api/app/core/rag/prompts/` | Prompt 模板与工厂 | `pipeline/05-reranking-prompt-llm.md` §2 | — |
| `api/app/core/rag/utils/` | ES/Redis 连接、LibreOffice | `pipeline/03-vdb-and-retrieval.md`, `pipeline/01-loader-parser-chunking.md` §4.2 | — |
| `api/app/core/rag/vdb/elasticsearch/` | ES 向量+全文 | `pipeline/03-vdb-and-retrieval.md` 全文 | `pipeline/02-embedding.md` §5.4 |
| `api/app/core/rag/res/` | NER / 同义词 / mapping | `pipeline/03-vdb-and-retrieval.md` §3 | — |
| `api/app/core/models/` | 统一封装层Embedding / Rerank / LLM | `pipeline/02-embedding.md` §1.2, `pipeline/05-reranking-prompt-llm.md` §1.2 | `evolution/architecture-refactor-suggestions.md` #1 |
| `api/app/core/agent/` | LangChainAgent | `pipeline/05-reranking-prompt-llm.md` §3.4 | — |
| `api/app/core/workflow/nodes/knowledge/` | Workflow Knowledge 节点 | `pipeline/05-reranking-prompt-llm.md` §3.4, `pipeline/03-vdb-and-retrieval.md` | `evolution/architecture-refactor-suggestions.md` #3 |
| `api/app/core/rag_utils/`(注意与 `rag/utils` 不同) | Chunk LLM 分析(与 Memory 系统耦合) | `overview/source-inventory.md` §rag_utils | `evolution/future-extensions-roadmap.md` D4 |
| `api/app/core/memory/` | 对话内存系统Ebbinghaus / ACT-R / Neo4j / langgraph | `evolution/future-extensions-roadmap.md` D4未来扩展引用 | — |
| `api/app/services/` | 业务服务层 | `pipeline/05-reranking-prompt-llm.md` §3.5 | — |
| `api/app/tasks.py` | Celery 任务入口 | `overview/source-inventory.md` §3, `pipeline/01-loader-parser-chunking.md` §3.1 | `evolution/future-extensions-roadmap.md` D3 |
## 2. 关键文件 → 文档章节(细粒度)
### `api/app/core/rag/app/naive.py`
| 行号 | 函数 / 关键代码 | 引用文档 |
|---|---|---|
| `:27 by_deepdoc()` | DeepDoc 解析路径 | `pipeline/01-loader-parser-chunking.md` §5.1 |
| `:45 by_mineru()` | MinerU 第三方解析 | 同上 §5.2 |
| `:65 by_textln()` | TextIn 第三方解析 | 同上 §5.3 |
| `:257 naive.__call__()` | 主解析入口 | 同上 §3 |
| `:508-738 chunk()` | 11 路 if/elif 分发,按扩展名挑 parser | 同上 §3, `evolution/architecture-refactor-suggestions.md` §0.2 #5 / #5 改造建议 |
### `api/app/core/rag/llm/embedding_model.py`
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
| `:14-38 Base` | Embedding 抽象基类(旧) | `pipeline/02-embedding.md` §5.1 |
| `:50-65 OpenAIEmbed.encode()` | OpenAI 兼容 Embedding 实现 | 同上 §5.2, `evolution/architecture-refactor-suggestions.md` #1 / #4 / #9 |
| `:138-143 QWenEmbed` | DashScope Embedding含显式 5 次重试) | `pipeline/02-embedding.md` §3.2 |
### `api/app/core/models/embedding.py`
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
| `:9-23 RedBearEmbeddings.__init__` | LangChain 统一封装初始化 | `pipeline/02-embedding.md` §1.2 / §5.3 |
| `:65-78 embed_documents()` | 文档侧 Embedding含火山多模态分支 | 同上 §2.1 |
### `api/app/core/rag/vdb/elasticsearch/elasticsearch_vector.py`
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
| `:29 ElasticSearchVector` | ES 向量主实现 | `pipeline/03-vdb-and-retrieval.md` §1 |
| `:55-63 add_chunks()` | 向量入库 | 同上 §4, `pipeline/02-embedding.md` §2.1, `evolution/architecture-refactor-suggestions.md` #4 |
| `:374-380 search_by_vector()` | 向量检索 | `pipeline/03-vdb-and-retrieval.md` §6, `pipeline/02-embedding.md` §2.2 |
| `:468 search_by_full_text()` | BM25 检索 | `pipeline/03-vdb-and-retrieval.md` §5 |
| `:560-607 rerank()` | ES 层 rerank | `pipeline/05-reranking-prompt-llm.md` §1.2 D, `evolution/architecture-refactor-suggestions.md` #3 |
| `:653-658 dense_vector mapping` | dense_vector 维度动态决定 | `pipeline/02-embedding.md` §3.4, `pipeline/03-vdb-and-retrieval.md` §3 |
| `:666 ElasticSearchVectorFactory` | 工厂类 | `overview/source-inventory.md`, `pipeline/03-vdb-and-retrieval.md` §1 |
| `:685-707 ES 配置环境变量` | 6 个 ES 相关 env vars | `evolution/architecture-refactor-suggestions.md` §0.2 #2 |
### `api/app/core/rag/nlp/search.py`
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
| `:36-147 knowledge_retrieval()` | 知识检索入口(旧通道) | `pipeline/05-reranking-prompt-llm.md` §1.2 |
| `:284-343 rerank()` | 模块级 rerank | 同上 |
| `:349 Dealer` | BM25/Hybrid 调度器 | `pipeline/03-vdb-and-retrieval.md` §6, `overview/source-inventory.md` §一 |
| `:365-373 get_vector()` | 调用旧 Embedding 接口的 `encode_queries` | `pipeline/02-embedding.md` §2.4 |
| `:387 search()` | 主 search | `pipeline/03-vdb-and-retrieval.md` §6 |
| `:439 FusionExpr("weighted_sum")` | 0.05/0.95 硬编码权重 | `pipeline/03-vdb-and-retrieval.md` §6, `evolution/future-extensions-roadmap.md` D2 |
| `:489-577 insert_citations()` | 引用回填embedding 相似度匹配) | `pipeline/05-reranking-prompt-llm.md` §4.1 |
| `:579-604 _rank_feature_scores()` | tag TF-IDF + PageRank | `pipeline/05-reranking-prompt-llm.md` §1.2 A |
| `:606-643 Dealer.rerank()` | 内置混合 rerank融合分数 | 同上, `evolution/architecture-refactor-suggestions.md` #3 |
| `:645-666 rerank_by_model()` | 外部模型 rerank | `pipeline/05-reranking-prompt-llm.md` §1.2 B |
| `:674-768 retrieval()` | 检索主流程 | 同上 §1.3 |
### `api/app/core/workflow/nodes/knowledge/node.py`
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
| `:12 import OpenAIEmbed` | 硬编码导入旧 Embedding 类 | `evolution/architecture-refactor-suggestions.md` #1 |
| `:14 import ElasticSearchVectorFactory` | 绕过 BaseVector 抽象 | 同上 §0.2 #1 / #2 改造建议 |
| `:29 KnowledgeRetrievalNode` | Workflow 节点主类 | `pipeline/05-reranking-prompt-llm.md` §3.4 |
| `:54 _extract_input()` | 渲染 query 模板 | 同上 |
| `:108-155 KnowledgeRetrievalNode.rerank()` | 节点级 rerank | 同上 §1.2 C, `evolution/architecture-refactor-suggestions.md` #3 |
| `:157-193 get_reranker_model()` | 每次调用都查 DB | `evolution/architecture-refactor-suggestions.md` §0.2 #4 |
| `:195-263 knowledge_retrieval()` | 检索分支PARTICIPLE / SEMANTIC / HYBRID / Graph | `pipeline/05-reranking-prompt-llm.md` §3.4, `pipeline/03-vdb-and-retrieval.md` |
| `:236-271 HYBRID 分支` | vector + full_text 并行 → dedup → rerank | 同上 |
| `:284 rerank()` 模块级函数 | 三轨 rerank 之一 | `evolution/architecture-refactor-suggestions.md` #3 |
| `:303-378 execute()` | 节点执行入口 | `pipeline/05-reranking-prompt-llm.md` §3.4 |
| `:327 print(reranked_docs)` ⚠️ | 调试残留 | `evolution/architecture-refactor-suggestions.md` #3 / #10hot-fix 候选) |
### `api/app/core/rag/graphrag/`
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
| `general/index.py:36 run_graphrag()` | GraphRAG 主入口doc 级) | `pipeline/04-graphrag.md` §general待交付 |
| `general/index.py:122 run_graphrag_for_kb()` | KB 级 | 同上 |
| `general/graph_extractor.py:34 GraphExtractor` | Microsoft 风格抽取 | 同上 |
| `general/community_reports_extractor.py:37` | 社区报告 | 同上 |
| `light/graph_extractor.py:31 GraphExtractor` | LightRAG 风格抽取 | 同上 §light |
| `entity_resolution.py:31 EntityResolution` | 实体消歧 | 同上 |
| `search.py:19 KGSearch` | 图检索 | 同上 |
| `utils.py:41 chat_limiter` | Trio 限流 | `pipeline/02-embedding.md` §3.1, `evolution/architecture-refactor-suggestions.md` #9 |
| `utils.py:115-134 get/set_embed_cache` | Redis Embedding 缓存 | `pipeline/02-embedding.md` §3.3, `evolution/architecture-refactor-suggestions.md` #4 |
| `utils.py:301-327 graph_node_to_chunk()` | 实体节点 → 向量 → ES | `pipeline/02-embedding.md` §2.3 |
### `api/app/core/rag/llm/chat_model.py`
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
| `:52 Base` | LLM 抽象基类 | `pipeline/05-reranking-prompt-llm.md` §3.1 |
| `:54-58 LLM_TIMEOUT_SECONDS / LLM_MAX_RETRIES` | 超时与重试 | 同上 §3.3, `evolution/architecture-refactor-suggestions.md` §0.2 #2 |
| `:122-150 _chat()` | 非流式 LLM 调用 | `pipeline/05-reranking-prompt-llm.md` §3.2 |
| `:152-185 _chat_streamly()` | 流式 LLM 调用 | 同上 |
| `:251-303 chat_with_tools()` | 工具调用 | 同上 §3.4 |
### `api/app/core/rag/prompts/`
| 文件 | 功能 | 引用文档 |
|---|---|---|
| `template.py:9 load_prompt()` | 启动时加载 .md 模板 | `pipeline/05-reranking-prompt-llm.md` §2.1 |
| `generator.py` | 20+ Prompt 工厂函数citation/keyword/... | 同上 |
| `*.md`31 个模板) | Prompt 内容 | `overview/source-inventory.md` |
### `api/app/core/rag/common/settings.py`
| 行号 | 关键代码 | 引用文档 |
|---|---|---|
| `:9-10 retriever / kg_retriever` | 进程级单例 | `evolution/architecture-refactor-suggestions.md` §0.2 #4 |
| `:13 init_settings()` | 模块导入时副作用 | 同上, `pipeline/03-vdb-and-retrieval.md` |
| `:24` 触发位置 | — | `evolution/architecture-refactor-suggestions.md` #8 |
### `api/app/services/draft_run_service.py`
| 行号 | 关键代码 | 引用文档 |
|---|---|---|
| `:195-263 create_knowledge_retrieval_tool()` | 知识检索工具 | `pipeline/05-reranking-prompt-llm.md` §3.5 |
| `:227-255` chunk 拼接 | `\n\n` 分隔 chunks | 同上 §2.3 |
| `:474-490 _filter_citations()` | 引用过滤 + 下载链接 | 同上 §4.2 |
## 3. 当前已识别的"代码残留与修复任务"
| # | 文件:行 | 问题 | 修复建议 | 关联 |
|---|---|---|---|---|
| 1 | `workflow/nodes/knowledge/node.py:327` | `print(reranked_docs)` 调试残留 | 立即提 hot-fix PR 删除 | S3-T1 #10 + S3-T1 §3.1 |
| 2 | `chat_model.py` 各 provider 子类 | base_url 与认证 header 硬编码 | 引入 Plugin Registry | S3-T1 #5 |
| 3 | `naive.py:508-738 chunk()` | 11 路 if/elif 硬编码 | 抽 `Parser` Protocol | S3-T1 #5 |
| 4 | `elasticsearch_vector.py:55-63 add_chunks` | 同步循环,无并发 | 改 trio 协程 + 共享 chat_limiter | S3-T1 #9 |
| 5 | `nlp/search.py:439` | `weighted_sum` 0.05/0.95 硬编码 | 改为 ctx.fusion_weights 注入 | S3-T2 D2 |
| 6 | `rag_utils/` vs `rag/utils/` | 命名冲突 | 重命名为 `rag/chunk_analytics/` 或合并 | S1-T3 §4.1 |
**File Index · v1.0-RC1 · 2026-05-08**