Some checks failed
Sync to Gitee / sync (push) Has been cancelled
Submit the formed RAG documentation set produced across Sprint-1/2/3 (WS-12 through WS-26) under docs/rag/. Includes: - README.md / INDEX.md: landing + total index (responsibility matrix, review verdicts, dual-link to source issues) - overview/: full-pipeline architecture (4 .mmd diagrams), 11-stage boundary contracts, doc map, source-code inventory - pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding, VDB & retrieval, GraphRAG, Rerank/Prompt/LLM) - graphrag/, end-to-end/: v1.0 formal versions with full source retained as reference - evolution/: 11 architecture-refactor proposals, 6-direction roadmap, capability map - review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary - _indexes/: glossary (81 terms), source->doc reverse index, chart index - _release/: v1.0-RC1 release manifest, versioning convention, ops & freshness plan - _meta/README.md: placeholder noting WS-12 governance assets gap Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot checks hit). The legacy docs/ ignore in .gitignore is narrowed to docs/* with an explicit allowlist for docs/rag/. Refs: WS-26 Co-authored-by: multica-agent <github@multica.ai>
167 lines
12 KiB
Markdown
167 lines
12 KiB
Markdown
# MemoryBear RAG · 源码反查索引(File Index)
|
||
|
||
> 从源码模块反查到对应的文档章节。开发者修改某个文件时,可在此查到所有引用该文件的文档,提前评估改动的"知识涟漪"。
|
||
|
||
## 1. 总览:代码目录 → 文档映射
|
||
|
||
| 代码目录 | 主要责任 | 主导文档 | 次要引用 |
|
||
|---|---|---|---|
|
||
| `api/app/core/rag/app/` | 多格式解析 orchestrator | `pipeline/01-loader-parser-chunking.md` | `overview/source-inventory.md` |
|
||
| `api/app/core/rag/common/` | 常量、token、settings | `pipeline/01-loader-parser-chunking.md`, `evolution/architecture-refactor-suggestions.md` §0.2 #4 / #2 | `overview/source-inventory.md` |
|
||
| `api/app/core/rag/crawler/` | Web 爬虫 | `pipeline/01-loader-parser-chunking.md` §4.1 | — |
|
||
| `api/app/core/rag/deepdoc/parser/` | 11 类格式解析 | `pipeline/01-loader-parser-chunking.md` §5 | `overview/source-inventory.md` |
|
||
| `api/app/core/rag/deepdoc/vision/` | OCR + 版面 + TSR | `pipeline/01-loader-parser-chunking.md` §5.6 | `evolution/architecture-refactor-suggestions.md` §0.2 #2(HF_ENDPOINT) |
|
||
| `api/app/core/rag/graphrag/` | GraphRAG 共享工具 + 图搜索 | `pipeline/04-graphrag.md`(待交付) | `overview/source-inventory.md` §3.3 |
|
||
| `api/app/core/rag/graphrag/general/` | Microsoft GraphRAG 风格流水线 | `pipeline/04-graphrag.md` §general(待交付) | `overview/04-graphrag-indexing.mmd` |
|
||
| `api/app/core/rag/graphrag/light/` | LightRAG 风格抽取器 | `pipeline/04-graphrag.md` §light(待交付) | 同上 |
|
||
| `api/app/core/rag/integrations/feishu/` | 飞书 SDK | `pipeline/01-loader-parser-chunking.md` §4 | — |
|
||
| `api/app/core/rag/integrations/yuque/` | 语雀 SDK | 同上 | — |
|
||
| `api/app/core/rag/llm/` | LLM 多模型 facade | `pipeline/05-reranking-prompt-llm.md` §3 | `evolution/architecture-refactor-suggestions.md` #1, #5 |
|
||
| `api/app/core/rag/models/` | Chunk 数据模型 | `pipeline/01-loader-parser-chunking.md` §3 | `overview/source-inventory.md` |
|
||
| `api/app/core/rag/nlp/` | 中文分词、Hybrid 搜索调度 | `pipeline/03-vdb-and-retrieval.md` §6, `pipeline/05-reranking-prompt-llm.md` §1.2 | `evolution/architecture-refactor-suggestions.md` #3 |
|
||
| `api/app/core/rag/prompts/` | Prompt 模板与工厂 | `pipeline/05-reranking-prompt-llm.md` §2 | — |
|
||
| `api/app/core/rag/utils/` | ES/Redis 连接、LibreOffice | `pipeline/03-vdb-and-retrieval.md`, `pipeline/01-loader-parser-chunking.md` §4.2 | — |
|
||
| `api/app/core/rag/vdb/elasticsearch/` | ES 向量+全文 | `pipeline/03-vdb-and-retrieval.md` 全文 | `pipeline/02-embedding.md` §5.4 |
|
||
| `api/app/core/rag/res/` | NER / 同义词 / mapping | `pipeline/03-vdb-and-retrieval.md` §3 | — |
|
||
| `api/app/core/models/` | 统一封装层(Embedding / Rerank / LLM) | `pipeline/02-embedding.md` §1.2, `pipeline/05-reranking-prompt-llm.md` §1.2 | `evolution/architecture-refactor-suggestions.md` #1 |
|
||
| `api/app/core/agent/` | LangChainAgent | `pipeline/05-reranking-prompt-llm.md` §3.4 | — |
|
||
| `api/app/core/workflow/nodes/knowledge/` | Workflow Knowledge 节点 | `pipeline/05-reranking-prompt-llm.md` §3.4, `pipeline/03-vdb-and-retrieval.md` | `evolution/architecture-refactor-suggestions.md` #3 |
|
||
| `api/app/core/rag_utils/`(注意与 `rag/utils` 不同) | Chunk LLM 分析(与 Memory 系统耦合) | `overview/source-inventory.md` §rag_utils | `evolution/future-extensions-roadmap.md` D4 |
|
||
| `api/app/core/memory/` | 对话内存系统(Ebbinghaus / ACT-R / Neo4j / langgraph) | `evolution/future-extensions-roadmap.md` D4(未来扩展引用) | — |
|
||
| `api/app/services/` | 业务服务层 | `pipeline/05-reranking-prompt-llm.md` §3.5 | — |
|
||
| `api/app/tasks.py` | Celery 任务入口 | `overview/source-inventory.md` §3, `pipeline/01-loader-parser-chunking.md` §3.1 | `evolution/future-extensions-roadmap.md` D3 |
|
||
|
||
## 2. 关键文件 → 文档章节(细粒度)
|
||
|
||
### `api/app/core/rag/app/naive.py`
|
||
|
||
| 行号 | 函数 / 关键代码 | 引用文档 |
|
||
|---|---|---|
|
||
| `:27 by_deepdoc()` | DeepDoc 解析路径 | `pipeline/01-loader-parser-chunking.md` §5.1 |
|
||
| `:45 by_mineru()` | MinerU 第三方解析 | 同上 §5.2 |
|
||
| `:65 by_textln()` | TextIn 第三方解析 | 同上 §5.3 |
|
||
| `:257 naive.__call__()` | 主解析入口 | 同上 §3 |
|
||
| `:508-738 chunk()` | 11 路 if/elif 分发,按扩展名挑 parser | 同上 §3, `evolution/architecture-refactor-suggestions.md` §0.2 #5 / #5 改造建议 |
|
||
|
||
### `api/app/core/rag/llm/embedding_model.py`
|
||
|
||
| 行号 | 类 / 函数 | 引用文档 |
|
||
|---|---|---|
|
||
| `:14-38 Base` | Embedding 抽象基类(旧) | `pipeline/02-embedding.md` §5.1 |
|
||
| `:50-65 OpenAIEmbed.encode()` | OpenAI 兼容 Embedding 实现 | 同上 §5.2, `evolution/architecture-refactor-suggestions.md` #1 / #4 / #9 |
|
||
| `:138-143 QWenEmbed` | DashScope Embedding(含显式 5 次重试) | `pipeline/02-embedding.md` §3.2 |
|
||
|
||
### `api/app/core/models/embedding.py`
|
||
|
||
| 行号 | 类 / 函数 | 引用文档 |
|
||
|---|---|---|
|
||
| `:9-23 RedBearEmbeddings.__init__` | LangChain 统一封装初始化 | `pipeline/02-embedding.md` §1.2 / §5.3 |
|
||
| `:65-78 embed_documents()` | 文档侧 Embedding(含火山多模态分支) | 同上 §2.1 |
|
||
|
||
### `api/app/core/rag/vdb/elasticsearch/elasticsearch_vector.py`
|
||
|
||
| 行号 | 类 / 函数 | 引用文档 |
|
||
|---|---|---|
|
||
| `:29 ElasticSearchVector` | ES 向量主实现 | `pipeline/03-vdb-and-retrieval.md` §1 |
|
||
| `:55-63 add_chunks()` | 向量入库 | 同上 §4, `pipeline/02-embedding.md` §2.1, `evolution/architecture-refactor-suggestions.md` #4 |
|
||
| `:374-380 search_by_vector()` | 向量检索 | `pipeline/03-vdb-and-retrieval.md` §6, `pipeline/02-embedding.md` §2.2 |
|
||
| `:468 search_by_full_text()` | BM25 检索 | `pipeline/03-vdb-and-retrieval.md` §5 |
|
||
| `:560-607 rerank()` | ES 层 rerank | `pipeline/05-reranking-prompt-llm.md` §1.2 D, `evolution/architecture-refactor-suggestions.md` #3 |
|
||
| `:653-658 dense_vector mapping` | dense_vector 维度动态决定 | `pipeline/02-embedding.md` §3.4, `pipeline/03-vdb-and-retrieval.md` §3 |
|
||
| `:666 ElasticSearchVectorFactory` | 工厂类 | `overview/source-inventory.md`, `pipeline/03-vdb-and-retrieval.md` §1 |
|
||
| `:685-707 ES 配置环境变量` | 6 个 ES 相关 env vars | `evolution/architecture-refactor-suggestions.md` §0.2 #2 |
|
||
|
||
### `api/app/core/rag/nlp/search.py`
|
||
|
||
| 行号 | 类 / 函数 | 引用文档 |
|
||
|---|---|---|
|
||
| `:36-147 knowledge_retrieval()` | 知识检索入口(旧通道) | `pipeline/05-reranking-prompt-llm.md` §1.2 |
|
||
| `:284-343 rerank()` | 模块级 rerank | 同上 |
|
||
| `:349 Dealer` | BM25/Hybrid 调度器 | `pipeline/03-vdb-and-retrieval.md` §6, `overview/source-inventory.md` §一 |
|
||
| `:365-373 get_vector()` | 调用旧 Embedding 接口的 `encode_queries` | `pipeline/02-embedding.md` §2.4 |
|
||
| `:387 search()` | 主 search | `pipeline/03-vdb-and-retrieval.md` §6 |
|
||
| `:439 FusionExpr("weighted_sum")` | 0.05/0.95 硬编码权重 | `pipeline/03-vdb-and-retrieval.md` §6, `evolution/future-extensions-roadmap.md` D2 |
|
||
| `:489-577 insert_citations()` | 引用回填(embedding 相似度匹配) | `pipeline/05-reranking-prompt-llm.md` §4.1 |
|
||
| `:579-604 _rank_feature_scores()` | tag TF-IDF + PageRank | `pipeline/05-reranking-prompt-llm.md` §1.2 A |
|
||
| `:606-643 Dealer.rerank()` | 内置混合 rerank(融合分数) | 同上, `evolution/architecture-refactor-suggestions.md` #3 |
|
||
| `:645-666 rerank_by_model()` | 外部模型 rerank | `pipeline/05-reranking-prompt-llm.md` §1.2 B |
|
||
| `:674-768 retrieval()` | 检索主流程 | 同上 §1.3 |
|
||
|
||
### `api/app/core/workflow/nodes/knowledge/node.py`
|
||
|
||
| 行号 | 类 / 函数 | 引用文档 |
|
||
|---|---|---|
|
||
| `:12 import OpenAIEmbed` | 硬编码导入旧 Embedding 类 | `evolution/architecture-refactor-suggestions.md` #1 |
|
||
| `:14 import ElasticSearchVectorFactory` | 绕过 BaseVector 抽象 | 同上 §0.2 #1 / #2 改造建议 |
|
||
| `:29 KnowledgeRetrievalNode` | Workflow 节点主类 | `pipeline/05-reranking-prompt-llm.md` §3.4 |
|
||
| `:54 _extract_input()` | 渲染 query 模板 | 同上 |
|
||
| `:108-155 KnowledgeRetrievalNode.rerank()` | 节点级 rerank | 同上 §1.2 C, `evolution/architecture-refactor-suggestions.md` #3 |
|
||
| `:157-193 get_reranker_model()` | 每次调用都查 DB | `evolution/architecture-refactor-suggestions.md` §0.2 #4 |
|
||
| `:195-263 knowledge_retrieval()` | 检索分支(PARTICIPLE / SEMANTIC / HYBRID / Graph) | `pipeline/05-reranking-prompt-llm.md` §3.4, `pipeline/03-vdb-and-retrieval.md` |
|
||
| `:236-271 HYBRID 分支` | vector + full_text 并行 → dedup → rerank | 同上 |
|
||
| `:284 rerank()` 模块级函数 | 三轨 rerank 之一 | `evolution/architecture-refactor-suggestions.md` #3 |
|
||
| `:303-378 execute()` | 节点执行入口 | `pipeline/05-reranking-prompt-llm.md` §3.4 |
|
||
| `:327 print(reranked_docs)` ⚠️ | 调试残留 | `evolution/architecture-refactor-suggestions.md` #3 / #10(hot-fix 候选) |
|
||
|
||
### `api/app/core/rag/graphrag/`
|
||
|
||
| 行号 | 类 / 函数 | 引用文档 |
|
||
|---|---|---|
|
||
| `general/index.py:36 run_graphrag()` | GraphRAG 主入口(doc 级) | `pipeline/04-graphrag.md` §general(待交付) |
|
||
| `general/index.py:122 run_graphrag_for_kb()` | KB 级 | 同上 |
|
||
| `general/graph_extractor.py:34 GraphExtractor` | Microsoft 风格抽取 | 同上 |
|
||
| `general/community_reports_extractor.py:37` | 社区报告 | 同上 |
|
||
| `light/graph_extractor.py:31 GraphExtractor` | LightRAG 风格抽取 | 同上 §light |
|
||
| `entity_resolution.py:31 EntityResolution` | 实体消歧 | 同上 |
|
||
| `search.py:19 KGSearch` | 图检索 | 同上 |
|
||
| `utils.py:41 chat_limiter` | Trio 限流 | `pipeline/02-embedding.md` §3.1, `evolution/architecture-refactor-suggestions.md` #9 |
|
||
| `utils.py:115-134 get/set_embed_cache` | Redis Embedding 缓存 | `pipeline/02-embedding.md` §3.3, `evolution/architecture-refactor-suggestions.md` #4 |
|
||
| `utils.py:301-327 graph_node_to_chunk()` | 实体节点 → 向量 → ES | `pipeline/02-embedding.md` §2.3 |
|
||
|
||
### `api/app/core/rag/llm/chat_model.py`
|
||
|
||
| 行号 | 类 / 函数 | 引用文档 |
|
||
|---|---|---|
|
||
| `:52 Base` | LLM 抽象基类 | `pipeline/05-reranking-prompt-llm.md` §3.1 |
|
||
| `:54-58 LLM_TIMEOUT_SECONDS / LLM_MAX_RETRIES` | 超时与重试 | 同上 §3.3, `evolution/architecture-refactor-suggestions.md` §0.2 #2 |
|
||
| `:122-150 _chat()` | 非流式 LLM 调用 | `pipeline/05-reranking-prompt-llm.md` §3.2 |
|
||
| `:152-185 _chat_streamly()` | 流式 LLM 调用 | 同上 |
|
||
| `:251-303 chat_with_tools()` | 工具调用 | 同上 §3.4 |
|
||
|
||
### `api/app/core/rag/prompts/`
|
||
|
||
| 文件 | 功能 | 引用文档 |
|
||
|---|---|---|
|
||
| `template.py:9 load_prompt()` | 启动时加载 .md 模板 | `pipeline/05-reranking-prompt-llm.md` §2.1 |
|
||
| `generator.py` | 20+ Prompt 工厂函数(citation/keyword/...) | 同上 |
|
||
| `*.md`(31 个模板) | Prompt 内容 | `overview/source-inventory.md` |
|
||
|
||
### `api/app/core/rag/common/settings.py`
|
||
|
||
| 行号 | 关键代码 | 引用文档 |
|
||
|---|---|---|
|
||
| `:9-10 retriever / kg_retriever` | 进程级单例 | `evolution/architecture-refactor-suggestions.md` §0.2 #4 |
|
||
| `:13 init_settings()` | 模块导入时副作用 | 同上, `pipeline/03-vdb-and-retrieval.md` |
|
||
| `:24` 触发位置 | — | `evolution/architecture-refactor-suggestions.md` #8 |
|
||
|
||
### `api/app/services/draft_run_service.py`
|
||
|
||
| 行号 | 关键代码 | 引用文档 |
|
||
|---|---|---|
|
||
| `:195-263 create_knowledge_retrieval_tool()` | 知识检索工具 | `pipeline/05-reranking-prompt-llm.md` §3.5 |
|
||
| `:227-255` chunk 拼接 | `\n\n` 分隔 chunks | 同上 §2.3 |
|
||
| `:474-490 _filter_citations()` | 引用过滤 + 下载链接 | 同上 §4.2 |
|
||
|
||
## 3. 当前已识别的"代码残留与修复任务"
|
||
|
||
| # | 文件:行 | 问题 | 修复建议 | 关联 |
|
||
|---|---|---|---|---|
|
||
| 1 | `workflow/nodes/knowledge/node.py:327` | `print(reranked_docs)` 调试残留 | 立即提 hot-fix PR 删除 | S3-T1 #10 + S3-T1 §3.1 |
|
||
| 2 | `chat_model.py` 各 provider 子类 | base_url 与认证 header 硬编码 | 引入 Plugin Registry | S3-T1 #5 |
|
||
| 3 | `naive.py:508-738 chunk()` | 11 路 if/elif 硬编码 | 抽 `Parser` Protocol | S3-T1 #5 |
|
||
| 4 | `elasticsearch_vector.py:55-63 add_chunks` | 同步循环,无并发 | 改 trio 协程 + 共享 chat_limiter | S3-T1 #9 |
|
||
| 5 | `nlp/search.py:439` | `weighted_sum` 0.05/0.95 硬编码 | 改为 ctx.fusion_weights 注入 | S3-T2 D2 |
|
||
| 6 | `rag_utils/` vs `rag/utils/` | 命名冲突 | 重命名为 `rag/chunk_analytics/` 或合并 | S1-T3 §4.1 |
|
||
|
||
— **File Index · v1.0-RC1 · 2026-05-08** —
|