Some checks failed
Sync to Gitee / sync (push) Has been cancelled
Submit the formed RAG documentation set produced across Sprint-1/2/3 (WS-12 through WS-26) under docs/rag/. Includes: - README.md / INDEX.md: landing + total index (responsibility matrix, review verdicts, dual-link to source issues) - overview/: full-pipeline architecture (4 .mmd diagrams), 11-stage boundary contracts, doc map, source-code inventory - pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding, VDB & retrieval, GraphRAG, Rerank/Prompt/LLM) - graphrag/, end-to-end/: v1.0 formal versions with full source retained as reference - evolution/: 11 architecture-refactor proposals, 6-direction roadmap, capability map - review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary - _indexes/: glossary (81 terms), source->doc reverse index, chart index - _release/: v1.0-RC1 release manifest, versioning convention, ops & freshness plan - _meta/README.md: placeholder noting WS-12 governance assets gap Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot checks hit). The legacy docs/ ignore in .gitignore is narrowed to docs/* with an explicit allowlist for docs/rag/. Refs: WS-26 Co-authored-by: multica-agent <github@multica.ai>
12 KiB
12 KiB
MemoryBear RAG · 源码反查索引(File Index)
从源码模块反查到对应的文档章节。开发者修改某个文件时,可在此查到所有引用该文件的文档,提前评估改动的"知识涟漪"。
1. 总览:代码目录 → 文档映射
| 代码目录 | 主要责任 | 主导文档 | 次要引用 |
|---|---|---|---|
api/app/core/rag/app/ |
多格式解析 orchestrator | pipeline/01-loader-parser-chunking.md |
overview/source-inventory.md |
api/app/core/rag/common/ |
常量、token、settings | pipeline/01-loader-parser-chunking.md, evolution/architecture-refactor-suggestions.md §0.2 #4 / #2 |
overview/source-inventory.md |
api/app/core/rag/crawler/ |
Web 爬虫 | pipeline/01-loader-parser-chunking.md §4.1 |
— |
api/app/core/rag/deepdoc/parser/ |
11 类格式解析 | pipeline/01-loader-parser-chunking.md §5 |
overview/source-inventory.md |
api/app/core/rag/deepdoc/vision/ |
OCR + 版面 + TSR | pipeline/01-loader-parser-chunking.md §5.6 |
evolution/architecture-refactor-suggestions.md §0.2 #2(HF_ENDPOINT) |
api/app/core/rag/graphrag/ |
GraphRAG 共享工具 + 图搜索 | pipeline/04-graphrag.md(待交付) |
overview/source-inventory.md §3.3 |
api/app/core/rag/graphrag/general/ |
Microsoft GraphRAG 风格流水线 | pipeline/04-graphrag.md §general(待交付) |
overview/04-graphrag-indexing.mmd |
api/app/core/rag/graphrag/light/ |
LightRAG 风格抽取器 | pipeline/04-graphrag.md §light(待交付) |
同上 |
api/app/core/rag/integrations/feishu/ |
飞书 SDK | pipeline/01-loader-parser-chunking.md §4 |
— |
api/app/core/rag/integrations/yuque/ |
语雀 SDK | 同上 | — |
api/app/core/rag/llm/ |
LLM 多模型 facade | pipeline/05-reranking-prompt-llm.md §3 |
evolution/architecture-refactor-suggestions.md #1, #5 |
api/app/core/rag/models/ |
Chunk 数据模型 | pipeline/01-loader-parser-chunking.md §3 |
overview/source-inventory.md |
api/app/core/rag/nlp/ |
中文分词、Hybrid 搜索调度 | pipeline/03-vdb-and-retrieval.md §6, pipeline/05-reranking-prompt-llm.md §1.2 |
evolution/architecture-refactor-suggestions.md #3 |
api/app/core/rag/prompts/ |
Prompt 模板与工厂 | pipeline/05-reranking-prompt-llm.md §2 |
— |
api/app/core/rag/utils/ |
ES/Redis 连接、LibreOffice | pipeline/03-vdb-and-retrieval.md, pipeline/01-loader-parser-chunking.md §4.2 |
— |
api/app/core/rag/vdb/elasticsearch/ |
ES 向量+全文 | pipeline/03-vdb-and-retrieval.md 全文 |
pipeline/02-embedding.md §5.4 |
api/app/core/rag/res/ |
NER / 同义词 / mapping | pipeline/03-vdb-and-retrieval.md §3 |
— |
api/app/core/models/ |
统一封装层(Embedding / Rerank / LLM) | pipeline/02-embedding.md §1.2, pipeline/05-reranking-prompt-llm.md §1.2 |
evolution/architecture-refactor-suggestions.md #1 |
api/app/core/agent/ |
LangChainAgent | pipeline/05-reranking-prompt-llm.md §3.4 |
— |
api/app/core/workflow/nodes/knowledge/ |
Workflow Knowledge 节点 | pipeline/05-reranking-prompt-llm.md §3.4, pipeline/03-vdb-and-retrieval.md |
evolution/architecture-refactor-suggestions.md #3 |
api/app/core/rag_utils/(注意与 rag/utils 不同) |
Chunk LLM 分析(与 Memory 系统耦合) | overview/source-inventory.md §rag_utils |
evolution/future-extensions-roadmap.md D4 |
api/app/core/memory/ |
对话内存系统(Ebbinghaus / ACT-R / Neo4j / langgraph) | evolution/future-extensions-roadmap.md D4(未来扩展引用) |
— |
api/app/services/ |
业务服务层 | pipeline/05-reranking-prompt-llm.md §3.5 |
— |
api/app/tasks.py |
Celery 任务入口 | overview/source-inventory.md §3, pipeline/01-loader-parser-chunking.md §3.1 |
evolution/future-extensions-roadmap.md D3 |
2. 关键文件 → 文档章节(细粒度)
api/app/core/rag/app/naive.py
| 行号 | 函数 / 关键代码 | 引用文档 |
|---|---|---|
:27 by_deepdoc() |
DeepDoc 解析路径 | pipeline/01-loader-parser-chunking.md §5.1 |
:45 by_mineru() |
MinerU 第三方解析 | 同上 §5.2 |
:65 by_textln() |
TextIn 第三方解析 | 同上 §5.3 |
:257 naive.__call__() |
主解析入口 | 同上 §3 |
:508-738 chunk() |
11 路 if/elif 分发,按扩展名挑 parser | 同上 §3, evolution/architecture-refactor-suggestions.md §0.2 #5 / #5 改造建议 |
api/app/core/rag/llm/embedding_model.py
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
:14-38 Base |
Embedding 抽象基类(旧) | pipeline/02-embedding.md §5.1 |
:50-65 OpenAIEmbed.encode() |
OpenAI 兼容 Embedding 实现 | 同上 §5.2, evolution/architecture-refactor-suggestions.md #1 / #4 / #9 |
:138-143 QWenEmbed |
DashScope Embedding(含显式 5 次重试) | pipeline/02-embedding.md §3.2 |
api/app/core/models/embedding.py
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
:9-23 RedBearEmbeddings.__init__ |
LangChain 统一封装初始化 | pipeline/02-embedding.md §1.2 / §5.3 |
:65-78 embed_documents() |
文档侧 Embedding(含火山多模态分支) | 同上 §2.1 |
api/app/core/rag/vdb/elasticsearch/elasticsearch_vector.py
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
:29 ElasticSearchVector |
ES 向量主实现 | pipeline/03-vdb-and-retrieval.md §1 |
:55-63 add_chunks() |
向量入库 | 同上 §4, pipeline/02-embedding.md §2.1, evolution/architecture-refactor-suggestions.md #4 |
:374-380 search_by_vector() |
向量检索 | pipeline/03-vdb-and-retrieval.md §6, pipeline/02-embedding.md §2.2 |
:468 search_by_full_text() |
BM25 检索 | pipeline/03-vdb-and-retrieval.md §5 |
:560-607 rerank() |
ES 层 rerank | pipeline/05-reranking-prompt-llm.md §1.2 D, evolution/architecture-refactor-suggestions.md #3 |
:653-658 dense_vector mapping |
dense_vector 维度动态决定 | pipeline/02-embedding.md §3.4, pipeline/03-vdb-and-retrieval.md §3 |
:666 ElasticSearchVectorFactory |
工厂类 | overview/source-inventory.md, pipeline/03-vdb-and-retrieval.md §1 |
:685-707 ES 配置环境变量 |
6 个 ES 相关 env vars | evolution/architecture-refactor-suggestions.md §0.2 #2 |
api/app/core/rag/nlp/search.py
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
:36-147 knowledge_retrieval() |
知识检索入口(旧通道) | pipeline/05-reranking-prompt-llm.md §1.2 |
:284-343 rerank() |
模块级 rerank | 同上 |
:349 Dealer |
BM25/Hybrid 调度器 | pipeline/03-vdb-and-retrieval.md §6, overview/source-inventory.md §一 |
:365-373 get_vector() |
调用旧 Embedding 接口的 encode_queries |
pipeline/02-embedding.md §2.4 |
:387 search() |
主 search | pipeline/03-vdb-and-retrieval.md §6 |
:439 FusionExpr("weighted_sum") |
0.05/0.95 硬编码权重 | pipeline/03-vdb-and-retrieval.md §6, evolution/future-extensions-roadmap.md D2 |
:489-577 insert_citations() |
引用回填(embedding 相似度匹配) | pipeline/05-reranking-prompt-llm.md §4.1 |
:579-604 _rank_feature_scores() |
tag TF-IDF + PageRank | pipeline/05-reranking-prompt-llm.md §1.2 A |
:606-643 Dealer.rerank() |
内置混合 rerank(融合分数) | 同上, evolution/architecture-refactor-suggestions.md #3 |
:645-666 rerank_by_model() |
外部模型 rerank | pipeline/05-reranking-prompt-llm.md §1.2 B |
:674-768 retrieval() |
检索主流程 | 同上 §1.3 |
api/app/core/workflow/nodes/knowledge/node.py
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
:12 import OpenAIEmbed |
硬编码导入旧 Embedding 类 | evolution/architecture-refactor-suggestions.md #1 |
:14 import ElasticSearchVectorFactory |
绕过 BaseVector 抽象 | 同上 §0.2 #1 / #2 改造建议 |
:29 KnowledgeRetrievalNode |
Workflow 节点主类 | pipeline/05-reranking-prompt-llm.md §3.4 |
:54 _extract_input() |
渲染 query 模板 | 同上 |
:108-155 KnowledgeRetrievalNode.rerank() |
节点级 rerank | 同上 §1.2 C, evolution/architecture-refactor-suggestions.md #3 |
:157-193 get_reranker_model() |
每次调用都查 DB | evolution/architecture-refactor-suggestions.md §0.2 #4 |
:195-263 knowledge_retrieval() |
检索分支(PARTICIPLE / SEMANTIC / HYBRID / Graph) | pipeline/05-reranking-prompt-llm.md §3.4, pipeline/03-vdb-and-retrieval.md |
:236-271 HYBRID 分支 |
vector + full_text 并行 → dedup → rerank | 同上 |
:284 rerank() 模块级函数 |
三轨 rerank 之一 | evolution/architecture-refactor-suggestions.md #3 |
:303-378 execute() |
节点执行入口 | pipeline/05-reranking-prompt-llm.md §3.4 |
:327 print(reranked_docs) ⚠️ |
调试残留 | evolution/architecture-refactor-suggestions.md #3 / #10(hot-fix 候选) |
api/app/core/rag/graphrag/
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
general/index.py:36 run_graphrag() |
GraphRAG 主入口(doc 级) | pipeline/04-graphrag.md §general(待交付) |
general/index.py:122 run_graphrag_for_kb() |
KB 级 | 同上 |
general/graph_extractor.py:34 GraphExtractor |
Microsoft 风格抽取 | 同上 |
general/community_reports_extractor.py:37 |
社区报告 | 同上 |
light/graph_extractor.py:31 GraphExtractor |
LightRAG 风格抽取 | 同上 §light |
entity_resolution.py:31 EntityResolution |
实体消歧 | 同上 |
search.py:19 KGSearch |
图检索 | 同上 |
utils.py:41 chat_limiter |
Trio 限流 | pipeline/02-embedding.md §3.1, evolution/architecture-refactor-suggestions.md #9 |
utils.py:115-134 get/set_embed_cache |
Redis Embedding 缓存 | pipeline/02-embedding.md §3.3, evolution/architecture-refactor-suggestions.md #4 |
utils.py:301-327 graph_node_to_chunk() |
实体节点 → 向量 → ES | pipeline/02-embedding.md §2.3 |
api/app/core/rag/llm/chat_model.py
| 行号 | 类 / 函数 | 引用文档 |
|---|---|---|
:52 Base |
LLM 抽象基类 | pipeline/05-reranking-prompt-llm.md §3.1 |
:54-58 LLM_TIMEOUT_SECONDS / LLM_MAX_RETRIES |
超时与重试 | 同上 §3.3, evolution/architecture-refactor-suggestions.md §0.2 #2 |
:122-150 _chat() |
非流式 LLM 调用 | pipeline/05-reranking-prompt-llm.md §3.2 |
:152-185 _chat_streamly() |
流式 LLM 调用 | 同上 |
:251-303 chat_with_tools() |
工具调用 | 同上 §3.4 |
api/app/core/rag/prompts/
| 文件 | 功能 | 引用文档 |
|---|---|---|
template.py:9 load_prompt() |
启动时加载 .md 模板 | pipeline/05-reranking-prompt-llm.md §2.1 |
generator.py |
20+ Prompt 工厂函数(citation/keyword/...) | 同上 |
*.md(31 个模板) |
Prompt 内容 | overview/source-inventory.md |
api/app/core/rag/common/settings.py
| 行号 | 关键代码 | 引用文档 |
|---|---|---|
:9-10 retriever / kg_retriever |
进程级单例 | evolution/architecture-refactor-suggestions.md §0.2 #4 |
:13 init_settings() |
模块导入时副作用 | 同上, pipeline/03-vdb-and-retrieval.md |
:24 触发位置 |
— | evolution/architecture-refactor-suggestions.md #8 |
api/app/services/draft_run_service.py
| 行号 | 关键代码 | 引用文档 |
|---|---|---|
:195-263 create_knowledge_retrieval_tool() |
知识检索工具 | pipeline/05-reranking-prompt-llm.md §3.5 |
:227-255 chunk 拼接 |
\n\n 分隔 chunks |
同上 §2.3 |
:474-490 _filter_citations() |
引用过滤 + 下载链接 | 同上 §4.2 |
3. 当前已识别的"代码残留与修复任务"
| # | 文件:行 | 问题 | 修复建议 | 关联 |
|---|---|---|---|---|
| 1 | workflow/nodes/knowledge/node.py:327 |
print(reranked_docs) 调试残留 |
立即提 hot-fix PR 删除 | S3-T1 #10 + S3-T1 §3.1 |
| 2 | chat_model.py 各 provider 子类 |
base_url 与认证 header 硬编码 | 引入 Plugin Registry | S3-T1 #5 |
| 3 | naive.py:508-738 chunk() |
11 路 if/elif 硬编码 | 抽 Parser Protocol |
S3-T1 #5 |
| 4 | elasticsearch_vector.py:55-63 add_chunks |
同步循环,无并发 | 改 trio 协程 + 共享 chat_limiter | S3-T1 #9 |
| 5 | nlp/search.py:439 |
weighted_sum 0.05/0.95 硬编码 |
改为 ctx.fusion_weights 注入 | S3-T2 D2 |
| 6 | rag_utils/ vs rag/utils/ |
命名冲突 | 重命名为 rag/chunk_analytics/ 或合并 |
S1-T3 §4.1 |
— File Index · v1.0-RC1 · 2026-05-08 —