docs(rag): add MemoryBear RAG implementation docs v1.0

Submit the formed RAG documentation set produced across Sprint-1/2/3 (WS-12 through WS-26) under docs/rag/. Includes: - README.md / INDEX.md: landing + total index (responsibility matrix, review verdicts, dual-link to source issues) - overview/: full-pipeline architecture (4 .mmd diagrams), 11-stage boundary contracts, doc map, source-code inventory - pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding, VDB & retrieval, GraphRAG, Rerank/Prompt/LLM) - graphrag/, end-to-end/: v1.0 formal versions with full source retained as reference - evolution/: 11 architecture-refactor proposals, 6-direction roadmap, capability map - review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary - _indexes/: glossary (81 terms), source->doc reverse index, chart index - _release/: v1.0-RC1 release manifest, versioning convention, ops & freshness plan - _meta/README.md: placeholder noting WS-12 governance assets gap Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot checks hit). The legacy docs/ ignore in .gitignore is narrowed to docs/* with an explicit allowlist for docs/rag/. Refs: WS-26 Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 10:51:48 +08:00
parent feae2f2e1e
commit 343a5eebe3
33 changed files with 8410 additions and 1 deletions
--- a/docs/rag/_indexes/chart-index.md
+++ b/docs/rag/_indexes/chart-index.md
@@ -0,0 +1,47 @@
+# MemoryBear RAG · 图表索引（Chart Index）
+
+> 全集中所有 Mermaid 图表的集中清单。每张图标注：内容、来源、文件路径、阅读重点。
+
+## 1. 总览
+
+| # | 图表名 | 类型 | 来源任务 | 文件路径 | 一句话描述 |
+|---|---|---|---|---|---|
+| 1 | 全链路架构图 | Mermaid Flowchart | S1-T2 | `overview/01-architecture.mmd` | 11 个 RAG 环节 + 模块映射的全景图 |
+| 2 | 文档入库时序图 | Mermaid Sequence | S1-T2 | `overview/02-indexing-pipeline.mmd` | 上传 → Celery → naive.chunk() → Embedding → ES 写入完整时序 |
+| 3 | 在线检索时序图 | Mermaid Sequence | S1-T2 | `overview/03-query-pipeline.mmd` | Workflow 节点检索 → 4 种模式分支 → 去重/Rerank → Prompt → LLM |
+| 4 | GraphRAG 索引时序图 | Mermaid Sequence | S1-T2 | `overview/04-graphrag-indexing.mmd` | light vs general 两条分支差异 |
+| 5 | 模块依赖图 | Mermaid Graph TB | S1-T3 | `overview/source-inventory.md` §二 | 上层调用者 / RAG Core / 旁路 三层依赖 |
+| 6 | Loader/Parser/Chunking 数据流图 | Mermaid Flowchart LR | S2-T1 | `pipeline/01-loader-parser-chunking.md` §3 | 多源 → 多格式 → Chunking → ES Doc |
+| 7 | 后处理与生成流程图 | ASCII 流程 | S2-T5 | `pipeline/05-reranking-prompt-llm.md` §"实现概览" | Rerank → Prompt → LLM → 后处理 |
+| 8 | 能力地图 | Mermaid（三色） | S3-T2 | `evolution/capability-map.mmd` | 已有（绿）/ 近期可上（黄）/ 中长期愿景（紫） |
+| 9 | 后续迭代路线图甘特图 | Mermaid Gantt | S3-T2 | `evolution/future-extensions-roadmap.md` §4 | Sprint-3 / 短期 / 中期 / 长期 时间线 |
+| 10 | 项目甘特图（总） | Mermaid Gantt | WS-11 主控 | `_release/release-manifest-v1.0-RC1.md` §附录 | 14 子任务的整体计划 |
+
+## 2. 速查：场景 → 应该看哪张图
+
+| 场景 | 推荐图表 | 备注 |
+|---|---|---|
+| 给业务方 / 新人介绍 RAG 链路 | #1 全链路架构图 + #8 能力地图 | 两图配合即可"5 分钟讲清是什么" |
+| 排查"文档为什么没入库" | #2 文档入库时序图 | 找到失败的具体阶段 |
+| 排查"为什么搜不到这个 chunk" | #3 在线检索时序图 + #5 模块依赖图 | 时序图定位调用步骤；依赖图找上下游 |
+| GraphRAG 调试 | #4 GraphRAG 索引时序图 | light/general 差异点 |
+| 评估改造影响面 | #5 模块依赖图 + 本目录 `_indexes/file-index.md` | 看代码 → 文档涟漪 |
+| 给架构会做演进汇报 | #8 能力地图 + #9 后续迭代甘特图 | 现状 + 路线 |
+
+## 3. 图表渲染说明
+
+- **Mermaid 文件 (`.mmd`)**：可直接在 GitHub / Mermaid Live Editor / VS Code Mermaid 插件中渲染。
+- **代码块嵌入图**：直接在 Markdown 渲染器（如 MkDocs Material）打开对应文档即可看到。
+- **未来扩展（建议）**：在 v1.1 时为 `.mmd` 文件配套生成 SVG，挂在 Wiki 上避免 GitHub 渲染限制（当前 GitHub Mermaid 节点上限 1500，建议后续按需拆图）。
+
+## 4. 待补图表（v1.0 → v1.1 计划）
+
+| # | 计划图表 | 来源 | 等待依赖 |
+|---|---|---|---|
+| TBD-1 | E2E 端到端时序图（含 GraphRAG 与 Memory 协同） | S2-T6（待重启） | S2-T1~T5 全部完成 |
+| TBD-2 | GraphRAG light vs general 的内部数据流图 | S2-T4（待重启） | S2-T4 启动 |
+| TBD-3 | "GraphRAG with evidence_path" 时序示意 | S3-T2 D3 落地 | D3 增量图演化第一阶段 |
+| TBD-4 | Memory ↔ RAG 协同时序图 | S3-T2 D4 落地 | D4 PoC-B 实施后回填 |
+| TBD-5 | 散点图：建议 # × 优先级 × 工作量 | S3-T1 + 评审反馈 | S3-T1 终审已完成；散点图作为可选优化 |
+
+— **Chart Index · v1.0-RC1 · 2026-05-08** —
--- a/docs/rag/_indexes/file-index.md
+++ b/docs/rag/_indexes/file-index.md
@@ -0,0 +1,166 @@
+# MemoryBear RAG · 源码反查索引（File Index）
+
+> 从源码模块反查到对应的文档章节。开发者修改某个文件时，可在此查到所有引用该文件的文档，提前评估改动的"知识涟漪"。
+
+## 1. 总览：代码目录 → 文档映射
+
+| 代码目录 | 主要责任 | 主导文档 | 次要引用 |
+|---|---|---|---|
+| `api/app/core/rag/app/` | 多格式解析 orchestrator | `pipeline/01-loader-parser-chunking.md` | `overview/source-inventory.md` |
+| `api/app/core/rag/common/` | 常量、token、settings | `pipeline/01-loader-parser-chunking.md`, `evolution/architecture-refactor-suggestions.md` §0.2 #4 / #2 | `overview/source-inventory.md` |
+| `api/app/core/rag/crawler/` | Web 爬虫 | `pipeline/01-loader-parser-chunking.md` §4.1 | — |
+| `api/app/core/rag/deepdoc/parser/` | 11 类格式解析 | `pipeline/01-loader-parser-chunking.md` §5 | `overview/source-inventory.md` |
+| `api/app/core/rag/deepdoc/vision/` | OCR + 版面 + TSR | `pipeline/01-loader-parser-chunking.md` §5.6 | `evolution/architecture-refactor-suggestions.md` §0.2 #2（HF_ENDPOINT） |
+| `api/app/core/rag/graphrag/` | GraphRAG 共享工具 + 图搜索 | `pipeline/04-graphrag.md`（待交付） | `overview/source-inventory.md` §3.3 |
+| `api/app/core/rag/graphrag/general/` | Microsoft GraphRAG 风格流水线 | `pipeline/04-graphrag.md` §general（待交付） | `overview/04-graphrag-indexing.mmd` |
+| `api/app/core/rag/graphrag/light/` | LightRAG 风格抽取器 | `pipeline/04-graphrag.md` §light（待交付） | 同上 |
+| `api/app/core/rag/integrations/feishu/` | 飞书 SDK | `pipeline/01-loader-parser-chunking.md` §4 | — |
+| `api/app/core/rag/integrations/yuque/` | 语雀 SDK | 同上 | — |
+| `api/app/core/rag/llm/` | LLM 多模型 facade | `pipeline/05-reranking-prompt-llm.md` §3 | `evolution/architecture-refactor-suggestions.md` #1, #5 |
+| `api/app/core/rag/models/` | Chunk 数据模型 | `pipeline/01-loader-parser-chunking.md` §3 | `overview/source-inventory.md` |
+| `api/app/core/rag/nlp/` | 中文分词、Hybrid 搜索调度 | `pipeline/03-vdb-and-retrieval.md` §6, `pipeline/05-reranking-prompt-llm.md` §1.2 | `evolution/architecture-refactor-suggestions.md` #3 |
+| `api/app/core/rag/prompts/` | Prompt 模板与工厂 | `pipeline/05-reranking-prompt-llm.md` §2 | — |
+| `api/app/core/rag/utils/` | ES/Redis 连接、LibreOffice | `pipeline/03-vdb-and-retrieval.md`, `pipeline/01-loader-parser-chunking.md` §4.2 | — |
+| `api/app/core/rag/vdb/elasticsearch/` | ES 向量+全文 | `pipeline/03-vdb-and-retrieval.md` 全文 | `pipeline/02-embedding.md` §5.4 |
+| `api/app/core/rag/res/` | NER / 同义词 / mapping | `pipeline/03-vdb-and-retrieval.md` §3 | — |
+| `api/app/core/models/` | 统一封装层（Embedding / Rerank / LLM） | `pipeline/02-embedding.md` §1.2, `pipeline/05-reranking-prompt-llm.md` §1.2 | `evolution/architecture-refactor-suggestions.md` #1 |
+| `api/app/core/agent/` | LangChainAgent | `pipeline/05-reranking-prompt-llm.md` §3.4 | — |
+| `api/app/core/workflow/nodes/knowledge/` | Workflow Knowledge 节点 | `pipeline/05-reranking-prompt-llm.md` §3.4, `pipeline/03-vdb-and-retrieval.md` | `evolution/architecture-refactor-suggestions.md` #3 |
+| `api/app/core/rag_utils/`（注意与 `rag/utils` 不同） | Chunk LLM 分析（与 Memory 系统耦合） | `overview/source-inventory.md` §rag_utils | `evolution/future-extensions-roadmap.md` D4 |
+| `api/app/core/memory/` | 对话内存系统（Ebbinghaus / ACT-R / Neo4j / langgraph） | `evolution/future-extensions-roadmap.md` D4（未来扩展引用） | — |
+| `api/app/services/` | 业务服务层 | `pipeline/05-reranking-prompt-llm.md` §3.5 | — |
+| `api/app/tasks.py` | Celery 任务入口 | `overview/source-inventory.md` §3, `pipeline/01-loader-parser-chunking.md` §3.1 | `evolution/future-extensions-roadmap.md` D3 |
+
+## 2. 关键文件 → 文档章节（细粒度）
+
+### `api/app/core/rag/app/naive.py`
+
+| 行号 | 函数 / 关键代码 | 引用文档 |
+|---|---|---|
+| `:27 by_deepdoc()` | DeepDoc 解析路径 | `pipeline/01-loader-parser-chunking.md` §5.1 |
+| `:45 by_mineru()` | MinerU 第三方解析 | 同上 §5.2 |
+| `:65 by_textln()` | TextIn 第三方解析 | 同上 §5.3 |
+| `:257 naive.__call__()` | 主解析入口 | 同上 §3 |
+| `:508-738 chunk()` | 11 路 if/elif 分发，按扩展名挑 parser | 同上 §3, `evolution/architecture-refactor-suggestions.md` §0.2 #5 / #5 改造建议 |
+
+### `api/app/core/rag/llm/embedding_model.py`
+
+| 行号 | 类 / 函数 | 引用文档 |
+|---|---|---|
+| `:14-38 Base` | Embedding 抽象基类（旧） | `pipeline/02-embedding.md` §5.1 |
+| `:50-65 OpenAIEmbed.encode()` | OpenAI 兼容 Embedding 实现 | 同上 §5.2, `evolution/architecture-refactor-suggestions.md` #1 / #4 / #9 |
+| `:138-143 QWenEmbed` | DashScope Embedding（含显式 5 次重试） | `pipeline/02-embedding.md` §3.2 |
+
+### `api/app/core/models/embedding.py`
+
+| 行号 | 类 / 函数 | 引用文档 |
+|---|---|---|
+| `:9-23 RedBearEmbeddings.__init__` | LangChain 统一封装初始化 | `pipeline/02-embedding.md` §1.2 / §5.3 |
+| `:65-78 embed_documents()` | 文档侧 Embedding（含火山多模态分支） | 同上 §2.1 |
+
+### `api/app/core/rag/vdb/elasticsearch/elasticsearch_vector.py`
+
+| 行号 | 类 / 函数 | 引用文档 |
+|---|---|---|
+| `:29 ElasticSearchVector` | ES 向量主实现 | `pipeline/03-vdb-and-retrieval.md` §1 |
+| `:55-63 add_chunks()` | 向量入库 | 同上 §4, `pipeline/02-embedding.md` §2.1, `evolution/architecture-refactor-suggestions.md` #4 |
+| `:374-380 search_by_vector()` | 向量检索 | `pipeline/03-vdb-and-retrieval.md` §6, `pipeline/02-embedding.md` §2.2 |
+| `:468 search_by_full_text()` | BM25 检索 | `pipeline/03-vdb-and-retrieval.md` §5 |
+| `:560-607 rerank()` | ES 层 rerank | `pipeline/05-reranking-prompt-llm.md` §1.2 D, `evolution/architecture-refactor-suggestions.md` #3 |
+| `:653-658 dense_vector mapping` | dense_vector 维度动态决定 | `pipeline/02-embedding.md` §3.4, `pipeline/03-vdb-and-retrieval.md` §3 |
+| `:666 ElasticSearchVectorFactory` | 工厂类 | `overview/source-inventory.md`, `pipeline/03-vdb-and-retrieval.md` §1 |
+| `:685-707 ES 配置环境变量` | 6 个 ES 相关 env vars | `evolution/architecture-refactor-suggestions.md` §0.2 #2 |
+
+### `api/app/core/rag/nlp/search.py`
+
+| 行号 | 类 / 函数 | 引用文档 |
+|---|---|---|
+| `:36-147 knowledge_retrieval()` | 知识检索入口（旧通道） | `pipeline/05-reranking-prompt-llm.md` §1.2 |
+| `:284-343 rerank()` | 模块级 rerank | 同上 |
+| `:349 Dealer` | BM25/Hybrid 调度器 | `pipeline/03-vdb-and-retrieval.md` §6, `overview/source-inventory.md` §一 |
+| `:365-373 get_vector()` | 调用旧 Embedding 接口的 `encode_queries` | `pipeline/02-embedding.md` §2.4 |
+| `:387 search()` | 主 search | `pipeline/03-vdb-and-retrieval.md` §6 |
+| `:439 FusionExpr("weighted_sum")` | 0.05/0.95 硬编码权重 | `pipeline/03-vdb-and-retrieval.md` §6, `evolution/future-extensions-roadmap.md` D2 |
+| `:489-577 insert_citations()` | 引用回填（embedding 相似度匹配） | `pipeline/05-reranking-prompt-llm.md` §4.1 |
+| `:579-604 _rank_feature_scores()` | tag TF-IDF + PageRank | `pipeline/05-reranking-prompt-llm.md` §1.2 A |
+| `:606-643 Dealer.rerank()` | 内置混合 rerank（融合分数） | 同上, `evolution/architecture-refactor-suggestions.md` #3 |
+| `:645-666 rerank_by_model()` | 外部模型 rerank | `pipeline/05-reranking-prompt-llm.md` §1.2 B |
+| `:674-768 retrieval()` | 检索主流程 | 同上 §1.3 |
+
+### `api/app/core/workflow/nodes/knowledge/node.py`
+
+| 行号 | 类 / 函数 | 引用文档 |
+|---|---|---|
+| `:12 import OpenAIEmbed` | 硬编码导入旧 Embedding 类 | `evolution/architecture-refactor-suggestions.md` #1 |
+| `:14 import ElasticSearchVectorFactory` | 绕过 BaseVector 抽象 | 同上 §0.2 #1 / #2 改造建议 |
+| `:29 KnowledgeRetrievalNode` | Workflow 节点主类 | `pipeline/05-reranking-prompt-llm.md` §3.4 |
+| `:54 _extract_input()` | 渲染 query 模板 | 同上 |
+| `:108-155 KnowledgeRetrievalNode.rerank()` | 节点级 rerank | 同上 §1.2 C, `evolution/architecture-refactor-suggestions.md` #3 |
+| `:157-193 get_reranker_model()` | 每次调用都查 DB | `evolution/architecture-refactor-suggestions.md` §0.2 #4 |
+| `:195-263 knowledge_retrieval()` | 检索分支（PARTICIPLE / SEMANTIC / HYBRID / Graph） | `pipeline/05-reranking-prompt-llm.md` §3.4, `pipeline/03-vdb-and-retrieval.md` |
+| `:236-271 HYBRID 分支` | vector + full_text 并行 → dedup → rerank | 同上 |
+| `:284 rerank()` 模块级函数 | 三轨 rerank 之一 | `evolution/architecture-refactor-suggestions.md` #3 |
+| `:303-378 execute()` | 节点执行入口 | `pipeline/05-reranking-prompt-llm.md` §3.4 |
+| `:327 print(reranked_docs)` ⚠️ | 调试残留 | `evolution/architecture-refactor-suggestions.md` #3 / #10（hot-fix 候选） |
+
+### `api/app/core/rag/graphrag/`
+
+| 行号 | 类 / 函数 | 引用文档 |
+|---|---|---|
+| `general/index.py:36 run_graphrag()` | GraphRAG 主入口（doc 级） | `pipeline/04-graphrag.md` §general（待交付） |
+| `general/index.py:122 run_graphrag_for_kb()` | KB 级 | 同上 |
+| `general/graph_extractor.py:34 GraphExtractor` | Microsoft 风格抽取 | 同上 |
+| `general/community_reports_extractor.py:37` | 社区报告 | 同上 |
+| `light/graph_extractor.py:31 GraphExtractor` | LightRAG 风格抽取 | 同上 §light |
+| `entity_resolution.py:31 EntityResolution` | 实体消歧 | 同上 |
+| `search.py:19 KGSearch` | 图检索 | 同上 |
+| `utils.py:41 chat_limiter` | Trio 限流 | `pipeline/02-embedding.md` §3.1, `evolution/architecture-refactor-suggestions.md` #9 |
+| `utils.py:115-134 get/set_embed_cache` | Redis Embedding 缓存 | `pipeline/02-embedding.md` §3.3, `evolution/architecture-refactor-suggestions.md` #4 |
+| `utils.py:301-327 graph_node_to_chunk()` | 实体节点 → 向量 → ES | `pipeline/02-embedding.md` §2.3 |
+
+### `api/app/core/rag/llm/chat_model.py`
+
+| 行号 | 类 / 函数 | 引用文档 |
+|---|---|---|
+| `:52 Base` | LLM 抽象基类 | `pipeline/05-reranking-prompt-llm.md` §3.1 |
+| `:54-58 LLM_TIMEOUT_SECONDS / LLM_MAX_RETRIES` | 超时与重试 | 同上 §3.3, `evolution/architecture-refactor-suggestions.md` §0.2 #2 |
+| `:122-150 _chat()` | 非流式 LLM 调用 | `pipeline/05-reranking-prompt-llm.md` §3.2 |
+| `:152-185 _chat_streamly()` | 流式 LLM 调用 | 同上 |
+| `:251-303 chat_with_tools()` | 工具调用 | 同上 §3.4 |
+
+### `api/app/core/rag/prompts/`
+
+| 文件 | 功能 | 引用文档 |
+|---|---|---|
+| `template.py:9 load_prompt()` | 启动时加载 .md 模板 | `pipeline/05-reranking-prompt-llm.md` §2.1 |
+| `generator.py` | 20+ Prompt 工厂函数（citation/keyword/...） | 同上 |
+| `*.md`（31 个模板） | Prompt 内容 | `overview/source-inventory.md` |
+
+### `api/app/core/rag/common/settings.py`
+
+| 行号 | 关键代码 | 引用文档 |
+|---|---|---|
+| `:9-10 retriever / kg_retriever` | 进程级单例 | `evolution/architecture-refactor-suggestions.md` §0.2 #4 |
+| `:13 init_settings()` | 模块导入时副作用 | 同上, `pipeline/03-vdb-and-retrieval.md` |
+| `:24` 触发位置 | — | `evolution/architecture-refactor-suggestions.md` #8 |
+
+### `api/app/services/draft_run_service.py`
+
+| 行号 | 关键代码 | 引用文档 |
+|---|---|---|
+| `:195-263 create_knowledge_retrieval_tool()` | 知识检索工具 | `pipeline/05-reranking-prompt-llm.md` §3.5 |
+| `:227-255` chunk 拼接 | `\n\n` 分隔 chunks | 同上 §2.3 |
+| `:474-490 _filter_citations()` | 引用过滤 + 下载链接 | 同上 §4.2 |
+
+## 3. 当前已识别的"代码残留与修复任务"
+
+| # | 文件:行 | 问题 | 修复建议 | 关联 |
+|---|---|---|---|---|
+| 1 | `workflow/nodes/knowledge/node.py:327` | `print(reranked_docs)` 调试残留 | 立即提 hot-fix PR 删除 | S3-T1 #10 + S3-T1 §3.1 |
+| 2 | `chat_model.py` 各 provider 子类 | base_url 与认证 header 硬编码 | 引入 Plugin Registry | S3-T1 #5 |
+| 3 | `naive.py:508-738 chunk()` | 11 路 if/elif 硬编码 | 抽 `Parser` Protocol | S3-T1 #5 |
+| 4 | `elasticsearch_vector.py:55-63 add_chunks` | 同步循环，无并发 | 改 trio 协程 + 共享 chat_limiter | S3-T1 #9 |
+| 5 | `nlp/search.py:439` | `weighted_sum` 0.05/0.95 硬编码 | 改为 ctx.fusion_weights 注入 | S3-T2 D2 |
+| 6 | `rag_utils/` vs `rag/utils/` | 命名冲突 | 重命名为 `rag/chunk_analytics/` 或合并 | S1-T3 §4.1 |
+
+— **File Index · v1.0-RC1 · 2026-05-08** —
--- a/docs/rag/_indexes/glossary.md
+++ b/docs/rag/_indexes/glossary.md
@@ -0,0 +1,198 @@
+# MemoryBear RAG · 关键术语表
+
+> 合并 Sprint-1 / Sprint-2 / Sprint-3 各文档术语，按字母顺序排列。
+> 每个术语注明：含义 + 在 MemoryBear 代码中的对应位置 + 出现的文档。
+
+## A
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **ASR** | Automatic Speech Recognition，语音转文字。MemoryBear 中通过 `seq2txt_model.transcription` 调用（QWenSeq2txt 带时间戳，GPTSeq2txt 用 Whisper） | `rag/llm/sequence2txt_model.py:1-215` | S2-T1, S2-T5 |
+| **Autopilot** | 工作空间内的"按时触发 / 按事件触发"自动化代理；与 `multica autopilot` 命令族对应 | — | 平台机制（项目 SOP） |
+
+## B
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **BaseVector** | VDB 抽象基类（仅定义抽象方法，目前唯一实现为 `ElasticSearchVector`） | `rag/vdb/vector_base.py:9` | S1-T3, S2-T3, S3-T1 |
+| **BM25** | Best Match 25，全文检索经典 ranking 函数；MemoryBear 通过 ES `query_string` + IK 分词器实现 | `rag/nlp/query.py`, `rag/vdb/elasticsearch/elasticsearch_vector.py:468 search_by_full_text` | S2-T3, S3-T2 |
+| **Boundaries** | 11 个 RAG 阶段的输入/输出/接口契约文档（S1-T2 交付物之一） | — | S1-T2 |
+
+## C
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **Celery** | 任务队列；MemoryBear 用它派发文档解析、GraphRAG 构建等异步流水线 | `tasks.py:212 parse_document`, `tasks.py:472 build_graphrag_for_kb`, `tasks.py:557 build_graphrag_for_document` | S1-T3, S2-T1, S2-T3, S3-T2 |
+| **chat_limiter** | Trio CapacityLimiter，控制 GraphRAG 中实体/关系 Embedding 的并发；默认 10 | `rag/graphrag/utils.py:41` | S2-T2, S3-T1 |
+| **Chunk** | 最终交给 Embedding 的文本片段，一般 ≤ `chunk_token_num`（默认 128–512） | `rag/models/chunk.py:17 DocumentChunk` | S2-T1, S2-T2, S2-T3 |
+| **chunk_token_num** | 单个 chunk 的最大 token 数 | `rag/app/naive.py` 调用层指定 | S2-T1 |
+| **citation** | 答案文本中插入的 `[ID:N]` 引用标记 | `rag/nlp/search.py:489-577 Dealer.insert_citations` | S2-T5 |
+| **CLIP / BGE-VL / Jina-Clip** | 跨模态 Embedding 模型，把图像和文本映射到同一语义空间 | 当前未启用，规划见 S3-T2 D1 | S3-T2 |
+| **cl100k_base** | OpenAI GPT-4 系列使用的 BPE tokenizer；MemoryBear 用它做 token 计数 | `rag/common/token_utils.py` | S2-T1, S2-T2 |
+| **Cross-Encoder** | 一种 Reranker 范式：把 (query, doc) 拼接后过同一个 Encoder，输出相关性分数 | 当前未自训，仅在外部 rerank 服务（DashScope/Jina）调用，规划见 S3-T2 D5 | S2-T5, S3-T2 |
+
+## D
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **Dealer** | `rag/nlp/search.py:349 Dealer` 类，BM25/hybrid 搜索调度器；GraphRAG 主要使用此通道 | `rag/nlp/search.py:349` | S1-T3, S2-T3, S2-T5, S3-T1 |
+| **deepdoc** | MemoryBear 的多格式解析模块，含 parser（11 种格式）+ vision（OCR / 版面识别 / TSR） | `rag/deepdoc/{parser,vision}` | S1-T3, S2-T1 |
+| **DocumentChunk** | Chunk 数据模型 | `rag/models/chunk.py:17` | S2-T1, S2-T2, S2-T3 |
+| **dense_vector** | ES 向量字段类型；MemoryBear 用 HNSW 索引 + cosine 相似度 | `elasticsearch_vector.py:653-658`, `rag/res/mapping.json` | S2-T2, S2-T3 |
+
+## E
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **E2E（End-to-End）** | 端到端调用链路，覆盖文档入库 + 在线检索 + 生成的完整时序 | `rag/app/`, `workflow/nodes/knowledge/`, `rag/llm/` | S2-T6（待交付） |
+| **Embedder** | Embedding 模型抽象接口（S3-T1 提议的统一 Protocol） | 提议中：`app/core/rag/protocols/embedder.py` | S3-T1, S3-T2 |
+| **Embedding 双轨** | MemoryBear 当前同时存在两条 Embedding 调用路径：`RedBearEmbeddings`（LangChain，新）与 `OpenAIEmbed/QWenEmbed/...`（遗留） | `rag/models/embedding.py` + `rag/llm/embedding_model.py` | S2-T2, S3-T1 |
+| **embed_cache** | GraphRAG 中的实体/关系 Embedding Redis 缓存，TTL 24h | `rag/graphrag/utils.py:115-134` | S2-T2, S3-T1 |
+| **EMBEDDING_BATCH_SIZE** | 批量 Embedding 大小的环境变量（README 提及但当前未生效） | — | S2-T2, S3-T1 |
+| **Entity Resolution** | 实体消歧；GraphRAG 索引流程的一环 | `rag/graphrag/entity_resolution.py:31` | S1-T3 |
+| **ESConnection** | ES 连接单例 | `rag/utils/es_conn.py` | S1-T3, S2-T3 |
+| **ElasticSearchVector** | VDB 主实现；同时承载 chunk + GraphRAG entity/relation + community_report | `rag/vdb/elasticsearch/elasticsearch_vector.py:29` | S1-T3, S2-T3, S3-T1 |
+
+## F
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **FOLDER 类型知识库** | 包含子知识库的文件夹型 KB；检索时递归遍历 | `workflow/nodes/knowledge/node.py` | S1-T3 |
+| **FusionExpr** | ES 检索中的"加权融合"DSL；当前固定 `0.05/0.95`（BM25:Vector） | `rag/nlp/search.py:439` | S2-T3, S3-T2 |
+
+## G
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **GraphRAG（general）** | Microsoft GraphRAG 风格：完整流水线（子图 → 合并 → PageRank → Leiden 社区 → 社区报告） | `rag/graphrag/general/index.py:36 run_graphrag` | S1-T2, S1-T3 |
+| **GraphRAG（light）** | LightRAG 风格：简化的实体/关系抽取，无社区报告；与 general 共享大部分代码 | `rag/graphrag/light/graph_extractor.py:31` | S1-T2, S1-T3 |
+| **GraphStore** | 图存储抽象（S3-T2 提议） | 提议中 | S3-T2 |
+| **GraphAugmentedRetriever** | 在 Hybrid 结果之上叠加 KGSearch 的 Retriever 实现 | 提议中 | S3-T1, S3-T2 |
+
+## H
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **HNSW** | Hierarchical Navigable Small World，向量索引算法；ES 8.x 内置 | ES 集群侧 | S2-T3 |
+| **HYBRID 检索** | BM25 + 向量并行 → 去重 → 可选 Rerank | `workflow/nodes/knowledge/node.py:236-271` | S2-T3, S2-T5 |
+| **HybridRetriever** | Hybrid 检索 Protocol 实现（S3-T1 PoC） | 提议中 | S3-T1 |
+
+## I
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **IK 分词器** | 中文分词器，ES IK plugin（`ik_max_word`） | ES 集群侧 | S2-T3 |
+| **init_settings()** | 模块级副作用，启动时自动建 ES 连接 + retriever 单例 | `rag/common/settings.py:24` | S1-T3, S3-T1 |
+| **insert_citations** | 答案分句后按 embedding 相似度回填 `[ID:N]` 引用 | `rag/nlp/search.py:489-577` | S2-T5 |
+
+## K
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **KGSearch** | GraphRAG 检索器 | `rag/graphrag/search.py:19` | S1-T3, S3-T2 |
+| **knowledge_graph_kwd** | ES 中区分图类型（entity / relation / community_report）的字段 | `rag/vdb/elasticsearch/elasticsearch_vector.py` | S1-T3 |
+| **KnowledgeRetrievalNode** | Workflow 引擎中的知识检索节点 | `workflow/nodes/knowledge/node.py:29` | S1-T3, S2-T5, S3-T1 |
+
+## L
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **LangChainAgent** | 基于 `create_agent` 的 ReAct Agent，工具调用循环 | `agent/langchain_agent.py:26-641` | S2-T5 |
+| **Late-Interaction** | 一种检索范式（如 ColBERT），文档级向量改为 token 级，retrieval 用 MaxSim | 当前未启用，规划见 S3-T2 D2 | S3-T2 |
+| **Leiden 算法** | 社区检测算法；GraphRAG 用它划分社区 | `rag/graphrag/general/index.py` 调用 `graspologic.partition.leiden` | S1-T2, S1-T3 |
+| **LightRAG** | GraphRAG 轻量化变种，无社区报告 | `rag/graphrag/light/` | S1-T2, S1-T3 |
+| **LLM** | Large Language Model；MemoryBear 通过 `chat_model.py` 与 `langchain_agent.py` 调用 | `rag/llm/chat_model.py:52 Base` | S2-T5 |
+| **LO（LibreOffice）** | 用作 PPT/PPTX 转 PDF 的兜底工具 | `rag/utils/libre_office.py` | S2-T1 |
+
+## M
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **MatchSparseExpr / Field.SPARSE_VECTOR** | 已声明未启用的稀疏向量表达式（SPLADE 接入预埋） | `rag/utils/doc_store_conn.py:75`, `vdb/field.py:11` | S3-T2 |
+| **Memory（记忆系统）** | MemoryBear 的对话内存系统：Ebbinghaus 衰减 + ACT-R + Neo4j + langgraph 读写图 | `core/memory/`（与 `core/rag/` 当前完全独立） | S3-T2 D4 |
+| **MemoryAugmentedRetriever** | D4 提议：在检索前用长期记忆改写 query 的 Retriever 包装层 | 提议中 | S3-T2 D4 |
+| **mind_map_extractor** | 独立运行的思维导图抽取器，不在 GraphRAG 主链路 | `rag/graphrag/mind_map_extractor.py` | S1-T2 |
+| **MinerU** | 第三方 PDF 解析服务（外部 API） | `rag/deepdoc/parser/mineru_parser.py:41`, `rag/app/textin_parser.py` | S1-T3, S2-T1 |
+| **Multimodal Embedding** | 多模态 Embedding；MemoryBear 仅火山引擎支持原生多模态 | `rag/models/embedding.py:65-78` 中 `_is_volcano` 分支 | S2-T2, S3-T2 D1 |
+
+## N
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **naive_merge / hierarchical_merge / tree_merge** | 三种 Chunking 合并策略 | `rag/nlp/__init__.py` | S2-T1 |
+| **Neo4j** | 图数据库；README 声明依赖，但 `core/rag` 当前零调用（规划见 S3-T2 D3） | — | S3-T2 |
+
+## O
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **OCR** | 文字检测 + 识别两阶段 | `rag/deepdoc/vision/ocr.py:522 OCR.__call__:694` | S2-T1 |
+| **OpenAIEmbed / QWenEmbed / ...** | 遗留的原始 Embedding 实现，被 GraphRAG 与 Dealer 使用 | `rag/llm/embedding_model.py:14-65` | S2-T2, S3-T1 |
+| **OpenTelemetry (OTel)** | 全链路追踪 + 指标 SDK；MemoryBear 当前未引入（规划见 S3-T1 #6） | 提议中 | S3-T1 |
+
+## P
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **PageRank** | 图节点重要性算法；GraphRAG 用它给实体打分 | `rag/graphrag/general/index.py` | S1-T2, S1-T3 |
+| **PARTICIPLE 检索** | 关键词分词检索（BM25） | `workflow/nodes/knowledge/node.py:195` | S2-T3 |
+| **Plugin Registry** | S3-T1 #5 提议的 Parser/LLM Provider 注册机制，替换 `naive.py` 11 路 if/elif | 提议中 | S3-T1 |
+| **Pydantic Settings** | S3-T1 #7 提议的中心化配置管理框架 | 提议中 | S3-T1 |
+
+## R
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **rag_utils（注意：与 `rag/utils` 不同）** | Chunk 内容 LLM 分析模块（摘要/标签/洞察/人物画像）；与 Memory 系统耦合 | `api/app/core/rag_utils/` | S1-T3 |
+| **RAGAS** | 开源 RAG 评估框架；MemoryBear 当前未集成 | 提议中 | S3-T2 D5 |
+| **rank_feature** | ES 中的 tag TF-IDF + PageRank 辅助排序分 | `rag/nlp/search.py:579-604` | S2-T5 |
+| **RedBearEmbeddings** | LangChain 统一封装的 Embedding 类（新路径） | `rag/models/embedding.py:9-23` | S2-T2 |
+| **RedBearRerank** | LangChain `BaseDocumentCompressor` 封装的 Reranker | `rag/models/rerank.py:11-84` | S2-T5, S3-T2 |
+| **Rerank 三轨** | (a) `node.py:284 rerank()` 模块级；(b) `KnowledgeRetrievalNode.rerank()` 节点方法；(c) `Dealer.rerank()` 融合排序 | `node.py:108-155, 284`、`nlp/search.py:606-643` | S2-T5, S3-T1 |
+| **Reranker** | Reranking Protocol（S3-T1 提议） | 提议中 | S3-T1, S3-T2 |
+| **retrieve_type** | 检索模式 enum：PARTICIPLE / SEMANTIC / HYBRID / Graph | `schemas/chunk_schema.py` | S2-T3, S3-T2 |
+| **Retriever** | 检索器 Protocol（S3-T1 提议） | 提议中 | S3-T1, S3-T2 |
+| **RouterRetriever** | 自适应路由 Retriever（S3-T2 D6 提议） | 提议中 | S3-T2 |
+| **RRF（Reciprocal Rank Fusion）** | 多路检索结果排序融合算法；S3-T2 PoC-A 提议接入 | 提议中 | S3-T2 |
+
+## S
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **SEMANTIC 检索** | 纯向量检索 | `workflow/nodes/knowledge/node.py:195` | S2-T3 |
+| **Section** | 解析器吐出的 `(text, position_or_layout)` 中间结构，是 Chunking 的"原料" | `rag/app/naive.py:257` | S2-T1 |
+| **SPLADE** | 学习型稀疏向量；S3-T2 D2 提议接入 | 提议中（脚手架已存：`MatchSparseExpr`） | S3-T2 |
+| **structlog** | 结构化日志库；S3-T1 #10 提议替换现有非结构化 `logger.*` | 提议中 | S3-T1 |
+| **System Prompt 组装** | "用户自定义 system_prompt + 技能 Prompt + 文档图片识别指令"三段拼接 | `app_chat_service.py:77-96` | S2-T5 |
+
+## T
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **TextIn** | 第三方 PDF 解析 API | `rag/app/textin_parser.py` | S1-T3 |
+| **Token** | 用 cl100k_base 编码后的 BPE token | `rag/common/token_utils.py` | S2-T1, S2-T2 |
+| **tokenize_chunks_with_images** | 带图片的 Chunk 化处理 | `rag/nlp/__init__.py` | S2-T1 |
+| **TSR** | Table Structure Recognition，复杂表格行/列/合并单元格还原 | `rag/deepdoc/vision/table_structure_recognizer.py:15` | S2-T1 |
+
+## V
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **VDB（Vector Database）** | 向量数据库；MemoryBear 当前唯一实现是 Elasticsearch 8.x | `rag/vdb/elasticsearch/` | S2-T3 |
+| **VectorBase** | 见 BaseVector | `rag/vdb/vector_base.py:9` | — |
+| **VLM** | Vision-Language Model；图像理解（CV 模型） | `rag/llm/cv_model.py` | S2-T1 |
+
+## W
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **weighted_sum (0.05, 0.95)** | ES 层 Hybrid 检索的固定权重（BM25:Vector） | `rag/nlp/search.py:439` | S2-T3, S3-T2 |
+| **Workflow Knowledge Node** | 见 KnowledgeRetrievalNode | `workflow/nodes/knowledge/node.py:29` | S1-T3, S2-T5 |
+
+## X
+
+| 术语 | 含义 | 代码位置 | 出现文档 |
+|---|---|---|---|
+| **xxhash** | 快速哈希函数；用于 GraphRAG embed_cache 的 key 生成 | `rag/graphrag/utils.py:115-134` | S2-T2 |
+
+— **Glossary · v1.0-RC1 · 共 81 个术语 · 2026-05-08** —