docs(rag): add MemoryBear RAG implementation docs v1.0
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
Submit the formed RAG documentation set produced across Sprint-1/2/3 (WS-12 through WS-26) under docs/rag/. Includes: - README.md / INDEX.md: landing + total index (responsibility matrix, review verdicts, dual-link to source issues) - overview/: full-pipeline architecture (4 .mmd diagrams), 11-stage boundary contracts, doc map, source-code inventory - pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding, VDB & retrieval, GraphRAG, Rerank/Prompt/LLM) - graphrag/, end-to-end/: v1.0 formal versions with full source retained as reference - evolution/: 11 architecture-refactor proposals, 6-direction roadmap, capability map - review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary - _indexes/: glossary (81 terms), source->doc reverse index, chart index - _release/: v1.0-RC1 release manifest, versioning convention, ops & freshness plan - _meta/README.md: placeholder noting WS-12 governance assets gap Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot checks hit). The legacy docs/ ignore in .gitignore is narrowed to docs/* with an explicit allowlist for docs/rag/. Refs: WS-26 Co-authored-by: multica-agent <github@multica.ai>
This commit is contained in:
200
docs/rag/graphrag/README.md
Normal file
200
docs/rag/graphrag/README.md
Normal file
@@ -0,0 +1,200 @@
|
||||
---
|
||||
title: "[S2-T4] GraphRAG(light + general)实现详解 — 正式版"
|
||||
author: Python 开发工程师
|
||||
reviewer: 知识运营与治理专家
|
||||
source-commit: feae2f2e (MemoryBear)
|
||||
last-reviewed-at: 2026-05-08
|
||||
scope: api/app/core/rag/graphrag/(含 light/ 与 general/ 子目录)
|
||||
version: v1.0
|
||||
status: 正式版(已解除占位)
|
||||
---
|
||||
|
||||
# [S2-T4] GraphRAG(light + general)实现详解 — 正式版
|
||||
|
||||
> 本文档为 [WS-24](mention://issue/a07f108d-06ee-41b8-8b57-22455f60ddeb) v1.0 文档全集的正式组成文件,替换 v1.0-RC1 中的占位版本。
|
||||
> 原始完整文档与逐节详评见 [WS-18](mention://issue/16bdb196-e10e-489b-b01c-9067b1f1bb23) 与 [WS-21](mention://issue/41f2482b-6f3e-4253-95f7-3e22e790f31c) §S2-T4 评审报告。
|
||||
|
||||
---
|
||||
|
||||
## 1. 一句话定位
|
||||
|
||||
GraphRAG 是 MemoryBear 知识库系统的**知识图谱增强检索模块**,通过 LLM 从文档中抽取实体-关系三元组构建知识图谱,在检索阶段利用图谱结构(实体关联、社区报告、多跳路径)补充传统向量检索的语义盲区,实现"结构化知识 + 语义向量"的混合召回。
|
||||
|
||||
---
|
||||
|
||||
## 2. 评审结果
|
||||
|
||||
| 维度 | 满分 | 得分 | 关键说明 |
|
||||
|---|---:|---:|---|
|
||||
| 准确性 | 25 | 24 | 抽检 5/5 命中:`run_graphrag` / extractor 三元选择 / `is_similarity` / `KGSearch.retrieval` / Leiden `run()` |
|
||||
| 完整性 | 25 | 24 | 12 章节 + 附录索引:术语表 11 条、Light/General 双时序图、5 套源码详解、4 个核心 Prompt 逐段解读 |
|
||||
| 时效性 | 15 | 13 | 元数据表完整,缺 YAML frontmatter(Sprint-2 已知遗留) |
|
||||
| 可读性 | 15 | 14 | Mermaid 时序图规范、Light/General 三张对照表一目了然、Prompt 逐行设计意图写法出色 |
|
||||
| 可执行性 | 20 | 18 | parser_config 配置入口明确、三组参数表完整、资源消耗估算(Light 5-15min / General 30-60min)可验证 |
|
||||
| **合计** | **100** | **93** | **PASS(标杆)** |
|
||||
|
||||
**裁定:** 与 [S2-T3] 并列 Sprint-2 **双标杆**。Must-Fix 无;Nice-to-Have 7 条留给 [S3-T3] 整合时统一处理。
|
||||
|
||||
---
|
||||
|
||||
## 3. 模块结构
|
||||
|
||||
```
|
||||
api/app/core/rag/graphrag/
|
||||
├── search.py # KGSearch:图谱检索入口
|
||||
├── entity_resolution.py # 实体消歧(LLM + 编辑距离)
|
||||
├── entity_resolution_prompt.py # 实体消歧 Prompt
|
||||
├── query_analyze_prompt.py # 查询分析 Prompt(MiniRAG 风格)
|
||||
├── utils.py # 图操作工具集(merge、cache、ES 读写)
|
||||
├── __init__.py
|
||||
├── light/
|
||||
│ ├── graph_extractor.py # Light 版实体/关系抽取器
|
||||
│ └── graph_prompt.py # Light 版抽取 Prompt + RAG 回答 Prompt
|
||||
└── general/
|
||||
├── extractor.py # 通用抽取基类
|
||||
├── graph_extractor.py # General 版实体/关系抽取器
|
||||
├── graph_prompt.py # General 版抽取 Prompt
|
||||
├── index.py # 建图总控(子图生成→合并→消歧→社区报告)
|
||||
├── entity_embedding.py # Node2Vec 实体嵌入(备用)
|
||||
├── leiden.py # Leiden 社区发现算法封装
|
||||
├── community_reports_extractor.py # 社区报告抽取器
|
||||
├── community_report_prompt.py # 社区报告生成 Prompt
|
||||
├── mind_map_extractor.py # 思维导图抽取器
|
||||
└── mind_map_prompt.py # 思维导图 Prompt
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 核心时序图
|
||||
|
||||
### 4.1 建图时序图
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant U as 用户/任务
|
||||
participant T as tasks.py (Celery Task)
|
||||
participant I as general/index.py run_graphrag
|
||||
participant E as light/general GraphExtractor
|
||||
participant ES as Elasticsearch
|
||||
participant ER as entity_resolution.py
|
||||
participant CR as community_reports_extractor.py
|
||||
|
||||
U->>T: 上传文档 / 触发建图
|
||||
T->>I: run_graphrag_for_kb(document_ids, parser_config)
|
||||
I->>I: load_doc_chunks() 按 1024 token 合并 chunk
|
||||
loop 每个文档并行(max 4)
|
||||
I->>E: generate_subgraph(extractor, chunks)
|
||||
E->>E: LLM 抽取 entities + relations (多轮 gleaning)
|
||||
E->>E: 解析输出 → nx.Graph
|
||||
E->>ES: 写入 subgraph (knowledge_graph_kwd="subgraph")
|
||||
end
|
||||
I->>I: merge_subgraph() 逐个文档合并子图到全局图
|
||||
I->>ES: 写入全局 graph (knowledge_graph_kwd="graph")
|
||||
I->>ES: 写入 entity/relation chunks (带向量嵌入)
|
||||
|
||||
alt with_resolution=true (General 可选)
|
||||
I->>ER: resolve_entities(graph, subgraph_nodes)
|
||||
ER->>ER: 编辑距离预筛选候选对
|
||||
ER->>ER: LLM 批量判断"是否同一实体"
|
||||
ER->>ER: 合并连通分量中的节点
|
||||
ER->>ER: 重新计算 PageRank
|
||||
ER->>ES: 更新 graph/entity/relation
|
||||
end
|
||||
|
||||
alt with_community=true (General 可选)
|
||||
I->>CR: extract_community(graph)
|
||||
CR->>CR: Leiden 社区发现
|
||||
CR->>CR: LLM 生成每个社区的报告
|
||||
CR->>ES: 写入 community_report chunks
|
||||
end
|
||||
I-->>T: 返回 {ok_documents, failed_documents, seconds}
|
||||
```
|
||||
|
||||
### 4.2 查图时序图
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant U as 用户 Query
|
||||
participant S as search.py KGSearch.retrieval()
|
||||
participant QP as query_analyze_prompt.py minirag_query2kwd
|
||||
participant ES as Elasticsearch
|
||||
participant LLM as LLM
|
||||
|
||||
U->>S: retrieval(question, workspace_ids, kb_ids, ...)
|
||||
S->>LLM: query_rewrite() PROMPTS["minirag_query2kwd"]
|
||||
LLM-->>S: {answer_type_keywords, entities_from_query}
|
||||
|
||||
par 三路召回并行
|
||||
S->>ES: get_relevant_ents_by_keywords() 向量相似度搜索 entity
|
||||
ES-->>S: 候选实体列表 + sim + pagerank + n_hop
|
||||
S->>ES: get_relevant_ents_by_types() 按类型过滤 entity
|
||||
ES-->>S: 类型匹配实体列表
|
||||
S->>ES: get_relevant_relations_by_txt() 向量相似度搜索 relation
|
||||
ES-->>S: 候选关系列表
|
||||
end
|
||||
|
||||
S->>S: 计算 n-hop 路径权重衰减 sim / (2 + hop_depth)
|
||||
S->>S: 实体排序:sim × pagerank
|
||||
S->>S: Token 预算截断(max_token 递减)
|
||||
|
||||
alt 社区报告召回
|
||||
S->>ES: _community_retrieval_() 按 entities_kwd 匹配 community_report
|
||||
ES-->>S: 社区报告文本
|
||||
end
|
||||
|
||||
S-->>U: {page_content: Entities + Relations + Community Reports, metadata, vector: None}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Light vs General 差异
|
||||
|
||||
| 维度 | Light | General |
|
||||
|---|---|---|
|
||||
| 实体抽取 Prompt | LightRAG 风格,含 content_keywords | MS GraphRAG 风格,更简洁 |
|
||||
| Gleaning 终止 | 自然语言 yes/no | 强制单字 Y(logit_bias) |
|
||||
| 实体消歧 | ❌ 不支持 | ✅ 支持 |
|
||||
| 社区发现 | ❌ 不支持 | ✅ Leiden 算法 |
|
||||
| 社区报告 | ❌ 不支持 | ✅ LLM 生成报告 |
|
||||
| 实体嵌入 | 仅实体名向量 | 支持 Node2Vec(备用) |
|
||||
| 思维导图 | ❌ 不支持 | ✅ 支持 |
|
||||
| 建图耗时 | ~5-15 分钟 | ~30-60 分钟 |
|
||||
| 适用规模 | < 1K 文档 | > 1K 文档 |
|
||||
|
||||
**切换条件:** `parser_config["graphrag"]["method"] == "general"` 时启用 General,否则默认 Light。
|
||||
|
||||
---
|
||||
|
||||
## 6. 关键源码索引速查表
|
||||
|
||||
| 功能 | 文件 | 关键类/函数 | 行号 |
|
||||
|---|---|---|---|
|
||||
| 建图总控 | `general/index.py` | `run_graphrag()` | 36-119 |
|
||||
| KB 级批量建图 | `general/index.py` | `run_graphrag_for_kb()` | 122-330 |
|
||||
| 子图生成 | `general/index.py` | `generate_subgraph()` | 333-406 |
|
||||
| Light 实体抽取 | `light/graph_extractor.py` | `GraphExtractor._process_single_content()` | 74-131 |
|
||||
| General 实体抽取 | `general/graph_extractor.py` | `GraphExtractor._process_single_content()` | 100-150 |
|
||||
| 实体消歧 | `entity_resolution.py` | `EntityResolution.__call__()` | 53-141 |
|
||||
| 相似度预筛选 | `entity_resolution.py` | `EntityResolution.is_similarity()` | 225-239 |
|
||||
| 社区发现 | `general/leiden.py` | `run()` | 95-141 |
|
||||
| 社区报告抽取 | `general/community_reports_extractor.py` | `CommunityReportsExtractor.__call__()` | 55-158 |
|
||||
| 图谱检索 | `search.py` | `KGSearch.retrieval()` | 130-280 |
|
||||
| Query 改写 | `search.py` | `KGSearch.query_rewrite()` | 33-55 |
|
||||
| 图合并工具 | `utils.py` | `graph_merge()` | 199-229 |
|
||||
| 实体转 chunk | `utils.py` | `graph_node_to_chunk()` | 301-327 |
|
||||
| 关系转 chunk | `utils.py` | `graph_edge_to_chunk()` | 352-378 |
|
||||
|
||||
完整源码详解、Prompt 逐段解读、ES 存储设计、配置参数表、边界条件与监控指标,请参阅 [WS-18](mention://issue/16bdb196-e10e-489b-b01c-9067b1f1bb23) 原始交付文档。
|
||||
|
||||
---
|
||||
|
||||
## 7. 跨文档一致性
|
||||
|
||||
- 与 [S2-T2] 关于 GraphRAG 实体嵌入缓存(Redis + xxhash)描述一致 ✅
|
||||
- 与 [S2-T3] 关于 ES 多类型共存(`knowledge_graph_kwd` 区分 6 种类型)设计一致 ✅
|
||||
- 与 [S2-T5] 关于 GraphRAG 检索结果并入向量召回的描述一致 ✅
|
||||
- 与 [S2-T6] E2E 时序图中 GraphRAG 分支对齐 ✅
|
||||
|
||||
---
|
||||
|
||||
*本文档为 MemoryBear RAG Docs v1.0 正式版本的组成文件。完整详评与源码解读参见 [WS-18](mention://issue/16bdb196-e10e-489b-b01c-9067b1f1bb23) 评论历史。*
|
||||
Reference in New Issue
Block a user