Files

Multica PM Agent 343a5eebe3

Sync to Gitee / sync (push) Has been cancelled

Details

docs(rag): add MemoryBear RAG implementation docs v1.0

Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>

2026-05-09 10:51:48 +08:00

8.9 KiB

Raw Blame History

title, author, reviewer, source-commit, last-reviewed-at, scope, version, status

title	author	reviewer	source-commit	last-reviewed-at	scope	version	status
[S2-T4] GraphRAG（light + general）实现详解 — 正式版	Python 开发工程师	知识运营与治理专家	`feae2f2e` (MemoryBear)	2026-05-08	api/app/core/rag/graphrag/（含 light/ 与 general/ 子目录）	v1.0	正式版（已解除占位）

[S2-T4] GraphRAG（light + general）实现详解 — 正式版

本文档为 WS-24 v1.0 文档全集的正式组成文件，替换 v1.0-RC1 中的占位版本。原始完整文档与逐节详评见 WS-18 与 WS-21 §S2-T4 评审报告。

1. 一句话定位

GraphRAG 是 MemoryBear 知识库系统的知识图谱增强检索模块，通过 LLM 从文档中抽取实体-关系三元组构建知识图谱，在检索阶段利用图谱结构（实体关联、社区报告、多跳路径）补充传统向量检索的语义盲区，实现"结构化知识 + 语义向量"的混合召回。

2. 评审结果

维度	满分	得分	关键说明
准确性	25	24	抽检 5/5 命中：`run_graphrag` / extractor 三元选择 / `is_similarity` / `KGSearch.retrieval` / Leiden `run()`
完整性	25	24	12 章节 + 附录索引：术语表 11 条、Light/General 双时序图、5 套源码详解、4 个核心 Prompt 逐段解读
时效性	15	13	元数据表完整，缺 YAML frontmatter（Sprint-2 已知遗留）
可读性	15	14	Mermaid 时序图规范、Light/General 三张对照表一目了然、Prompt 逐行设计意图写法出色
可执行性	20	18	parser_config 配置入口明确、三组参数表完整、资源消耗估算（Light 5-15min / General 30-60min）可验证
合计	100	93	PASS（标杆）

裁定: 与 [S2-T3] 并列 Sprint-2 双标杆。Must-Fix 无；Nice-to-Have 7 条留给 [S3-T3] 整合时统一处理。

3. 模块结构

api/app/core/rag/graphrag/
├── search.py                          # KGSearch：图谱检索入口
├── entity_resolution.py               # 实体消歧（LLM + 编辑距离）
├── entity_resolution_prompt.py        # 实体消歧 Prompt
├── query_analyze_prompt.py            # 查询分析 Prompt（MiniRAG 风格）
├── utils.py                           # 图操作工具集（merge、cache、ES 读写）
├── __init__.py
├── light/
│   ├── graph_extractor.py             # Light 版实体/关系抽取器
│   └── graph_prompt.py                # Light 版抽取 Prompt + RAG 回答 Prompt
└── general/
    ├── extractor.py                   # 通用抽取基类
    ├── graph_extractor.py             # General 版实体/关系抽取器
    ├── graph_prompt.py                # General 版抽取 Prompt
    ├── index.py                       # 建图总控（子图生成→合并→消歧→社区报告）
    ├── entity_embedding.py            # Node2Vec 实体嵌入（备用）
    ├── leiden.py                      # Leiden 社区发现算法封装
    ├── community_reports_extractor.py # 社区报告抽取器
    ├── community_report_prompt.py     # 社区报告生成 Prompt
    ├── mind_map_extractor.py          # 思维导图抽取器
    └── mind_map_prompt.py             # 思维导图 Prompt

4. 核心时序图

4.1 建图时序图

sequenceDiagram
    participant U as 用户/任务
    participant T as tasks.py (Celery Task)
    participant I as general/index.py run_graphrag
    participant E as light/general GraphExtractor
    participant ES as Elasticsearch
    participant ER as entity_resolution.py
    participant CR as community_reports_extractor.py

    U->>T: 上传文档 / 触发建图
    T->>I: run_graphrag_for_kb(document_ids, parser_config)
    I->>I: load_doc_chunks() 按 1024 token 合并 chunk
    loop 每个文档并行（max 4）
        I->>E: generate_subgraph(extractor, chunks)
        E->>E: LLM 抽取 entities + relations (多轮 gleaning)
        E->>E: 解析输出 → nx.Graph
        E->>ES: 写入 subgraph (knowledge_graph_kwd="subgraph")
    end
    I->>I: merge_subgraph() 逐个文档合并子图到全局图
    I->>ES: 写入全局 graph (knowledge_graph_kwd="graph")
    I->>ES: 写入 entity/relation chunks (带向量嵌入)

    alt with_resolution=true (General 可选)
        I->>ER: resolve_entities(graph, subgraph_nodes)
        ER->>ER: 编辑距离预筛选候选对
        ER->>ER: LLM 批量判断"是否同一实体"
        ER->>ER: 合并连通分量中的节点
        ER->>ER: 重新计算 PageRank
        ER->>ES: 更新 graph/entity/relation
    end

    alt with_community=true (General 可选)
        I->>CR: extract_community(graph)
        CR->>CR: Leiden 社区发现
        CR->>CR: LLM 生成每个社区的报告
        CR->>ES: 写入 community_report chunks
    end
    I-->>T: 返回 {ok_documents, failed_documents, seconds}

4.2 查图时序图

sequenceDiagram
    participant U as 用户 Query
    participant S as search.py KGSearch.retrieval()
    participant QP as query_analyze_prompt.py minirag_query2kwd
    participant ES as Elasticsearch
    participant LLM as LLM

    U->>S: retrieval(question, workspace_ids, kb_ids, ...)
    S->>LLM: query_rewrite() PROMPTS["minirag_query2kwd"]
    LLM-->>S: {answer_type_keywords, entities_from_query}

    par 三路召回并行
        S->>ES: get_relevant_ents_by_keywords() 向量相似度搜索 entity
        ES-->>S: 候选实体列表 + sim + pagerank + n_hop
        S->>ES: get_relevant_ents_by_types() 按类型过滤 entity
        ES-->>S: 类型匹配实体列表
        S->>ES: get_relevant_relations_by_txt() 向量相似度搜索 relation
        ES-->>S: 候选关系列表
    end

    S->>S: 计算 n-hop 路径权重衰减 sim / (2 + hop_depth)
    S->>S: 实体排序：sim × pagerank
    S->>S: Token 预算截断（max_token 递减）

    alt 社区报告召回
        S->>ES: _community_retrieval_() 按 entities_kwd 匹配 community_report
        ES-->>S: 社区报告文本
    end

    S-->>U: {page_content: Entities + Relations + Community Reports, metadata, vector: None}

5. Light vs General 差异

维度	Light	General
实体抽取 Prompt	LightRAG 风格，含 content_keywords	MS GraphRAG 风格，更简洁
Gleaning 终止	自然语言 yes/no	强制单字 Y（logit_bias）
实体消歧	❌ 不支持	✅ 支持
社区发现	❌ 不支持	✅ Leiden 算法
社区报告	❌ 不支持	✅ LLM 生成报告
实体嵌入	仅实体名向量	支持 Node2Vec（备用）
思维导图	❌ 不支持	✅ 支持
建图耗时	~5-15 分钟	~30-60 分钟
适用规模	< 1K 文档	> 1K 文档

切换条件: parser_config["graphrag"]["method"] == "general" 时启用 General，否则默认 Light。

6. 关键源码索引速查表

功能	文件	关键类/函数	行号
建图总控	`general/index.py`	`run_graphrag()`	36-119
KB 级批量建图	`general/index.py`	`run_graphrag_for_kb()`	122-330
子图生成	`general/index.py`	`generate_subgraph()`	333-406
Light 实体抽取	`light/graph_extractor.py`	`GraphExtractor._process_single_content()`	74-131
General 实体抽取	`general/graph_extractor.py`	`GraphExtractor._process_single_content()`	100-150
实体消歧	`entity_resolution.py`	`EntityResolution.__call__()`	53-141
相似度预筛选	`entity_resolution.py`	`EntityResolution.is_similarity()`	225-239
社区发现	`general/leiden.py`	`run()`	95-141
社区报告抽取	`general/community_reports_extractor.py`	`CommunityReportsExtractor.__call__()`	55-158
图谱检索	`search.py`	`KGSearch.retrieval()`	130-280
Query 改写	`search.py`	`KGSearch.query_rewrite()`	33-55
图合并工具	`utils.py`	`graph_merge()`	199-229
实体转 chunk	`utils.py`	`graph_node_to_chunk()`	301-327
关系转 chunk	`utils.py`	`graph_edge_to_chunk()`	352-378

完整源码详解、Prompt 逐段解读、ES 存储设计、配置参数表、边界条件与监控指标，请参阅 WS-18 原始交付文档。

7. 跨文档一致性

与 [S2-T2] 关于 GraphRAG 实体嵌入缓存（Redis + xxhash）描述一致 ✅
与 [S2-T3] 关于 ES 多类型共存（knowledge_graph_kwd 区分 6 种类型）设计一致 ✅
与 [S2-T5] 关于 GraphRAG 检索结果并入向量召回的描述一致 ✅
与 [S2-T6] E2E 时序图中 GraphRAG 分支对齐 ✅

本文档为 MemoryBear RAG Docs v1.0 正式版本的组成文件。完整详评与源码解读参见 WS-18 评论历史。

8.9 KiB Raw Blame History Unescape Escape