Files
MemoryBear/docs/rag/overview/04-graphrag-indexing.mmd
Multica PM Agent 343a5eebe3
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
docs(rag): add MemoryBear RAG implementation docs v1.0
Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 10:51:48 +08:00

79 lines
3.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
%% MemoryBear GraphRAG 索引构建时序图
%% 覆盖 Light 与 General 两条分支的差异
sequenceDiagram
autonumber
participant Celery as Celery<br/>tasks.py:473
participant Index as graphrag/general/index.py<br/>run_graphrag_for_kb()
participant KGExt as GraphExtractor<br/>light/graph_extractor.py:31<br/>general/graph_extractor.py:34
participant LLM as llm/chat_model.py
participant ES as ESVector<br/>elasticsearch_vector.py
participant Merge as merge_subgraph()
participant Resolve as entity_resolution.py<br/>EntityResolution
participant Leiden as general/leiden.py<br/>run()
participant Community as general/<br/>community_reports_extractor.py:37
Note over Celery,Community: === 触发条件 ===
Celery->>Celery: build_graphrag_for_kb(kb_id)
Celery->>Celery: 检查 parser_config.graphrag.use_graphrag
Celery->>Index: run_graphrag_for_kb(row, document_ids, ...)
Note over Index,LLM: === 阶段 1子图生成 (按 chunk) ===
Index->>Index: init_graphrag(task, vector_size)
Index->>Index: generate_subgraph() per chunk
loop 每个 chunk
Index->>KGExt: _process_single_content(chunk_key_dp, chunk_text)
alt Light 分支
KGExt->>KGExt: LightRAG-style prompt<br/>+ content_keywords 提取
KGExt->>KGExt: GLEANING loop (max 2)
else General 分支
KGExt->>KGExt: MS GraphRAG-style prompt<br/>perform_variable_replacements
KGExt->>KGExt: tiktoken logit-bias Y/N loop
end
KGExt->>LLM: LLM 调用 → entities + relations JSON
LLM-->>KGExt: extracted data
KGExt->>KGExt: _merge_nodes() + _merge_edges()
KGExt-->>Index: (entities_data, relationships_data)
end
Index->>ES: store subgraph (entities + relations chunks)
Note over Merge,ES: === 阶段 2子图合并 ===
Index->>Merge: merge_subgraph()
Merge->>ES: get_graph() 加载全局图
Merge->>Merge: graph_merge(old_graph, subgraph, change)
Merge->>Merge: nx.pagerank(new_graph)
Merge->>ES: set_graph() 写回全局图 + entities + relations
Note over Resolve,ES: === 阶段 3实体消歧 (可选) ===
opt with_resolution == True
Index->>Resolve: resolve_entities(graph, subgraph_nodes)
Resolve->>LLM: 两两实体相似度 LLM 匹配
LLM-->>Resolve: 合并建议
Resolve->>Resolve: nx.pagerank(graph)
Resolve->>ES: set_graph()
end
Note over Leiden,Community: === 阶段 4社区报告 (General only) ===
opt with_community == True (General)
Index->>Leiden: leiden.run(graph)
Leiden->>Leiden: graspologic.partition.<br/>hierarchical_leiden<br/>max_cluster_size=12
Leiden-->>Index: {level: {community_id: {nodes: [...]}}}
loop 每个 community (nodes >= 2)
Index->>Community: __call__(graph, callback)
Community->>Community: 构建 entity_df + relation_df
Community->>LLM: COMMUNITY_REPORT_PROMPT
LLM-->>Community: {title, summary, findings, rating}
Community->>Community: add_community_info2graph()
end
Community->>ES: index community_report chunks
end
Note over Index,ES: === Mind Map (独立功能,非主链路) ===
Note right of Index: mind_map_extractor.py<br/>由外部调用,非索引管道<br/>sections → 层级 markdown mind map