docs(rag): add MemoryBear RAG implementation docs v1.0
Some checks failed
Sync to Gitee / sync (push) Has been cancelled

Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>
This commit is contained in:
Multica PM Agent
2026-05-09 10:51:48 +08:00
parent feae2f2e1e
commit 343a5eebe3
33 changed files with 8410 additions and 1 deletions

View File

@@ -0,0 +1,78 @@
%% MemoryBear GraphRAG 索引构建时序图
%% 覆盖 Light 与 General 两条分支的差异
sequenceDiagram
autonumber
participant Celery as Celery<br/>tasks.py:473
participant Index as graphrag/general/index.py<br/>run_graphrag_for_kb()
participant KGExt as GraphExtractor<br/>light/graph_extractor.py:31<br/>general/graph_extractor.py:34
participant LLM as llm/chat_model.py
participant ES as ESVector<br/>elasticsearch_vector.py
participant Merge as merge_subgraph()
participant Resolve as entity_resolution.py<br/>EntityResolution
participant Leiden as general/leiden.py<br/>run()
participant Community as general/<br/>community_reports_extractor.py:37
Note over Celery,Community: === 触发条件 ===
Celery->>Celery: build_graphrag_for_kb(kb_id)
Celery->>Celery: 检查 parser_config.graphrag.use_graphrag
Celery->>Index: run_graphrag_for_kb(row, document_ids, ...)
Note over Index,LLM: === 阶段 1子图生成 (按 chunk) ===
Index->>Index: init_graphrag(task, vector_size)
Index->>Index: generate_subgraph() per chunk
loop 每个 chunk
Index->>KGExt: _process_single_content(chunk_key_dp, chunk_text)
alt Light 分支
KGExt->>KGExt: LightRAG-style prompt<br/>+ content_keywords 提取
KGExt->>KGExt: GLEANING loop (max 2)
else General 分支
KGExt->>KGExt: MS GraphRAG-style prompt<br/>perform_variable_replacements
KGExt->>KGExt: tiktoken logit-bias Y/N loop
end
KGExt->>LLM: LLM 调用 → entities + relations JSON
LLM-->>KGExt: extracted data
KGExt->>KGExt: _merge_nodes() + _merge_edges()
KGExt-->>Index: (entities_data, relationships_data)
end
Index->>ES: store subgraph (entities + relations chunks)
Note over Merge,ES: === 阶段 2子图合并 ===
Index->>Merge: merge_subgraph()
Merge->>ES: get_graph() 加载全局图
Merge->>Merge: graph_merge(old_graph, subgraph, change)
Merge->>Merge: nx.pagerank(new_graph)
Merge->>ES: set_graph() 写回全局图 + entities + relations
Note over Resolve,ES: === 阶段 3实体消歧 (可选) ===
opt with_resolution == True
Index->>Resolve: resolve_entities(graph, subgraph_nodes)
Resolve->>LLM: 两两实体相似度 LLM 匹配
LLM-->>Resolve: 合并建议
Resolve->>Resolve: nx.pagerank(graph)
Resolve->>ES: set_graph()
end
Note over Leiden,Community: === 阶段 4社区报告 (General only) ===
opt with_community == True (General)
Index->>Leiden: leiden.run(graph)
Leiden->>Leiden: graspologic.partition.<br/>hierarchical_leiden<br/>max_cluster_size=12
Leiden-->>Index: {level: {community_id: {nodes: [...]}}}
loop 每个 community (nodes >= 2)
Index->>Community: __call__(graph, callback)
Community->>Community: 构建 entity_df + relation_df
Community->>LLM: COMMUNITY_REPORT_PROMPT
LLM-->>Community: {title, summary, findings, rating}
Community->>Community: add_community_info2graph()
end
Community->>ES: index community_report chunks
end
Note over Index,ES: === Mind Map (独立功能,非主链路) ===
Note right of Index: mind_map_extractor.py<br/>由外部调用,非索引管道<br/>sections → 层级 markdown mind map