docs(rag): add MemoryBear RAG implementation docs v1.0

Submit the formed RAG documentation set produced across Sprint-1/2/3 (WS-12 through WS-26) under docs/rag/. Includes: - README.md / INDEX.md: landing + total index (responsibility matrix, review verdicts, dual-link to source issues) - overview/: full-pipeline architecture (4 .mmd diagrams), 11-stage boundary contracts, doc map, source-code inventory - pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding, VDB & retrieval, GraphRAG, Rerank/Prompt/LLM) - graphrag/, end-to-end/: v1.0 formal versions with full source retained as reference - evolution/: 11 architecture-refactor proposals, 6-direction roadmap, capability map - review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary - _indexes/: glossary (81 terms), source->doc reverse index, chart index - _release/: v1.0-RC1 release manifest, versioning convention, ops & freshness plan - _meta/README.md: placeholder noting WS-12 governance assets gap Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot checks hit). The legacy docs/ ignore in .gitignore is narrowed to docs/* with an explicit allowlist for docs/rag/. Refs: WS-26 Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 10:51:48 +08:00
parent feae2f2e1e
commit 343a5eebe3
33 changed files with 8410 additions and 1 deletions
--- a/docs/rag/overview/04-graphrag-indexing.mmd
+++ b/docs/rag/overview/04-graphrag-indexing.mmd
@@ -0,0 +1,78 @@
+%% MemoryBear GraphRAG 索引构建时序图
+%% 覆盖 Light 与 General 两条分支的差异
+
+sequenceDiagram
+    autonumber
+    participant Celery as Celery<br/>tasks.py:473
+    participant Index as graphrag/general/index.py<br/>run_graphrag_for_kb()
+    participant KGExt as GraphExtractor<br/>light/graph_extractor.py:31<br/>general/graph_extractor.py:34
+    participant LLM as llm/chat_model.py
+    participant ES as ESVector<br/>elasticsearch_vector.py
+    participant Merge as merge_subgraph()
+    participant Resolve as entity_resolution.py<br/>EntityResolution
+    participant Leiden as general/leiden.py<br/>run()
+    participant Community as general/<br/>community_reports_extractor.py:37
+
+    Note over Celery,Community: === 触发条件 ===
+    Celery->>Celery: build_graphrag_for_kb(kb_id)
+    Celery->>Celery: 检查 parser_config.graphrag.use_graphrag
+    Celery->>Index: run_graphrag_for_kb(row, document_ids, ...)
+
+    Note over Index,LLM: === 阶段 1：子图生成 (按 chunk) ===
+    Index->>Index: init_graphrag(task, vector_size)
+    Index->>Index: generate_subgraph() per chunk
+
+    loop 每个 chunk
+        Index->>KGExt: _process_single_content(chunk_key_dp, chunk_text)
+
+        alt Light 分支
+            KGExt->>KGExt: LightRAG-style prompt<br/>+ content_keywords 提取
+            KGExt->>KGExt: GLEANING loop (max 2)
+        else General 分支
+            KGExt->>KGExt: MS GraphRAG-style prompt<br/>perform_variable_replacements
+            KGExt->>KGExt: tiktoken logit-bias Y/N loop
+        end
+
+        KGExt->>LLM: LLM 调用 → entities + relations JSON
+        LLM-->>KGExt: extracted data
+        KGExt->>KGExt: _merge_nodes() + _merge_edges()
+        KGExt-->>Index: (entities_data, relationships_data)
+    end
+
+    Index->>ES: store subgraph (entities + relations chunks)
+
+    Note over Merge,ES: === 阶段 2：子图合并 ===
+    Index->>Merge: merge_subgraph()
+    Merge->>ES: get_graph() 加载全局图
+    Merge->>Merge: graph_merge(old_graph, subgraph, change)
+    Merge->>Merge: nx.pagerank(new_graph)
+    Merge->>ES: set_graph() 写回全局图 + entities + relations
+
+    Note over Resolve,ES: === 阶段 3：实体消歧 (可选) ===
+    opt with_resolution == True
+        Index->>Resolve: resolve_entities(graph, subgraph_nodes)
+        Resolve->>LLM: 两两实体相似度 LLM 匹配
+        LLM-->>Resolve: 合并建议
+        Resolve->>Resolve: nx.pagerank(graph)
+        Resolve->>ES: set_graph()
+    end
+
+    Note over Leiden,Community: === 阶段 4：社区报告 (General only) ===
+    opt with_community == True (General)
+        Index->>Leiden: leiden.run(graph)
+        Leiden->>Leiden: graspologic.partition.<br/>hierarchical_leiden<br/>max_cluster_size=12
+        Leiden-->>Index: {level: {community_id: {nodes: [...]}}}
+
+        loop 每个 community (nodes >= 2)
+            Index->>Community: __call__(graph, callback)
+            Community->>Community: 构建 entity_df + relation_df
+            Community->>LLM: COMMUNITY_REPORT_PROMPT
+            LLM-->>Community: {title, summary, findings, rating}
+            Community->>Community: add_community_info2graph()
+        end
+
+        Community->>ES: index community_report chunks
+    end
+
+    Note over Index,ES: === Mind Map (独立功能，非主链路) ===
+    Note right of Index: mind_map_extractor.py<br/>由外部调用，非索引管道<br/>sections → 层级 markdown mind map