docs(rag): add MemoryBear RAG implementation docs v1.0
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
Submit the formed RAG documentation set produced across Sprint-1/2/3 (WS-12 through WS-26) under docs/rag/. Includes: - README.md / INDEX.md: landing + total index (responsibility matrix, review verdicts, dual-link to source issues) - overview/: full-pipeline architecture (4 .mmd diagrams), 11-stage boundary contracts, doc map, source-code inventory - pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding, VDB & retrieval, GraphRAG, Rerank/Prompt/LLM) - graphrag/, end-to-end/: v1.0 formal versions with full source retained as reference - evolution/: 11 architecture-refactor proposals, 6-direction roadmap, capability map - review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary - _indexes/: glossary (81 terms), source->doc reverse index, chart index - _release/: v1.0-RC1 release manifest, versioning convention, ops & freshness plan - _meta/README.md: placeholder noting WS-12 governance assets gap Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot checks hit). The legacy docs/ ignore in .gitignore is narrowed to docs/* with an explicit allowlist for docs/rag/. Refs: WS-26 Co-authored-by: multica-agent <github@multica.ai>
This commit is contained in:
991
docs/rag/pipeline/04-graphrag.md
Normal file
991
docs/rag/pipeline/04-graphrag.md
Normal file
@@ -0,0 +1,991 @@
|
||||
# GraphRAG(light + general)实现详解
|
||||
|
||||
| 元数据 | 值 |
|
||||
|---|---|
|
||||
| 环节编号 | 05-graphrag |
|
||||
| 源码目录 | `api/app/core/rag/graphrag/` |
|
||||
| 关联任务 | [WS-11](mention://issue/6c0b5472-a0fa-4997-925c-a67f235f82da) / [S2-T4](mention://issue/16bdb196-e10e-489b-b01c-9067b1f1bb23) |
|
||||
| 依赖输入 | [S2-T2] Embedding、[S2-T3] VDB、[S1-T2] 架构图 |
|
||||
| 输出下游 | [S3-T2] 知识图谱增强 |
|
||||
|
||||
---
|
||||
|
||||
## 1. 一句话定位
|
||||
|
||||
GraphRAG 是 MemoryBear 知识库系统的**知识图谱增强检索模块**,通过 LLM 从文档中抽取实体-关系三元组构建知识图谱,在检索阶段利用图谱结构(实体关联、社区报告、多跳路径)补充传统向量检索的语义盲区,实现"结构化知识 + 语义向量"的混合召回。
|
||||
|
||||
---
|
||||
|
||||
## 2. 设计目标与适用场景
|
||||
|
||||
### 2.1 设计目标
|
||||
|
||||
1. **结构化知识补充**:向量检索擅长语义匹配,但对"多跳推理""实体关系推导""全局摘要"等场景覆盖不足。GraphRAG 通过显式构建实体关系图谱填补这一 gap。
|
||||
2. **两种精度-成本档位**:
|
||||
- **Light 模式**(默认):基于 LightRAG 思路,轻量快速,适合对延迟敏感、文档规模中等的场景。
|
||||
- **General 模式**(完整版):基于 Microsoft GraphRAG,支持实体消歧、社区发现、社区报告生成,适合需要深度分析、复杂推理的场景。
|
||||
3. **与现有基础设施复用**:不引入 Neo4j 等独立图数据库,复用 Elasticsearch 作为图谱存储,降低运维复杂度。
|
||||
|
||||
### 2.2 适用场景
|
||||
|
||||
| 场景 | 推荐模式 | 原因 |
|
||||
|---|---|---|
|
||||
| 快速知识问答,文档 < 1K | Light | 建图快、成本低 |
|
||||
| 企业级知识库,文档 > 10K | General | 实体消歧 + 社区报告提供全局洞察 |
|
||||
| 需要跨文档实体关联分析 | General | 实体消歧合并跨文档同名实体 |
|
||||
| 需要"某实体的全局影响力"评估 | General | 社区报告 + PageRank 提供全局视角 |
|
||||
| 实时对话/低延迟检索 | Light | General 的社区报告生成耗时高 |
|
||||
|
||||
---
|
||||
|
||||
## 3. 关键概念与术语表
|
||||
|
||||
| 术语 | 定义 |
|
||||
|---|---|
|
||||
| **Entity(实体)** | 从文本中抽取的命名对象,如人名、组织、地点。在代码中存储为图的节点。 |
|
||||
| **Relationship(关系)** | 实体之间的语义关联,如"A 是 B 的 CEO"。存储为图的边。 |
|
||||
| **Subgraph(子图)** | 单个文档抽取出的局部知识图谱,最终合并为全局图谱。 |
|
||||
| **Entity Resolution(实体消歧)** | 识别图谱中不同名称但指向同一实体的节点,将其合并(如 "Apple Inc." vs "Apple")。 |
|
||||
| **Community(社区)** | 图谱中高密度连接的节点簇,通过 Leiden 算法发现。 |
|
||||
| **Community Report(社区报告)** | 对单个社区的 LLM 生成的结构化摘要报告,含标题、摘要、影响力评级、关键发现。 |
|
||||
| **PageRank** | 用于衡量实体在图谱中的重要程度,检索时作为排序因子之一。 |
|
||||
| **N-hop Path** | 从查询实体出发,沿图谱边行走 N 步可达的实体路径,用于扩展召回。 |
|
||||
| **Tuple Delimiter** | 实体/关系抽取输出中的字段分隔符,代码中为 `<\|>`。 |
|
||||
| **Record Delimiter** | 抽取输出中多条记录的分隔符,代码中为 `##`。 |
|
||||
| **knowledge_graph_kwd** | ES 文档中的类型标记字段,取值:`entity` / `relation` / `graph` / `subgraph` / `community_report` / `ty2ents`。 |
|
||||
|
||||
---
|
||||
|
||||
## 4. 实现概览
|
||||
|
||||
### 4.1 模块结构
|
||||
|
||||
```
|
||||
api/app/core/rag/graphrag/
|
||||
├── search.py # KGSearch:图谱检索入口
|
||||
├── entity_resolution.py # 实体消歧(LLM + 编辑距离)
|
||||
├── entity_resolution_prompt.py # 实体消歧 Prompt
|
||||
├── query_analyze_prompt.py # 查询分析 Prompt(MiniRAG 风格)
|
||||
├── utils.py # 图操作工具集(merge、cache、ES 读写)
|
||||
├── __init__.py
|
||||
├── light/
|
||||
│ ├── graph_extractor.py # Light 版实体/关系抽取器
|
||||
│ └── graph_prompt.py # Light 版抽取 Prompt + RAG 回答 Prompt
|
||||
└── general/
|
||||
├── extractor.py # 通用抽取基类(LLM 调用、节点/边合并)
|
||||
├── graph_extractor.py # General 版实体/关系抽取器
|
||||
├── graph_prompt.py # General 版抽取 Prompt
|
||||
├── index.py # GraphRAG 建图总控(子图生成→合并→消歧→社区报告)
|
||||
├── entity_embedding.py # Node2Vec 实体嵌入(备用)
|
||||
├── leiden.py # Leiden 社区发现算法封装
|
||||
├── community_reports_extractor.py # 社区报告抽取器
|
||||
├── community_report_prompt.py # 社区报告生成 Prompt
|
||||
├── mind_map_extractor.py # 思维导图抽取器
|
||||
└── mind_map_prompt.py # 思维导图 Prompt
|
||||
```
|
||||
|
||||
### 4.2 建图时序图
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant U as 用户/任务
|
||||
participant T as tasks.py<br/>(Celery Task)
|
||||
participant I as general/index.py<br/>run_graphrag/run_graphrag_for_kb
|
||||
participant E as light/general<br/>GraphExtractor
|
||||
participant ES as Elasticsearch<br/>(Doc Store)
|
||||
participant ER as entity_resolution.py<br/>EntityResolution
|
||||
participant CR as community_reports_extractor.py<br/>CommunityReportsExtractor
|
||||
|
||||
U->>T: 上传文档 / 触发建图
|
||||
T->>I: run_graphrag_for_kb(document_ids, parser_config)
|
||||
I->>I: load_doc_chunks()<br/>按 1024 token 合并 chunk
|
||||
loop 每个文档并行(max 4)
|
||||
I->>E: generate_subgraph(extractor, chunks)
|
||||
E->>E: LLM 抽取 entities + relations<br/>(多轮 gleaning)
|
||||
E->>E: 解析输出 → nx.Graph
|
||||
E->>ES: 写入 subgraph (knowledge_graph_kwd="subgraph")
|
||||
end
|
||||
I->>I: merge_subgraph()<br/>逐个文档合并子图到全局图
|
||||
I->>ES: 写入全局 graph (knowledge_graph_kwd="graph")
|
||||
I->>ES: 写入 entity/relation chunks<br/>(带向量嵌入)
|
||||
|
||||
alt with_resolution=true (General 可选)
|
||||
I->>ER: resolve_entities(graph, subgraph_nodes)
|
||||
ER->>ER: 编辑距离预筛选候选对
|
||||
ER->>ER: LLM 批量判断"是否同一实体"
|
||||
ER->>ER: 合并连通分量中的节点
|
||||
ER->>ER: 重新计算 PageRank
|
||||
ER->>ES: 更新 graph/entity/relation
|
||||
end
|
||||
|
||||
alt with_community=true (General 可选)
|
||||
I->>CR: extract_community(graph)
|
||||
CR->>CR: Leiden 社区发现
|
||||
CR->>CR: LLM 生成每个社区的报告<br/>(title/summary/rating/findings)
|
||||
CR->>ES: 写入 community_report chunks
|
||||
end
|
||||
I-->>T: 返回 {ok_documents, failed_documents, seconds}
|
||||
```
|
||||
|
||||
### 4.3 查图时序图
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant U as 用户 Query
|
||||
participant S as search.py<br/>KGSearch.retrieval()
|
||||
participant QP as query_analyze_prompt.py<br/>minirag_query2kwd
|
||||
participant ES as Elasticsearch
|
||||
participant LLM as LLM
|
||||
|
||||
U->>S: retrieval(question, workspace_ids, kb_ids, ...)
|
||||
S->>LLM: query_rewrite()<br/>PROMPTS["minirag_query2kwd"]
|
||||
LLM-->>S: {answer_type_keywords, entities_from_query}
|
||||
|
||||
par 三路召回并行
|
||||
S->>ES: get_relevant_ents_by_keywords()<br/>向量相似度搜索 entity
|
||||
ES-->>S: 候选实体列表 + sim + pagerank + n_hop
|
||||
S->>ES: get_relevant_ents_by_types()<br/>按类型过滤 entity
|
||||
ES-->>S: 类型匹配实体列表
|
||||
S->>ES: get_relevant_relations_by_txt()<br/>向量相似度搜索 relation
|
||||
ES-->>S: 候选关系列表
|
||||
end
|
||||
|
||||
S->>S: 计算 n-hop 路径权重衰减<br/>sim / (2 + hop_depth)
|
||||
S->>S: 实体排序:sim × pagerank<br/>关系排序:sim × pagerank × boost
|
||||
S->>S: Token 预算截断(max_token 递减)
|
||||
|
||||
alt 社区报告召回
|
||||
S->>ES: _community_retrieval_()<br/>按 entities_kwd 匹配 community_report
|
||||
ES-->>S: 社区报告文本
|
||||
end
|
||||
|
||||
S-->>U: {page_content: Entities + Relations + Community Reports,<br/>metadata, vector: None}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 关键源码详解
|
||||
|
||||
### 5.1 图谱构建链路
|
||||
|
||||
#### 5.1.1 建图总控入口
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/index.py:36-119`
|
||||
|
||||
```python
|
||||
async def run_graphrag(
|
||||
row: dict, language, with_resolution: bool, with_community: bool,
|
||||
chat_model, embedding_model, callback,
|
||||
):
|
||||
# 选择抽取器:LightKGExt(默认)或 GeneralKGExt
|
||||
extractor = LightKGExt if method != "general" else GeneralKGExt
|
||||
subgraph = await generate_subgraph(extractor, workspace_id, kb_id, document_id, chunks, ...)
|
||||
new_graph = await merge_subgraph(workspace_id, kb_id, document_id, subgraph, embedding_model, callback)
|
||||
if with_resolution:
|
||||
await resolve_entities(new_graph, subgraph_nodes, ...)
|
||||
if with_community:
|
||||
await extract_community(new_graph, ...)
|
||||
```
|
||||
|
||||
**设计要点**:
|
||||
- `parser_config["graphrag"]["method"]` 控制 Light/General 切换(`"general"` 为 General,其他为 Light)。
|
||||
- `with_resolution` 和 `with_community` 为独立开关,仅在 General 模式下有意义(Light 不支持)。
|
||||
- 使用 `RedisDistributedLock` 保证同一 KB 的并发建图安全。
|
||||
|
||||
#### 5.1.2 子图生成
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/index.py:333-406`
|
||||
|
||||
```python
|
||||
async def generate_subgraph(extractor, workspace_id, kb_id, document_id, chunks, ...):
|
||||
# 幂等检查:如果 document_id 已在图中,跳过
|
||||
contains = await does_graph_contains(workspace_id, kb_id, document_id)
|
||||
if contains:
|
||||
return None
|
||||
ext = extractor(llm_bdl, language=language, entity_types=entity_types)
|
||||
ents, rels = await ext(document_id, chunks, callback, task_id=task_id)
|
||||
subgraph = nx.Graph()
|
||||
for ent in ents:
|
||||
subgraph.add_node(ent["entity_name"], **ent)
|
||||
for rel in rels:
|
||||
if subgraph.has_node(rel["src_id"]) and subgraph.has_node(rel["tgt_id"]):
|
||||
subgraph.add_edge(rel["src_id"], rel["tgt_id"], **rel)
|
||||
tidy_graph(subgraph, callback, check_attribute=False)
|
||||
# 写入 ES 作为 subgraph 类型文档
|
||||
await trio.to_thread.run_sync(settings.docStoreConn.insert, [chunk], ...)
|
||||
return subgraph
|
||||
```
|
||||
|
||||
**关键设计**:
|
||||
- `does_graph_contains()` 通过查询 `knowledge_graph_kwd="graph"` 的 `source_id` 字段实现幂等性。
|
||||
- `tidy_graph()` 清理无 description/source_id 的脏节点/边。
|
||||
- 每个文档的 subgraph 独立存储,便于增量更新和重建。
|
||||
|
||||
#### 5.1.3 实体/关系抽取(Light vs General)
|
||||
|
||||
**Light 版抽取器**
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/light/graph_extractor.py:31-132`
|
||||
|
||||
```python
|
||||
class GraphExtractor(Extractor):
|
||||
def __init__(self, llm_invoker, language="English", entity_types=None,
|
||||
example_number=2, max_gleanings=None):
|
||||
# 使用 LightRAG 风格的 Prompt
|
||||
self._entity_extract_prompt = PROMPTS["entity_extraction"]
|
||||
self._continue_prompt = PROMPTS["entity_continue_extraction"]
|
||||
self._if_loop_prompt = PROMPTS["entity_if_loop_extraction"]
|
||||
# 预留 60% token 给输入文本
|
||||
self._left_token_count = max(getattr(llm_invoker, 'max_length', 8096) * 0.6, ...)
|
||||
|
||||
async def _process_single_content(self, chunk_key_dp, chunk_seq, num_chunks, out_results, task_id=""):
|
||||
hint_prompt = self._entity_extract_prompt.format(**self._context_base, input_text=content)
|
||||
# 首轮抽取
|
||||
final_result = await trio.to_thread.run_sync(self._chat, "", [{"role": "user", "content": hint_prompt}], {}, task_id)
|
||||
# 多轮 gleaning:追问"还有遗漏吗?"
|
||||
for now_glean_index in range(self._max_gleanings):
|
||||
glean_result = await trio.to_thread.run_sync(self._chat, "", history, gen_conf, task_id)
|
||||
final_result += glean_result
|
||||
# 用 if_loop_prompt 判断是否继续
|
||||
if_loop_result = await trio.to_thread.run_sync(self._chat, "", history, gen_conf, task_id)
|
||||
if if_loop_result.strip().lower() != "yes":
|
||||
break
|
||||
```
|
||||
|
||||
**General 版抽取器**
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/graph_extractor.py:34-151`
|
||||
|
||||
```python
|
||||
class GraphExtractor(Extractor):
|
||||
def __init__(self, llm_invoker, language="English", entity_types=None, ...):
|
||||
self._extraction_prompt = GRAPH_EXTRACTION_PROMPT
|
||||
self._max_gleanings = max_gleanings or ENTITY_EXTRACTION_MAX_GLEANINGS
|
||||
# 使用 tiktoken 构造 logit_bias 强制输出 YES/NO
|
||||
encoding = tiktoken.get_encoding("cl100k_base")
|
||||
yes = encoding.encode("YES")
|
||||
no = encoding.encode("NO")
|
||||
self._loop_args = {"logit_bias": {yes[0]: 100, no[0]: 100}, "max_tokens": 1}
|
||||
|
||||
async def _process_single_content(self, chunk_key_dp, chunk_seq, num_chunks, out_results, task_id=""):
|
||||
# 类似 Light,但使用 CONTINUE_PROMPT + LOOP_PROMPT
|
||||
for i in range(self._max_gleanings):
|
||||
history.append({"role": "user", "content": CONTINUE_PROMPT})
|
||||
response = await trio.to_thread.run_sync(lambda: self._chat("", history, {}))
|
||||
if i >= self._max_gleanings - 1:
|
||||
break
|
||||
history.append({"role": "assistant", "content": response})
|
||||
history.append({"role": "user", "content": LOOP_PROMPT})
|
||||
continuation = await trio.to_thread.run_sync(lambda: self._chat("", history))
|
||||
if continuation != "Y":
|
||||
break
|
||||
```
|
||||
|
||||
**Light vs General 抽取差异**:
|
||||
|
||||
| 维度 | Light | General |
|
||||
|---|---|---|
|
||||
| Prompt 风格 | LightRAG(更详细的示例 + content_keywords) | MS GraphRAG(简洁 + 无 keywords) |
|
||||
| Gleaning 终止 | 自然语言判断 `"yes"/"no"` | 强制单字 `"Y"`(logit_bias) |
|
||||
| 示例数量 | 默认 3 个,可调 `example_number` | 固定 3 个 |
|
||||
| 输出格式 | 含 `content_keywords` 元组 | 仅 entity + relationship |
|
||||
|
||||
#### 5.1.4 节点/边合并与摘要
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/extractor.py:205-300`
|
||||
|
||||
```python
|
||||
async def _merge_nodes(self, entity_name, entities, all_relationships_data, task_id=""):
|
||||
# 投票决定实体类型(出现次数最多者)
|
||||
entity_type = sorted(Counter([dp["entity_type"] for dp in entities]).items(), key=lambda x: x[1], reverse=True)[0][0]
|
||||
# 去重合并所有描述
|
||||
description = GRAPH_FIELD_SEP.join(sorted(set([dp["description"] for dp in entities])))
|
||||
# LLM 摘要(描述超过 12 条时触发)
|
||||
description = await self._handle_entity_relation_summary(entity_name, description, task_id=task_id)
|
||||
node_data = dict(entity_type=entity_type, description=description, source_id=already_source_ids)
|
||||
all_relationships_data.append(node_data)
|
||||
|
||||
async def _handle_entity_relation_summary(self, entity_or_relation_name, description, task_id=""):
|
||||
description_list = use_description.split(GRAPH_FIELD_SEP)
|
||||
if len(description_list) <= 12:
|
||||
return use_description # 描述较少时不摘要
|
||||
# 触发 LLM 摘要
|
||||
async with chat_limiter:
|
||||
summary = await trio.to_thread.run_sync(self._chat, "", [{"role": "user", "content": use_prompt}], {}, task_id)
|
||||
return summary
|
||||
```
|
||||
|
||||
**设计要点**:
|
||||
- 同一实体名在不同 chunk 中的描述用 `<SEP>` 拼接,超过 12 条触发 LLM 摘要,防止描述无限膨胀。
|
||||
- 关系合并同理:权重累加、关键词去重并集、描述拼接摘要。
|
||||
|
||||
#### 5.1.5 子图合并到全局图
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/utils.py:199-229`
|
||||
|
||||
```python
|
||||
def graph_merge(g1: nx.Graph, g2: nx.Graph, change: GraphChange):
|
||||
"""Merge graph g2 into g1 in place."""
|
||||
for node_name, attr in g2.nodes(data=True):
|
||||
change.added_updated_nodes.add(node_name)
|
||||
if not g1.has_node(node_name):
|
||||
g1.add_node(node_name, **attr)
|
||||
continue
|
||||
# 已存在:描述追加、source_id 合并
|
||||
node = g1.nodes[node_name]
|
||||
node["description"] += GRAPH_FIELD_SEP + attr["description"]
|
||||
node["source_id"] += attr["source_id"]
|
||||
|
||||
for source, target, attr in g2.edges(data=True):
|
||||
change.added_updated_edges.add(get_from_to(source, target))
|
||||
edge = g1.get_edge_data(source, target)
|
||||
if edge is None:
|
||||
g1.add_edge(source, target, **attr)
|
||||
continue
|
||||
# 已存在:权重累加、描述追加
|
||||
edge["weight"] += attr.get("weight", 0)
|
||||
edge["description"] += GRAPH_FIELD_SEP + attr["description"]
|
||||
edge["keywords"] += attr["keywords"]
|
||||
edge["source_id"] += attr["source_id"]
|
||||
|
||||
# 更新度中心性(rank)
|
||||
for node_degree in g1.degree:
|
||||
g1.nodes[str(node_degree[0])]["rank"] = int(node_degree[1])
|
||||
```
|
||||
|
||||
#### 5.1.6 实体消歧
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/entity_resolution.py:31-141`
|
||||
|
||||
```python
|
||||
class EntityResolution(Extractor):
|
||||
async def __call__(self, graph, subgraph_nodes, prompt_variables=None, callback=None, task_id=""):
|
||||
# 1. 按 entity_type 分组
|
||||
node_clusters = {entity_type: [] for entity_type in entity_types}
|
||||
for node in nodes:
|
||||
node_clusters[graph.nodes[node].get('entity_type', '-')].append(node)
|
||||
|
||||
# 2. 生成候选对(组合数限制 + 编辑距离预筛选)
|
||||
for k, v in node_clusters.items():
|
||||
candidate_resolution[k] = [(a, b) for a, b in itertools.combinations(v, 2)
|
||||
if (a in subgraph_nodes or b in subgraph_nodes) and self.is_similarity(a, b)]
|
||||
|
||||
# 3. LLM 批量判断(batch=100,并发=5,trio 协程)
|
||||
async def limited_resolve_candidate(candidate_batch, result_set, result_lock):
|
||||
async with semaphore:
|
||||
await self._resolve_candidate(candidate_batch, result_set, result_lock, task_id)
|
||||
|
||||
# 4. 合并连通分量
|
||||
connect_graph = nx.Graph()
|
||||
connect_graph.add_edges_from(resolution_result)
|
||||
for sub_connect_graph in nx.connected_components(connect_graph):
|
||||
merging_nodes = list(sub_connect_graph)
|
||||
await self._merge_graph_nodes(graph, merging_nodes, change, task_id)
|
||||
|
||||
# 5. 重新计算 PageRank
|
||||
pr = nx.pagerank(graph)
|
||||
```
|
||||
|
||||
**编辑距离预筛选算法**(`is_similarity`,第 225-239 行):
|
||||
|
||||
```python
|
||||
def is_similarity(self, a, b):
|
||||
# 规则1:2-gram 差异中不能包含数字(避免 "Product 1" vs "Product 2" 被误判)
|
||||
if self._has_digit_in_2gram_diff(a, b):
|
||||
return False
|
||||
# 规则2:英文用 editdistance,阈值 = min(len(a), len(b)) // 2
|
||||
if is_english(a) and is_english(b):
|
||||
return editdistance.eval(a, b) <= min(len(a), len(b)) // 2
|
||||
# 规则3:中文/混合文本用字符集 Jaccard 相似度,阈值 0.8
|
||||
a, b = set(a), set(b)
|
||||
max_l = max(len(a), len(b))
|
||||
if max_l < 4:
|
||||
return len(a & b) > 1
|
||||
return len(a & b) * 1. / max_l >= 0.8
|
||||
```
|
||||
|
||||
**消歧流程设计意图**:
|
||||
1. **预筛选**:编辑距离过滤掉明显不同的实体对,减少 LLM 调用量(组合数从 O(n²) 降到可控范围)。
|
||||
2. **批量 LLM 判断**:每批 100 对,并发 5 个请求,timeout 280s(测试环境)或无限(生产环境)。
|
||||
3. **连通分量合并**:LLM 判定"A=B"和"B=C"后,即使 LLM 没直接判断"A=C",通过连通分量也会将 A、B、C 合并。
|
||||
4. **任务取消支持**:每步检查 `has_canceled(task_id)`,支持用户中断长时任务。
|
||||
|
||||
#### 5.1.7 社区发现与报告生成
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/leiden.py:95-141`
|
||||
|
||||
```python
|
||||
def run(graph, args):
|
||||
max_cluster_size = args.get("max_cluster_size", 12)
|
||||
use_lcc = args.get("use_lcc", True)
|
||||
# 使用 graspologic 的 hierarchical_leiden
|
||||
community_mapping = hierarchical_leiden(graph, max_cluster_size=max_cluster_size, random_seed=seed)
|
||||
# 按层级组织社区,计算社区权重(节点 rank × weight 归一化)
|
||||
for level in levels:
|
||||
for node_id, raw_community_id in node_id_to_community_map[level].items():
|
||||
community_id = str(raw_community_id)
|
||||
result[community_id]["nodes"].append(node_id)
|
||||
result[community_id]["weight"] += graph.nodes[node_id].get("rank", 0) * graph.nodes[node_id].get("weight", 1)
|
||||
```
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/community_reports_extractor.py:55-158`
|
||||
|
||||
```python
|
||||
class CommunityReportsExtractor(Extractor):
|
||||
async def __call__(self, graph, callback=None, task_id=""):
|
||||
communities = leiden.run(graph, {})
|
||||
async with trio.open_nursery() as nursery:
|
||||
for level, comm in communities.items():
|
||||
for community in comm.items():
|
||||
nursery.start_soon(extract_community_report, community)
|
||||
|
||||
async def extract_community_report(community):
|
||||
cm_id, cm = community
|
||||
ents = cm["nodes"]
|
||||
if len(ents) < 2:
|
||||
return # 忽略单节点社区
|
||||
ent_df = pd.DataFrame([{"entity": e, "description": graph.nodes[e]["description"]} for e in ents])
|
||||
rela_df = pd.DataFrame([...]) # 社区内关系,上限 10000
|
||||
prompt = perform_variable_replacements(COMMUNITY_REPORT_PROMPT,
|
||||
variables={"entity_df": ent_df.to_csv(), "relation_df": rela_df.to_csv()})
|
||||
response = await trio.to_thread.run_sync(self._chat, text, ...)
|
||||
# 解析 JSON,校验字段类型
|
||||
if not dict_has_keys_with_types(response, [("title", str), ("summary", str), ("findings", list), ("rating", float), ("rating_explanation", str)]):
|
||||
return
|
||||
```
|
||||
|
||||
### 5.2 图谱检索链路
|
||||
|
||||
#### 5.2.1 检索入口
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/search.py:19-280`
|
||||
|
||||
```python
|
||||
class KGSearch(Dealer):
|
||||
def retrieval(self, question, workspace_ids, kb_ids, emb_mdl, llm,
|
||||
max_token=8196, ent_topn=6, rel_topn=6, comm_topn=1,
|
||||
ent_sim_threshold=0.3, rel_sim_threshold=0.3, **kwargs):
|
||||
# Step 1: Query 改写
|
||||
ty_kwds, ents = self.query_rewrite(llm, qst, idxnms, kb_ids)
|
||||
# Step 2: 三路召回
|
||||
ents_from_query = self.get_relevant_ents_by_keywords(ents, filters, idxnms, kb_ids, emb_mdl, ent_sim_threshold)
|
||||
ents_from_types = self.get_relevant_ents_by_types(ty_kwds, filters, idxnms, kb_ids, 10000)
|
||||
rels_from_txt = self.get_relevant_relations_by_txt(qst, filters, idxnms, kb_ids, emb_mdl, rel_sim_threshold)
|
||||
# Step 3: n-hop 路径扩展
|
||||
nhop_pathes = defaultdict(dict)
|
||||
for _, ent in ents_from_query.items():
|
||||
for nbr in ent.get("n_hop_ents", []):
|
||||
for i in range(len(path) - 1):
|
||||
nhop_pathes[(path[i], path[i+1])]["sim"] += ent["sim"] / (2 + i)
|
||||
# Step 4: 融合打分
|
||||
for ent in ents_from_types:
|
||||
if ent in ents_from_query:
|
||||
ents_from_query[ent]["sim"] *= 2 # 类型匹配 boost
|
||||
for (f, t) in rels_from_txt:
|
||||
s = nhop_pathes.get(pair, {}).get("sim", 0)
|
||||
if f in ents_from_types: s += 1
|
||||
if t in ents_from_types: s += 1
|
||||
rels_from_txt[(f, t)]["sim"] *= s + 1 # n-hop + 类型 boost
|
||||
# Step 5: 排序截断
|
||||
ents_from_query = sorted(..., key=lambda x: x[1]["sim"] * x[1]["pagerank"], reverse=True)[:ent_topn]
|
||||
rels_from_txt = sorted(..., key=lambda x: x[1]["sim"] * x[1]["pagerank"], reverse=True)[:rel_topn]
|
||||
# Step 6: 社区报告召回
|
||||
community = self._community_retrieval_([n for n, _ in ents_from_query], filters, kb_ids, idxnms, comm_topn, max_token)
|
||||
return {"page_content": ents + relas + community, "vector": None, ...}
|
||||
```
|
||||
|
||||
#### 5.2.2 Query 改写
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/search.py:33-55`
|
||||
|
||||
```python
|
||||
def query_rewrite(self, llm, question, idxnms, kb_ids):
|
||||
# 从 ES 获取当前 KB 的实体类型池
|
||||
ty2ents = trio.run(lambda: get_entity_type2samples(idxnms, kb_ids))
|
||||
hint_prompt = PROMPTS["minirag_query2kwd"].format(
|
||||
query=question,
|
||||
TYPE_POOL=json.dumps(ty2ents, ensure_ascii=False, indent=2))
|
||||
result = self._chat(llm, hint_prompt, [{"role": "user", "content": "Output:"}], {})
|
||||
keywords_data = json_repair.loads(result)
|
||||
type_keywords = keywords_data.get("answer_type_keywords", [])
|
||||
entities_from_query = keywords_data.get("entities_from_query", [])[:5]
|
||||
return type_keywords, entities_from_query
|
||||
```
|
||||
|
||||
**设计意图**:
|
||||
- Query 改写将自然语言问题转换为两种结构化信号:
|
||||
1. `answer_type_keywords`:回答类型(如 "ORGANIZATION", "PERSON"),用于类型过滤召回。
|
||||
2. `entities_from_query`:查询中的具体实体,用于向量相似度召回。
|
||||
- 类型池 `ty2ents` 从 ES 中已建图谱的实体类型采样而来,保证类型建议与当前知识库实际类型一致。
|
||||
|
||||
#### 5.2.3 实体向量召回
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/search.py:96-106`
|
||||
|
||||
```python
|
||||
def get_relevant_ents_by_keywords(self, keywords, filters, idxnms, kb_ids, emb_mdl, sim_thr=0.3, N=56):
|
||||
filters["knowledge_graph_kwd"] = "entity"
|
||||
matchDense = self.get_vector(", ".join(keywords), emb_mdl, 1024, sim_thr)
|
||||
es_res = self.dataStore.search(
|
||||
["page_content", "entity_kwd", "rank_flt"], [], filters, [matchDense],
|
||||
OrderByExpr(), 0, N, idxnms, kb_ids)
|
||||
return self._ent_info_from_(es_res, sim_thr)
|
||||
```
|
||||
|
||||
**设计要点**:
|
||||
- 实体和关系都以独立 chunk 形式存储在 ES 中,附带 dense_vector 字段。
|
||||
- 向量维度由 embedding model 决定,存储字段名为 `q_{dim}_vec`。
|
||||
- `sim_thr=0.3` 为默认相似度阈值,过滤低质量匹配。
|
||||
|
||||
#### 5.2.4 n-hop 路径扩展与融合公式
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/search.py:160-210`
|
||||
|
||||
```python
|
||||
# n-hop 路径:从命中实体出发,沿预计算的邻居路径扩展
|
||||
for _, ent in ents_from_query.items():
|
||||
nhops = ent.get("n_hop_ents", [])
|
||||
for nbr in nhops:
|
||||
path = nbr["path"]
|
||||
wts = nbr["weights"]
|
||||
for i in range(len(path) - 1):
|
||||
f, t = path[i], path[i + 1]
|
||||
if (f, t) in nhop_pathes:
|
||||
nhop_pathes[(f, t)]["sim"] += ent["sim"] / (2 + i)
|
||||
else:
|
||||
nhop_pathes[(f, t)]["sim"] = ent["sim"] / (2 + i)
|
||||
nhop_pathes[(f, t)]["pagerank"] = wts[i]
|
||||
|
||||
# 融合公式:P(E|Q) ≈ P(E) * P(Q|E) → pagerank * sim
|
||||
# 实体排序:score = sim × pagerank
|
||||
ents_from_query = sorted(ents_from_query.items(),
|
||||
key=lambda x: x[1]["sim"] * x[1]["pagerank"], reverse=True)[:ent_topn]
|
||||
```
|
||||
|
||||
**设计意图**:
|
||||
- n-hop 路径在实体入库时预计算(通过 NetworkX 邻居遍历),存储在 `n_hop_with_weight` 字段。
|
||||
- 距离越远的 hop,贡献权重按 `1/(2+i)` 衰减(1-hop: 1/3, 2-hop: 1/4...)。
|
||||
- 最终排序融合了两个信号:向量相似度(P(Q|E),查询与实体的语义匹配)和 PageRank(P(E),实体在全局图谱中的重要性)。
|
||||
|
||||
#### 5.2.5 与向量检索的协同
|
||||
|
||||
GraphRAG 检索**不替代**向量检索,而是作为**并行的召回源**之一。在 `settings.py` 中:
|
||||
|
||||
```python
|
||||
kg_retriever = kg_search.KGSearch(docStoreConn) # 图谱检索器
|
||||
retriever = search.Dealer(docStoreConn) # 向量检索器
|
||||
```
|
||||
|
||||
上层调用方(如对话工作流)会同时调用两者,将图谱召回结果(Entities + Relations + Community Reports)与向量召回的 Document Chunks 一起送入 LLM 上下文。
|
||||
|
||||
---
|
||||
|
||||
## 6. Light vs General 差异详解
|
||||
|
||||
### 6.1 功能对比
|
||||
|
||||
| 维度 | Light | General | 说明 |
|
||||
|---|---|---|---|
|
||||
| **实体抽取 Prompt** | LightRAG 风格,含 content_keywords | MS GraphRAG 风格,更简洁 | `light/graph_prompt.py` vs `general/graph_prompt.py` |
|
||||
| **Gleaning 终止** | 自然语言 yes/no | 强制单字 Y(logit_bias) | Light 更灵活,General 更确定 |
|
||||
| **实体消歧** | ❌ 不支持 | ✅ 支持 | `entity_resolution.py` 仅在 General 流程中调用 |
|
||||
| **社区发现** | ❌ 不支持 | ✅ Leiden 算法 | `general/leiden.py` |
|
||||
| **社区报告** | ❌ 不支持 | ✅ LLM 生成报告 | `general/community_reports_extractor.py` |
|
||||
| **实体嵌入** | 仅实体名向量 | 支持 Node2Vec(备用) | `general/entity_embedding.py` 当前未在主线使用 |
|
||||
| **思维导图** | ❌ 不支持 | ✅ 支持 | `general/mind_map_extractor.py` |
|
||||
| **并发控制** | 相同 | 相同 | `trio.Semaphore` + `chat_limiter` |
|
||||
| **建图耗时** | 低(无消歧/社区) | 高(消歧 + 社区报告 ≈ 额外 10-30 分钟) | |
|
||||
| **Token 消耗** | 低 | 高(社区报告每社区一次 LLM 调用) | |
|
||||
| **适用数据规模** | < 1K 文档 | > 1K 文档 | |
|
||||
|
||||
### 6.2 切换条件
|
||||
|
||||
**配置入口**:`parser_config["graphrag"]["method"]`
|
||||
|
||||
```python
|
||||
# api/app/core/rag/graphrag/general/index.py:54
|
||||
extractor = LightKGExt if (
|
||||
"method" not in row["parser_config"].get("graphrag", {})
|
||||
or row["parser_config"]["graphrag"]["method"] != "general"
|
||||
) else GeneralKGExt
|
||||
```
|
||||
|
||||
| 条件 | 推荐模式 |
|
||||
|---|---|
|
||||
| `parser_config.graphrag.method` 未设置 或 != `"general"` | **Light**(默认) |
|
||||
| `parser_config.graphrag.method == "general"` | **General** |
|
||||
| `with_resolution=True` 且 method=general | General + 实体消歧 |
|
||||
| `with_community=True` 且 method=general | General + 社区报告 |
|
||||
|
||||
### 6.3 资源消耗对比(估算)
|
||||
|
||||
以 1000 个 chunk(约 50 万字)的知识库为例:
|
||||
|
||||
| 阶段 | Light | General | 差异原因 |
|
||||
|---|---|---|---|
|
||||
| 实体抽取 | ~100 次 LLM 调用 | ~100 次 LLM 调用 | 两者类似 |
|
||||
| 实体消歧 | 0 | ~10-50 次 LLM 调用 | 候选对数量取决于实体重复率 |
|
||||
| 社区报告 | 0 | ~20-100 次 LLM 调用 | 社区数量取决于图密度 |
|
||||
| 总 Token | ~500K-1M | ~2M-5M | General 多轮摘要 + 社区报告 |
|
||||
| 总时间 | ~5-15 分钟 | ~30-60 分钟 | 消歧和社区是主要耗时 |
|
||||
| ES 存储 | ~实体数 + 关系数 | + 社区报告数 + 全局图 | |
|
||||
|
||||
---
|
||||
|
||||
## 7. 关键 Prompt 解读
|
||||
|
||||
### 7.1 Query 分析 Prompt:`minirag_query2kwd`
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/query_analyze_prompt.py:9-155`
|
||||
|
||||
```
|
||||
---Role---
|
||||
You are a helpful assistant tasked with identifying both answer-type and low-level keywords...
|
||||
|
||||
---Goal---
|
||||
Given the query, list both answer-type and low-level keywords.
|
||||
answer_type_keywords focus on the type of the answer...
|
||||
The answer_type_keywords must be selected from Answer type pool.
|
||||
|
||||
---Instructions---
|
||||
- Output the keywords in JSON format.
|
||||
- "answer_type_keywords" for the types of the answer... No more than 3.
|
||||
- "entities_from_query" for specific entities or details.
|
||||
```
|
||||
|
||||
**设计意图逐行解读**:
|
||||
|
||||
| Prompt 片段 | 设计意图 |
|
||||
|---|---|
|
||||
| `answer_type_keywords must be selected from Answer type pool` | 强制从知识库实际存在的类型中选择,避免 LLM 编造不存在的类型。类型池从已建图谱采样,保证类型有效性。 |
|
||||
| `No more than 3` | 限制类型数量,防止过度发散导致召回噪声。 |
|
||||
| `entities_from_query must be extracted from the query` | 强调实体必须从查询原文提取,禁止 LLM 扩展或推测,保证召回精确性。 |
|
||||
| 4 个覆盖不同领域的示例 | Few-shot 示例涵盖时间、地点、组织、抽象概念,帮助 LLM 理解类型判定逻辑。 |
|
||||
| `TYPE_POOL` 动态注入 | 运行时从 ES 查询当前 KB 的实体类型分布,使类型建议与知识库内容一致。 |
|
||||
|
||||
### 7.2 实体消歧 Prompt:`ENTITY_RESOLUTION_PROMPT`
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/entity_resolution_prompt.py:1-58`
|
||||
|
||||
```
|
||||
-Goal-
|
||||
Please answer the following Question as required
|
||||
|
||||
-Steps-
|
||||
1. Identify each line of questioning as required
|
||||
2. Return output in English as a single list of each line answer...
|
||||
Use **{record_delimiter}** as the list delimiter.
|
||||
|
||||
-Examples-
|
||||
Example 1: Product 对比(computer vs phone → No,television vs TV → No)
|
||||
Example 2: Toponym 对比(Chicago vs ChiTown → Yes,Shanghai vs Zhengzhou → No)
|
||||
|
||||
-Real Data-
|
||||
Question:{input_text}
|
||||
```
|
||||
|
||||
**设计意图逐行解读**:
|
||||
|
||||
| Prompt 片段 | 设计意图 |
|
||||
|---|---|
|
||||
| `only focus on critical properties and overlook noisy factors` | 引导 LLM 关注核心语义特征,忽略大小写、缩写、冠词等噪声。 |
|
||||
| `Use domain knowledge of {entity_type}s` | 提示 LLM 利用领域知识辅助判断(如 "Peking" = "Beijing" 在地理领域成立)。 |
|
||||
| `answer the above N questions in the format: For Question i, Yes/No...` | 强制固定输出格式,便于正则解析。 |
|
||||
| `##` record_delimiter + `<\|>` entity_index_delimiter + `&&` resolution_result_delimiter | 三层分隔符设计,降低解析冲突概率。 |
|
||||
| 两个示例分别覆盖产品和地名 | 展示不同领域的消歧标准差异,增强泛化能力。 |
|
||||
|
||||
**注意**:示例中 "television vs TV → No" 和 "Chicago vs ChiTown → Yes" 看起来矛盾,实际上是在**引导 LLM 区分"缩写是否代表同一实体"**——TV 是 television 的缩写(同一事物),但 Prompt 标注为 No,可能是示例错误;而 Chicago vs ChiTown(俚语别称)标注为 Yes。这个示例设计值得商榷,实际效果取决于 LLM 的理解。
|
||||
|
||||
### 7.3 Light 版实体抽取 Prompt
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/light/graph_prompt.py:20-59`
|
||||
|
||||
```
|
||||
---Goal---
|
||||
Given a text document... identify all entities... and all relationships...
|
||||
|
||||
---Steps---
|
||||
1. Identify all entities. Format: ("entity"{tuple_delimiter}<name>{tuple_delimiter}<type>{tuple_delimiter}<description>)
|
||||
2. Identify all relationships. Format: ("relationship"{tuple_delimiter}<src>{tuple_delimiter}<tgt>{tuple_delimiter}<desc>{tuple_delimiter}<keywords>{tuple_delimiter}<strength>)
|
||||
3. Identify high-level key words... Format: ("content_keywords"{tuple_delimiter}<keywords>)
|
||||
4. Return output as a single list...
|
||||
5. When finished, output {completion_delimiter}
|
||||
```
|
||||
|
||||
**设计意图**:
|
||||
- **Tuple 格式**:`("entity"<\|>NAME<\|>TYPE<\|>DESC)` 使用固定分隔符,便于正则提取,比 JSON 更抗格式错误。
|
||||
- **content_keywords**:额外提取文档级关键词,可用于后续检索增强或标签分类。
|
||||
- **relationship_keywords**:关系关键词用于关系 chunk 的文本检索补充。
|
||||
- **strength**:关系强度(1-10)用于后续排序加权。
|
||||
- **多轮 gleaning**:首轮抽取后,用 `"MANY entities were missed"` 追问,最多 2 轮(`ENTITY_EXTRACTION_MAX_GLEANINGS=2`)。
|
||||
|
||||
### 7.4 General 版实体抽取 Prompt
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/graph_prompt.py:8-106`
|
||||
|
||||
与 Light 版的主要差异:
|
||||
- **无 content_keywords**:仅抽取 entity + relationship,更聚焦。
|
||||
- **无 relationship_keywords**:关系描述更简洁。
|
||||
- **无 strength 数值**:关系权重由出现频率决定(非 LLM 评分)。
|
||||
- **LOOP_PROMPT 使用 logit_bias**:强制输出单字 `Y` 或 `N`,比 Light 的自然语言判断更确定。
|
||||
|
||||
### 7.5 社区报告 Prompt
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/community_report_prompt.py:8-157`
|
||||
|
||||
```
|
||||
# Goal
|
||||
Write a comprehensive report of a community...
|
||||
|
||||
# Report Structure
|
||||
- TITLE: community's name...
|
||||
- SUMMARY: An executive summary...
|
||||
- IMPACT SEVERITY RATING: a float score between 0-10...
|
||||
- RATING EXPLANATION: single sentence...
|
||||
- DETAILED FINDINGS: 5-10 key insights...
|
||||
|
||||
# Grounding Rules
|
||||
Points supported by data should list their data references as follows:
|
||||
"...supported by multiple data references [Data: <dataset name> (record ids)]"
|
||||
```
|
||||
|
||||
**设计意图**:
|
||||
- **结构化 JSON 输出**:强制 `title/summary/rating/rating_explanation/findings` 五字段,便于程序解析。
|
||||
- **影响力评级(0-10)**:量化社区重要性,检索时按 `weight_flt` 排序优先返回高影响力社区。
|
||||
- **Grounding Rules**:要求引用数据记录 ID,增强可解释性(虽然当前实现未实际利用这些引用)。
|
||||
- **示例输入**:提供 `VERDANT OASIS PLAZA` 和 `HARMONY ASSEMBLY` 的完整示例,展示输出格式和数据引用方式。
|
||||
|
||||
---
|
||||
|
||||
## 8. 图谱存储设计
|
||||
|
||||
### 8.1 不使用 Neo4j
|
||||
|
||||
MemoryBear 的 GraphRAG **不依赖 Neo4j** 等专用图数据库,而是复用 Elasticsearch 作为统一存储。理由:
|
||||
1. **运维简化**:无需维护额外的图数据库集群。
|
||||
2. **混合检索**:实体/关系的向量嵌入与文档 chunk 存储在同一张索引,便于统一检索。
|
||||
3. **增量更新**:ES 的文档模型天然支持增量写入和版本管理。
|
||||
|
||||
### 8.2 ES 文档类型(knowledge_graph_kwd)
|
||||
|
||||
| 类型 | 存储内容 | 关键字段 |
|
||||
|---|---|---|
|
||||
| `graph` | 全局图(NetworkX node_link_data JSON) | `page_content`(JSON)、`source_id` |
|
||||
| `subgraph` | 单文档子图 | `page_content`(JSON)、`source_id` |
|
||||
| `entity` | 单个实体(可向量检索) | `entity_kwd`、`entity_type_kwd`、`rank_flt`、`q_*_vec` |
|
||||
| `relation` | 单个关系(可向量检索) | `from_entity_kwd`、`to_entity_kwd`、`weight_int`、`q_*_vec` |
|
||||
| `community_report` | 社区报告 | `docnm_kwd`(标题)、`weight_flt`、`entities_kwd` |
|
||||
| `ty2ents` | 类型→实体样例映射 | `page_content`(JSON dict) |
|
||||
|
||||
### 8.3 向量嵌入策略
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/utils.py:301-327`(实体)和 `352-378`(关系)
|
||||
|
||||
```python
|
||||
async def graph_node_to_chunk(kb_id, embd_mdl, ent_name, meta, chunks):
|
||||
chunk = {
|
||||
"entity_kwd": ent_name,
|
||||
"knowledge_graph_kwd": "entity",
|
||||
"entity_type_kwd": meta["entity_type"],
|
||||
"page_content": json.dumps(meta, ensure_ascii=False),
|
||||
...
|
||||
}
|
||||
# 实体向量 = entity_name 的 embedding
|
||||
ebd, _ = embd_mdl.encode([ent_name])
|
||||
chunk["q_%d_vec" % len(ebd)] = ebd
|
||||
|
||||
async def graph_edge_to_chunk(kb_id, embd_mdl, from_ent_name, to_ent_name, meta, chunks):
|
||||
# 关系向量 = "from->to: description" 的 embedding
|
||||
txt = f"{from_ent_name}->{to_ent_name}"
|
||||
ebd, _ = embd_mdl.encode([txt + f": {meta['description']}"])
|
||||
chunk["q_%d_vec" % len(ebd)] = ebd
|
||||
```
|
||||
|
||||
**设计要点**:
|
||||
- 实体向量基于**实体名**(`ent_name`),而非描述文本——因为检索时用户查询通常包含实体名。
|
||||
- 关系向量基于 `"from->to: description"`,兼顾结构信息和语义信息。
|
||||
- 向量缓存:通过 Redis + xxhash 缓存 embedding 结果,避免重复计算。
|
||||
|
||||
---
|
||||
|
||||
## 9. 配置项与可调参数
|
||||
|
||||
### 9.1 环境变量
|
||||
|
||||
| 环境变量 | 默认值 | 说明 | 源码位置 |
|
||||
|---|---|---|---|
|
||||
| `MAX_CONCURRENT_CHATS` | 10 | LLM 并发调用上限(trio CapacityLimiter) | `utils.py:41` |
|
||||
| `MAX_CONCURRENT_PROCESS_AND_EXTRACT_CHUNK` | 10 | Chunk 处理并发上限 | `general/extractor.py:33` |
|
||||
| `ENABLE_TIMEOUT_ASSERTION` | 未设置 | 测试模式:启用短超时(3-280s) | 多处 `trio.fail_after` |
|
||||
|
||||
### 9.2 parser_config 配置
|
||||
|
||||
**文件**: `api/app/models/knowledge_model.py:77-82` / `document_model.py:27-32`
|
||||
|
||||
```python
|
||||
"graphrag": {
|
||||
"use_graphrag": False, # 总开关
|
||||
"method": "light", # "light" 或 "general"
|
||||
"resolution": False, # 是否启用实体消歧(仅 General)
|
||||
"community": False, # 是否启用社区报告(仅 General)
|
||||
"entity_types": [] # 自定义实体类型列表,空则使用默认值
|
||||
}
|
||||
```
|
||||
|
||||
### 9.3 检索参数
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/search.py:130-141`
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|---|---|---|
|
||||
| `max_token` | 8196 | 返回结果的总 token 预算 |
|
||||
| `ent_topn` | 6 | 返回实体数量上限 |
|
||||
| `rel_topn` | 6 | 返回关系数量上限 |
|
||||
| `comm_topn` | 1 | 返回社区报告数量上限 |
|
||||
| `ent_sim_threshold` | 0.3 | 实体向量相似度阈值 |
|
||||
| `rel_sim_threshold` | 0.3 | 关系向量相似度阈值 |
|
||||
|
||||
### 9.4 消歧参数
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/entity_resolution.py`
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|---|---|---|
|
||||
| `resolution_batch_size` | 100 | 每批消歧的实体对数量 |
|
||||
| `max_concurrent_tasks` | 5 | 消歧 LLM 调用并发数 |
|
||||
| 超时 | 280s(测试)/ 无限(生产) | `trio.move_on_after` |
|
||||
|
||||
### 9.5 社区发现参数
|
||||
|
||||
**文件**: `api/app/core/rag/graphrag/general/leiden.py:97`
|
||||
|
||||
| 参数 | 默认值 | 说明 |
|
||||
|---|---|---|
|
||||
| `max_cluster_size` | 12 | 单个社区最大节点数 |
|
||||
| `use_lcc` | True | 是否只取最大连通分量 |
|
||||
| `seed` | 0xDEADBEEF | Leiden 算法随机种子 |
|
||||
|
||||
---
|
||||
|
||||
## 10. 边界条件与已知限制
|
||||
|
||||
### 10.1 已知限制
|
||||
|
||||
| 限制 | 影响 | 缓解措施 |
|
||||
|---|---|---|
|
||||
| 实体消歧仅处理 subgraph_nodes 内的节点 | 历史已消歧的节点不再参与新一轮消歧 | 手动重建图谱触发全量消歧 |
|
||||
| 社区报告忽略 < 2 个节点的社区 | 孤立实体无社区报告覆盖 | 通过实体直接召回补充 |
|
||||
| 关系抽取忽略无对应实体的关系 | 实体抽取失败导致关系丢失 | `tidy_graph` 后检查日志 |
|
||||
| LLM 输出格式错误导致解析失败 | 部分 chunk 的实体/关系丢失 | `json_repair` 库容错 + 错误计数限制(max_errors=3) |
|
||||
| 实体名大写归一化 | "Apple" 和 "apple" 被视为同一实体 | 设计如此,避免大小写重复 |
|
||||
| 中文编辑距离用字符集 Jaccard | 对短实体(< 4 字)阈值不同 | `is_similarity` 中特殊处理 |
|
||||
| 图谱全量重建需遍历所有 subgraph | 大数据集重建耗时高 | 增量合并避免全量重建 |
|
||||
|
||||
### 10.2 幂等性与并发安全
|
||||
|
||||
- `generate_subgraph()` 检查 `does_graph_contains()`,避免同一文档重复建图。
|
||||
- `merge_subgraph()` 使用 `RedisDistributedLock` 保证同一 KB 的并发合并安全。
|
||||
- `run_graphrag_for_kb()` 支持 `max_parallel_documents=4`,控制文档级并发。
|
||||
|
||||
### 10.3 任务取消
|
||||
|
||||
所有长时操作(抽取、消歧、社区报告)都穿插 `has_canceled(task_id)` 检查,支持用户通过 Redis 键取消任务:
|
||||
|
||||
```python
|
||||
def has_canceled(task_id):
|
||||
return redis_client.get(f"{task_id}-cancel") is not None
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. 监控指标与排错指引
|
||||
|
||||
### 11.1 关键日志
|
||||
|
||||
| 日志模式 | 含义 | 排查方向 |
|
||||
|---|---|---|
|
||||
| `ignored X relations due to missing entities` | 关系指向的实体未抽取到 | 检查 LLM 输出格式,或降低 tidy_graph 的清理标准 |
|
||||
| `Resolved X candidate pairs, Y of them are selected to merge` | 实体消歧结果统计 | Y/X 过低说明预筛选太严格或 LLM 过于保守 |
|
||||
| `Graph extracted X communities in Ys` | 社区发现完成 | 社区数异常(0 或过多)检查图谱连通性 |
|
||||
| `Task {id} cancelled during...` | 任务被取消 | 正常用户行为,无需排查 |
|
||||
| `Didn't extract any entities and relationships` | LLM 返回空 | 检查 LLM 可用性、Prompt 长度是否超限 |
|
||||
| `Insert chunk error` | ES 写入失败 | 检查 ES 集群状态、索引 mapping |
|
||||
|
||||
### 11.2 性能指标
|
||||
|
||||
| 指标 | 采集方式 | 健康阈值 |
|
||||
|---|---|---|
|
||||
| 单文档建图耗时 | callback 日志 | Light < 5min,General < 30min |
|
||||
| 实体抽取 Token 消耗 | `sum_token_count` | 关注单 chunk 消耗是否异常高 |
|
||||
| ES 查询延迟 | `dataStore.search` 耗时 | P99 < 500ms |
|
||||
| LLM 调用成功率 | 错误日志计数 | > 95% |
|
||||
| 消歧候选对数量 | `num_candidates` | 与节点数平方成正比,关注异常增长 |
|
||||
|
||||
---
|
||||
|
||||
## 12. 优化建议与未来扩展点
|
||||
|
||||
### 12.1 短期优化(1-2 周可落地)
|
||||
|
||||
1. **实体消歧预筛选优化**:当前 `is_similarity` 对中文使用字符集 Jaccard,对同音字/形近字(如"阿里巴巴" vs "阿狸巴巴")效果差。建议引入拼音相似度或字形相似度作为第三层预筛选。
|
||||
2. **消歧 Prompt 示例修正**:`entity_resolution_prompt.py` 中 "television vs TV → No" 的示例与常识矛盾,建议修正为 Yes,避免误导 LLM。
|
||||
3. **社区报告并发控制**:当前 `community_reports_extractor.py` 对每个社区启动一个 trio task,社区数过多时会压垮 LLM。建议增加社区级并发限制。
|
||||
4. **关系向量优化**:当前关系向量使用 `"from->to: description"`,但 description 可能很长。建议仅使用 `"from->to"` 或关系关键词作为嵌入文本,提升检索效率。
|
||||
|
||||
### 12.2 中期扩展(1-2 月)
|
||||
|
||||
1. **多跳推理增强**:当前 n-hop 路径是预计算的静态数据。可考虑在检索阶段动态执行多跳遍历,支持更灵活的推理路径。
|
||||
2. **时序图谱**:在关系/实体上增加时间维度,支持"某实体在某时间段的关系变化"类查询。
|
||||
3. **图可视化 API**:基于 `nx.node_link_data` 输出,提供前端可消费的图数据接口,支持交互式图谱浏览。
|
||||
4. **增量实体类型发现**:当前实体类型是静态配置。可通过 LLM 自动发现文档中的新实体类型,动态扩展类型池。
|
||||
|
||||
### 12.3 长期方向(路线图)
|
||||
|
||||
1. **GraphRAG + 多模态**:将图片中的实体(如 OCR 提取的组织 logo)纳入图谱,支持跨模态实体关联。
|
||||
2. **动态图谱更新**:当前是批处理模式(文档上传后触发建图)。可探索流式更新,支持实时知识库编辑后的图谱增量更新。
|
||||
3. **替代 ES 的图数据库评估**:当图谱规模达到百万节点级别时,ES 的图查询性能可能成为瓶颈。可评估 Neo4j / Dgraph 等专用图数据库的接入可行性。
|
||||
|
||||
---
|
||||
|
||||
## 附录:源码索引速查表
|
||||
|
||||
| 功能 | 文件 | 关键类/函数 | 行号 |
|
||||
|---|---|---|---|
|
||||
| 建图总控 | `general/index.py` | `run_graphrag()` | 36-119 |
|
||||
| KB 级批量建图 | `general/index.py` | `run_graphrag_for_kb()` | 122-330 |
|
||||
| 子图生成 | `general/index.py` | `generate_subgraph()` | 333-406 |
|
||||
| 子图合并 | `general/index.py` | `merge_subgraph()` | 409-436 |
|
||||
| Light 实体抽取 | `light/graph_extractor.py` | `GraphExtractor._process_single_content()` | 74-131 |
|
||||
| General 实体抽取 | `general/graph_extractor.py` | `GraphExtractor._process_single_content()` | 100-150 |
|
||||
| 抽取基类 | `general/extractor.py` | `Extractor.__call__()` | 97-203 |
|
||||
| 节点合并 | `general/extractor.py` | `Extractor._merge_nodes()` | 205-225 |
|
||||
| 边合并 | `general/extractor.py` | `Extractor._merge_edges()` | 227-236 |
|
||||
| 图节点合并 | `general/extractor.py` | `Extractor._merge_graph_nodes()` | 238-275 |
|
||||
| 描述摘要 | `general/extractor.py` | `Extractor._handle_entity_relation_summary()` | 277-300 |
|
||||
| 实体消歧 | `entity_resolution.py` | `EntityResolution.__call__()` | 53-141 |
|
||||
| 消歧候选判断 | `entity_resolution.py` | `EntityResolution._resolve_candidate()` | 143-186 |
|
||||
| 结果解析 | `entity_resolution.py` | `EntityResolution._process_results()` | 188-213 |
|
||||
| 相似度预筛选 | `entity_resolution.py` | `EntityResolution.is_similarity()` | 225-239 |
|
||||
| 社区发现 | `general/leiden.py` | `run()` | 95-141 |
|
||||
| 社区报告抽取 | `general/community_reports_extractor.py` | `CommunityReportsExtractor.__call__()` | 55-158 |
|
||||
| 图谱检索 | `search.py` | `KGSearch.retrieval()` | 130-280 |
|
||||
| Query 改写 | `search.py` | `KGSearch.query_rewrite()` | 33-55 |
|
||||
| 实体向量召回 | `search.py` | `KGSearch.get_relevant_ents_by_keywords()` | 96-106 |
|
||||
| 关系向量召回 | `search.py` | `KGSearch.get_relevant_relations_by_txt()` | 107-117 |
|
||||
| 类型过滤召回 | `search.py` | `KGSearch.get_relevant_ents_by_types()` | 118-128 |
|
||||
| 社区报告召回 | `search.py` | `KGSearch._community_retrieval_()` | 282-302 |
|
||||
| 图合并工具 | `utils.py` | `graph_merge()` | 199-229 |
|
||||
| 图写入 ES | `utils.py` | `set_graph()` | 426-516 |
|
||||
| 图读取 ES | `utils.py` | `get_graph()` | 407-423 |
|
||||
| 实体转 chunk | `utils.py` | `graph_node_to_chunk()` | 301-327 |
|
||||
| 关系转 chunk | `utils.py` | `graph_edge_to_chunk()` | 352-378 |
|
||||
| LLM 缓存 | `utils.py` | `get_llm_cache()` / `set_llm_cache()` | 97-113 |
|
||||
| 任务取消检查 | `utils.py` | `has_canceled()` | 628-634 |
|
||||
| Query 分析 Prompt | `query_analyze_prompt.py` | `PROMPTS["minirag_query2kwd"]` | 9-155 |
|
||||
| 消歧 Prompt | `entity_resolution_prompt.py` | `ENTITY_RESOLUTION_PROMPT` | 1-58 |
|
||||
| Light 抽取 Prompt | `light/graph_prompt.py` | `PROMPTS["entity_extraction"]` | 20-59 |
|
||||
| General 抽取 Prompt | `general/graph_prompt.py` | `GRAPH_EXTRACTION_PROMPT` | 8-106 |
|
||||
| 社区报告 Prompt | `general/community_report_prompt.py` | `COMMUNITY_REPORT_PROMPT` | 8-157 |
|
||||
| 建图触发入口 | `tasks.py` | `build_graphrag_for_document()` | 557-636 |
|
||||
| KB 建图触发 | `tasks.py` | `build_graphrag_for_kb()` | 472-556 |
|
||||
| 模型默认配置 | `models/knowledge_model.py` | `parser_config["graphrag"]` | 77-82 |
|
||||
Reference in New Issue
Block a user