# RAG 环节边界定义

> 目标：明确每个 RAG 阶段的输入 / 输出 / 上下游接口（数据结构层面），避免 Sprint-2 各文档之间留白或重叠。

---

## 总览图

```
[Data Sources] ──→ [Loader] ──→ [Parser] ──→ [Chunking] ──→ [Embedding] ──→ [VDB]
                                                                      ↑
                                                                      │ (async)
                                                              [GraphRAG]

[User Query] ──→ [Query Understanding] ──→ [Retrieval] ──→ [Reranking] ──→ [Prompt] ──→ [LLM] ──→ [Post-Process] ──→ [Answer]
                                              ↑
                                              │ (GRAPH mode)
                                        [KG Search]
```

---

## 1. Loader（数据加载层）

| 维度 | 定义 |
|------|------|
| **上游** | 外部系统：飞书 API、语雀 API、Web URL、用户上传接口 |
| **输入** | 飞书：folder_token, app_id, app_secret；语雀：user_id, token；Web：entry_url, max_pages；上传：multipart/form-data |
| **输出** | **原始文件内容**：`CrawledDocument` (dataclass) 或 **本地文件路径** (.docx/.pdf/.md/.html/.xlsx) |
| **输出数据结构** | `CrawledDocument(url, title, content, content_length, crawl_timestamp, metadata)`；本地文件：`str` (path) |
| **下游** | Parser：接收文件路径或 bytes，调用对应 format-specific parser |
| **边界约定** | Loader 不做任何格式解析（不提取正文、不做 OCR）。仅负责：鉴权 → 获取/下载 → 存盘。格式识别由 Parser 层的 `naive.chunk()` 根据文件扩展名决定。 |

---

## 2. Parser（文档解析层）

| 维度 | 定义 |
|------|------|
| **上游** | Loader：接收文件路径 `str` 或二进制 `bytes` |
| **输入** | `(filename: str, binary: bytes \| None, from_page, to_page, callback, vision_model)` |
| **输出** | `sections: List[Tuple[str, str]]` — (text_content, layout_tag)；`tables: List[Tuple[Tuple[Optional[Image.Image], Union[str, List[str]]], List[Tuple]]]` |
| **输出数据结构** | 元组列表，其中 tag 表示布局类型（"Title"/"Text"/"Table"/...），text 可能含位置标签 `@@page\tx0\tx1\ttop\tbottom##` |
| **下游** | Chunking：接收 `sections` + `tables`，执行合并与分块 |
| **边界约定** | Parser 负责格式-specific 的**纯提取**，不负责语义分块。PDF Parser 特殊：需输出 OCR 结果 + 布局信息 + 表格 HTML。Parser 之间互不调用——由 `naive.chunk()` 统一 dispatch。 |

---

## 3. Chunking（文本分块层）

| 维度 | 定义 |
|------|------|
| **上游** | Parser：`sections` + `tables` |
| **输入** | `sections: List[Tuple[str, str]]`, `tables`, `chunk_token_num: int`, `delimiter: str`, `parser_config: dict` |
| **输出** | `res: List[Dict]` — 分块后的文档字典列表 |
| **输出数据结构（关键字段）** | `content_with_weight: str`（原始文本）, `content_ltks: str`（粗粒度分词）, `content_sm_ltks: str`（细粒度分词）, `image: PIL.Image`（可选）, `page_num_int: int`, `position_int: List[int]`, `top_int: int`, `doc_type_kwd: str` |
| **下游** | Embedding：接收 `res`，提取 `content_with_weight` 进行向量化；GraphRAG：接收 `res` 中的文本进行实体关系抽取 |
| **边界约定** | Chunking 不调用 Embedding，也不直接写入 VDB。它只负责将长文本切分成符合 token 预算的 chunks，并填充分词/位置元数据。多模态（图片/音频）的分块结果也统一为此数据结构。 |

---

## 4. Embedding（向量化层）

| 维度 | 定义 |
|------|------|
| **上游** | Chunking：接收 chunk dicts 的 `content_with_weight` |
| **输入** | `texts: List[str]`（batch，默认 ≤16 条） |
| **输出** | `(np.array, total_tokens)` — `np.array` shape `(batch_size, vector_dimension)` |
| **输出数据结构** | NumPy ndarray，float32；向量维度由模型决定（如 OpenAI text-embedding-3: 1536d） |
| **下游** | VDB：接收 `(chunk_text, vector, metadata)` 组装成 `DocumentChunk` 后入库 |
| **边界约定** | Embedding 层无状态，不管理模型生命周期。Provider 通过工厂模式实例化（`Base._FACTORY_NAME` 匹配）。输入文本超长时自动截断（OpenAI 截到 8000 tokens，QWen 截到 2048）。支持 `encode_queries()` 单条 query 编码。 |

---

## 5. VDB（向量数据库层）

| 维度 | 定义 |
|------|------|
| **上游** | Embedding：接收 `(text, vector, metadata)`；Chunking：接收 chunk dicts 中的 metadata |
| **输入** | `DocumentChunk(page_content: str, vector: List[float], metadata: dict)`；或检索时：`query: str, top_k: int, indices: str, score_threshold: float` |
| **输出（入库）** | ack / error；**输出（检索）**：`List[DocumentChunk]` |
| **存储 Schema** | `page_content: text(ik_max_word)`, `metadata: object(doc_id, document_id, knowledge_id, sort_id, status)`, `vector: dense_vector(cosine, dynamic_dims)` |
| **下游** | Retrieval：通过 `search_by_vector` / `search_by_full_text` / `search` (hybrid) 获取结果 |
| **边界约定** | VDB 同时承担**文档存储**（全文索引）和**向量存储**（密集向量索引）双重职责。ES 是唯一的后端（无 Milvus/Pinecone 等）。GraphRAG 的实体/关系/社区报告也以相同 chunk 格式存储于此。 |

---

## 6. GraphRAG（知识图谱层）

| 维度 | 定义 |
|------|------|
| **上游** | Chunking：接收 chunk dicts 的 `content_with_weight`；Celery：异步触发 `build_graphrag_for_document` |
| **输入（索引）** | `(document_id, chunk_text)` tuples；`chat_model: Base`, `embedding_model: OpenAIEmbed`, `vector_service: ElasticSearchVector` |
| **输入（检索）** | `question: str, workspace_ids: List[str], kb_ids: List[str], emb_mdl, llm` |
| **输出（索引）** | `nx.Graph`（全局图）存储到 ES；`entity` chunks + `relation` chunks + `community_report` chunks（General only） |
| **输出（检索）** | `Dict` with `page_content` = "Entities CSV + Relations CSV + Community Reports"，`metadata` 含引用信息 |
| **下游** | VDB：索引/存储实体、关系、社区报告 chunks；Retrieval：`KGSearch.retrieval()` 返回的 chunk 被 `insert(0, ...)` 插入标准检索结果 |
| **边界约定** | GraphRAG 是**独立异步流程**，不与标准 RAG 索引同步。Light 和 General 共享相同的存储格式但 General 多出 community_report。GraphRAG 不替代 VDB，而是**在 VDB 之上增加图语义层**。检索时 KG 结果优先级最高（insert at position 0）。 |

---

## 7. Retrieval（检索层）

| 维度 | 定义 |
|------|------|
| **上游** | VDB：通过 `search_by_vector` / `search_by_full_text` 获取候选；GraphRAG：`KGSearch.retrieval()` 获取图语义结果；Workflow Node：`KnowledgeRetrievalNode.execute()` 发起调用 |
| **输入** | `query: str, config: Dict(knowledge_bases[], merge_strategy, reranker_id, reranker_top_k, use_graph)` |
| **输出** | `List[DocumentChunk]` — 按相关性降序排列的文档块 |
| **输出数据结构** | `DocumentChunk(page_content: str, metadata: dict)`，其中 metadata 含 `score`, `doc_id`, `document_id`, `knowledge_id`, `highlight` |
| **下游** | Reranking：接收候选列表，可选执行重排序；Prompt：接收 chunks 组装上下文 |
| **边界约定** | Retrieval 层支持 4 种模式：PARTICIPLE（全文）、SEMANTIC（向量）、HYBRID（混合）、GRAPH（图增强）。多 KB 时逐 KB 检索后合并。HYBRID 的默认权重为 BM25 0.05 + Vector 0.95。检索失败（空结果）时自动降级（min_match 0.1 + similarity 0.17 重试）。 |

---

## 8. Reranking（重排序层）

| 维度 | 定义 |
|------|------|
| **上游** | Retrieval：接收候选 `List[DocumentChunk]` |
| **输入** | `query: str, docs: List[DocumentChunk], top_k: int`；或 `reranker_id: UUID` |
| **输出** | `List[DocumentChunk]` — 重排序后的文档块（长度 ≤ top_k） |
| **输出数据结构** | 同 Retrieval 输出，metadata 中更新 `score` 为重排序后的分数 |
| **下游** | Prompt：接收重排序后的 chunks 组装上下文 |
| **边界约定** | Reranking 是**可选层**。未配置 reranker_id 时，HYBRID 结果按 metadata.score 降序截断。配置了 reranker_id 时，调用外部 Rerank API（Jina / DashScope / Xinference）。Rerank 失败时**降级**到原始结果（不阻断流程）。 |

---

## 9. Prompt（Prompt 组装层）

| 维度 | 定义 |
|------|------|
| **上游** | Reranking：接收排序后的 chunks；Workflow：接收用户 query |
| **输入** | `chunks: List[DocumentChunk], query: str, system_prompt: str`（可选） |
| **输出** | `system: str, history: List[Dict]` — LLM 可调用的消息格式 |
| **输出数据结构** | `system: str`（含检索上下文 + 系统指令），`history: [{"role": "user", "content": query}]` |
| **下游** | LLM：`Base.chat(system, history, gen_conf)` |
| **边界约定** | Prompt 层**不**调用 LLM，只负责**文本组装**。组装逻辑包括：citation_prompt（引用标注格式）、keyword_extraction（用于缓存 key）、content_tagging（内容分类）。Prompt 模板以 `.md` 文件形式存储在 `prompts/` 目录，通过 `template.py` 动态加载。 |

---

## 10. LLM（大模型生成层）

| 维度 | 定义 |
|------|------|
| **上游** | Prompt：接收 `system` + `history` |
| **输入** | `system: str, history: List[Dict], gen_conf: dict(temperature, top_p, max_tokens)` |
| **输出** | `(answer: str, tokens: int)` 或流式 `Generator[str \| int]` |
| **输出数据结构** | 字符串（生成的回答文本）；流式模式下逐 token 返回 |
| **下游** | Post-Process：`insert_citations()` 插入引用标注 |
| **边界约定** | LLM 层**无上下文记忆**（stateless），每次调用携带完整 history。支持 10+ Provider，通过 `_FACTORY_NAME` 工厂模式匹配。流式输出通过 `chat_streamly()` 实现，返回 Generator。错误处理：API 异常时抛出，由上层（Workflow / Celery）捕获。 |

---

## 11. Post-Process（后处理层）

| 维度 | 定义 |
|------|------|
| **上游** | LLM：接收生成的 `answer`；Retrieval：接收原始 `chunks` + `chunk_v`（向量） |
| **输入** | `answer: str, chunks: List[DocumentChunk], chunk_v: List[np.array], embd_mdl, tkweight, vtweight` |
| **输出** | `(answer_with_citations: str, cited_ids: Set[str])` |
| **输出数据结构** | 字符串（含 `[1]`, `[2]` 等引用标记），`Set[str]`（被引用的 chunk id 集合） |
| **下游** | User：最终展示；Cache：写入 Redis 缓存 |
| **边界约定** | Post-Process 只做**引用标注插入**（`insert_citations()`），不做内容修改。引用定位算法基于 `pagerank * similarity` 评分。代码块（```...```）内**不**插入引用。缓存键由 `(model_name, prompt_text)` 组合生成，TTL 由 Redis 配置决定。 |

---

## 跨层数据流总表

| 阶段 | 输入数据类型 | 输出数据类型 | 关键数据结构 / 文件 |
|------|-------------|-------------|---------------------|
| Loader | URL / Token / File | `CrawledDocument` / `str` (path) | `crawler/models.py`, `integrations/*/models.py` |
| Parser | `str` (path) / `bytes` | `List[Tuple[str, str]]` + tables | `deepdoc/parser/*.py` |
| Chunking | sections + tables | `List[Dict]` | `nlp/__init__.py`, `app/naive.py` |
| Embedding | `List[str]` | `(np.array, int)` | `llm/embedding_model.py` |
| VDB | `DocumentChunk` | ack / `List[DocumentChunk]` | `vdb/field.py`, `models/chunk.py` |
| GraphRAG | chunk texts | `nx.Graph` + chunks | `graphrag/search.py`, `graphrag/general/index.py` |
| Retrieval | `query + config` | `List[DocumentChunk]` | `nlp/search.py` |
| Reranking | `query + docs` | `List[DocumentChunk]` | `models/rerank.py` |
| Prompt | `chunks + query` | `system + history` | `prompts/generator.py` |
| LLM | `system + history` | `str + int` | `llm/chat_model.py` |
| Post-Process | `answer + chunks` | `str + Set[str]` | `nlp/search.py:489` |

---

## 留白与重叠风险点

| 风险区域 | 说明 | 建议归属 |
|----------|------|----------|
| **Parser ↔ Chunking 边界** | Parser 输出的 `sections` 格式（含 tag 和位置信息）被 Chunking 的 `naive_merge` 直接消费。若 Parser 改了 tag 格式，Chunking 会受影响。 | **统一在 Parser 文档中定义 `sections` 数据契约**，Chunking 文档只引用该契约。 |
| **Embedding ↔ VDB 边界** | Embedding 输出维度必须与 VDB mapping 中 `dense_vector` 的 dims 一致。动态维度由首次 encode 决定。 | **Embedding 文档声明维度获取方式**，VDB 文档只引用。 |
| **GraphRAG ↔ VDB 边界** | GraphRAG 的实体/关系/社区报告以 `DocumentChunk` 格式存入 VDB，与标准 chunk 共用同一 ES index。 | **VDB 文档定义通用存储格式**，GraphRAG 文档只说明使用了该格式。 |
| **Retrieval ↔ Reranking 边界** | Retrieval 的 HYBRID 模式在 Node 层已做 dedup，但 `knowledge_retrieval()` 函数也有独立 rerank 调用。 | **Reranking 文档**说明两种调用路径（Node 层 vs 函数层）的区别。 |
| **Prompt ↔ LLM 边界** | Prompt 组装的 `history` 格式必须与各 Provider 的 API 格式兼容。 | **Prompt 文档**声明输出格式规范，LLM 文档说明各 Provider 的适配。 |