Files

Multica PM Agent 343a5eebe3

Sync to Gitee / sync (push) Has been cancelled

Details

docs(rag): add MemoryBear RAG implementation docs v1.0

Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>

2026-05-09 10:51:48 +08:00

14 KiB

Raw Blame History

MemoryBear RAG · 关键术语表

合并 Sprint-1 / Sprint-2 / Sprint-3 各文档术语，按字母顺序排列。每个术语注明：含义 + 在 MemoryBear 代码中的对应位置 + 出现的文档。

A

术语	含义	代码位置	出现文档
ASR	Automatic Speech Recognition，语音转文字。MemoryBear 中通过 `seq2txt_model.transcription` 调用（QWenSeq2txt 带时间戳，GPTSeq2txt 用 Whisper）	`rag/llm/sequence2txt_model.py:1-215`	S2-T1, S2-T5
Autopilot	工作空间内的"按时触发 / 按事件触发"自动化代理；与 `multica autopilot` 命令族对应	—	平台机制（项目 SOP）

B

术语	含义	代码位置	出现文档
BaseVector	VDB 抽象基类（仅定义抽象方法，目前唯一实现为 `ElasticSearchVector`）	`rag/vdb/vector_base.py:9`	S1-T3, S2-T3, S3-T1
BM25	Best Match 25，全文检索经典 ranking 函数；MemoryBear 通过 ES `query_string` + IK 分词器实现	`rag/nlp/query.py`, `rag/vdb/elasticsearch/elasticsearch_vector.py:468 search_by_full_text`	S2-T3, S3-T2
Boundaries	11 个 RAG 阶段的输入/输出/接口契约文档（S1-T2 交付物之一）	—	S1-T2

C

术语	含义	代码位置	出现文档
Celery	任务队列；MemoryBear 用它派发文档解析、GraphRAG 构建等异步流水线	`tasks.py:212 parse_document`, `tasks.py:472 build_graphrag_for_kb`, `tasks.py:557 build_graphrag_for_document`	S1-T3, S2-T1, S2-T3, S3-T2
chat_limiter	Trio CapacityLimiter，控制 GraphRAG 中实体/关系 Embedding 的并发；默认 10	`rag/graphrag/utils.py:41`	S2-T2, S3-T1
Chunk	最终交给 Embedding 的文本片段，一般 ≤ `chunk_token_num`（默认 128–512）	`rag/models/chunk.py:17 DocumentChunk`	S2-T1, S2-T2, S2-T3
chunk_token_num	单个 chunk 的最大 token 数	`rag/app/naive.py` 调用层指定	S2-T1
citation	答案文本中插入的 `[ID:N]` 引用标记	`rag/nlp/search.py:489-577 Dealer.insert_citations`	S2-T5
CLIP / BGE-VL / Jina-Clip	跨模态 Embedding 模型，把图像和文本映射到同一语义空间	当前未启用，规划见 S3-T2 D1	S3-T2
cl100k_base	OpenAI GPT-4 系列使用的 BPE tokenizer；MemoryBear 用它做 token 计数	`rag/common/token_utils.py`	S2-T1, S2-T2
Cross-Encoder	一种 Reranker 范式：把 (query, doc) 拼接后过同一个 Encoder，输出相关性分数	当前未自训，仅在外部 rerank 服务（DashScope/Jina）调用，规划见 S3-T2 D5	S2-T5, S3-T2

D

术语	含义	代码位置	出现文档
Dealer	`rag/nlp/search.py:349 Dealer` 类，BM25/hybrid 搜索调度器；GraphRAG 主要使用此通道	`rag/nlp/search.py:349`	S1-T3, S2-T3, S2-T5, S3-T1
deepdoc	MemoryBear 的多格式解析模块，含 parser（11 种格式）+ vision（OCR / 版面识别 / TSR）	`rag/deepdoc/{parser,vision}`	S1-T3, S2-T1
DocumentChunk	Chunk 数据模型	`rag/models/chunk.py:17`	S2-T1, S2-T2, S2-T3
dense_vector	ES 向量字段类型；MemoryBear 用 HNSW 索引 + cosine 相似度	`elasticsearch_vector.py:653-658`, `rag/res/mapping.json`	S2-T2, S2-T3

E

术语	含义	代码位置	出现文档
E2E（End-to-End）	端到端调用链路，覆盖文档入库 + 在线检索 + 生成的完整时序	`rag/app/`, `workflow/nodes/knowledge/`, `rag/llm/`	S2-T6（待交付）
Embedder	Embedding 模型抽象接口（S3-T1 提议的统一 Protocol）	提议中：`app/core/rag/protocols/embedder.py`	S3-T1, S3-T2
Embedding 双轨	MemoryBear 当前同时存在两条 Embedding 调用路径：`RedBearEmbeddings`（LangChain，新）与 `OpenAIEmbed/QWenEmbed/...`（遗留）	`rag/models/embedding.py` + `rag/llm/embedding_model.py`	S2-T2, S3-T1
embed_cache	GraphRAG 中的实体/关系 Embedding Redis 缓存，TTL 24h	`rag/graphrag/utils.py:115-134`	S2-T2, S3-T1
EMBEDDING_BATCH_SIZE	批量 Embedding 大小的环境变量（README 提及但当前未生效）	—	S2-T2, S3-T1
Entity Resolution	实体消歧；GraphRAG 索引流程的一环	`rag/graphrag/entity_resolution.py:31`	S1-T3
ESConnection	ES 连接单例	`rag/utils/es_conn.py`	S1-T3, S2-T3
ElasticSearchVector	VDB 主实现；同时承载 chunk + GraphRAG entity/relation + community_report	`rag/vdb/elasticsearch/elasticsearch_vector.py:29`	S1-T3, S2-T3, S3-T1

F

术语	含义	代码位置	出现文档
FOLDER 类型知识库	包含子知识库的文件夹型 KB；检索时递归遍历	`workflow/nodes/knowledge/node.py`	S1-T3
FusionExpr	ES 检索中的"加权融合"DSL；当前固定 `0.05/0.95`（BM25:Vector）	`rag/nlp/search.py:439`	S2-T3, S3-T2

G

术语	含义	代码位置	出现文档
GraphRAG（general）	Microsoft GraphRAG 风格：完整流水线（子图 → 合并 → PageRank → Leiden 社区 → 社区报告）	`rag/graphrag/general/index.py:36 run_graphrag`	S1-T2, S1-T3
GraphRAG（light）	LightRAG 风格：简化的实体/关系抽取，无社区报告；与 general 共享大部分代码	`rag/graphrag/light/graph_extractor.py:31`	S1-T2, S1-T3
GraphStore	图存储抽象（S3-T2 提议）	提议中	S3-T2
GraphAugmentedRetriever	在 Hybrid 结果之上叠加 KGSearch 的 Retriever 实现	提议中	S3-T1, S3-T2

H

术语	含义	代码位置	出现文档
HNSW	Hierarchical Navigable Small World，向量索引算法；ES 8.x 内置	ES 集群侧	S2-T3
HYBRID 检索	BM25 + 向量并行 → 去重 → 可选 Rerank	`workflow/nodes/knowledge/node.py:236-271`	S2-T3, S2-T5
HybridRetriever	Hybrid 检索 Protocol 实现（S3-T1 PoC）	提议中	S3-T1

I

术语	含义	代码位置	出现文档
IK 分词器	中文分词器，ES IK plugin（`ik_max_word`）	ES 集群侧	S2-T3
init_settings()	模块级副作用，启动时自动建 ES 连接 + retriever 单例	`rag/common/settings.py:24`	S1-T3, S3-T1
insert_citations	答案分句后按 embedding 相似度回填 `[ID:N]` 引用	`rag/nlp/search.py:489-577`	S2-T5

K

术语	含义	代码位置	出现文档
KGSearch	GraphRAG 检索器	`rag/graphrag/search.py:19`	S1-T3, S3-T2
knowledge_graph_kwd	ES 中区分图类型（entity / relation / community_report）的字段	`rag/vdb/elasticsearch/elasticsearch_vector.py`	S1-T3
KnowledgeRetrievalNode	Workflow 引擎中的知识检索节点	`workflow/nodes/knowledge/node.py:29`	S1-T3, S2-T5, S3-T1

L

术语	含义	代码位置	出现文档
LangChainAgent	基于 `create_agent` 的 ReAct Agent，工具调用循环	`agent/langchain_agent.py:26-641`	S2-T5
Late-Interaction	一种检索范式（如 ColBERT），文档级向量改为 token 级，retrieval 用 MaxSim	当前未启用，规划见 S3-T2 D2	S3-T2
Leiden 算法	社区检测算法；GraphRAG 用它划分社区	`rag/graphrag/general/index.py` 调用 `graspologic.partition.leiden`	S1-T2, S1-T3
LightRAG	GraphRAG 轻量化变种，无社区报告	`rag/graphrag/light/`	S1-T2, S1-T3
LLM	Large Language Model；MemoryBear 通过 `chat_model.py` 与 `langchain_agent.py` 调用	`rag/llm/chat_model.py:52 Base`	S2-T5
LO（LibreOffice）	用作 PPT/PPTX 转 PDF 的兜底工具	`rag/utils/libre_office.py`	S2-T1

M

术语	含义	代码位置	出现文档
MatchSparseExpr / Field.SPARSE_VECTOR	已声明未启用的稀疏向量表达式（SPLADE 接入预埋）	`rag/utils/doc_store_conn.py:75`, `vdb/field.py:11`	S3-T2
Memory（记忆系统）	MemoryBear 的对话内存系统：Ebbinghaus 衰减 + ACT-R + Neo4j + langgraph 读写图	`core/memory/`（与 `core/rag/` 当前完全独立）	S3-T2 D4
MemoryAugmentedRetriever	D4 提议：在检索前用长期记忆改写 query 的 Retriever 包装层	提议中	S3-T2 D4
mind_map_extractor	独立运行的思维导图抽取器，不在 GraphRAG 主链路	`rag/graphrag/mind_map_extractor.py`	S1-T2
MinerU	第三方 PDF 解析服务（外部 API）	`rag/deepdoc/parser/mineru_parser.py:41`, `rag/app/textin_parser.py`	S1-T3, S2-T1
Multimodal Embedding	多模态 Embedding；MemoryBear 仅火山引擎支持原生多模态	`rag/models/embedding.py:65-78` 中 `_is_volcano` 分支	S2-T2, S3-T2 D1

N

术语	含义	代码位置	出现文档
naive_merge / hierarchical_merge / tree_merge	三种 Chunking 合并策略	`rag/nlp/__init__.py`	S2-T1
Neo4j	图数据库；README 声明依赖，但 `core/rag` 当前零调用（规划见 S3-T2 D3）	—	S3-T2

O

术语	含义	代码位置	出现文档
OCR	文字检测 + 识别两阶段	`rag/deepdoc/vision/ocr.py:522 OCR.__call__:694`	S2-T1
OpenAIEmbed / QWenEmbed / ...	遗留的原始 Embedding 实现，被 GraphRAG 与 Dealer 使用	`rag/llm/embedding_model.py:14-65`	S2-T2, S3-T1
OpenTelemetry (OTel)	全链路追踪 + 指标 SDK；MemoryBear 当前未引入（规划见 S3-T1 #6）	提议中	S3-T1

P

术语	含义	代码位置	出现文档
PageRank	图节点重要性算法；GraphRAG 用它给实体打分	`rag/graphrag/general/index.py`	S1-T2, S1-T3
PARTICIPLE 检索	关键词分词检索（BM25）	`workflow/nodes/knowledge/node.py:195`	S2-T3
Plugin Registry	S3-T1 #5 提议的 Parser/LLM Provider 注册机制，替换 `naive.py` 11 路 if/elif	提议中	S3-T1
Pydantic Settings	S3-T1 #7 提议的中心化配置管理框架	提议中	S3-T1

R

术语	含义	代码位置	出现文档
rag_utils（注意：与 `rag/utils` 不同）	Chunk 内容 LLM 分析模块（摘要/标签/洞察/人物画像）；与 Memory 系统耦合	`api/app/core/rag_utils/`	S1-T3
RAGAS	开源 RAG 评估框架；MemoryBear 当前未集成	提议中	S3-T2 D5
rank_feature	ES 中的 tag TF-IDF + PageRank 辅助排序分	`rag/nlp/search.py:579-604`	S2-T5
RedBearEmbeddings	LangChain 统一封装的 Embedding 类（新路径）	`rag/models/embedding.py:9-23`	S2-T2
RedBearRerank	LangChain `BaseDocumentCompressor` 封装的 Reranker	`rag/models/rerank.py:11-84`	S2-T5, S3-T2
Rerank 三轨	(a) `node.py:284 rerank()` 模块级；(b) `KnowledgeRetrievalNode.rerank()` 节点方法；(c) `Dealer.rerank()` 融合排序	`node.py:108-155, 284`、`nlp/search.py:606-643`	S2-T5, S3-T1
Reranker	Reranking Protocol（S3-T1 提议）	提议中	S3-T1, S3-T2
retrieve_type	检索模式 enum：PARTICIPLE / SEMANTIC / HYBRID / Graph	`schemas/chunk_schema.py`	S2-T3, S3-T2
Retriever	检索器 Protocol（S3-T1 提议）	提议中	S3-T1, S3-T2
RouterRetriever	自适应路由 Retriever（S3-T2 D6 提议）	提议中	S3-T2
RRF（Reciprocal Rank Fusion）	多路检索结果排序融合算法；S3-T2 PoC-A 提议接入	提议中	S3-T2

S

术语	含义	代码位置	出现文档
SEMANTIC 检索	纯向量检索	`workflow/nodes/knowledge/node.py:195`	S2-T3
Section	解析器吐出的 `(text, position_or_layout)` 中间结构，是 Chunking 的"原料"	`rag/app/naive.py:257`	S2-T1
SPLADE	学习型稀疏向量；S3-T2 D2 提议接入	提议中（脚手架已存：`MatchSparseExpr`）	S3-T2
structlog	结构化日志库；S3-T1 #10 提议替换现有非结构化 `logger.*`	提议中	S3-T1
System Prompt 组装	"用户自定义 system_prompt + 技能 Prompt + 文档图片识别指令"三段拼接	`app_chat_service.py:77-96`	S2-T5

T

术语	含义	代码位置	出现文档
TextIn	第三方 PDF 解析 API	`rag/app/textin_parser.py`	S1-T3
Token	用 cl100k_base 编码后的 BPE token	`rag/common/token_utils.py`	S2-T1, S2-T2
tokenize_chunks_with_images	带图片的 Chunk 化处理	`rag/nlp/__init__.py`	S2-T1
TSR	Table Structure Recognition，复杂表格行/列/合并单元格还原	`rag/deepdoc/vision/table_structure_recognizer.py:15`	S2-T1

V

术语	含义	代码位置	出现文档
VDB（Vector Database）	向量数据库；MemoryBear 当前唯一实现是 Elasticsearch 8.x	`rag/vdb/elasticsearch/`	S2-T3
VectorBase	见 BaseVector	`rag/vdb/vector_base.py:9`	—
VLM	Vision-Language Model；图像理解（CV 模型）	`rag/llm/cv_model.py`	S2-T1

W

术语	含义	代码位置	出现文档
weighted_sum (0.05, 0.95)	ES 层 Hybrid 检索的固定权重（BM25:Vector）	`rag/nlp/search.py:439`	S2-T3, S3-T2
Workflow Knowledge Node	见 KnowledgeRetrievalNode	`workflow/nodes/knowledge/node.py:29`	S1-T3, S2-T5

X

术语	含义	代码位置	出现文档
xxhash	快速哈希函数；用于 GraphRAG embed_cache 的 key 生成	`rag/graphrag/utils.py:115-134`	S2-T2

— Glossary · v1.0-RC1 · 共 81 个术语 · 2026-05-08 —

14 KiB Raw Blame History Unescape Escape

MemoryBear RAG · 关键术语表

A

B

C

D

E

F

G

H

I

K

L

M

N

O

P

R

S

T

V

W

X

14 KiB

Raw Blame History