docs(rag): add MemoryBear RAG implementation docs v1.0
Some checks failed
Sync to Gitee / sync (push) Has been cancelled

Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>
This commit is contained in:
Multica PM Agent
2026-05-09 10:51:48 +08:00
parent feae2f2e1e
commit 343a5eebe3
33 changed files with 8410 additions and 1 deletions

View File

@@ -0,0 +1,98 @@
%% MemoryBear RAG 能力地图Capability Map
%% 横轴:能力域;纵轴:成熟度(已有 / 近期可上 / 中长期愿景)
%% 与 [S3-T1] 提议的 Retriever / Reranker / Generator / Embedder 抽象接口对齐
graph LR
classDef have fill:#10b981,stroke:#065f46,color:#fff,stroke-width:1px
classDef near fill:#f59e0b,stroke:#92400e,color:#fff,stroke-width:1px
classDef vision fill:#6366f1,stroke:#3730a3,color:#fff,stroke-width:1px
classDef domain fill:#e5e7eb,stroke:#374151,color:#111,stroke-width:1px
subgraph DLOAD[数据接入]
L1[Web 爬虫]:::have
L2[飞书 / 语雀 / 文件上传]:::have
L3[企业 IM / 邮件 / Notion / S3 增量同步]:::near
L4[流式数据 / Kafka / CDC]:::vision
end
subgraph DPARSE[解析与多模态采集]
P1[deepdoc PDF/OCR/Layout/Table]:::have
P2[图片 OCR + VLM describe]:::have
P3[音频 ASR]:::have
P4[视频 VLM 整体描述]:::have
P5[音视频时间戳化抽帧 + 关键帧 caption]:::near
P6[原生 CLIP/BGE-VL 跨模态嵌入]:::vision
end
subgraph DCHUNK[切分与表征]
C1[naive_merge / 类型化 chunker]:::have
C2[RagTokenizer 中英分词]:::have
C3[Late-Interaction / ColBERT 子词表征]:::near
C4[语义分块 + 自适应粒度]:::vision
end
subgraph DEMB[Embedding]
E1[10+ Provider 工厂]:::have
E2[问题增强 question_proposal]:::have
E3[Sparse 向量 / SPLADE 学习稀疏]:::near
E4[Multi-Vector / 多语种统一编码]:::vision
end
subgraph DVDB[向量与检索]
V1[ES dense_vector + BM25]:::have
V2[FusionExpr 0.05/0.95 加权融合]:::have
V3[KGSearch N-hop + Community]:::have
V4[HNSW 量化 / Sparse 索引上线]:::near
V5[语义路由 / 多检索器自适应组合]:::near
V6[联邦检索 / 跨租户隐私检索]:::vision
end
subgraph DRANK[重排序]
R1[内置 token+vector 融合排序]:::have
R2[Jina / DashScope / Xinference 外部 Reranker]:::have
R3[Cross-Encoder 蒸馏 + 在线 PairWise 学习]:::near
R4[基于反馈的自动 Reranker 微调]:::vision
end
subgraph DKG[知识图谱]
K1[GraphRAG light + general]:::have
K2[entity_resolution + Leiden 社区]:::have
K3[增量图演化 + 时间戳]:::near
K4[路径解释性 + Neo4j 双引擎]:::near
K5[多源图融合 / 自动本体演化]:::vision
end
subgraph DMEM[对话记忆]
M1[memory.forgetting_engine Ebbinghaus]:::have
M2[memory.reflection_engine 周期反思]:::have
M3[langgraph 读图 Agent]:::have
M4[短期 ↔ 长期 ↔ 检索召回三段桥接]:::near
M5[人格化记忆策略 + 用户偏好学习]:::vision
end
subgraph DEVAL[评估与反馈闭环]
EV1[README F1/BLEU/J 论文级评估]:::have
EV2[RAGAS / TruLens 集成 + 在线 A/B]:::near
EV3[👍/👎 反馈 → Rerank 微调闭环]:::near
EV4[自演化路由策略 / RLHF 长记忆]:::vision
end
subgraph DOPS[平台与可观测]
O1[Celery 任务链 + Redis 缓存]:::have
O2[FastAPI / Swagger]:::have
O3[OpenTelemetry Trace + 检索指标看板]:::near
O4[Prompt 仓库 + Eval CI / 灰度发布]:::vision
end
%% 跨域依赖(仅画关键边,避免过密)
DLOAD --> DPARSE
DPARSE --> DCHUNK
DCHUNK --> DEMB
DEMB --> DVDB
DVDB --> DRANK
DRANK -. citations .-> DOPS
DCHUNK -. async .-> DKG
DKG --> DVDB
DEVAL -. metrics .-> DRANK
DEVAL -. metrics .-> DVDB
DMEM -. memory-augmented retrieval .-> DVDB
DMEM -. summary into prompt .-> DRANK