Files
MemoryBear/docs/rag/overview/02-indexing-pipeline.mmd
Multica PM Agent 343a5eebe3
Some checks failed
Sync to Gitee / sync (push) Has been cancelled
docs(rag): add MemoryBear RAG implementation docs v1.0
Submit the formed RAG documentation set produced across Sprint-1/2/3
(WS-12 through WS-26) under docs/rag/. Includes:

- README.md / INDEX.md: landing + total index (responsibility matrix,
  review verdicts, dual-link to source issues)
- overview/: full-pipeline architecture (4 .mmd diagrams),
  11-stage boundary contracts, doc map, source-code inventory
- pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding,
  VDB & retrieval, GraphRAG, Rerank/Prompt/LLM)
- graphrag/, end-to-end/: v1.0 formal versions with full source
  retained as reference
- evolution/: 11 architecture-refactor proposals,
  6-direction roadmap, capability map
- review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary
- _indexes/: glossary (81 terms), source->doc reverse index, chart index
- _release/: v1.0-RC1 release manifest, versioning convention,
  ops & freshness plan
- _meta/README.md: placeholder noting WS-12 governance assets gap

Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot
checks hit). The legacy docs/ ignore in .gitignore is narrowed to
docs/* with an explicit allowlist for docs/rag/.

Refs: WS-26
Co-authored-by: multica-agent <github@multica.ai>
2026-05-09 10:51:48 +08:00

88 lines
3.5 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
%% MemoryBear 文档入库时序图Indexing Pipeline
%% 起点:用户上传 / API 调用;终点:向量入库 + GraphRAG 索引完成
sequenceDiagram
autonumber
participant User as 用户/API
participant API as document_controller.py:275<br/>parse_documents()
participant Celery as Celery Worker<br/>tasks.py
participant DB as PostgreSQL<br/>(Document / Knowledge)
participant Chunker as app/naive.py:508<br/>chunk()
participant Parser as deepdoc/parser/<br/>(PDF/DOCX/HTML/...)
tokenizer as nlp/__init__.py<br/>tokenize / naive_merge
participant Embed as llm/embedding_model.py<br/>Base.encode()
participant VDB as ESVectorFactory<br/>elasticsearch_vector.py
participant Graph as graphrag/general/index.py<br/>run_graphrag_for_kb()
Note over User,VDB: === 阶段 1文件上传与触发 ===
User->>API: POST /documents (file / URL)
API->>DB: INSERT Document (status=pending)
API->>Celery: delay parse_document(file_path, document_id)
Note over Celery,VDB: === 阶段 2文档解析与分块 ===
Celery->>DB: SELECT Document, Knowledge
Celery->>Celery: _build_vision_model()
Celery->>Chunker: chunk(filename, binary, vision_model)
alt PDF 格式
Chunker->>Parser: RAGPdfParser.__call__()
Parser->>Parser: __images__() → OCR → _layouts_rec()
Parser->>Parser: _table_transformer_job()
Parser->>Parser: _text_merge() + _concat_downward()
Parser-->>Chunker: sections: List[(text, tag)]<br/>tables: List[(image, html)]
else DOCX 格式
Chunker->>Parser: RAGDocxParser.parse()
Parser-->>Chunker: sections, tables
else HTML/MD/TXT/Excel
Chunker->>Parser: 对应 Parser
Parser-->>Chunker: sections
end
alt 按文档类型路由
Chunker->>Chunker: book.py / paper.py / laws.py / ...
Chunker->>tokenizer: hierarchical_merge() / tree_merge()
else 默认 naive
Chunker->>tokenizer: naive_merge(sections, chunk_token_num)
end
tokenizer->>tokenizer: tokenize(d) → content_ltks / content_sm_ltks
tokenizer->>tokenizer: tokenize_chunks() → 附 page_num / position / image
tokenizer-->>Celery: res: List[Dict] (chunk dicts)
Note over Celery,VDB: === 阶段 3向量化与存储 ===
Celery->>DB: progress=0.8
Celery->>VDB: delete_by_metadata_field(document_id)
alt auto_questions 开启
Celery->>Celery: ThreadPool 并发生成问题
Celery->>Embed: question_proposal(chat_mdl, content)
end
Celery->>Embed: encode(chunk_texts) → np.array
Embed-->>Celery: vectors + token_count
loop 每 batch
Celery->>Celery: 组装 DocumentChunk(page_content, vector, metadata)
Celery->>VDB: insert_documents(chunks)
VDB->>VDB: cosineSimilarity 索引 + BM25
VDB-->>Celery: ack
end
Celery->>DB: UPDATE Document (progress=1.0, chunk_num=N)
Note over Celery,Graph: === 阶段 4GraphRAG 异步构建 ===
Celery->>Celery: build_graphrag_for_document.delay()
Celery->>Graph: run_graphrag_for_kb(document_ids)
Graph->>Graph: generate_subgraph() per chunk
Graph->>Graph: LLM 抽取 entities + relations
Graph->>Graph: merge_subgraph() → nx.pagerank
opt entity_resolution
Graph->>Graph: resolve_entities() (LLM 匹配)
end
opt community_reports (general only)
Graph->>Graph: leiden.run() 层次聚类
Graph->>Graph: CommunityReportsExtractor → LLM 报告
end
Graph->>VDB: store graph entities / relations / reports
Graph-->>Celery: done