Some checks failed
Sync to Gitee / sync (push) Has been cancelled
Submit the formed RAG documentation set produced across Sprint-1/2/3 (WS-12 through WS-26) under docs/rag/. Includes: - README.md / INDEX.md: landing + total index (responsibility matrix, review verdicts, dual-link to source issues) - overview/: full-pipeline architecture (4 .mmd diagrams), 11-stage boundary contracts, doc map, source-code inventory - pipeline/: 5 deep-dives (Loader/Parser/Chunking, Embedding, VDB & retrieval, GraphRAG, Rerank/Prompt/LLM) - graphrag/, end-to-end/: v1.0 formal versions with full source retained as reference - evolution/: 11 architecture-refactor proposals, 6-direction roadmap, capability map - review/: S3-T1 / S3-T2 final reviews, S2-T7 final summary - _indexes/: glossary (81 terms), source->doc reverse index, chart index - _release/: v1.0-RC1 release manifest, versioning convention, ops & freshness plan - _meta/README.md: placeholder noting WS-12 governance assets gap Aggregate review score 92.6/100 (8/8 PASS, 31/31 source-code spot checks hit). The legacy docs/ ignore in .gitignore is narrowed to docs/* with an explicit allowlist for docs/rag/. Refs: WS-26 Co-authored-by: multica-agent <github@multica.ai>
88 lines
3.5 KiB
Plaintext
88 lines
3.5 KiB
Plaintext
%% MemoryBear 文档入库时序图(Indexing Pipeline)
|
||
%% 起点:用户上传 / API 调用;终点:向量入库 + GraphRAG 索引完成
|
||
|
||
sequenceDiagram
|
||
autonumber
|
||
participant User as 用户/API
|
||
participant API as document_controller.py:275<br/>parse_documents()
|
||
participant Celery as Celery Worker<br/>tasks.py
|
||
participant DB as PostgreSQL<br/>(Document / Knowledge)
|
||
participant Chunker as app/naive.py:508<br/>chunk()
|
||
participant Parser as deepdoc/parser/<br/>(PDF/DOCX/HTML/...)
|
||
tokenizer as nlp/__init__.py<br/>tokenize / naive_merge
|
||
participant Embed as llm/embedding_model.py<br/>Base.encode()
|
||
participant VDB as ESVectorFactory<br/>elasticsearch_vector.py
|
||
participant Graph as graphrag/general/index.py<br/>run_graphrag_for_kb()
|
||
|
||
Note over User,VDB: === 阶段 1:文件上传与触发 ===
|
||
User->>API: POST /documents (file / URL)
|
||
API->>DB: INSERT Document (status=pending)
|
||
API->>Celery: delay parse_document(file_path, document_id)
|
||
|
||
Note over Celery,VDB: === 阶段 2:文档解析与分块 ===
|
||
Celery->>DB: SELECT Document, Knowledge
|
||
Celery->>Celery: _build_vision_model()
|
||
Celery->>Chunker: chunk(filename, binary, vision_model)
|
||
|
||
alt PDF 格式
|
||
Chunker->>Parser: RAGPdfParser.__call__()
|
||
Parser->>Parser: __images__() → OCR → _layouts_rec()
|
||
Parser->>Parser: _table_transformer_job()
|
||
Parser->>Parser: _text_merge() + _concat_downward()
|
||
Parser-->>Chunker: sections: List[(text, tag)]<br/>tables: List[(image, html)]
|
||
else DOCX 格式
|
||
Chunker->>Parser: RAGDocxParser.parse()
|
||
Parser-->>Chunker: sections, tables
|
||
else HTML/MD/TXT/Excel
|
||
Chunker->>Parser: 对应 Parser
|
||
Parser-->>Chunker: sections
|
||
end
|
||
|
||
alt 按文档类型路由
|
||
Chunker->>Chunker: book.py / paper.py / laws.py / ...
|
||
Chunker->>tokenizer: hierarchical_merge() / tree_merge()
|
||
else 默认 naive
|
||
Chunker->>tokenizer: naive_merge(sections, chunk_token_num)
|
||
end
|
||
|
||
tokenizer->>tokenizer: tokenize(d) → content_ltks / content_sm_ltks
|
||
tokenizer->>tokenizer: tokenize_chunks() → 附 page_num / position / image
|
||
tokenizer-->>Celery: res: List[Dict] (chunk dicts)
|
||
|
||
Note over Celery,VDB: === 阶段 3:向量化与存储 ===
|
||
Celery->>DB: progress=0.8
|
||
Celery->>VDB: delete_by_metadata_field(document_id)
|
||
|
||
alt auto_questions 开启
|
||
Celery->>Celery: ThreadPool 并发生成问题
|
||
Celery->>Embed: question_proposal(chat_mdl, content)
|
||
end
|
||
|
||
Celery->>Embed: encode(chunk_texts) → np.array
|
||
Embed-->>Celery: vectors + token_count
|
||
|
||
loop 每 batch
|
||
Celery->>Celery: 组装 DocumentChunk(page_content, vector, metadata)
|
||
Celery->>VDB: insert_documents(chunks)
|
||
VDB->>VDB: cosineSimilarity 索引 + BM25
|
||
VDB-->>Celery: ack
|
||
end
|
||
|
||
Celery->>DB: UPDATE Document (progress=1.0, chunk_num=N)
|
||
|
||
Note over Celery,Graph: === 阶段 4:GraphRAG 异步构建 ===
|
||
Celery->>Celery: build_graphrag_for_document.delay()
|
||
Celery->>Graph: run_graphrag_for_kb(document_ids)
|
||
Graph->>Graph: generate_subgraph() per chunk
|
||
Graph->>Graph: LLM 抽取 entities + relations
|
||
Graph->>Graph: merge_subgraph() → nx.pagerank
|
||
opt entity_resolution
|
||
Graph->>Graph: resolve_entities() (LLM 匹配)
|
||
end
|
||
opt community_reports (general only)
|
||
Graph->>Graph: leiden.run() 层次聚类
|
||
Graph->>Graph: CommunityReportsExtractor → LLM 报告
|
||
end
|
||
Graph->>VDB: store graph entities / relations / reports
|
||
Graph-->>Celery: done
|