%% MemoryBear 文档入库时序图(Indexing Pipeline) %% 起点:用户上传 / API 调用;终点:向量入库 + GraphRAG 索引完成 sequenceDiagram autonumber participant User as 用户/API participant API as document_controller.py:275
parse_documents() participant Celery as Celery Worker
tasks.py participant DB as PostgreSQL
(Document / Knowledge) participant Chunker as app/naive.py:508
chunk() participant Parser as deepdoc/parser/
(PDF/DOCX/HTML/...) tokenizer as nlp/__init__.py
tokenize / naive_merge participant Embed as llm/embedding_model.py
Base.encode() participant VDB as ESVectorFactory
elasticsearch_vector.py participant Graph as graphrag/general/index.py
run_graphrag_for_kb() Note over User,VDB: === 阶段 1:文件上传与触发 === User->>API: POST /documents (file / URL) API->>DB: INSERT Document (status=pending) API->>Celery: delay parse_document(file_path, document_id) Note over Celery,VDB: === 阶段 2:文档解析与分块 === Celery->>DB: SELECT Document, Knowledge Celery->>Celery: _build_vision_model() Celery->>Chunker: chunk(filename, binary, vision_model) alt PDF 格式 Chunker->>Parser: RAGPdfParser.__call__() Parser->>Parser: __images__() → OCR → _layouts_rec() Parser->>Parser: _table_transformer_job() Parser->>Parser: _text_merge() + _concat_downward() Parser-->>Chunker: sections: List[(text, tag)]
tables: List[(image, html)] else DOCX 格式 Chunker->>Parser: RAGDocxParser.parse() Parser-->>Chunker: sections, tables else HTML/MD/TXT/Excel Chunker->>Parser: 对应 Parser Parser-->>Chunker: sections end alt 按文档类型路由 Chunker->>Chunker: book.py / paper.py / laws.py / ... Chunker->>tokenizer: hierarchical_merge() / tree_merge() else 默认 naive Chunker->>tokenizer: naive_merge(sections, chunk_token_num) end tokenizer->>tokenizer: tokenize(d) → content_ltks / content_sm_ltks tokenizer->>tokenizer: tokenize_chunks() → 附 page_num / position / image tokenizer-->>Celery: res: List[Dict] (chunk dicts) Note over Celery,VDB: === 阶段 3:向量化与存储 === Celery->>DB: progress=0.8 Celery->>VDB: delete_by_metadata_field(document_id) alt auto_questions 开启 Celery->>Celery: ThreadPool 并发生成问题 Celery->>Embed: question_proposal(chat_mdl, content) end Celery->>Embed: encode(chunk_texts) → np.array Embed-->>Celery: vectors + token_count loop 每 batch Celery->>Celery: 组装 DocumentChunk(page_content, vector, metadata) Celery->>VDB: insert_documents(chunks) VDB->>VDB: cosineSimilarity 索引 + BM25 VDB-->>Celery: ack end Celery->>DB: UPDATE Document (progress=1.0, chunk_num=N) Note over Celery,Graph: === 阶段 4:GraphRAG 异步构建 === Celery->>Celery: build_graphrag_for_document.delay() Celery->>Graph: run_graphrag_for_kb(document_ids) Graph->>Graph: generate_subgraph() per chunk Graph->>Graph: LLM 抽取 entities + relations Graph->>Graph: merge_subgraph() → nx.pagerank opt entity_resolution Graph->>Graph: resolve_entities() (LLM 匹配) end opt community_reports (general only) Graph->>Graph: leiden.run() 层次聚类 Graph->>Graph: CommunityReportsExtractor → LLM 报告 end Graph->>VDB: store graph entities / relations / reports Graph-->>Celery: done