MemoryBear/docs/rag/overview/02-indexing-pipeline.mmd

%% MemoryBear 文档入库时序图（Indexing Pipeline）
%% 起点：用户上传 / API 调用；终点：向量入库 + GraphRAG 索引完成

sequenceDiagram
    autonumber
    participant User as 用户/API
    participant API as document_controller.py:275<br/>parse_documents()
    participant Celery as Celery Worker<br/>tasks.py
    participant DB as PostgreSQL<br/>(Document / Knowledge)
    participant Chunker as app/naive.py:508<br/>chunk()
    participant Parser as deepdoc/parser/<br/>(PDF/DOCX/HTML/...)
    tokenizer as nlp/__init__.py<br/>tokenize / naive_merge
    participant Embed as llm/embedding_model.py<br/>Base.encode()
    participant VDB as ESVectorFactory<br/>elasticsearch_vector.py
    participant Graph as graphrag/general/index.py<br/>run_graphrag_for_kb()

    Note over User,VDB: === 阶段 1：文件上传与触发 ===
    User->>API: POST /documents (file / URL)
    API->>DB: INSERT Document (status=pending)
    API->>Celery: delay parse_document(file_path, document_id)

    Note over Celery,VDB: === 阶段 2：文档解析与分块 ===
    Celery->>DB: SELECT Document, Knowledge
    Celery->>Celery: _build_vision_model()
    Celery->>Chunker: chunk(filename, binary, vision_model)

    alt PDF 格式
        Chunker->>Parser: RAGPdfParser.__call__()
        Parser->>Parser: __images__() → OCR → _layouts_rec()
        Parser->>Parser: _table_transformer_job()
        Parser->>Parser: _text_merge() + _concat_downward()
        Parser-->>Chunker: sections: List[(text, tag)]<br/>tables: List[(image, html)]
    else DOCX 格式
        Chunker->>Parser: RAGDocxParser.parse()
        Parser-->>Chunker: sections, tables
    else HTML/MD/TXT/Excel
        Chunker->>Parser: 对应 Parser
        Parser-->>Chunker: sections
    end

    alt 按文档类型路由
        Chunker->>Chunker: book.py / paper.py / laws.py / ...
        Chunker->>tokenizer: hierarchical_merge() / tree_merge()
    else 默认 naive
        Chunker->>tokenizer: naive_merge(sections, chunk_token_num)
    end

    tokenizer->>tokenizer: tokenize(d) → content_ltks / content_sm_ltks
    tokenizer->>tokenizer: tokenize_chunks() → 附 page_num / position / image
    tokenizer-->>Celery: res: List[Dict] (chunk dicts)

    Note over Celery,VDB: === 阶段 3：向量化与存储 ===
    Celery->>DB: progress=0.8
    Celery->>VDB: delete_by_metadata_field(document_id)

    alt auto_questions 开启
        Celery->>Celery: ThreadPool 并发生成问题
        Celery->>Embed: question_proposal(chat_mdl, content)
    end

    Celery->>Embed: encode(chunk_texts) → np.array
    Embed-->>Celery: vectors + token_count

    loop 每 batch
        Celery->>Celery: 组装 DocumentChunk(page_content, vector, metadata)
        Celery->>VDB: insert_documents(chunks)
        VDB->>VDB: cosineSimilarity 索引 + BM25
        VDB-->>Celery: ack
    end

    Celery->>DB: UPDATE Document (progress=1.0, chunk_num=N)

    Note over Celery,Graph: === 阶段 4：GraphRAG 异步构建 ===
    Celery->>Celery: build_graphrag_for_document.delay()
    Celery->>Graph: run_graphrag_for_kb(document_ids)
    Graph->>Graph: generate_subgraph() per chunk
    Graph->>Graph: LLM 抽取 entities + relations
    Graph->>Graph: merge_subgraph() → nx.pagerank
    opt entity_resolution
        Graph->>Graph: resolve_entities() (LLM 匹配)
    end
    opt community_reports (general only)
        Graph->>Graph: leiden.run() 层次聚类
        Graph->>Graph: CommunityReportsExtractor → LLM 报告
    end
    Graph->>VDB: store graph entities / relations / reports
    Graph-->>Celery: done