- Make MetadataExtractor language param optional (default None) to
support auto-detection fallback when no language is explicitly set
- Refactor clean_metadata from walrus-operator dict comprehension to
explicit loop for correctness and readability
- Remove _replace_first_person_with_user from StatementExtractor to preserve
original user text for downstream metadata/alias extraction
- Delete metadata_utils.py module, inline clean_metadata into Celery task
- Remove unused imports and commented-out collect_user_raw_messages method
- Apply formatting cleanup across metadata models and extraction orchestrator
- Merge alias add/remove into MetadataExtractionResponse and Celery metadata task,
removing the separate sync step from extraction_orchestrator
- Replace first-person pronouns ("我") with "用户" in statement extraction to
preserve identity semantics for downstream metadata/alias extraction
- Update extract_statement.jinja2 prompt to enforce "用户" as subject for user
statements instead of resolving to real names
- Add alias change instructions (aliases_to_add/aliases_to_remove) to
extract_user_metadata.jinja2 with incremental merge logic
- Deduplicate special entities ("用户", "AI助手") in graph_saver by reusing
existing Neo4j node IDs per end_user_id
- Sync final aliases from PgSQL to Neo4j user entity nodes after metadata write
The knowledge graph retrieval logic in `search.py` was updated to consistently return `DocumentChunk` instances instead of raw dictionaries, improving type safety and alignment with the RAG pipeline's expected data structure. Additionally, debug logging was enhanced in `draft_run_service.py` to log the full `retrieve_chunks_result` before extracting page content, aiding troubleshooting.
The pdfplumber parser now uses a global lock to prevent concurrent access issues during PDF image rendering. Additionally, added a warning log to trace knowledge retrieval results for debugging purposes. The syntax fix in knowledge node's match case ensures correct pattern matching behavior.
BREAKING CHANGE: The pdfplumber parser now requires LOCK_KEY_pdfplumber to be defined in sys.modules for thread safety.
Closes#841
- Remove merge_metadata and its helper functions from metadata_utils.py
- Pass existing_metadata to MetadataExtractor.extract_metadata() as LLM context
- Add merge instructions to extract_user_metadata.jinja2 prompt (zh/en)
- Update Celery task to read existing metadata before extraction and overwrite
- Simplify field descriptions in UserMetadataProfile model
- Add _update_timestamps helper to track changed fields
- Replace version-based optimistic locking and retry loop with apoc.atomic.add/insert for concurrent safety
- Merge duplicate accesses within a batch before updating (access_count_delta)
- Simplify _calculate_update to only compute on new timestamps instead of full history rebuild
- Remove max_retries instance variable (kept as param for backward compat)
- Trim verbose docstrings and inline comments
- Increase max_retries from 3 to 5 for concurrent conflict recovery
- Add randomized exponential backoff between retries to reduce contention
- Merge duplicate node accesses in batch operations to avoid self-conflicts
- Support access_times parameter for merged batch access counting
- Add Community node label support in atomic update content field map