Skip to content

feat(parse): 保留文档图片附件#2243

Open
qin-ctx wants to merge 1 commit into
mainfrom
feat/document-image-attachments
Open

feat(parse): 保留文档图片附件#2243
qin-ctx wants to merge 1 commit into
mainfrom
feat/document-image-attachments

Conversation

@qin-ctx
Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx commented May 26, 2026

Description

本 PR 保留文档解析过程中产生或引用的图片附件,并让 Markdown、PDF、Word、legacy .doc 的图片可以跟随资源一起落盘、迁移和参与后续语义处理。

示例:

上传 paper.md,内容包含:

本文介绍 CXL 内存池。

![架构图](diagram.png)

解析后会把本地图片复制到资源树内,并把 Markdown 改写为稳定资源 URI:

本文介绍 CXL 内存池。

![架构图](viking://resources/papers/paper/media/images/image-1.png)
<!-- image-summary: A CXL memory pool diagram - with GPU allocation details. -->

这样图片不会再停留在临时目录或本地相对路径里,后续检索也能通过图片摘要理解这张图和正文的关系。

Related Issue

暂无

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • 新增统一的文档图片附件工具,标准化图片扩展名识别、media/images/image-N.ext 路径和 Markdown 图片引用生成。
  • Markdown 解析支持复制本地图片引用,并安全写入 parser 产生的二进制附件;PDF/Word 解析将提取出的图片、表格区域图片作为附件交给 Markdown 解析落盘。
  • 资源持久化前把临时 RESOURCE_ROOT_PLACEHOLDER 替换成最终 viking://resources/... URI,并在 Semantic DAG 中把图片摘要写回 Markdown 图片引用旁边。
  • legacy .doc 优先尝试 LibreOffice 转 PDF 后复用 PDF parser,保留图片和表格视觉内容;保留原有文本抽取作为 fallback。
  • 补充 Markdown、PDF、Word、legacy .doc、资源 URI 重写和图片摘要回写的单元测试。

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

测试命令:

.venv/bin/python -m pytest tests/parse/test_markdown_attachments.py tests/parse/test_document_image_attachments.py tests/parse/test_pdf_bookmark_extraction.py tests/misc/test_resource_processor_mv.py tests/storage/test_semantic_dag_stats.py tests/parse/test_document_parser_threading.py

结果:37 passed。测试过程中仍有项目既有的 requests 依赖版本提示和 Pydantic deprecation warnings。

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

不适用

Additional Notes

这次改动刻意让解析阶段先使用占位符 URI,再由 ResourceProcessor 在资源最终落盘前统一改写为目标资源 URI,避免 parser 提前猜测最终路径。

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add Markdown image attachment support

Relevant files:

  • openviking/parse/parsers/image_attachments.py
  • openviking/parse/parsers/markdown.py
  • tests/parse/test_markdown_attachments.py

Sub-PR theme: Add PDF image/table attachment support

Relevant files:

  • openviking/parse/parsers/pdf.py
  • tests/parse/test_pdf_bookmark_extraction.py

Sub-PR theme: Add Word/legacy .doc image support

Relevant files:

  • openviking/parse/parsers/word.py
  • openviking/parse/parsers/legacy_doc.py
  • tests/parse/test_document_image_attachments.py
  • tests/parse/test_document_parser_threading.py

Sub-PR theme: Add placeholder rewriting and image summary embedding

Relevant files:

  • openviking/storage/queuefs/semantic_dag.py
  • openviking/utils/resource_processor.py
  • tests/misc/test_resource_processor_mv.py
  • tests/storage/test_semantic_dag_stats.py

⚡ Recommended focus areas for review

Possible issue with zip strict=False

In _embed_image_summaries_in_markdown, using zip(..., strict=False) could truncate file_paths or file_summaries if they have different lengths, leading to missing image summaries or mismatched files.

for file_path, summary_dict in zip(node.file_paths, node.file_summaries, strict=False):
    if not summary_dict:
        continue
Potential missing kwargs in fallback path

In the legacy .doc parser's fallback path (when PDF conversion fails), the _extract_text call doesn't pass any kwargs. This might be intended, but worth verifying if any kwargs are needed for text extraction.

text = await asyncio.to_thread(self._extract_text, path)

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant