feat(parse): 保留文档图片附件 by qin-ctx · Pull Request #2243 · volcengine/OpenViking

qin-ctx · 2026-05-26T10:05:40Z

Description

本 PR 保留文档解析过程中产生或引用的图片附件，并让 Markdown、PDF、Word、legacy .doc 的图片可以跟随资源一起落盘、迁移和参与后续语义处理。

示例：

上传 paper.md，内容包含：

本文介绍 CXL 内存池。

![架构图](diagram.png)

解析后会把本地图片复制到资源树内，并把 Markdown 改写为稳定资源 URI：

本文介绍 CXL 内存池。

![架构图](viking://resources/papers/paper/media/images/image-1.png)
<!-- image-summary: A CXL memory pool diagram - with GPU allocation details. -->

这样图片不会再停留在临时目录或本地相对路径里，后续检索也能通过图片摘要理解这张图和正文的关系。

Related Issue

暂无

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

新增统一的文档图片附件工具，标准化图片扩展名识别、media/images/image-N.ext 路径和 Markdown 图片引用生成。
Markdown 解析支持复制本地图片引用，并安全写入 parser 产生的二进制附件；PDF/Word 解析将提取出的图片、表格区域图片作为附件交给 Markdown 解析落盘。
资源持久化前把临时 RESOURCE_ROOT_PLACEHOLDER 替换成最终 viking://resources/... URI，并在 Semantic DAG 中把图片摘要写回 Markdown 图片引用旁边。
legacy .doc 优先尝试 LibreOffice 转 PDF 后复用 PDF parser，保留图片和表格视觉内容；保留原有文本抽取作为 fallback。
补充 Markdown、PDF、Word、legacy .doc、资源 URI 重写和图片摘要回写的单元测试。

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

测试命令：

.venv/bin/python -m pytest tests/parse/test_markdown_attachments.py tests/parse/test_document_image_attachments.py tests/parse/test_pdf_bookmark_extraction.py tests/misc/test_resource_processor_mv.py tests/storage/test_semantic_dag_stats.py tests/parse/test_document_parser_threading.py

结果：37 passed。测试过程中仍有项目既有的 requests 依赖版本提示和 Pydantic deprecation warnings。

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

不适用

Additional Notes

这次改动刻意让解析阶段先使用占位符 URI，再由 ResourceProcessor 在资源最终落盘前统一改写为目标资源 URI，避免 parser 提前猜测最终路径。

github-actions · 2026-05-26T10:07:30Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add Markdown image attachment support Relevant files: openviking/parse/parsers/image_attachments.py openviking/parse/parsers/markdown.py tests/parse/test_markdown_attachments.py Sub-PR theme: Add PDF image/table attachment support Relevant files: openviking/parse/parsers/pdf.py tests/parse/test_pdf_bookmark_extraction.py Sub-PR theme: Add Word/legacy .doc image support Relevant files: openviking/parse/parsers/word.py openviking/parse/parsers/legacy_doc.py tests/parse/test_document_image_attachments.py tests/parse/test_document_parser_threading.py Sub-PR theme: Add placeholder rewriting and image summary embedding Relevant files: openviking/storage/queuefs/semantic_dag.py openviking/utils/resource_processor.py tests/misc/test_resource_processor_mv.py tests/storage/test_semantic_dag_stats.py
⚡ Recommended focus areas for review Possible issue with zip strict=False In _embed_image_summaries_in_markdown, using zip(..., strict=False) could truncate file_paths or file_summaries if they have different lengths, leading to missing image summaries or mismatched files. for file_path, summary_dict in zip(node.file_paths, node.file_summaries, strict=False): if not summary_dict: continue Potential missing kwargs in fallback path In the legacy .doc parser's fallback path (when PDF conversion fails), the _extract_text call doesn't pass any kwargs. This might be intended, but worth verifying if any kwargs are needed for text extraction. text = await asyncio.to_thread(self._extract_text, path)

github-actions · 2026-05-26T10:08:14Z

PR Code Suggestions ✨

No code suggestions found for the PR.

feat(parse): preserve document image attachments

9d3a491

github-project-automation Bot added this to OpenViking project May 26, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 26, 2026

github-actions Bot added the Review effort 4/5 label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parse): 保留文档图片附件#2243

feat(parse): 保留文档图片附件#2243
qin-ctx wants to merge 1 commit into
mainfrom
feat/document-image-attachments

qin-ctx commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qin-ctx commented May 26, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

github-actions Bot commented May 26, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 26, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant