feat(parse): 保留文档图片附件#2243
Open
qin-ctx wants to merge 1 commit into
Open
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
本 PR 保留文档解析过程中产生或引用的图片附件,并让 Markdown、PDF、Word、legacy
.doc的图片可以跟随资源一起落盘、迁移和参与后续语义处理。示例:
上传
paper.md,内容包含:解析后会把本地图片复制到资源树内,并把 Markdown 改写为稳定资源 URI:
这样图片不会再停留在临时目录或本地相对路径里,后续检索也能通过图片摘要理解这张图和正文的关系。
Related Issue
暂无
Type of Change
Changes Made
media/images/image-N.ext路径和 Markdown 图片引用生成。RESOURCE_ROOT_PLACEHOLDER替换成最终viking://resources/...URI,并在 Semantic DAG 中把图片摘要写回 Markdown 图片引用旁边。.doc优先尝试 LibreOffice 转 PDF 后复用 PDF parser,保留图片和表格视觉内容;保留原有文本抽取作为 fallback。.doc、资源 URI 重写和图片摘要回写的单元测试。Testing
测试命令:
结果:37 passed。测试过程中仍有项目既有的
requests依赖版本提示和 Pydantic deprecation warnings。Checklist
Screenshots (if applicable)
不适用
Additional Notes
这次改动刻意让解析阶段先使用占位符 URI,再由
ResourceProcessor在资源最终落盘前统一改写为目标资源 URI,避免 parser 提前猜测最终路径。