Skip to content

Comments

✨ Add support for new filetypes#2429

Open
yzAiden wants to merge 1 commit intoModelEngine-Group:developfrom
yzAiden:support_new_filetypes
Open

✨ Add support for new filetypes#2429
yzAiden wants to merge 1 commit intoModelEngine-Group:developfrom
yzAiden:support_new_filetypes

Conversation

@yzAiden
Copy link

@yzAiden yzAiden commented Feb 5, 2026

一、功能:
1.提升数据清洗能力:支持.epub, .html, .csv, .json, .xml文件类型的清洗和检索。
二、设计:
1.所有新增文件类型均由UnstructuredProcessor进行处理。
2..epub, .html, .csv, .xml类型文件与现有处理逻辑完全一致。
3.对json类型文件的分片进行单独设计,方式为优先将 JSON 解析为文本并在不破坏最外层 key-value 语义的前提下按长度切分,无法安全按 KV 切分时退化为按标点的纯文本切分,解析失败则直接按纯文本策略分片。其余流程与现有处理方式一致。
三、主要改动的位置:
1.sdk/nexent/data_process
四、相关issue引用
#2258

@yzAiden yzAiden closed this Feb 5, 2026
@yzAiden yzAiden reopened this Feb 5, 2026
@yzAiden yzAiden closed this Feb 5, 2026
@yzAiden yzAiden reopened this Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant