Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to fix multi-round chat handling in the multimodal prompt processor by preventing mutation of the original message payload while parsing multimodal items.
Changes:
- Stop removing (
pop-ing) the multimodal source fields (url/data) when extracting the media source in_parse_multimodal_item.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| item_params = item.get(item_type, {}) | ||
| data_src = item_params.pop('url', None) or item_params.pop('data', None) | ||
| data_src = item_params.get('url', None) or item_params.get('data', None) | ||
|
|
There was a problem hiding this comment.
Switching from pop() to get() avoids mutating the original message (good for multi-round reuse), but it also changes the parsed output: item_params will now still contain url/data and gets merged into the emitted content item via **item_params. This can unintentionally carry large base64 blobs downstream and may conflict with the reserved 'data' field (if item_params contains a 'data' key it will overwrite the parsed data). Consider making a shallow copy of the params and popping url/data from the copy so the input messages are not mutated while keeping the output schema the same as before.
There was a problem hiding this comment.
Seems correct, item_params = item.get(item_type, {}).copy() is a shallow copy, won't be costly.
We should bring back this copy and pop out the potential long base64 string to avoid passing down in multiple rounds.
cc @lvhan028
No description provided.