fix: preserve HTML tables in Outlook .msg conversion#1673
fix: preserve HTML tables in Outlook .msg conversion#1673octo-patch wants to merge 1 commit intomicrosoft:mainfrom
Conversation
…#1567) When a .msg file contains an HTML body (PR_BODY_HTML), prefer it over the plain text body so that tables and other HTML formatting are converted to proper markdown instead of being stripped. - Try Unicode HTML stream (__substg1.0_1013001F) first - Fall back to binary HTML stream (__substg1.0_10130102) - Convert HTML to markdown via BeautifulSoup + _CustomMarkdownify - Fall back to plain text body if no HTML body is present
VANDRANKI
left a comment
There was a problem hiding this comment.
The priority order (HTML body first, plain text fallback) is correct - MSG files that have an HTML body should produce better markdown than stripping it to plain text.
A few points:
script/style removal - good. Without this, CSS and JS in the HTML body would end up in the markdown output.
assert olefile is not None in _get_binary_stream_html - the existing _get_stream_data method uses assert olefile is not None too, so this is consistent with the existing pattern. But both methods are only called after the class-level import check, so the assertion should never fire in practice - it is really just a type-narrowing hint. A comment explaining this would help future readers.
BeautifulSoup dependency - is this already available in the optional deps for OutlookMsgConverter? If not, it needs to be added to the extras in pyproject.toml alongside olefile.
Tests - is there a test .msg file with an HTML body that exercises the new path? If not, a test covering table preservation would strengthen this.
The logic itself looks correct.
Fixes #1567
Problem
The
OutlookMsgConverterreads only the plain text body, discarding the HTML body. When a .msg file contains HTML tables, the plain text fallback strips all HTML formatting.Solution
Prefer the HTML body (PR_BODY_HTML) when it exists:
__substg1.0_1013001Ffirst__substg1.0_10130102No new dependencies introduced.
Testing
Tested with a .msg file containing HTML tables. Before: unformatted plain text. After: proper markdown tables.