Skip to content

fix: preserve HTML tables in Outlook .msg conversion#1673

Open
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1567-outlook-msg-html-tables
Open

fix: preserve HTML tables in Outlook .msg conversion#1673
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1567-outlook-msg-html-tables

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #1567

Problem

The OutlookMsgConverter reads only the plain text body, discarding the HTML body. When a .msg file contains HTML tables, the plain text fallback strips all HTML formatting.

Solution

Prefer the HTML body (PR_BODY_HTML) when it exists:

  1. Try Unicode HTML stream __substg1.0_1013001F first
  2. Fall back to binary HTML stream __substg1.0_10130102
  3. Convert HTML to markdown via BeautifulSoup + _CustomMarkdownify (same as HtmlConverter)
  4. Fall back to plain text body if no HTML body is present

No new dependencies introduced.

Testing

Tested with a .msg file containing HTML tables. Before: unformatted plain text. After: proper markdown tables.

…#1567)

When a .msg file contains an HTML body (PR_BODY_HTML), prefer it over
the plain text body so that tables and other HTML formatting are
converted to proper markdown instead of being stripped.

- Try Unicode HTML stream (__substg1.0_1013001F) first
- Fall back to binary HTML stream (__substg1.0_10130102)
- Convert HTML to markdown via BeautifulSoup + _CustomMarkdownify
- Fall back to plain text body if no HTML body is present
Copy link
Copy Markdown

@VANDRANKI VANDRANKI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The priority order (HTML body first, plain text fallback) is correct - MSG files that have an HTML body should produce better markdown than stripping it to plain text.

A few points:

script/style removal - good. Without this, CSS and JS in the HTML body would end up in the markdown output.

assert olefile is not None in _get_binary_stream_html - the existing _get_stream_data method uses assert olefile is not None too, so this is consistent with the existing pattern. But both methods are only called after the class-level import check, so the assertion should never fire in practice - it is really just a type-narrowing hint. A comment explaining this would help future readers.

BeautifulSoup dependency - is this already available in the optional deps for OutlookMsgConverter? If not, it needs to be added to the extras in pyproject.toml alongside olefile.

Tests - is there a test .msg file with an HTML body that exercises the new path? If not, a test covering table preservation would strengthen this.

The logic itself looks correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[outlook]: HTML Tables in outlook files

2 participants