fix(get_addresses): accept email.header.Header inputs (no more 'Header' object has no attribute 'strip')#153
Open
rakurtz wants to merge 3 commits into
Open
fix(get_addresses): accept email.header.Header inputs (no more 'Header' object has no attribute 'strip')#153rakurtz wants to merge 3 commits into
rakurtz wants to merge 3 commits into
Conversation
Message.get() returns email.header.Header for RFC 2047 encoded-word
headers (non-ASCII display names). The Header type has no .strip()
method and is not a valid input to email.utils.getaddresses, which
caused MailParser.from_bytes() to raise
AttributeError: 'Header' object has no attribute 'strip'
for any mail with a non-ASCII display name in an address header.
Normalize input at the top of get_addresses() so callers can pass
either a str, a Header, or None.
Adds three regression tests in tests/test_utils.py:
- get_addresses with an email.header.Header instance
- get_addresses with None
- get_addresses with a plain str (guards the existing happy path)
Co-authored-by: Cursor <cursoragent@cursor.com>
Coercing email.header.Header via str() avoids the AttributeError crash, but can be lossy for some real-world headers and surface replacement characters (�) in rendered output. Use Header.encode() in get_addresses() to preserve RFC 2047 encoded-words. This keeps the header parseable by email.utils.getaddresses while allowing core.py to decode display names later via decode_header_part() without introducing replacement chars. Adds a regression test that parses a full message via MailParser.from_bytes and asserts Unicode display names are preserved on mail.to. Co-authored-by: Cursor <cursoragent@cursor.com>
Some real-world mails produce To/From Header objects that render as encoded-word tokens like =?unknown-8bit?...?=. Passing those directly to email.utils.getaddresses(strict=True) can misclassify the token itself as the address, leaking encoded text into downstream output (e.g. PDF To: line). Decode Header values first via decode_header_part(raw_header.encode()) so getaddresses operates on a parseable Unicode string (e.g. "Álpám Longsom <recipient@example.com>"). Adds regression tests for: - get_addresses with unknown-8bit Header objects - MailParser.from_bytes end-to-end unknown-8bit display-name decoding Co-authored-by: Cursor <cursoragent@cursor.com>
Author
|
after the first commit the PR is based on i realized that the string encoding was broken. the second and third commit fix this by decoding the Header values before address parsing. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I ran into a .eml file validation error when an email contained certain strings / encodings in the header. I could be tracked down to a call of
.strip()on a non-str object.Summary
mailparser.utils.get_addresses(raw_header)callsraw_header.strip()andpasses
raw_headertoemail.utils.getaddresses([raw_header], strict=True).Both assume a plain
str, butcore.pyfeedsget_addresseswhateveremail.message.Message.get(name)returns:For any address header containing RFC 2047 encoded-words — typical for non-ASCII display names like =?utf-8?q?=C3=81rp=C3=A1d?= user@example.com — Message.get() may return an email.header.Header instance. That instance has no .strip(), so MailParser.from_bytes(...) (or parse_from_bytes) raises:
This is the same shape of bug that #149 fixed in ported_string; that PR covered attachment header values but the address-parsing path was missed.
Fix
Normalise raw_header once at the top of get_addresses:
(edited to reflect the last commit that ensures correct encoding/decoding)
email.header.Header.__str__already produces the encoded-word string form that the existing_ADDR_FALLBACK_REandemail.utils.getaddressesknow how to handle, andcore.pydecodes the display name afterwards viadecode_header_part. The strict-string happy path is unchanged.Tests
Adds three focused unit tests in tests/test_utils.py::TestUtilsEdgeCases (matching the style of the existing test_ported_string_handles_header_object test from #149):
test_get_addresses_handles_header_object — direct regression. Constructs an email.header.Header from an encoded-word value and asserts get_addresses returns the parsed [(display_name, "user@example.com")] without raising.
test_get_addresses_handles_none — defensive: get_addresses(None) returns [] instead of crashing on attribute access.
test_get_addresses_plain_string_unchanged — guards the existing happy path so the new isinstance(raw_header, str) branch can't silently regress the common case.
Full local test run on Python 3.13:
Notes
No behaviour change for str inputs.
Public docstring of get_addresses updated to document the accepted input types (str | email.header.Header | None).
No changes to core.py — the fix is local to the helper so any other callers (current or future) get the same robustness automatically.