Description
IpynbConverter.accepts() reads the raw file stream and decodes it to check if the file is a Jupyter notebook. When the decode fails (non-ASCII bytes in the file), the exception propagates uncaught and crashes the entire conversion pipeline — even though the file has nothing to do with Jupyter notebooks.
Traceback (production environment)
File "markitdown\_markitdown.py", line 601, in _convert
_accepts = converter.accepts(file_stream, stream_info, **_kwargs)
File "markitdown\converters\_ipynb_converter.py", line 36, in accepts
notebook_content = file_stream.read().decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 43916: ordinal not in range(128)
The affected file was a standard French-language PDF invoice containing UTF-8 encoded characters (é, è, à... → bytes starting with 0xc3).
Root Cause
In accepts(), when the MIME type matches application/json, the code decodes the full file stream content:
# _ipynb_converter.py — current code, unsafe
encoding = stream_info.charset or "utf-8" # was "ascii" in older versions
notebook_content = file_stream.read().decode(encoding)
If the file contains bytes that cannot be decoded (wrong charset, binary content), a UnicodeDecodeError propagates uncaught and kills the full conversion chain. The semantic intent of accepts() is simply to return True or False — it should never raise.
Expected Behavior
accepts() should return False gracefully when the file cannot be decoded, instead of raising an exception.
Proposed Fix
Wrap the decode block in a try/except (UnicodeDecodeError, ValueError):
for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
cur_pos = file_stream.tell()
try:
encoding = stream_info.charset or "utf-8"
notebook_content = file_stream.read().decode(encoding)
return (
"nbformat" in notebook_content
and "nbformat_minor" in notebook_content
)
except (UnicodeDecodeError, ValueError):
# File contains non-decodable bytes — definitely not a notebook
return False
finally:
file_stream.seek(cur_pos)
Impact
- Severity: High — any non-ASCII file whose MIME type starts with
application/json will crash the pipeline.
- Affected environments: Windows production workers (Python default locale may fall back to ASCII).
- Workaround: Catch
UnicodeDecodeError in the calling code before conversion.
Steps to Reproduce
from markitdown import MarkItDown
md = MarkItDown()
# Any PDF with French/accented text (non-ASCII bytes)
result = md.convert("invoice_french.pdf")
# → UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 ...
Description
IpynbConverter.accepts()reads the raw file stream and decodes it to check if the file is a Jupyter notebook. When the decode fails (non-ASCII bytes in the file), the exception propagates uncaught and crashes the entire conversion pipeline — even though the file has nothing to do with Jupyter notebooks.Traceback (production environment)
The affected file was a standard French-language PDF invoice containing UTF-8 encoded characters (
é,è,à... → bytes starting with0xc3).Root Cause
In
accepts(), when the MIME type matchesapplication/json, the code decodes the full file stream content:If the file contains bytes that cannot be decoded (wrong charset, binary content), a
UnicodeDecodeErrorpropagates uncaught and kills the full conversion chain. The semantic intent ofaccepts()is simply to returnTrueorFalse— it should never raise.Expected Behavior
accepts()should returnFalsegracefully when the file cannot be decoded, instead of raising an exception.Proposed Fix
Wrap the decode block in a
try/except (UnicodeDecodeError, ValueError):Impact
application/jsonwill crash the pipeline.UnicodeDecodeErrorin the calling code before conversion.Steps to Reproduce