Skip to content

bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.) #1894

@echavet

Description

@echavet

Description

IpynbConverter.accepts() reads the raw file stream and decodes it to check if the file is a Jupyter notebook. When the decode fails (non-ASCII bytes in the file), the exception propagates uncaught and crashes the entire conversion pipeline — even though the file has nothing to do with Jupyter notebooks.

Traceback (production environment)

File "markitdown\_markitdown.py", line 601, in _convert
    _accepts = converter.accepts(file_stream, stream_info, **_kwargs)
  File "markitdown\converters\_ipynb_converter.py", line 36, in accepts
    notebook_content = file_stream.read().decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 43916: ordinal not in range(128)

The affected file was a standard French-language PDF invoice containing UTF-8 encoded characters (é, è, à... → bytes starting with 0xc3).

Root Cause

In accepts(), when the MIME type matches application/json, the code decodes the full file stream content:

# _ipynb_converter.py — current code, unsafe
encoding = stream_info.charset or "utf-8"  # was "ascii" in older versions
notebook_content = file_stream.read().decode(encoding)

If the file contains bytes that cannot be decoded (wrong charset, binary content), a UnicodeDecodeError propagates uncaught and kills the full conversion chain. The semantic intent of accepts() is simply to return True or False — it should never raise.

Expected Behavior

accepts() should return False gracefully when the file cannot be decoded, instead of raising an exception.

Proposed Fix

Wrap the decode block in a try/except (UnicodeDecodeError, ValueError):

for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
    if mimetype.startswith(prefix):
        cur_pos = file_stream.tell()
        try:
            encoding = stream_info.charset or "utf-8"
            notebook_content = file_stream.read().decode(encoding)
            return (
                "nbformat" in notebook_content
                and "nbformat_minor" in notebook_content
            )
        except (UnicodeDecodeError, ValueError):
            # File contains non-decodable bytes — definitely not a notebook
            return False
        finally:
            file_stream.seek(cur_pos)

Impact

  • Severity: High — any non-ASCII file whose MIME type starts with application/json will crash the pipeline.
  • Affected environments: Windows production workers (Python default locale may fall back to ASCII).
  • Workaround: Catch UnicodeDecodeError in the calling code before conversion.

Steps to Reproduce

from markitdown import MarkItDown
md = MarkItDown()
# Any PDF with French/accented text (non-ASCII bytes)
result = md.convert("invoice_french.pdf")
# → UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions