bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.)

## Description

`IpynbConverter.accepts()` reads the raw file stream and decodes it to check if the file is a Jupyter notebook. When the decode fails (non-ASCII bytes in the file), the exception propagates uncaught and **crashes the entire conversion pipeline** — even though the file has nothing to do with Jupyter notebooks.

## Traceback (production environment)

```
File "markitdown\_markitdown.py", line 601, in _convert
    _accepts = converter.accepts(file_stream, stream_info, **_kwargs)
  File "markitdown\converters\_ipynb_converter.py", line 36, in accepts
    notebook_content = file_stream.read().decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 43916: ordinal not in range(128)
```

The affected file was a standard **French-language PDF invoice** containing UTF-8 encoded characters (`é`, `è`, `à`... → bytes starting with `0xc3`).

## Root Cause

In `accepts()`, when the MIME type matches `application/json`, the code decodes the full file stream content:

```python
# _ipynb_converter.py — current code, unsafe
encoding = stream_info.charset or "utf-8"  # was "ascii" in older versions
notebook_content = file_stream.read().decode(encoding)
```

If the file contains bytes that cannot be decoded (wrong charset, binary content), a `UnicodeDecodeError` propagates uncaught and kills the full conversion chain. The semantic intent of `accepts()` is simply to return `True` or `False` — it should **never raise**.

## Expected Behavior

`accepts()` should return `False` gracefully when the file cannot be decoded, instead of raising an exception.

## Proposed Fix

Wrap the decode block in a `try/except (UnicodeDecodeError, ValueError)`:

```python
for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
    if mimetype.startswith(prefix):
        cur_pos = file_stream.tell()
        try:
            encoding = stream_info.charset or "utf-8"
            notebook_content = file_stream.read().decode(encoding)
            return (
                "nbformat" in notebook_content
                and "nbformat_minor" in notebook_content
            )
        except (UnicodeDecodeError, ValueError):
            # File contains non-decodable bytes — definitely not a notebook
            return False
        finally:
            file_stream.seek(cur_pos)
```

## Impact

- **Severity**: High — any non-ASCII file whose MIME type starts with `application/json` will crash the pipeline.
- **Affected environments**: Windows production workers (Python default locale may fall back to ASCII).
- **Workaround**: Catch `UnicodeDecodeError` in the calling code before conversion.

## Steps to Reproduce

```python
from markitdown import MarkItDown
md = MarkItDown()
# Any PDF with French/accented text (non-ASCII bytes)
result = md.convert("invoice_french.pdf")
# → UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 ...
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.) #1894

Description

Traceback (production environment)

Root Cause

Expected Behavior

Proposed Fix

Impact

Steps to Reproduce

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: IpynbConverter.accepts() raises UnicodeDecodeError on non-ASCII files (French PDFs, etc.) #1894

Description

Description

Traceback (production environment)

Root Cause

Expected Behavior

Proposed Fix

Impact

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions