Skip to content

Conversation

@eureka928
Copy link
Contributor

Problem

When using chunk_elements() on markdown files containing code blocks, line breaks within the code were being discarded, resulting in unreadable code:

Fixes #4095

# Before fix - code becomes flattened:
"def hello(): print('Hello') return True"

# Expected - preserve formatting:
"def hello():\n    print('Hello')\n    return True"

Root Cause

Two issues were identified:

  1. HTML Parser: <pre> elements generated generic Text elements instead of CodeSnippet elements
  2. Chunking: The _iter_text_segments() method normalized all whitespace to single spaces, destroying newlines

Solution

1. HTML Parser Change (unstructured/partition/html/parser.py)

Made <pre> elements generate CodeSnippet elements:

class Pre(BlockItem):
    """Custom element-class for `<pre>` element.

    Can only contain phrasing content. Generates CodeSnippet elements to preserve
    code formatting including whitespace and line breaks.
    """

    _ElementCls = CodeSnippet  # Added this line

2. Chunking Change (unstructured/chunking/base.py)

Modified _iter_text_segments() to preserve whitespace for CodeSnippet elements:

def _iter_text_segments(self) -> Iterator[str]:
    """Generate overlap text and each element text segment in order.

    Empty text segments are not included. CodeSnippet elements preserve their
    original whitespace (including newlines) to maintain code formatting.
    """
    if self._overlap_prefix:
        yield self._overlap_prefix
    for e in self._elements:
        if e.text and len(e.text):
            # -- preserve whitespace for code snippets to maintain formatting --
            if isinstance(e, CodeSnippet):
                text = e.text.strip()
            else:
                text = " ".join(e.text.strip().split())
            if text:
                yield text

Files Changed

File Change
unstructured/partition/html/parser.py Added CodeSnippet import, set _ElementCls = CodeSnippet in Pre class
unstructured/chunking/base.py Added CodeSnippet import, special handling in _iter_text_segments()
test_unstructured/partition/html/test_parser.py Added test for CodeSnippet generation, updated existing test
test_unstructured/chunking/test_base.py Added 2 tests for whitespace preservation

Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=42954461

@eureka928 eureka928 force-pushed the fix/preserve-code-block-line-breaks branch 3 times, most recently from 169fd08 to 294f471 Compare January 23, 2026 02:07
@eureka928 eureka928 requested a review from badGarnet January 23, 2026 02:17
@eureka928
Copy link
Contributor Author

This issue is for test_ingest_src fail... I don't know the reason

{8C419DE5-1C21-4A35-92E7-94AE752E9A31}

@eureka928 eureka928 force-pushed the fix/preserve-code-block-line-breaks branch from 0bde867 to cba9237 Compare January 24, 2026 06:42
@eureka928
Copy link
Contributor Author

@badGarnet finally, this passed all!
Hope you merge this when you have a sec
Have a nice weekend

@eureka928 eureka928 force-pushed the fix/preserve-code-block-line-breaks branch from 270ce94 to 27fdd9a Compare January 26, 2026 08:09
…x to correct version

- Restored "Resolve GHSA-58pv-8j8x-9vj2" in 0.18.30 section
- Moved Unstructured-IO#4095 fix to 0.18.31-dev1 section
@eureka928 eureka928 requested a review from badGarnet January 26, 2026 16:01
@badGarnet badGarnet enabled auto-merge January 26, 2026 16:19
@badGarnet badGarnet added this pull request to the merge queue Jan 26, 2026
Merged via the queue into Unstructured-IO:main with commit d4caedf Jan 26, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/The line breaks in the code after the chunk have been discarded.

2 participants