Skip to content

Conversation

@badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented Jan 15, 2026

This PR introduces a new chunking option isolate_tables.

  • default to False and keeps current chunking behavior
  • when True tables are not chunked with any other element when using chunk_elements and chunk_by_title

@badGarnet
Copy link
Collaborator Author

@claude review

@claude
Copy link

claude bot commented Jan 15, 2026

Claude finished @badGarnet's task —— View job


Code Review for Table Isolation in Chunking

Todo List

  • Read repository setup instructions (CLAUDE.md)
  • Review core implementation changes in unstructured/chunking/base.py
  • Analyze test changes in test_unstructured/chunking/test_basic.py
  • Analyze test changes in test_unstructured/chunking/test_title.py
  • Review changelog and version updates
  • Provide comprehensive review feedback

Review Summary

This PR successfully implements table isolation in chunking functionality. The implementation is clean, well-tested, and correctly documented.

Strengths

  1. Simple and Effective Implementation: The core logic change in unstructured/chunking/base.py is minimal and precisely targeted:

    # Lines 386-388: Prevents tables from being added to non-empty pre-chunks
    if element.category == "Table" or self._elements[-1].category == "Table":
        return False
    
    # Lines 454-456: Prevents combining pre-chunks that contain tables
    if self._elements[-1].category == "Table" or pre_chunk._elements[0].category == "Table":
        return False
  2. Comprehensive Test Coverage:

    • Added test_basic_chunk_isolates_tables() in both test files
    • Updated existing tests to reflect new behavior (e.g., test_it_splits_elements_by_title_and_table() now expects 4 chunks instead of 3)
    • Removed tests for combining tables with text elements, which is now disallowed
  3. Proper Documentation:

    • Clear changelog entry explaining the behavior change
    • Updated docstring at line 260: "A Table element is placed into a section by itself"
  4. Consistent Behavior: The isolation logic is applied at both the pre-chunk building stage (PreChunkBuilder.will_fit()) and the pre-chunk combination stage (PreChunk.can_combine())

🔍 Code Quality Assessment

  • Logic: Sound implementation that prevents tables from being grouped with other elements
  • Performance: No performance impact - simple category string comparisons
  • Maintainability: Clean, readable code with appropriate comments
  • Testing: Good test coverage with realistic scenarios using actual Excel files

📝 Minor Observations

  1. Version Bump: Correctly incremented from 0.18.30-dev2 to 0.18.30-dev3
  2. Test Cleanup: Appropriately removed obsolete tests for table-text combination
  3. Backward Compatibility: This is a behavior change that could affect existing users, but it's clearly documented as an enhancement

Recommendation: APPROVE

This PR implements the stated feature correctly with appropriate tests and documentation. The code is clean, the logic is sound, and the implementation aligns perfectly with the PR description goal of isolating tables for "quick identification of tabular data after chunking."

@badGarnet
Copy link
Collaborator Author

@claude review

@claude
Copy link

claude bot commented Jan 16, 2026

Claude finished @badGarnet's task —— View job


Code Review for Table Isolation in Chunking

Todo List

  • Read repository setup instructions (CLAUDE.md)
  • Review core implementation changes in unstructured/chunking/base.py
  • Analyze test changes in test_unstructured/chunking/test_basic.py
  • Analyze test changes in test_unstructured/chunking/test_title.py
  • Review changelog and version updates
  • Examine parameter implementation in basic.py and title.py
  • Provide comprehensive review feedback

✅ Review Summary

This PR successfully implements the isolate_tables feature for chunking. The implementation is clean, well-tested, and properly integrated across the codebase.

📋 Implementation Analysis

1. Core Logic Changes (unstructured/chunking/base.py)

The implementation correctly adds table isolation logic in two key places:

Lines 137-142: New isolate_tables property with proper default handling:

@lazyproperty
def isolate_tables(self) -> bool:
    """when True tables are not combined with any other elements no matter the circumstances.
    Default to False.
    """
    arg_value = self._kwargs.get("isolate_tables")
    return False if arg_value is None else bool(arg_value)

Lines 395-398: Prevention logic in PreChunkBuilder.will_fit():

if self._opts.isolate_tables and (
    element.category == "Table" or self._elements[-1].category == "Table"
):
    return False

Lines 464-467: Prevention logic in PreChunk.can_combine():

if self._opts.isolate_tables and (
    self._elements[-1].category == "Table" or pre_chunk._elements[0].category == "Table"
):
    return False

2. API Integration (basic.py and title.py)

  • ✅ Both chunking strategies properly expose the isolate_tables parameter
  • ✅ Parameter correctly passed through to the underlying chunking options
  • ✅ Default value of False maintains backward compatibility

3. Test Coverage

test_basic.py (lines 172-182):

  • ✅ Comprehensive test using real Excel data (stanley-cups.xlsx)
  • ✅ Validates both default behavior (tables combined) and new behavior (tables isolated)
  • ✅ Proper assertions on chunk count and element types

test_title.py (lines 47-55):

  • ✅ Dedicated test for isolate_tables functionality
  • ✅ Uses realistic test data from JSON fixtures
  • ✅ Validates correct chunk separation and types

4. Documentation & Versioning

  • ✅ Proper changelog entry describing the new feature
  • ✅ Version bumped from 0.18.30-dev3 to 0.18.30-dev4
  • ✅ Clear docstring documentation in the base class

🔍 Code Quality Assessment

Strengths:

  1. Minimal & Targeted: The logic changes are precisely where they need to be with no unnecessary modifications
  2. Consistent Logic: The same isolation check is applied at both the pre-chunk building and combining stages
  3. Performance: Uses simple string comparisons with no performance impact
  4. Backward Compatible: Default False value preserves existing behavior
  5. Type Safety: Proper type hints and boolean conversion

Architecture:

  • The implementation respects the existing chunking architecture
  • Changes are made at the appropriate abstraction levels
  • No breaking changes to existing APIs

🧪 Testing Quality

Test Coverage:

  • ✅ Tests both chunking strategies (basic and by_title)
  • ✅ Uses real document data (Excel files, JSON fixtures)
  • ✅ Validates both positive and negative cases
  • ✅ Proper assertions on element types and chunk structure

Test Design:

  • Tests are focused and test exactly what they claim to test
  • Good use of realistic test data rather than mocked objects
  • Clear assertions that verify the expected behavior

📝 Minor Observations

  1. Documentation: The docstring at line 268 in base.py could be slightly more specific: "A Table element is placed into a section by itself when isolate_tables is True"

  2. Edge Cases: The implementation correctly handles edge cases:

    • Empty pre-chunks always accept any element (line 392)
    • Table isolation is only enforced when the option is enabled
    • Works correctly with both text-to-table and table-to-text scenarios

Final Recommendation: APPROVE

This is a well-implemented feature that:

  • ✅ Solves the stated problem of isolating tables for better chunking
  • ✅ Maintains backward compatibility with a sensible default
  • ✅ Has comprehensive test coverage
  • ✅ Follows existing code patterns and architecture
  • ✅ Is properly documented and versioned

The code is production-ready and the implementation approach is sound. The feature enables users to get cleaner table isolation when needed while preserving the current behavior by default.


@badGarnet badGarnet marked this pull request as ready for review January 16, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants