Skip to content

Conversation

@harumiWeb
Copy link
Owner

@harumiWeb harumiWeb commented Jan 5, 2026

Summary by CodeRabbit

  • New Features

    • Extract merged-cell ranges from Excel and opt to include/exclude them in outputs
    • New output/filter option to control merged-cell inclusion
    • Added rich example demonstrating form reconstruction with many merged cells
  • Documentation

    • Updated READMEs and docs with merged-cell details, examples, and usage notes
    • Added sample application forms and a Gantt chart example for real-world scenarios

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

A new feature adds merged cell range extraction and filtering across ExStruct: MergedCell model, extraction via OpenPyXL, backend protocol extension, pipeline step, engine options and output filtering, updated modeling, tests, samples, and docs reflecting merged_cells support.

Changes

Cohort / File(s) Summary
Documentation
README.md, README.ja.md, docs/README.en.md, docs/README.ja.md, docs/release-notes/v0.3.2.md, mkdocs.yml
Added/updated examples and descriptions to include merged cell ranges, renamed benchmark → example, added Example 2 and release notes, and updated navigation.
Data model & public API surface
src/exstruct/models/__init__.py, docs/agents/DATA_MODEL.md, docs/agents/FEATURE_SPEC.md, pyproject.toml
Added MergedCell type and SheetData.merged_cells; bumped docs/spec versions and project version to v0.3.2.
Core extraction utilities
src/exstruct/core/cells.py, src/exstruct/core/backends/base.py
New function extract_sheet_merged_cells(Path) -> dict[str, list[MergedCell]]; added MergedCellData alias and Backend.extract_merged_cells() method.
Backends
src/exstruct/core/backends/openpyxl_backend.py, src/exstruct/core/backends/com_backend.py
Openpyxl backend implements extract_merged_cells() delegating to cell utils with safe-fallback; COM backend exposes method raising NotImplementedError.
Pipeline & integration
src/exstruct/core/pipeline.py, src/exstruct/core/integrate.py, src/exstruct/core/modeling.py
Added include_merged_cells flag to inputs, artifacts, and resolve logic; new pipeline step step_extract_merged_cells_openpyxl; threaded merged_cell_data through raw collection and workbook building; extract_workbook() gains include_merged_cells param.
Engine & filtering
src/exstruct/engine.py
StructOptions and FilterOptions gain include_merged_cells; engine forwards option to extraction and conditionally filters merged_cells from outputs.
Samples
Forms:
sample/forms_with_many_merged_cells/en_form_sf425/*, sample/forms_with_many_merged_cells/ja_general_form/*
Gantt:
sample/gantt_chart/en/*, (deleted) sample/gantt_chart/sample_gantt.json, sample/gantt_chart/sample_gantt_json_for_llm.md
Added merged-cell–heavy form samples (EN/JA) and notebook guidance; reorganized Gantt samples into en/ with new example script and JSON; removed older Japanese Gantt samples.
Tests — core & backends
tests/core/test_merged_cells_core.py, tests/backends/test_merged_cells.py, tests/backends/test_auto_page_breaks.py
Added unit tests for extract_sheet_merged_cells, Openpyxl backend success/failure, and COM backend NotImplemented behavior; adjusted helpers to accept new flag.
Tests — pipeline & integration
tests/core/test_pipeline.py, tests/core/test_pipeline_fallbacks.py, tests/integration/test_integrate_raw_data.py
Updated pipeline tests to expect merged-cells step presence/absence based on flag; propagated merged_cell_data in integration tests.
Tests — modeling, engine & export
tests/models/test_modeling.py, tests/engine/test_engine.py, tests/export/test_export_requirements.py
Updated tests to include merged_cells in raw/model data; added test ensuring empty merged_cells is omitted from serialized JSON.
Planning / QA docs
docs/agents/TASKS.md, docs/agents/TEST_REQUIREMENTS.md, docs/agents/ROADMAP.md
Converted SmartArt-related plans to merged-cell–focused tasks; added test cases (MRG-01..MRG-04), EXP-25, roadmap entries v0.3.2/v0.3.3, and coverage target updates.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Engine as ExStructEngine
    participant Pipeline as Extraction Pipeline
    participant Backend as OpenpyxlBackend
    participant Cells as Cell Utils
    participant Models as Modeling

    Client->>Engine: extract(file, mode, include_merged_cells=True)
    Engine->>Pipeline: extract_workbook(..., include_merged_cells=True)
    Pipeline->>Pipeline: resolve_extraction_inputs(...)

    alt include_merged_cells == True
        Pipeline->>Backend: extract_merged_cells(file)
        Backend->>Cells: extract_sheet_merged_cells(file)
        Cells->>Cells: iterate ws.merged_cells.ranges\ncompute (r1,c1,r2,c2) and top-left value
        Cells-->>Backend: dict[sheet -> list[MergedCell]]
        Backend-->>Pipeline: merged_cell_data
        Pipeline->>Pipeline: step_extract_merged_cells_openpyxl()
        Pipeline->>Models: collect_sheet_raw_data(..., merged_cell_data)
    else
        Pipeline->>Models: collect_sheet_raw_data(..., merged_cells=[])
    end

    Pipeline->>Models: build_sheet_data(raw.merged_cells)
    Models->>Engine: WorkbookData(sheets with merged_cells)
    alt FilterOptions.include_merged_cells == True
        Engine-->>Client: WorkbookData includes merged_cells
    else
        Engine-->>Client: WorkbookData omits merged_cells
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Dev/refactor #23 – Refactors core pipeline/backends/models that this PR extends; closely related at the code/interface level.

Poem

🐇 I hop through rows where ranges hide,
I stitch the cells that once were wide.
r1 to c2, a tidy sweep,
Merged ranges saved for LLMs to keep.
Hooray—JSON tidy, snug, and neat! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.55% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Added function to get merged cell range' accurately describes the primary feature introduced in this PR—merged cell extraction functionality across multiple backend implementations and data models.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/exstruct/core/cells.py 88.88% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
sample/gantt_chart/en/sample.json_for_llm.md (1)

3-3: Optional: Consider using a proper heading level instead of emphasis.

Line 3 uses italic emphasis for the subtitle. While this is stylistically acceptable, consider using a lower-level heading (e.g., ### (Original, Redistributable Template)) for better semantic structure and accessibility.

🔎 Proposed fix
-_(Original, Redistributable Template)_
+### (Original, Redistributable Template)
src/exstruct/models/__init__.py (1)

79-87: Suggest adding Field descriptor to v for consistency.

The v field (line 86) uses a simple default value "" while all other fields in MergedCell use Field() descriptors with descriptions. For consistency and clarity, consider adding a Field descriptor with a description.

🔎 Proposed fix
-    v: str = ""
+    v: str = Field(default="", description="Representative value of the merged cell range.")
tests/export/test_export_requirements.py (1)

117-120: Parse JSON instead of substring checking for more robust assertions.

The current test performs substring matching on the JSON string ("merged_cells" not in payload), which is fragile. If "merged_cells" appeared anywhere in the JSON (e.g., in a value or nested key), the test would incorrectly fail.

🔎 Proposed refactor to parse and validate JSON structure
 def test_merged_cells_empty_is_omitted_in_sheet_json() -> None:
     sheet = SheetData(rows=[CellRow(r=1, c={"0": "v"})], merged_cells=[])
-    payload = sheet.to_json()
-    assert "merged_cells" not in payload
+    payload_json = sheet.to_json()
+    payload = json.loads(payload_json)
+    assert "merged_cells" not in payload
src/exstruct/core/backends/com_backend.py (1)

157-159: Enhance docstring to match the file's Google-style pattern.

Other methods in this file (extract_print_areas, extract_colors_map, extract_auto_page_breaks) use Google-style docstrings with Returns and/or Raises sections. The new extract_merged_cells method should follow the same pattern for consistency.

🔎 Proposed docstring enhancement
     def extract_merged_cells(self) -> MergedCellData:
-        """Extract merged cell ranges via COM (not implemented)."""
+        """Extract merged cell ranges via COM (not implemented).
+
+        Returns:
+            Mapping of sheet name to merged cell ranges.
+
+        Raises:
+            NotImplementedError: COM merged cell extraction is not yet supported.
+        """
         raise NotImplementedError("COM merged cell extraction is not implemented.")
src/exstruct/core/backends/openpyxl_backend.py (1)

91-100: Consider adding logging for consistency and debugging.

The method correctly delegates to extract_sheet_merged_cells and handles failures gracefully. However, unlike the similar extract_colors_map method (lines 67-89), this method silently swallows exceptions. Adding a debug or warning log would improve observability and align with the established pattern.

🔎 Proposed enhancement for consistency
 def extract_merged_cells(self) -> MergedCellData:
     """Extract merged cell ranges per sheet.

     Returns:
         Mapping of sheet name to merged cell ranges.
     """
     try:
         return extract_sheet_merged_cells(self.file_path)
-    except Exception:
+    except Exception as exc:
+        logger.debug(
+            "Merged cells extraction failed; returning empty dict. (%r)", exc
+        )
         return {}
tests/core/test_merged_cells_core.py (2)

8-27: Add docstrings to test functions per coding guidelines.

The helper function and test function are missing Google-style docstrings. Per the project's coding guidelines, all functions should include docstrings.

🔎 Suggested docstrings
 def _make_merged_book(path: Path) -> None:
+    """Create a test workbook with merged cell ranges.
+
+    Args:
+        path: Destination file path for the workbook.
+    """
     wb = Workbook()
 def test_extract_sheet_merged_cells_basic(tmp_path: Path) -> None:
+    """Test extraction of merged cell ranges from a workbook."""
     path = tmp_path / "merged.xlsx"

30-36: Add docstring to test function.

 def test_extract_sheet_merged_cells_empty(tmp_path: Path) -> None:
+    """Test that extraction returns empty list for workbooks without merged cells."""
     path = tmp_path / "plain.xlsx"
sample/forms_with_many_merged_cells/en_form_sf425/sample.json_for_llm.md (1)

1-149: Optional: Address markdown linting issues.

The static analysis tools flagged several markdown style issues, primarily using bold emphasis instead of proper headings (MD036) and heading level increments that skip levels (MD001). While these don't affect functionality, addressing them would improve document structure and consistency.

Since this is sample documentation, these style issues are minor and can be addressed at your discretion.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea60a35 and 9a87f42.

⛔ Files ignored due to path filters (7)
  • docs/assets/demo_form.ja.png is excluded by !**/*.png
  • docs/assets/demo_form_en.png is excluded by !**/*.png
  • sample/forms_with_many_merged_cells/en_form_sf425/sample.xlsx is excluded by !**/*.xlsx
  • sample/forms_with_many_merged_cells/ja_general_form/ja_form.xlsx is excluded by !**/*.xlsx
  • sample/gantt_chart/en/sample.xlsx is excluded by !**/*.xlsx
  • sample/gantt_chart/sample_gantt.xlsx is excluded by !**/*.xlsx
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (38)
  • README.ja.md
  • README.md
  • docs/README.en.md
  • docs/README.ja.md
  • docs/agents/DATA_MODEL.md
  • docs/agents/FEATURE_SPEC.md
  • docs/agents/ROADMAP.md
  • docs/agents/TASKS.md
  • docs/agents/TEST_REQUIREMENTS.md
  • sample/basic/sample.json_for_llm.md
  • sample/forms_with_many_merged_cells/en_form_sf425/README.md
  • sample/forms_with_many_merged_cells/en_form_sf425/sample.json
  • sample/forms_with_many_merged_cells/en_form_sf425/sample.json_for_llm.md
  • sample/forms_with_many_merged_cells/ja_general_form/ja_form.json
  • sample/forms_with_many_merged_cells/ja_general_form/ja_form.json_for_llm.md
  • sample/gantt_chart/en/example.py
  • sample/gantt_chart/en/sample.json
  • sample/gantt_chart/en/sample.json_for_llm.md
  • sample/gantt_chart/sample_gantt.json
  • sample/gantt_chart/sample_gantt_json_for_llm.md
  • src/exstruct/core/backends/base.py
  • src/exstruct/core/backends/com_backend.py
  • src/exstruct/core/backends/openpyxl_backend.py
  • src/exstruct/core/cells.py
  • src/exstruct/core/integrate.py
  • src/exstruct/core/modeling.py
  • src/exstruct/core/pipeline.py
  • src/exstruct/engine.py
  • src/exstruct/models/__init__.py
  • tests/backends/test_auto_page_breaks.py
  • tests/backends/test_merged_cells.py
  • tests/core/test_merged_cells_core.py
  • tests/core/test_pipeline.py
  • tests/core/test_pipeline_fallbacks.py
  • tests/engine/test_engine.py
  • tests/export/test_export_requirements.py
  • tests/integration/test_integrate_raw_data.py
  • tests/models/test_modeling.py
💤 Files with no reviewable changes (2)
  • sample/gantt_chart/sample_gantt_json_for_llm.md
  • sample/gantt_chart/sample_gantt.json
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Add type hints to all function and method arguments and return values, must be mypy strict compliant
Use Any type only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Each function must have a single responsibility; cyclomatic complexity must not exceed 12
Organize imports in the following order: 1) standard library, 2) third-party packages, 3) exstruct internal modules
Add Google-style docstrings to all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external entities to Pydantic models at the boundary layer
Generated code must pass mypy strict mode with zero errors
Generated code must pass Ruff linter checks (E, W, F, I, B, UP, N, C90 rules) with zero errors
Avoid God Functions (large monolithic functions that do too much)
Avoid God Objects (classes that have too many responsibilities)
Eliminate unnecessary nesting and deeply nested conditional branches
Avoid circular dependencies between modules
Write functions that are testable and avoid unnecessary side effects

Files:

  • tests/export/test_export_requirements.py
  • src/exstruct/engine.py
  • tests/integration/test_integrate_raw_data.py
  • tests/backends/test_auto_page_breaks.py
  • src/exstruct/models/__init__.py
  • tests/core/test_merged_cells_core.py
  • src/exstruct/core/backends/com_backend.py
  • src/exstruct/core/cells.py
  • src/exstruct/core/backends/openpyxl_backend.py
  • tests/backends/test_merged_cells.py
  • src/exstruct/core/modeling.py
  • tests/core/test_pipeline.py
  • src/exstruct/core/integrate.py
  • tests/models/test_modeling.py
  • sample/gantt_chart/en/example.py
  • src/exstruct/core/pipeline.py
  • src/exstruct/core/backends/base.py
  • tests/core/test_pipeline_fallbacks.py
  • tests/engine/test_engine.py
🧬 Code graph analysis (14)
tests/export/test_export_requirements.py (1)
src/exstruct/models/__init__.py (2)
  • SheetData (145-237)
  • CellRow (89-98)
src/exstruct/engine.py (1)
src/exstruct/__init__.py (1)
  • process_excel (318-398)
tests/core/test_merged_cells_core.py (2)
src/exstruct/core/cells.py (1)
  • extract_sheet_merged_cells (529-561)
tests/backends/test_merged_cells.py (1)
  • _make_merged_book (14-20)
src/exstruct/core/backends/com_backend.py (3)
src/exstruct/core/backends/base.py (1)
  • extract_merged_cells (41-42)
src/exstruct/core/backends/openpyxl_backend.py (1)
  • extract_merged_cells (91-100)
tests/backends/test_backends.py (2)
  • test_com_backend_extract_print_areas_handles_sheet_error (104-123)
  • test_com_backend_extract_colors_map_returns_none_on_failure (78-101)
src/exstruct/core/cells.py (2)
src/exstruct/models/__init__.py (1)
  • MergedCell (79-86)
src/exstruct/core/workbook.py (1)
  • openpyxl_workbook (17-56)
src/exstruct/core/backends/openpyxl_backend.py (5)
src/exstruct/core/cells.py (1)
  • extract_sheet_merged_cells (529-561)
src/exstruct/core/ranges.py (1)
  • parse_range_zero_based (25-48)
src/exstruct/core/workbook.py (1)
  • openpyxl_workbook (17-56)
src/exstruct/core/backends/base.py (1)
  • extract_merged_cells (41-42)
src/exstruct/core/backends/com_backend.py (1)
  • extract_merged_cells (157-159)
tests/backends/test_merged_cells.py (3)
src/exstruct/core/backends/com_backend.py (1)
  • ComBackend (20-159)
src/exstruct/core/backends/openpyxl_backend.py (1)
  • OpenpyxlBackend (26-114)
src/exstruct/models/__init__.py (1)
  • MergedCell (79-86)
src/exstruct/core/modeling.py (1)
src/exstruct/models/__init__.py (1)
  • MergedCell (79-86)
tests/core/test_pipeline.py (1)
src/exstruct/core/pipeline.py (3)
  • build_pre_com_pipeline (255-338)
  • ExtractionInputs (43-66)
  • ExtractionArtifacts (70-89)
src/exstruct/core/integrate.py (2)
tests/integration/test_integrate_fallback.py (2)
  • test_extract_workbook_fallback_includes_print_areas (38-56)
  • test_extract_workbook_with_links (59-79)
src/exstruct/__init__.py (1)
  • extract (91-125)
sample/gantt_chart/en/example.py (2)
src/exstruct/engine.py (3)
  • ColorsOptions (33-56)
  • ExStructEngine (171-571)
  • StructOptions (60-84)
src/exstruct/__init__.py (1)
  • export (128-167)
src/exstruct/core/pipeline.py (4)
src/exstruct/models/__init__.py (2)
  • MergedCell (79-86)
  • PrintArea (136-142)
src/exstruct/core/backends/openpyxl_backend.py (2)
  • OpenpyxlBackend (26-114)
  • extract_merged_cells (91-100)
src/exstruct/core/backends/base.py (1)
  • extract_merged_cells (41-42)
src/exstruct/core/backends/com_backend.py (1)
  • extract_merged_cells (157-159)
src/exstruct/core/backends/base.py (3)
src/exstruct/models/__init__.py (1)
  • MergedCell (79-86)
src/exstruct/core/backends/com_backend.py (1)
  • extract_merged_cells (157-159)
src/exstruct/core/backends/openpyxl_backend.py (1)
  • extract_merged_cells (91-100)
tests/engine/test_engine.py (2)
src/exstruct/models/__init__.py (1)
  • MergedCell (79-86)
src/exstruct/engine.py (3)
  • OutputOptions (149-168)
  • FilterOptions (101-128)
  • serialize (362-385)
🪛 LanguageTool
sample/forms_with_many_merged_cells/en_form_sf425/sample.json_for_llm.md

[style] ~139-~139: Try using a synonym here to strengthen your wording.
Context: ...ng the collection of information. Send comments regarding the burden estimate or any ot...

(COMMENT_REMARK)

🪛 markdownlint-cli2 (0.18.1)
sample/forms_with_many_merged_cells/ja_general_form/ja_form.json_for_llm.md

32-32: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

sample/gantt_chart/en/sample.json_for_llm.md

3-3: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

sample/forms_with_many_merged_cells/en_form_sf425/sample.json_for_llm.md

5-5: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


11-11: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


15-15: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


21-21: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


55-55: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


57-57: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)


59-59: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


65-65: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


121-121: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


125-125: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)


147-147: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

docs/README.ja.md

500-500: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

README.ja.md

500-500: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: test (windows-latest, 3.11)
🔇 Additional comments (49)
docs/agents/ROADMAP.md (1)

39-45: ✓ LGTM — roadmap entries are well-formatted and correctly positioned.

The new version sections (v0.3.2 and v0.3.3) follow the existing document structure and language conventions. Feature descriptions are clear, and version ordering is correct.

sample/forms_with_many_merged_cells/ja_general_form/ja_form.json_for_llm.md (1)

1-110: Well-structured sample documentation.

This Markdown file provides a clear, human-readable representation of the Japanese care insurance application form. The structure, tables, and formatting are appropriate for documentation purposes.

Note: The static analysis hint about line 32 is a false positive—the bold text **(※)...** is correctly used for emphasis on a parenthetical note, not as a heading.

sample/forms_with_many_merged_cells/ja_general_form/ja_form.json (2)

217-363: Well-structured merged_cells data.

The merged_cells array is comprehensive and properly formatted with valid coordinate ranges and optional values. This provides a good test case for the merged cell extraction feature, covering various merge patterns throughout the form.


188-214: No action required—the data structure is valid.

The values -1, -2, -3, and -4 in rows 64–68 are cell values stored in column "0", not column keys. Column keys are always strings (e.g., "0", "1"). The schema explicitly allows integer values, including negative integers, so this is a valid and supported pattern.

Likely an incorrect or invalid review comment.

tests/backends/test_auto_page_breaks.py (1)

30-30: LGTM! Test scaffolding aligned with expanded API.

The new include_merged_cells parameter properly extends the test helper's signature to match the production extract_workbook interface, even though this specific test focuses on auto_page_breaks. The unused parameter is appropriate for a mock function.

src/exstruct/models/__init__.py (1)

173-175: LGTM! Properly integrated merged_cells field.

The merged_cells field is correctly defined with proper type hints, Field descriptor, and description. It follows the established pattern used for other list fields in SheetData.

docs/agents/TASKS.md (1)

1-17: LGTM! Comprehensive test coverage plan.

The task list covers all critical testing areas for the merged cells feature: backend extraction, pipeline propagation, modeling, engine filtering, and export serialization. The coverage target is clearly stated.

sample/gantt_chart/en/sample.json (1)

460-485: LGTM! Valid merged_cells sample data.

The merged_cells array correctly demonstrates the MergedCell model with:

  • Proper coordinate format (r1/r2 as 1-based rows, c1/c2 as 0-based columns)
  • Consistent horizontal merges for phase headers and title
  • Representative values matching the corresponding row data

This sample effectively demonstrates the merged cell feature.

sample/forms_with_many_merged_cells/en_form_sf425/README.md (1)

9-10: URL is valid and SF-425 form is available.

The EPA forms URL (https://www.epa.gov/financial/forms) is accessible and the SF-425 form is available at that location.

tests/core/test_pipeline_fallbacks.py (1)

37-37: LGTM: Consistent API extension for merged cells parameter.

The addition of include_merged_cells=None across all three test functions correctly aligns with the updated resolve_extraction_inputs signature.

Also applies to: 70-70, 111-111

src/exstruct/core/backends/openpyxl_backend.py (1)

16-16: LGTM: Clean imports for merged cells support.

The new imports are properly organized and support the merged cell extraction feature.

Also applies to: 20-20

tests/backends/test_merged_cells.py (1)

23-29: LGTM: Comprehensive test coverage for merged cell extraction.

The three test functions provide good coverage:

  • Successful extraction from OpenpyxlBackend
  • Failure handling with graceful fallback
  • NotImplementedError verification for ComBackend

Also applies to: 32-43, 46-49

tests/integration/test_integrate_raw_data.py (1)

59-59: LGTM: Clean integration of merged_cell_data parameter.

The addition of merged_cell_data={"Sheet1": []} correctly extends the test calls to match the updated collect_sheet_raw_data signature.

Also applies to: 94-94

src/exstruct/core/modeling.py (1)

9-9: LGTM! Clean integration of merged_cells field.

The changes correctly add the merged_cells field to SheetRawData and propagate it through to SheetData via build_sheet_data. Type hints, docstrings, and import organization all follow the project's coding guidelines.

Also applies to: 30-30, 40-40, 73-73

tests/models/test_modeling.py (1)

28-28: LGTM! Correctly accommodates the new field.

The addition of merged_cells=[] appropriately updates the test to include the new required field in SheetRawData.

docs/agents/DATA_MODEL.md (1)

3-3: LGTM! Documentation accurately reflects the new feature.

The documentation correctly describes the new MergedCell model with proper field types and coordinate system conventions (1-based rows, 0-based columns). The version bump, changelog entry, and section renumbering are all appropriate.

Also applies to: 151-167, 180-180, 248-248

tests/engine/test_engine.py (4)

17-17: LGTM!

Import of MergedCell is correctly added to support the new merged cells feature in the sample workbook.


37-37: LGTM!

The include_merged_cells parameter is correctly added to the fake extract function signature, matching the real extract_workbook API.


68-68: LGTM!

The sample workbook now includes a MergedCell instance, enabling comprehensive testing of the merged cells feature throughout the engine tests.


91-97: LGTM!

The test correctly verifies that include_merged_cells=False filters out merged cell data from serialized output. The pattern follows the existing filter tests (test_engine_serialize_filters_shapes, test_engine_serialize_filters_tables).

Note: Unlike test_engine_serialize_filters_tables (line 82), this test correctly omits the unused tmp_path parameter.

src/exstruct/core/cells.py (2)

18-18: LGTM!

Import of MergedCell alongside CellRow is correct and follows the existing import pattern.


529-561: LGTM!

Well-implemented function for extracting merged cell ranges:

  • Proper type hints and Google-style docstring as per coding guidelines
  • Correctly uses openpyxl_workbook context manager for resource safety
  • Handles edge cases: None merged_cells, None cell values
  • Column indices correctly converted from 1-based (openpyxl) to 0-based (as documented in MergedCell model: "c1: Start column (0-based)")

The implementation aligns with the MergedCell model definition in src/exstruct/models/__init__.py (lines 78-85).

src/exstruct/core/backends/base.py (3)

6-6: LGTM!

Import correctly updated to include MergedCell alongside existing model imports.


11-11: LGTM!

Type alias MergedCellData follows the established pattern of CellData and PrintAreaData.


40-42: LGTM!

The extract_merged_cells method is correctly added to the Backend protocol, following the pattern of existing methods (extract_cells, extract_print_areas). The relevant code snippets confirm that:

  • OpenpyxlBackend implements this method (returns {} on failure)
  • ComBackend raises NotImplementedError

This aligns with test requirements BE-06 and BE-07 in docs/agents/TEST_REQUIREMENTS.md.

docs/agents/TEST_REQUIREMENTS.md (5)

3-3: LGTM!

Version bump to 0.5 and explicit 78% coverage requirement are appropriate additions for tracking the merged cells feature release.

Also applies to: 7-7


51-57: LGTM!

Comprehensive test requirements for merged cells:

  • MRG-01: Mode-specific extraction (standard/verbose only)
  • MRG-02: Coordinate conventions (1-based row, 0-based column)
  • MRG-03: Top-left cell value handling
  • MRG-04: Multiple range retention

These align with the MergedCell model definition and extract_sheet_merged_cells implementation.


154-154: LGTM!

EXP-25 correctly specifies filtering behavior, consistent with the test in tests/engine/test_engine.py (line 91-97).


225-225: LGTM!

PIPE-09 correctly documents conditional pipeline step inclusion for merged cells extraction.


236-237: LGTM!

Backend test requirements:

  • BE-06: OpenpyxlBackend returns empty map on failure (matches openpyxl_backend.py lines 90-99)
  • BE-07: ComBackend raises NotImplementedError (matches com_backend.py lines 156-158)
docs/README.ja.md (5)

5-5: LGTM!

Image path format is consistent with the documentation structure.


11-12: LGTM!

Feature descriptions correctly updated to include "セル結合範囲" (merged cell ranges) in both the feature list and output mode descriptions.


338-495: LGTM!

Excellent Example 2 demonstrating the merged cells feature with a practical application form use case. The JSON excerpt shows:

  • merged_cells array with r1, c1, r2, c2, v fields
  • Real-world form data structure
  • LLM transformation to Markdown

This effectively showcases how ExStruct handles complex Excel structures with merged cells.


496-501: Intentional styling choice.

The static analysis hint (MD036) flags the bold text as "emphasis used instead of heading". This appears to be an intentional formatting choice to highlight the key takeaway without disrupting the document hierarchy.


509-509: LGTM!

Sample directory listing correctly updated with the new merged cells sample folder.

docs/README.en.md (5)

7-7: LGTM!

Feature description correctly expanded to include "merged cell ranges" in the comprehensive list of structured data outputs.


14-14: LGTM!

Output mode descriptions correctly updated to indicate merged cell ranges are included in standard and verbose modes.


136-137: LGTM!

Mode documentation accurately reflects the merged cells feature:

  • standard: includes merged cell ranges
  • verbose: includes merged cell ranges

This is consistent with MRG-01 in TEST_REQUIREMENTS.md.


339-552: LGTM!

Example 2 effectively demonstrates the merged cells feature using the SF-425 Federal Financial Report form. The JSON excerpt shows:

  • merged_cells array with coordinate fields
  • Values preserved from top-left cells
  • Integration with existing features (rows, shapes, print_areas)

The LLM reconstruction demonstrates practical value for document processing workflows.


562-562: LGTM!

Sample directory listing updated consistently with Japanese README.

src/exstruct/core/integrate.py (3)

20-20: LGTM!

Parameter include_merged_cells: bool | None = None correctly added following the established pattern of other optional feature flags (include_cell_links, include_print_areas, include_colors_map).


35-35: LGTM!

Docstring correctly documents the new parameter with consistent formatting: "Whether to include merged cell ranges; None uses mode defaults."


52-52: LGTM!

Parameter correctly threaded through to resolve_extraction_inputs, enabling the pipeline to conditionally include merged cell extraction based on the flag value.

src/exstruct/engine.py (1)

73-73: LGTM! Merged cell support follows established patterns.

The include_merged_cells flag implementation is consistent with existing feature flags throughout the engine:

  • StructOptions uses None for auto-resolution (light=False, others=True)
  • FilterOptions defaults to True for output inclusion
  • Filtering logic in _filter_sheet matches the pattern used for other optional fields
  • Proper forwarding to extract_workbook

Also applies to: 83-83, 126-128, 287-289, 359-359

README.md (2)

339-419: Check placeholder text in Example 2.

The new example effectively demonstrates merged cell handling, but the JSON and Markdown content contains placeholder characters (??? and similar). Verify whether this is intentional demo data or if actual content should be included.

Example at lines 355, 358, 371, 384-418 shows these placeholders.


429-429: Sample directory exists with expected content.

The sample/forms_with_many_merged_cells/ directory is present and contains properly organized samples including English (SF425 form) and Japanese general forms with accompanying documentation files.

README.ja.md (1)

11-12: LGTM! Japanese documentation properly mirrors English version.

The Japanese README updates are consistent with the English version:

  • Feature descriptions include merged cell ranges (セル結合範囲)
  • Example 2 structure matches English version
  • Terminology is properly localized

Also applies to: 152-152, 338-495, 509-509

tests/core/test_pipeline.py (1)

16-188: LGTM! Comprehensive test coverage for merged cell support.

The tests properly validate:

  • Conditional step inclusion based on include_merged_cells flag
  • Step exclusion when the flag is disabled
  • Default resolution behavior (None → True for standard/verbose modes)
  • Data flow through artifacts to final workbook structure
src/exstruct/core/pipeline.py (1)

13-22: LGTM! Complete and well-integrated pipeline support for merged cells.

The implementation demonstrates excellent adherence to coding guidelines:

  • Type hints present throughout (mypy strict compliant)
  • Google-style docstrings for all new functions and parameters
  • Consistent patterns with existing extraction steps
  • Single-responsibility functions
  • Proper data flow from extraction through artifacts to final workbook
  • Fallback path correctly handles merged cells

The merged cell extraction step is properly:

  • Gated by the include_merged_cells flag
  • Integrated into all extraction modes (light/standard/verbose)
  • Using the same pattern as other openpyxl-based steps

Also applies to: 35-35, 43-90, 171-236, 255-338, 470-481, 586-749

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
sample/forms_with_many_merged_cells/ja_general_form/notebooklm.chat.md (1)

1-167: Optional: Add table of contents for lengthy documentation.

With 167 lines across multiple Q&A sections, readers may benefit from a quick navigation aid (e.g., markdown TOC, anchor links to major sections) to locate relevant guidance more quickly. This is especially useful for end users consulting the guide while filling out the form.

tests/backends/test_merged_cells.py (1)

29-55: Optional: Consider adding docstrings to test functions.

While test names are self-documenting (pytest convention), the coding guidelines specify "Add Google-style docstrings to all functions and classes." Adding brief docstrings would ensure strict guideline compliance.

Example docstring format
def test_openpyxl_backend_extract_merged_cells(tmp_path: Path) -> None:
    """Verify OpenpyxlBackend extracts merged cell ranges correctly."""
    # ... test body

def test_openpyxl_backend_extract_merged_cells_handles_failure(
    monkeypatch: MonkeyPatch, tmp_path: Path
) -> None:
    """Verify extract_merged_cells returns empty dict on extraction failure."""
    # ... test body

def test_com_backend_extract_merged_cells_not_implemented() -> None:
    """Verify ComBackend raises NotImplementedError for merged cell extraction."""
    # ... test body

Based on coding guidelines, which specify docstrings on all functions.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a87f42 and 32f14e4.

📒 Files selected for processing (5)
  • docs/release-notes/v0.3.2.md
  • mkdocs.yml
  • pyproject.toml
  • sample/forms_with_many_merged_cells/ja_general_form/notebooklm.chat.md
  • tests/backends/test_merged_cells.py
✅ Files skipped from review due to trivial changes (3)
  • pyproject.toml
  • mkdocs.yml
  • docs/release-notes/v0.3.2.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Add type hints to all function and method arguments and return values, must be mypy strict compliant
Use Any type only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Each function must have a single responsibility; cyclomatic complexity must not exceed 12
Organize imports in the following order: 1) standard library, 2) third-party packages, 3) exstruct internal modules
Add Google-style docstrings to all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external entities to Pydantic models at the boundary layer
Generated code must pass mypy strict mode with zero errors
Generated code must pass Ruff linter checks (E, W, F, I, B, UP, N, C90 rules) with zero errors
Avoid God Functions (large monolithic functions that do too much)
Avoid God Objects (classes that have too many responsibilities)
Eliminate unnecessary nesting and deeply nested conditional branches
Avoid circular dependencies between modules
Write functions that are testable and avoid unnecessary side effects

Files:

  • tests/backends/test_merged_cells.py
🧬 Code graph analysis (1)
tests/backends/test_merged_cells.py (3)
src/exstruct/core/backends/com_backend.py (1)
  • ComBackend (20-159)
src/exstruct/core/backends/openpyxl_backend.py (1)
  • OpenpyxlBackend (26-114)
src/exstruct/models/__init__.py (1)
  • MergedCell (79-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (6)
sample/forms_with_many_merged_cells/ja_general_form/notebooklm.chat.md (1)

1-167: Clear and comprehensive form-completion guidance.

This new documentation provides well-structured, Japanese-language Q&A guidance for the care insurance application form, with helpful cross-references between sections and visual organization. The intentional duplication (comparing NotebookLM output from JSON vs. PDF inputs) is appropriately labeled and adds pedagogical value.

tests/backends/test_merged_cells.py (5)

1-11: LGTM: Import organization follows guidelines.

The imports are correctly organized in the prescribed order: standard library, third-party packages, and exstruct internal modules.


14-26: LGTM: All past review feedback addressed.

The helper function now includes the return type annotation, Google-style docstring, and properly closes the workbook to prevent resource leaks. All issues from the previous review have been resolved.


29-35: LGTM: Test correctly verifies merged cell extraction.

The test properly validates that OpenpyxlBackend extracts merged cells by checking all MergedCell fields (r1, c1, r2, c2, v) against expected values.


38-49: LGTM: Error handling test is well-structured.

The test correctly verifies that extract_merged_cells returns an empty dict when the underlying extraction fails, using proper monkeypatching and type-safe mock implementation.


52-55: LGTM: NotImplementedError test validates ComBackend behavior.

The test correctly verifies that ComBackend raises NotImplementedError for merged cell extraction, consistent with the implementation shown in the relevant code snippets.

@harumiWeb harumiWeb merged commit a08f4ba into main Jan 5, 2026
9 checks passed
@harumiWeb harumiWeb deleted the dev/merge-cells branch January 5, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants