Skip to content

Conversation

@harumiWeb
Copy link
Owner

@harumiWeb harumiWeb commented Jan 6, 2026

Summary by CodeRabbit

  • New Features

    • New option to control whether merged cell values appear in row data (configurable, default preserves existing behavior).
    • Merged-cell output redesigned to a compact schema/items format for smaller, more consistent payloads.
  • Documentation

    • Updated release notes, roadmap, specs and READMEs with migration guidance and examples.
    • Added sample examples demonstrating merged-cell handling.
  • Chores

    • Updated ignore list, JSON schemas and package version for the new format.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2026

Warning

Rate limit exceeded

@harumiWeb has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 1cdcebb and ba1ee5a.

📒 Files selected for processing (1)
  • docs/README.ja.md
📝 Walkthrough

Walkthrough

This PR replaces the flat merged-cell representation with a schema/items compressed MergedCells model, renames internal MergedCellMergedCellRange, and adds include_merged_values_in_rows to control whether merged values are retained in row data. Changes touch models, pipeline, extraction, schemas, serialization, docs, samples, and tests.

Changes

Cohort / File(s) Summary
Models & Serialization
src/exstruct/models/__init__.py, src/exstruct/io/__init__.py, schemas/*.json
Introduces public MergedCells (schema + items), updates SheetData.merged_cells to `MergedCells
Core Extraction & Range Types
src/exstruct/core/cells.py, src/exstruct/core/backends/base.py
Adds MergedCellRange dataclass (r1,c1,r2,c2,v), normalizes empty values to " ", and updates type aliases/imports to use MergedCellRange.
Pipeline & Integration
src/exstruct/core/pipeline.py, src/exstruct/core/modeling.py, src/exstruct/core/integrate.py, src/exstruct/engine.py
Adds include_merged_values_in_rows flag (default True) to inputs and public extraction paths; implements row-filtering helpers to drop merged values when flag is False; adds _build_merged_cells to compress ranges into MergedCells.
Schemas & Generation
schemas/merged_cells.json, schemas/sheet.json, schemas/workbook.json, scripts/gen_json_schema.py
Adds MergedCells schema and integrates it into sheet/workbook schemas and generator targets.
Samples & Examples
sample/forms_with_many_merged_cells/*
Adds sample packages, example scripts demonstrating include_merged_values_in_rows=False, and sample JSON using new merged_cells format.
Docs & Release Notes
README.md, README.ja.md, docs/README.*, docs/agents/*, docs/release-notes/*, mkdocs.yml, AGENTS.md
Updates documentation, examples, release notes, feature spec, roadmap, and agent docs to reflect the schema change, migration guidance, and v0.3.5 version.
Tests
tests/*
Updates tests to use MergedCellRange and MergedCells, thread include_merged_values_in_rows through test inputs, and adjust expectations (empty→" ").
Build / Config
pyproject.toml, .gitignore
Bumps project version to 0.3.5; adds drafts/ to .gitignore.
Misc (samples init)
sample/__init__.py, sample/.../__init__.py
Adds package marker files for new sample packages.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as Client/CLI
  participant Engine as ExStructEngine
  participant Pipeline as Pipeline
  participant Cells as CellsExtractor
  participant Modeling as ModelBuilder
  participant IO as Serializer/IO

  CLI->>Engine: extract(file, StructOptions(include_merged_values_in_rows=?))
  Engine->>Pipeline: resolve_extraction_inputs(..., include_merged_values_in_rows)
  Pipeline->>Cells: collect_sheet_raw_data(..., include_merged_values_in_rows)
  alt include_merged_values_in_rows == False
    Cells-->>Pipeline: merged_ranges (MergedCellRange list)
    Pipeline->>Pipeline: _filter_rows_excluding_merged_values(rows, merged_ranges)
  else
    Cells-->>Pipeline: merged_ranges (MergedCellRange list)
  end
  Pipeline->>Modeling: build_sheet_data(raw, merged_ranges)
  Modeling->>Modeling: _build_merged_cells(merged_ranges) → MergedCells(schema, items)
  Modeling-->>Engine: SheetData(merged_cells = MergedCells | null)
  Engine->>IO: serialize(workbook) (model_dump by_alias=True)
  IO-->>CLI: output JSON (merged_cells uses schema/items)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Dev/refactor #23 — Overlaps core changes (cells, modeling, pipeline, models, schemas) and likely contains related structural edits to merged-cell handling.
  • Added function to get merged cell range #35 — Related work touching merged-cell extraction and pipeline integration; strong code-level connection.

Poem

🐰 I hopped through spreadsheets, found cells intertwined,
Flattened to schema, tidy rows aligned.
Ranges now compressed, values gently pruned,
A rabbit’s small cheer for the code that’s improved! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.03% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: Compressing the amount of context in merged cell data' accurately summarizes the main feature: a data structure refactoring that compresses merged cell representation from a flat array of objects to a schema/items structure, reducing output size and context.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 87.27273% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/exstruct/core/pipeline.py 83.33% 12 Missing ⚠️
src/exstruct/io/__init__.py 75.00% 1 Missing ⚠️
src/exstruct/models/__init__.py 90.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI Agents
In @docs/README.ja.md:
- Around line 363-371: The JSON output for merged_cells was changed from a list
of objects to a schema+items format (fields merged_cells, schema, items), which
is a breaking serialization change; update the docs to add a clear migration
note: describe the old format (array of objects), show the new schema+items
format with an explicit version number where it was introduced, provide a short
migration example showing how to map old objects to the new schema/items
representation and include parsing examples/snippets for consumers to support
both formats during transition.

In @tests/models/test_models_export.py:
- Around line 141-148: Add a Google-style docstring to the test function
test_sheet_json_includes_merged_cells_schema that briefly explains it verifies
that SheetData serialization (via SheetData.to_json) includes the merged_cells
schema and preserves merged cell items; mention the expected schema
["r1","c1","r2","c2","v"] and that the first item equals [1, 0, 1, 1, "merged"].
Locate the function using its name and references to SheetData, MergedCells, and
to_json and place the docstring immediately under the def line following
Google-style conventions.
🧹 Nitpick comments (6)
sample/forms_with_many_merged_cells/ja_general_form/example.py (1)

1-9: Consider adding type annotations for enhanced clarity.

The example script clearly demonstrates the include_merged_values_in_rows option. However, adding type annotations to the module-level variables would improve clarity and align with the project's type-safety standards.

🔎 Proposed enhancement with type annotations
+from pathlib import Path
+
 from exstruct import ExStructEngine, StructOptions
 
-file_path = "ja_form.xlsx"
+file_path: str = "ja_form.xlsx"
 
 engine = ExStructEngine(
     options=StructOptions(include_merged_values_in_rows=False),
 )
-wb = engine.extract(file_path, mode="standard")
-engine.export(wb, "output.json", pretty=False)
+wb = engine.extract(file_path, mode="standard")  # WorkbookData
+engine.export(wb, Path("output.json"), pretty=False)
sample/forms_with_many_merged_cells/en_form_sf425/example.py (1)

1-9: Consider adding a module docstring and main guard for better example clarity.

This example script correctly demonstrates the new include_merged_values_in_rows option. However, as per the coding guidelines, consider adding:

  1. A module-level docstring explaining what this example demonstrates
  2. A if __name__ == "__main__": guard around the execution logic

These additions would improve the example's educational value and follow Python best practices for sample scripts.

🔎 Proposed improvements
+"""
+Example demonstrating ExStruct extraction with merged cell values excluded from row data.
+
+This script extracts data from sample.xlsx in standard mode with
+include_merged_values_in_rows=False, which stores merged cell data separately
+in the merged_cells field while omitting them from row cell values.
+"""
+
 from exstruct import ExStructEngine, StructOptions
 
-file_path = "sample.xlsx"
-
-engine = ExStructEngine(
-    options=StructOptions(include_merged_values_in_rows=False),
-)
-wb = engine.extract(file_path, mode="standard")
-engine.export(wb, "output.json", pretty=False)
+
+if __name__ == "__main__":
+    file_path = "sample.xlsx"
+
+    engine = ExStructEngine(
+        options=StructOptions(include_merged_values_in_rows=False),
+    )
+    wb = engine.extract(file_path, mode="standard")
+    engine.export(wb, "output.json", pretty=False)
tests/integration/test_integrate_raw_data.py (2)

62-62: Consider adding test coverage for the new parameter's behavior.

The test now passes include_merged_values_in_rows=True but doesn't validate whether merged values are actually included in the rows. Consider adding assertions that verify this parameter's effect, or create a dedicated test that checks both True and False cases.


98-98: Consider adding test coverage for the new parameter's behavior.

Similar to the other test, this passes include_merged_values_in_rows=True without validating the parameter's effect. Consider adding test coverage that verifies the behavior difference between True and False values.

src/exstruct/io/__init__.py (1)

32-68: Update the docstring to document alias handling.

The docstring should mention that model instances are serialized using field aliases when by_alias=True is passed to model_dump(). This is important for understanding how fields like schema and items in MergedCells appear in the output.

🔎 Suggested docstring enhancement
 def dict_without_empty_values(obj: object) -> JsonStructure:
     """
     Remove None, empty string, empty list, and empty dict values from a nested structure or supported model object.
 
-    Recursively processes dicts, lists, and supported model types (WorkbookData, CellRow, Chart, PrintArea, PrintAreaView, Shape, Arrow, SmartArt). Model instances are converted to dictionaries with None fields excluded before recursive cleaning. Values considered empty and removed are: `None`, `""` (empty string), `[]` (empty list), and `{}` (empty dict).
+    Recursively processes dicts, lists, and supported model types (WorkbookData, CellRow, Chart, PrintArea, PrintAreaView, Shape, Arrow, SmartArt). Model instances are converted to dictionaries with None fields excluded and field aliases applied before recursive cleaning. Values considered empty and removed are: `None`, `""` (empty string), `[]` (empty list), and `{}` (empty dict).
 
     Parameters:
         obj (object): A value to clean; may be a dict, list, scalar, or one of the supported model instances.
 
     Returns:
         JsonStructure: The input structure with empty values removed, preserving other values and nesting.
     """
schemas/workbook.json (1)

376-424: Consider adding required property to MergedCells definition.

The MergedCells schema defines items and schema properties but does not specify them as required. This could allow empty objects {} to be valid, which may not be the intended behavior.

🔎 Proposed fix
       }
     },
+    "required": [
+      "schema",
+      "items"
+    ],
     "title": "MergedCells",
     "type": "object"
   },
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c6f6e92 and c21d544.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (42)
  • .gitignore
  • AGENTS.md
  • README.ja.md
  • README.md
  • docs/README.en.md
  • docs/README.ja.md
  • docs/agents/CODE_REVIEW.md
  • docs/agents/DATA_MODEL.md
  • docs/agents/FEATURE_SPEC.md
  • docs/agents/ROADMAP.md
  • docs/agents/TASKS.md
  • docs/release-notes/v0.3.2.md
  • sample/__init__.py
  • sample/forms_with_many_merged_cells/__init__.py
  • sample/forms_with_many_merged_cells/en_form_sf425/__init__.py
  • sample/forms_with_many_merged_cells/en_form_sf425/example.py
  • sample/forms_with_many_merged_cells/en_form_sf425/sample.json
  • sample/forms_with_many_merged_cells/ja_general_form/__init__.py
  • sample/forms_with_many_merged_cells/ja_general_form/example.py
  • sample/forms_with_many_merged_cells/ja_general_form/ja_form.json
  • schemas/merged_cells.json
  • schemas/sheet.json
  • schemas/workbook.json
  • scripts/gen_json_schema.py
  • src/exstruct/core/backends/base.py
  • src/exstruct/core/cells.py
  • src/exstruct/core/integrate.py
  • src/exstruct/core/modeling.py
  • src/exstruct/core/pipeline.py
  • src/exstruct/engine.py
  • src/exstruct/io/__init__.py
  • src/exstruct/models/__init__.py
  • tests/backends/test_auto_page_breaks.py
  • tests/backends/test_merged_cells.py
  • tests/core/test_merged_cells_core.py
  • tests/core/test_pipeline.py
  • tests/core/test_pipeline_fallbacks.py
  • tests/engine/test_engine.py
  • tests/export/test_export_requirements.py
  • tests/integration/test_integrate_raw_data.py
  • tests/models/test_modeling.py
  • tests/models/test_models_export.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Add type hints to all function and method arguments and return values, must be mypy strict compliant
Use Any type only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Each function must have a single responsibility; cyclomatic complexity must not exceed 12
Organize imports in the following order: 1) standard library, 2) third-party packages, 3) exstruct internal modules
Add Google-style docstrings to all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external entities to Pydantic models at the boundary layer
Generated code must pass mypy strict mode with zero errors
Generated code must pass Ruff linter checks (E, W, F, I, B, UP, N, C90 rules) with zero errors
Avoid God Functions (large monolithic functions that do too much)
Avoid God Objects (classes that have too many responsibilities)
Eliminate unnecessary nesting and deeply nested conditional branches
Avoid circular dependencies between modules
Write functions that are testable and avoid unnecessary side effects

Files:

  • tests/export/test_export_requirements.py
  • sample/forms_with_many_merged_cells/en_form_sf425/example.py
  • sample/forms_with_many_merged_cells/ja_general_form/__init__.py
  • sample/forms_with_many_merged_cells/__init__.py
  • sample/forms_with_many_merged_cells/ja_general_form/example.py
  • tests/core/test_merged_cells_core.py
  • tests/integration/test_integrate_raw_data.py
  • tests/backends/test_merged_cells.py
  • tests/models/test_modeling.py
  • tests/core/test_pipeline_fallbacks.py
  • src/exstruct/engine.py
  • src/exstruct/core/integrate.py
  • src/exstruct/core/backends/base.py
  • tests/models/test_models_export.py
  • src/exstruct/core/modeling.py
  • src/exstruct/core/cells.py
  • tests/backends/test_auto_page_breaks.py
  • sample/__init__.py
  • scripts/gen_json_schema.py
  • tests/engine/test_engine.py
  • sample/forms_with_many_merged_cells/en_form_sf425/__init__.py
  • src/exstruct/io/__init__.py
  • src/exstruct/core/pipeline.py
  • tests/core/test_pipeline.py
  • src/exstruct/models/__init__.py
🧠 Learnings (1)
📚 Learning: 2025-12-27T14:32:54.717Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T14:32:54.717Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external entities to Pydantic models at the boundary layer

Applied to files:

  • src/exstruct/io/__init__.py
🧬 Code graph analysis (15)
tests/export/test_export_requirements.py (2)
src/exstruct/models/__init__.py (2)
  • SheetData (159-253)
  • CellRow (103-112)
tests/models/test_schemas_generated.py (1)
  • test_sheet_schema_definitions_present (23-30)
sample/forms_with_many_merged_cells/en_form_sf425/example.py (1)
src/exstruct/engine.py (2)
  • ExStructEngine (173-574)
  • StructOptions (60-86)
sample/forms_with_many_merged_cells/ja_general_form/example.py (1)
src/exstruct/engine.py (2)
  • ExStructEngine (173-574)
  • StructOptions (60-86)
tests/backends/test_merged_cells.py (2)
src/exstruct/core/cells.py (1)
  • MergedCellRange (71-78)
src/exstruct/core/backends/openpyxl_backend.py (1)
  • extract_merged_cells (91-100)
tests/models/test_modeling.py (1)
src/exstruct/core/cells.py (1)
  • MergedCellRange (71-78)
src/exstruct/engine.py (1)
src/exstruct/__init__.py (1)
  • process_excel (318-398)
src/exstruct/core/integrate.py (1)
src/exstruct/core/backends/openpyxl_backend.py (1)
  • extract_cells (35-48)
src/exstruct/core/backends/base.py (2)
src/exstruct/models/__init__.py (2)
  • CellRow (103-112)
  • PrintArea (150-156)
src/exstruct/core/cells.py (2)
  • MergedCellRange (71-78)
  • WorkbookColorsMap (53-67)
tests/models/test_models_export.py (1)
src/exstruct/models/__init__.py (4)
  • CellRow (103-112)
  • MergedCells (84-100)
  • SheetData (159-253)
  • SmartArt (74-81)
src/exstruct/core/modeling.py (2)
src/exstruct/models/__init__.py (1)
  • MergedCells (84-100)
src/exstruct/core/cells.py (1)
  • MergedCellRange (71-78)
scripts/gen_json_schema.py (1)
src/exstruct/models/__init__.py (1)
  • MergedCells (84-100)
tests/engine/test_engine.py (1)
src/exstruct/models/__init__.py (1)
  • MergedCells (84-100)
src/exstruct/core/pipeline.py (3)
src/exstruct/core/cells.py (1)
  • MergedCellRange (71-78)
src/exstruct/models/__init__.py (1)
  • CellRow (103-112)
src/exstruct/core/modeling.py (1)
  • SheetRawData (20-41)
tests/core/test_pipeline.py (3)
src/exstruct/core/cells.py (1)
  • MergedCellRange (71-78)
src/exstruct/core/pipeline.py (3)
  • ExtractionInputs (42-67)
  • ExtractionArtifacts (71-90)
  • build_cells_tables_workbook (821-870)
src/exstruct/models/__init__.py (1)
  • CellRow (103-112)
src/exstruct/models/__init__.py (1)
src/exstruct/io/__init__.py (1)
  • dict_without_empty_values (32-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (72)
.gitignore (1)

6-6: LGTM!

The addition of the drafts/ directory to .gitignore is straightforward and follows the existing pattern.

tests/backends/test_auto_page_breaks.py (1)

31-31: LGTM!

The new parameter include_merged_values_in_rows is properly typed and aligns the mock signature with the actual extract_workbook interface. It's appropriate for test mocks to include parameters that match the real signature, even when not used in the mock implementation.

sample/forms_with_many_merged_cells/ja_general_form/__init__.py (1)

1-1: LGTM!

This package marker file is appropriately minimal and clearly documented.

docs/release-notes/v0.3.2.md (1)

8-8: LGTM!

The documentation accurately reflects the updated API, changing from MergedCell to MergedCells format in line with the PR's refactoring.

sample/forms_with_many_merged_cells/en_form_sf425/__init__.py (1)

1-1: LGTM!

This package marker file follows the same clean pattern as other sample packages and is appropriately documented.

AGENTS.md (1)

202-210: New documentation section is well-structured and actionable.

The addition of section "10. 各種仕様の確認" provides clear guidance for AI agents to reference relevant documentation when working on ExStruct development. The referenced files (pipeline.md, architecture.md, CODING_GUIDELINES.md, DATA_MODEL.md, TASKS.md) are appropriately listed with descriptions.

sample/__init__.py (1)

1-1: Minimal package marker is appropriate.

The module docstring provides a clear identifier for the sample package.

docs/agents/ROADMAP.md (1)

43-53: Roadmap updates accurately reflect the merged cells compression feature and future work.

The version assignments are clear:

  • v0.3.5: Merged cell range data compression and rows structure improvements (current PR)
  • v0.3.6: Formula retrieval option (forthcoming)
  • v0.5.0: Excel Form Controls analysis (future major milestone)

The changes align with the PR objectives and maintain a logical feature roadmap.

docs/agents/TASKS.md (1)

1-11: LGTM!

The task list accurately reflects the implementation status. One remaining item (Line 10) for documentation consistency check is appropriately marked incomplete.

src/exstruct/core/cells.py (4)

18-19: LGTM!

Import is correctly organized with other internal module imports.


70-78: LGTM!

The frozen dataclass is well-defined with proper type hints. The field names align with the schema definition (r1, c1, r2, c2, v).


540-556: LGTM!

The function signature, type hints, and Google-style docstring are correct. The return type properly reflects the new MergedCellRange model.


556-573: LGTM!

The implementation correctly:

  • Normalizes empty merged cell values to a single space " " per specification
  • Converts column indices to 0-based (min_col - 1, max_col - 1)
  • Preserves 1-based row indices
docs/agents/DATA_MODEL.md (3)

153-165: LGTM!

The MergedCells model documentation accurately describes the new schema-based format with:

  • schema: field ordering array
  • items: array of tuples matching the schema
  • Empty value normalization to " " documented at line 164

168-189: LGTM!

The SheetData model documentation correctly reflects:

  • merged_cells type as MergedCells | null
  • The include_merged_values_in_rows flag behavior documented in the notes

239-253: LGTM!

The changelog entries properly document the evolution from MergedCell (v0.14) to the compressed MergedCells schema+items format (v0.15).

sample/forms_with_many_merged_cells/__init__.py (1)

1-1: LGTM!

Standard package marker with an appropriate docstring.

tests/models/test_modeling.py (2)

1-3: LGTM!

Import correctly added for MergedCellRange alongside existing model imports.


29-42: LGTM!

Test correctly:

  • Uses MergedCellRange with the normalized space value v=" "
  • Verifies that merged_cells is not None after building workbook data, confirming the transformation from list[MergedCellRange] to MergedCells works
tests/export/test_export_requirements.py (1)

117-120: LGTM!

The test correctly uses merged_cells=None instead of an empty list, aligning with the updated SheetData.merged_cells type (MergedCells | None). The assertion verifies that None values are properly omitted from JSON output.

tests/core/test_merged_cells_core.py (1)

18-28: LGTM!

The test correctly expects " " (single space) for the merged range D4:E4 which has no value. This aligns with the normalization logic in extract_sheet_merged_cells where empty strings are replaced with a space character.

docs/README.en.md (1)

380-389: LGTM! Clear documentation of the new schema-based structure.

The documentation clearly illustrates the new merged_cells representation with the schema and items fields. The example effectively demonstrates how merged cell data is now structured as a tabular payload rather than a flat array.

scripts/gen_json_schema.py (2)

13-13: LGTM! Import follows established pattern.

The addition of MergedCells to the imports is consistent with the existing model imports and enables schema generation for the new data structure.


63-63: LGTM! Schema generation target added correctly.

The "merged_cells": MergedCells entry follows the same pattern as other models in the targets dictionary, ensuring the new schema is generated alongside existing schemas.

src/exstruct/core/integrate.py (3)

21-21: LGTM! Parameter correctly typed with backward-compatible default.

The include_merged_values_in_rows parameter is properly typed as bool with a default value of True, maintaining backward compatibility while enabling the new functionality.


37-37: LGTM! Clear parameter documentation.

The docstring accurately describes the purpose of the include_merged_values_in_rows parameter and follows the established documentation pattern.


55-55: LGTM! Parameter correctly forwarded.

The parameter is properly forwarded to resolve_extraction_inputs, maintaining consistency through the extraction pipeline.

tests/core/test_pipeline_fallbacks.py (3)

38-38: LGTM! Test updated with new parameter.

The test correctly includes the include_merged_values_in_rows=True parameter, maintaining explicit clarity in test expectations.


72-72: LGTM! Test updated with new parameter.

The test correctly includes the include_merged_values_in_rows=True parameter, maintaining consistency with the updated API.


114-114: LGTM! Test updated with new parameter.

The test correctly includes the include_merged_values_in_rows=True parameter, ensuring all fallback scenarios are tested with the new API.

README.ja.md (1)

363-371: LGTM! Merged cells representation is consistent with the new schema.

The JSON example correctly demonstrates the new schema-based merged_cells structure with the schema array and items array format. This aligns with the MergedCells model introduced in the PR.

schemas/merged_cells.json (1)

1-50: LGTM! Schema definition is correct and well-structured.

The JSON Schema correctly defines the MergedCells structure with:

  • Proper tuple constraints (minItems/maxItems = 5)
  • Correct type definitions for the 5-element array (4 integers + 1 string)
  • Enum validation for the schema field
  • Clear descriptions for all fields

This aligns perfectly with the Pydantic MergedCells model introduced in the PR.

tests/engine/test_engine.py (3)

17-17: LGTM! Import updated to use the new MergedCells model.

The import change from MergedCell to MergedCells correctly reflects the renamed public API introduced in this PR.


38-38: LGTM! Parameter addition maintains signature consistency.

The include_merged_values_in_rows parameter correctly mirrors the real extract_workbook signature with the appropriate default value of True. This ensures the fake function matches the actual API for testing purposes.


69-69: LGTM! Correctly uses the new MergedCells schema-based constructor.

The usage of MergedCells(items=[(1, 0, 1, 1, "merged")]) properly demonstrates the new compressed representation where merged cell data is provided as a list of tuples rather than individual field arguments.

src/exstruct/engine.py (4)

74-75: LGTM - Clear documentation of new option.

The docstring clearly describes the purpose of the new include_merged_values_in_rows option.


85-85: LGTM - Appropriate default value.

The default value of True preserves backward compatibility by keeping merged values in rows by default.


362-362: LGTM - Correctly propagates new option.

The new include_merged_values_in_rows option is properly passed through to extract_workbook.


289-291: No action needed—None handling is already properly implemented.

The SheetData.merged_cells field is correctly typed as MergedCells | None and all consumers handle None appropriately:

  • The type annotation already supports both values.
  • Serialization uses exclude_none=True, automatically omitting None from JSON output.
  • Tests verify the behavior: test_merged_cells_empty_is_omitted_in_sheet_json() confirms that merged_cells=None is excluded from exported JSON.
  • Callers that reference merged_cells either explicitly check for None (e.g., test_core/test_pipeline.py:197) or rely on safe serialization patterns.
tests/backends/test_merged_cells.py (3)

11-11: LGTM - Import updated for renamed type.

The import correctly reflects the rename from MergedCell to MergedCellRange and the move from exstruct.models to exstruct.core.cells.


35-35: LGTM - Test assertion updated for renamed type.

The test correctly uses the renamed MergedCellRange type.


41-41: LGTM - Type hint updated consistently.

The return type annotation correctly uses the renamed MergedCellRange type.

src/exstruct/core/backends/base.py (2)

6-7: LGTM - Imports updated for type refactoring.

The imports correctly reflect the rename from MergedCell to MergedCellRange and its move from exstruct.models to exstruct.core.cells. The addition of WorkbookColorsMap import is also appropriate.


11-11: LGTM - Type alias updated consistently.

The MergedCellData type alias correctly uses the renamed MergedCellRange type, maintaining consistency with the refactoring across the codebase.

src/exstruct/io/__init__.py (1)

374-376: LGTM! Consistent alias-based serialization.

The addition of by_alias=True across all export and serialization paths ensures that field aliases (e.g., schema and items in MergedCells) are consistently honored in the output. This aligns well with the PR objective to compress merged cell data representation.

Also applies to: 435-437, 462-464, 489-489, 528-528

schemas/sheet.json (2)

376-424: LGTM! Well-structured compressed schema.

The MergedCells definition effectively compresses merged cell data using a schema-based approach. The field descriptions clearly specify the coordinate systems (1-based rows, 0-based columns), and the type constraints properly enforce the 5-element tuple structure [r1, c1, r2, c2, v].


721-732: LGTM! Proper integration with SheetData schema.

The merged_cells field is correctly defined as an optional field (nullable with default null) that references the new MergedCells definition. This aligns with the pattern used for other optional fields in the SheetData schema.

tests/core/test_pipeline.py (3)

5-5: LGTM! Proper test updates for new parameter.

The import of MergedCellRange and the addition of include_merged_values_in_rows=True to all existing test cases ensures backward compatibility while exercising the new API. The consistent pattern across all tests is good for maintainability.

Also applies to: 30-30, 50-50, 75-75, 93-93, 117-117, 135-135, 152-152, 182-182


161-197: LGTM! Correct assertion for empty merged cells.

The assertion assert sheet.merged_cells is None at line 197 is correct. When merged_cell_data is provided as an empty list (line 187), the resulting sheet.merged_cells should be None, reflecting the absence of merged cells in the output.


200-221: LGTM! Comprehensive test for merged value filtering.

The new test properly verifies that when include_merged_values_in_rows=False, cells covered by merged cell ranges are correctly filtered from the row data. The test clearly demonstrates that cells "0" and "1" (covered by the merged range) are excluded, while cell "2" remains. This effectively validates the new feature.

README.md (3)

380-389: LGTM! Clear documentation of new merged_cells format.

The example effectively demonstrates the new compressed merged_cells representation using the schema-based approach. The structure with "schema": ["r1", "c1", "r2", "c2", "v"] and corresponding items array clearly shows how merged cell data is now organized, which aligns perfectly with the PR objective.


597-601: LGTM! Helpful documentation build instructions.

The new "Documentation build" section provides clear, actionable commands for contributors who need to generate and build documentation locally. The use of uv run for command execution is appropriate for the development environment.


5-5: All updated image paths are valid. The files exist at their new locations:

  • docs/assets/icon.webp
  • docs/assets/demo_sheet.png
  • docs/assets/demo_form_en.png
sample/forms_with_many_merged_cells/ja_general_form/ja_form.json (1)

118-219: LGTM! Proper demonstration of new merged_cells format.

The sample file correctly demonstrates the new compressed merged_cells representation with the schema-based approach. The structure matches the JSON schema definition, showing real-world usage with merged cell ranges of varying spans and Japanese text content.

docs/agents/FEATURE_SPEC.md (1)

1-58: Documentation updates are clear and well-structured.

The specification clearly describes:

  • The new schema + items format for merged_cells
  • The include_merged_values_in_rows flag with sensible default (True) for backward compatibility
  • Coordinate conventions (row 1-based, col 0-based)

This aligns well with the implementation changes across the codebase.

schemas/workbook.json (1)

597-608: Schema addition for merged_cells in SheetData looks correct.

The anyOf pattern with MergedCells or null (default null) appropriately models the optional merged cells data.

src/exstruct/core/modeling.py (2)

78-92: Implementation is correct and follows coding guidelines.

The _build_merged_cells function:

  • Has proper type hints for arguments and return value
  • Includes a Google-style docstring
  • Has a single responsibility (transform raw ranges to model)
  • Returns None for empty input, avoiding empty MergedCells objects

41-41: Type change for merged_cells field is consistent with the new data model.

The field type list[MergedCellRange] correctly represents raw extracted data before conversion to the compressed MergedCells model.

src/exstruct/models/__init__.py (3)

11-13: Helper function follows coding guidelines.

The _default_merged_cells_schema function has proper type hints and a concise docstring. The use of Literal types ensures type safety for the schema field names.


84-100: Well-designed MergedCells model with proper alias handling.

The implementation correctly:

  • Uses schema_ with alias="schema" to avoid Python keyword conflict
  • Enables populate_by_name=True for flexible input handling
  • Provides sensible defaults via default_factory
  • Has accurate type hints for the tuple structure

194-196: Serialization correctly uses by_alias=True.

This ensures the schema_ field is serialized as "schema" in JSON output. This is consistent with dict_without_empty_values in src/exstruct/io/__init__.py which also uses by_alias=True.

sample/forms_with_many_merged_cells/en_form_sf425/sample.json (1)

165-245: Sample data correctly demonstrates the new merged_cells format.

The schema + items structure is properly used with:

  • Schema defining ["r1", "c1", "r2", "c2", "v"] field order
  • Items as arrays of 5 elements matching the schema
  • Multi-line strings preserved in the value field (e.g., Paperwork Burden Statement)

This serves as a good reference for the new format.

src/exstruct/core/pipeline.py (10)

24-34: Type alias and import updates are consistent with the new data model.

The change from MergedCell to MergedCellRange aligns with the broader refactoring. The MergedCellData type alias correctly reflects this change.


590-634: Row filtering logic is correct and well-structured.

The function properly:

  • Returns early when no filtering is needed
  • Uses interval-based lookup for efficient column checking
  • Preserves non-merged cells and their associated links
  • Skips entirely empty rows after filtering
  • Creates new CellRow instances to avoid mutation

637-654: Interval building logic is correct.

The function correctly:

  • Expands merged cell ranges to all covered rows
  • Accumulates column intervals per row
  • Merges overlapping intervals for efficient lookup

657-671: Interval merging is correctly implemented.

The standard interval merging algorithm is correctly applied. Minor note: the docstring is minimal but adequate for this straightforward utility function.


674-681: Column interval check is efficient.

The function leverages sorted intervals (from _merge_intervals) to enable early termination when col_index < start, avoiding unnecessary iterations.


684-689: Safe parsing handles edge cases gracefully.

Non-integer column keys are preserved in the output by returning None, ensuring they are not incorrectly filtered out.


692-744: Sheet data collection correctly integrates the new filtering behavior.

The implementation:

  • Accepts and documents the new include_merged_values_in_rows parameter
  • Conditionally applies row filtering based on the flag
  • Preserves merged_cells metadata regardless of the flag value

This aligns with the specification that rows can omit merged values while metadata remains intact.


784-799: Pipeline correctly propagates the flag through extraction.

The include_merged_values_in_rows flag is passed from inputs to collect_sheet_raw_data, ensuring consistent behavior throughout the extraction pipeline.


846-868: Fallback workbook building correctly applies the same filtering logic.

The fallback path mirrors the main extraction path by:

  • Applying row filtering when include_merged_values_in_rows is False
  • Preserving merged_cells metadata in the output

This ensures consistent behavior regardless of whether COM extraction succeeds or falls back.


55-67: New field properly added to ExtractionInputs.

The include_merged_values_in_rows field is:

  • Documented in the class docstring
  • Passed through directly (no mode-based resolution, consistent with the spec's explicit control design)

harumiWeb and others added 2 commits January 6, 2026 22:28
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

resolved_merged_cells = (
include_merged_cells if include_merged_cells is not None else mode != "light"
)

P2 Badge Ensure merged-value filtering works when merged cells are off

When include_merged_values_in_rows=False, the filtering logic depends on merged ranges being extracted, but resolved_merged_cells can still be False (e.g., light mode default or user disables merged cells). In that case no merged ranges are collected, _filter_rows_excluding_merged_values is a no-op, and merged values remain in rows, so the flag silently does nothing. Consider forcing merged-cell extraction when include_merged_values_in_rows is false (or decoupling extraction from output) so the flag consistently removes merged values as documented.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI Agents
In @docs/README.ja.md:
- Around line 378-415: Replace the incorrect version string "v0.15" with
"v0.3.5" in the paragraph that describes the merged_cells format (the line that
currently reads "`merged_cells` は v0.15 で…"); ensure that the text now matches
the section heading "互換性メモ(v0.3.5)", the other mentions in the same file, the
English README, and pyproject.toml so all references uniformly use "v0.3.5".
🧹 Nitpick comments (4)
tests/core/test_pipeline.py (1)

219-240: Consider adding edge case coverage.

The test verifies that merged values are excluded while non-merged data is preserved. Consider adding a test case where:

  • Multiple merged ranges overlap with the same row
  • Merged ranges span multiple rows (verifying all rows are filtered)
src/exstruct/core/pipeline.py (3)

659-673: Add docstring for _merge_intervals.

Per coding guidelines, functions should have Google-style docstrings. This function only has a one-line description without Args/Returns.

🔎 Suggested docstring
 def _merge_intervals(intervals: list[tuple[int, int]]) -> list[tuple[int, int]]:
-    """Merge overlapping or adjacent intervals."""
+    """Merge overlapping or adjacent intervals.
+
+    Args:
+        intervals: List of (start, end) tuples representing column intervals.
+
+    Returns:
+        List of merged non-overlapping intervals, sorted by start index.
+    """
     if not intervals:
         return []

676-683: Linear search on sorted intervals could leverage binary search.

_col_in_intervals iterates linearly over sorted intervals. For workbooks with many merged regions, consider using bisect for O(log n) lookup. However, given typical Excel usage patterns, this is likely not a bottleneck.

🔎 Binary search alternative
import bisect

def _col_in_intervals(col_index: int, intervals: list[tuple[int, int]]) -> bool:
    """Check whether a column index falls in any interval."""
    if not intervals:
        return False
    starts = [iv[0] for iv in intervals]
    idx = bisect.bisect_right(starts, col_index) - 1
    if idx < 0:
        return False
    return intervals[idx][0] <= col_index <= intervals[idx][1]

686-691: Add docstring for _safe_col_index.

Per coding guidelines, functions should have Google-style docstrings with Args/Returns.

🔎 Suggested docstring
 def _safe_col_index(col_key: str) -> int | None:
-    """Parse a column key to int, returning None on failure."""
+    """Parse a column key to int, returning None on failure.
+
+    Args:
+        col_key: String column key to parse.
+
+    Returns:
+        Integer column index, or None if parsing fails.
+    """
     try:
         return int(col_key)
     except ValueError:
         return None
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c21d544 and 1cdcebb.

📒 Files selected for processing (10)
  • README.ja.md
  • README.md
  • docs/README.en.md
  • docs/README.ja.md
  • docs/release-notes/v0.3.5.md
  • mkdocs.yml
  • pyproject.toml
  • src/exstruct/core/pipeline.py
  • tests/core/test_pipeline.py
  • tests/models/test_models_export.py
✅ Files skipped from review due to trivial changes (2)
  • docs/release-notes/v0.3.5.md
  • mkdocs.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • README.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Add type hints to all function and method arguments and return values, must be mypy strict compliant
Use Any type only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Each function must have a single responsibility; cyclomatic complexity must not exceed 12
Organize imports in the following order: 1) standard library, 2) third-party packages, 3) exstruct internal modules
Add Google-style docstrings to all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external entities to Pydantic models at the boundary layer
Generated code must pass mypy strict mode with zero errors
Generated code must pass Ruff linter checks (E, W, F, I, B, UP, N, C90 rules) with zero errors
Avoid God Functions (large monolithic functions that do too much)
Avoid God Objects (classes that have too many responsibilities)
Eliminate unnecessary nesting and deeply nested conditional branches
Avoid circular dependencies between modules
Write functions that are testable and avoid unnecessary side effects

Files:

  • tests/models/test_models_export.py
  • tests/core/test_pipeline.py
  • src/exstruct/core/pipeline.py
🧬 Code graph analysis (3)
tests/models/test_models_export.py (1)
src/exstruct/models/__init__.py (7)
  • CellRow (103-112)
  • MergedCells (84-100)
  • SheetData (159-253)
  • SmartArt (74-81)
  • to_json (198-203)
  • to_json (264-270)
  • to_json (348-353)
tests/core/test_pipeline.py (2)
src/exstruct/core/cells.py (1)
  • MergedCellRange (71-78)
src/exstruct/models/__init__.py (1)
  • CellRow (103-112)
src/exstruct/core/pipeline.py (2)
src/exstruct/core/cells.py (1)
  • MergedCellRange (71-78)
src/exstruct/core/modeling.py (1)
  • SheetRawData (20-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (12)
pyproject.toml (1)

3-3: LGTM! Version bump aligns with the feature release.

The version increment from 0.3.2 to 0.3.5 appropriately reflects the merged cell data model changes introduced in this PR.

README.ja.md (1)

378-415: Excellent migration documentation with clear examples.

The compatibility memo effectively documents the breaking change in merged_cells format, providing:

  • Clear version indication (v0.3.5)
  • Side-by-side comparison of old vs new formats
  • Practical migration code supporting both formats during transition

This provides consumers with the guidance needed to handle the serialization change.

tests/models/test_models_export.py (1)

141-155: LGTM! Test validates the new merged_cells schema format.

The test properly verifies that:

  • merged_cells is serialized with a schema field containing the expected column names
  • items are serialized as arrays matching the schema order
  • The docstring clearly explains the test's purpose (addressing the previous review comment)
docs/README.en.md (1)

396-433: Clear migration documentation for English readers.

The migration note effectively communicates the breaking change with:

  • Version indication (v0.3.5)
  • Format comparison (old array-of-objects vs new schema/items)
  • Practical Python helper function for backward compatibility

This provides the same quality guidance as the Japanese version, ensuring all consumers can handle the transition.

tests/core/test_pipeline.py (3)

5-14: LGTM on imports.

The new imports are correctly added for MergedCellRange and the internal _filter_rows_excluding_merged_values function to enable testing of the merged cell filtering functionality.


162-178: Good test coverage for the forced merged_cells behavior.

This test correctly verifies that when include_merged_values_in_rows=False is passed with include_merged_cells=False, the resolved inputs force include_merged_cells to True. This ensures merged cell metadata is available for filtering.


243-261: LGTM on filtering function tests.

These tests effectively verify that:

  1. Links are correctly filtered alongside cell values (line 254)
  2. Empty rows are dropped after filtering (line 261)

Good coverage of the _filter_rows_excluding_merged_values behavior.

src/exstruct/core/pipeline.py (5)

41-67: LGTM on ExtractionInputs extension.

The new include_merged_values_in_rows field is properly documented in the docstring and correctly typed as bool. The dataclass remains immutable with frozen=True.


228-229: Good defensive logic for merged cell forcing.

When include_merged_values_in_rows is False, merged cell metadata is required for filtering. Forcing resolved_merged_cells = True ensures the data is available. This is well-documented behavior tested in test_resolve_extraction_inputs_forces_merged_cells_when_excluding_values.


592-636: LGTM on the filtering implementation.

The function correctly:

  • Returns early when inputs are empty (line 605-606)
  • Preserves non-integer column keys (line 619-620)
  • Filters links consistently with cell values (line 627-634)
  • Drops rows that become empty after filtering (line 624-625)

694-746: LGTM on collect_sheet_raw_data updates.

The function correctly applies filtering when include_merged_values_in_rows is False and always passes merged_cells to SheetRawData for downstream use. The docstring is properly updated to document the new parameter.


853-870: LGTM on build_cells_tables_workbook fallback path.

The fallback workbook construction correctly mirrors the main path's merged-value filtering logic, ensuring consistent behavior regardless of which code path is taken.

@harumiWeb harumiWeb merged commit 3bd891a into main Jan 6, 2026
9 checks passed
@harumiWeb harumiWeb deleted the dev/merged_cells_compress branch January 6, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants