-
Notifications
You must be signed in to change notification settings - Fork 14
feat: Compressing the amount of context in merged cell data #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- include_merged_values_in_rows(既定 True)で rows から結合セル値を除外可能に - include_merged_cells を false の場合は merged_cells を None 出力に統一
|
Warning Rate limit exceeded@harumiWeb has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 22 minutes and 41 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR replaces the flat merged-cell representation with a schema/items compressed Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Client/CLI
participant Engine as ExStructEngine
participant Pipeline as Pipeline
participant Cells as CellsExtractor
participant Modeling as ModelBuilder
participant IO as Serializer/IO
CLI->>Engine: extract(file, StructOptions(include_merged_values_in_rows=?))
Engine->>Pipeline: resolve_extraction_inputs(..., include_merged_values_in_rows)
Pipeline->>Cells: collect_sheet_raw_data(..., include_merged_values_in_rows)
alt include_merged_values_in_rows == False
Cells-->>Pipeline: merged_ranges (MergedCellRange list)
Pipeline->>Pipeline: _filter_rows_excluding_merged_values(rows, merged_ranges)
else
Cells-->>Pipeline: merged_ranges (MergedCellRange list)
end
Pipeline->>Modeling: build_sheet_data(raw, merged_ranges)
Modeling->>Modeling: _build_merged_cells(merged_ranges) → MergedCells(schema, items)
Modeling-->>Engine: SheetData(merged_cells = MergedCells | null)
Engine->>IO: serialize(workbook) (model_dump by_alias=True)
IO-->>CLI: output JSON (merged_cells uses schema/items)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI Agents
In @docs/README.ja.md:
- Around line 363-371: The JSON output for merged_cells was changed from a list
of objects to a schema+items format (fields merged_cells, schema, items), which
is a breaking serialization change; update the docs to add a clear migration
note: describe the old format (array of objects), show the new schema+items
format with an explicit version number where it was introduced, provide a short
migration example showing how to map old objects to the new schema/items
representation and include parsing examples/snippets for consumers to support
both formats during transition.
In @tests/models/test_models_export.py:
- Around line 141-148: Add a Google-style docstring to the test function
test_sheet_json_includes_merged_cells_schema that briefly explains it verifies
that SheetData serialization (via SheetData.to_json) includes the merged_cells
schema and preserves merged cell items; mention the expected schema
["r1","c1","r2","c2","v"] and that the first item equals [1, 0, 1, 1, "merged"].
Locate the function using its name and references to SheetData, MergedCells, and
to_json and place the docstring immediately under the def line following
Google-style conventions.
🧹 Nitpick comments (6)
sample/forms_with_many_merged_cells/ja_general_form/example.py (1)
1-9: Consider adding type annotations for enhanced clarity.The example script clearly demonstrates the
include_merged_values_in_rowsoption. However, adding type annotations to the module-level variables would improve clarity and align with the project's type-safety standards.🔎 Proposed enhancement with type annotations
+from pathlib import Path + from exstruct import ExStructEngine, StructOptions -file_path = "ja_form.xlsx" +file_path: str = "ja_form.xlsx" engine = ExStructEngine( options=StructOptions(include_merged_values_in_rows=False), ) -wb = engine.extract(file_path, mode="standard") -engine.export(wb, "output.json", pretty=False) +wb = engine.extract(file_path, mode="standard") # WorkbookData +engine.export(wb, Path("output.json"), pretty=False)sample/forms_with_many_merged_cells/en_form_sf425/example.py (1)
1-9: Consider adding a module docstring and main guard for better example clarity.This example script correctly demonstrates the new
include_merged_values_in_rowsoption. However, as per the coding guidelines, consider adding:
- A module-level docstring explaining what this example demonstrates
- A
if __name__ == "__main__":guard around the execution logicThese additions would improve the example's educational value and follow Python best practices for sample scripts.
🔎 Proposed improvements
+""" +Example demonstrating ExStruct extraction with merged cell values excluded from row data. + +This script extracts data from sample.xlsx in standard mode with +include_merged_values_in_rows=False, which stores merged cell data separately +in the merged_cells field while omitting them from row cell values. +""" + from exstruct import ExStructEngine, StructOptions -file_path = "sample.xlsx" - -engine = ExStructEngine( - options=StructOptions(include_merged_values_in_rows=False), -) -wb = engine.extract(file_path, mode="standard") -engine.export(wb, "output.json", pretty=False) + +if __name__ == "__main__": + file_path = "sample.xlsx" + + engine = ExStructEngine( + options=StructOptions(include_merged_values_in_rows=False), + ) + wb = engine.extract(file_path, mode="standard") + engine.export(wb, "output.json", pretty=False)tests/integration/test_integrate_raw_data.py (2)
62-62: Consider adding test coverage for the new parameter's behavior.The test now passes
include_merged_values_in_rows=Truebut doesn't validate whether merged values are actually included in the rows. Consider adding assertions that verify this parameter's effect, or create a dedicated test that checks both True and False cases.
98-98: Consider adding test coverage for the new parameter's behavior.Similar to the other test, this passes
include_merged_values_in_rows=Truewithout validating the parameter's effect. Consider adding test coverage that verifies the behavior difference between True and False values.src/exstruct/io/__init__.py (1)
32-68: Update the docstring to document alias handling.The docstring should mention that model instances are serialized using field aliases when
by_alias=Trueis passed tomodel_dump(). This is important for understanding how fields likeschemaanditemsinMergedCellsappear in the output.🔎 Suggested docstring enhancement
def dict_without_empty_values(obj: object) -> JsonStructure: """ Remove None, empty string, empty list, and empty dict values from a nested structure or supported model object. - Recursively processes dicts, lists, and supported model types (WorkbookData, CellRow, Chart, PrintArea, PrintAreaView, Shape, Arrow, SmartArt). Model instances are converted to dictionaries with None fields excluded before recursive cleaning. Values considered empty and removed are: `None`, `""` (empty string), `[]` (empty list), and `{}` (empty dict). + Recursively processes dicts, lists, and supported model types (WorkbookData, CellRow, Chart, PrintArea, PrintAreaView, Shape, Arrow, SmartArt). Model instances are converted to dictionaries with None fields excluded and field aliases applied before recursive cleaning. Values considered empty and removed are: `None`, `""` (empty string), `[]` (empty list), and `{}` (empty dict). Parameters: obj (object): A value to clean; may be a dict, list, scalar, or one of the supported model instances. Returns: JsonStructure: The input structure with empty values removed, preserving other values and nesting. """schemas/workbook.json (1)
376-424: Consider addingrequiredproperty toMergedCellsdefinition.The
MergedCellsschema definesitemsandschemaproperties but does not specify them as required. This could allow empty objects{}to be valid, which may not be the intended behavior.🔎 Proposed fix
} }, + "required": [ + "schema", + "items" + ], "title": "MergedCells", "type": "object" },
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (42)
.gitignoreAGENTS.mdREADME.ja.mdREADME.mddocs/README.en.mddocs/README.ja.mddocs/agents/CODE_REVIEW.mddocs/agents/DATA_MODEL.mddocs/agents/FEATURE_SPEC.mddocs/agents/ROADMAP.mddocs/agents/TASKS.mddocs/release-notes/v0.3.2.mdsample/__init__.pysample/forms_with_many_merged_cells/__init__.pysample/forms_with_many_merged_cells/en_form_sf425/__init__.pysample/forms_with_many_merged_cells/en_form_sf425/example.pysample/forms_with_many_merged_cells/en_form_sf425/sample.jsonsample/forms_with_many_merged_cells/ja_general_form/__init__.pysample/forms_with_many_merged_cells/ja_general_form/example.pysample/forms_with_many_merged_cells/ja_general_form/ja_form.jsonschemas/merged_cells.jsonschemas/sheet.jsonschemas/workbook.jsonscripts/gen_json_schema.pysrc/exstruct/core/backends/base.pysrc/exstruct/core/cells.pysrc/exstruct/core/integrate.pysrc/exstruct/core/modeling.pysrc/exstruct/core/pipeline.pysrc/exstruct/engine.pysrc/exstruct/io/__init__.pysrc/exstruct/models/__init__.pytests/backends/test_auto_page_breaks.pytests/backends/test_merged_cells.pytests/core/test_merged_cells_core.pytests/core/test_pipeline.pytests/core/test_pipeline_fallbacks.pytests/engine/test_engine.pytests/export/test_export_requirements.pytests/integration/test_integrate_raw_data.pytests/models/test_modeling.pytests/models/test_models_export.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Add type hints to all function and method arguments and return values, must be mypy strict compliant
UseAnytype only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Each function must have a single responsibility; cyclomatic complexity must not exceed 12
Organize imports in the following order: 1) standard library, 2) third-party packages, 3) exstruct internal modules
Add Google-style docstrings to all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external entities to Pydantic models at the boundary layer
Generated code must pass mypy strict mode with zero errors
Generated code must pass Ruff linter checks (E, W, F, I, B, UP, N, C90 rules) with zero errors
Avoid God Functions (large monolithic functions that do too much)
Avoid God Objects (classes that have too many responsibilities)
Eliminate unnecessary nesting and deeply nested conditional branches
Avoid circular dependencies between modules
Write functions that are testable and avoid unnecessary side effects
Files:
tests/export/test_export_requirements.pysample/forms_with_many_merged_cells/en_form_sf425/example.pysample/forms_with_many_merged_cells/ja_general_form/__init__.pysample/forms_with_many_merged_cells/__init__.pysample/forms_with_many_merged_cells/ja_general_form/example.pytests/core/test_merged_cells_core.pytests/integration/test_integrate_raw_data.pytests/backends/test_merged_cells.pytests/models/test_modeling.pytests/core/test_pipeline_fallbacks.pysrc/exstruct/engine.pysrc/exstruct/core/integrate.pysrc/exstruct/core/backends/base.pytests/models/test_models_export.pysrc/exstruct/core/modeling.pysrc/exstruct/core/cells.pytests/backends/test_auto_page_breaks.pysample/__init__.pyscripts/gen_json_schema.pytests/engine/test_engine.pysample/forms_with_many_merged_cells/en_form_sf425/__init__.pysrc/exstruct/io/__init__.pysrc/exstruct/core/pipeline.pytests/core/test_pipeline.pysrc/exstruct/models/__init__.py
🧠 Learnings (1)
📚 Learning: 2025-12-27T14:32:54.717Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-27T14:32:54.717Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external entities to Pydantic models at the boundary layer
Applied to files:
src/exstruct/io/__init__.py
🧬 Code graph analysis (15)
tests/export/test_export_requirements.py (2)
src/exstruct/models/__init__.py (2)
SheetData(159-253)CellRow(103-112)tests/models/test_schemas_generated.py (1)
test_sheet_schema_definitions_present(23-30)
sample/forms_with_many_merged_cells/en_form_sf425/example.py (1)
src/exstruct/engine.py (2)
ExStructEngine(173-574)StructOptions(60-86)
sample/forms_with_many_merged_cells/ja_general_form/example.py (1)
src/exstruct/engine.py (2)
ExStructEngine(173-574)StructOptions(60-86)
tests/backends/test_merged_cells.py (2)
src/exstruct/core/cells.py (1)
MergedCellRange(71-78)src/exstruct/core/backends/openpyxl_backend.py (1)
extract_merged_cells(91-100)
tests/models/test_modeling.py (1)
src/exstruct/core/cells.py (1)
MergedCellRange(71-78)
src/exstruct/engine.py (1)
src/exstruct/__init__.py (1)
process_excel(318-398)
src/exstruct/core/integrate.py (1)
src/exstruct/core/backends/openpyxl_backend.py (1)
extract_cells(35-48)
src/exstruct/core/backends/base.py (2)
src/exstruct/models/__init__.py (2)
CellRow(103-112)PrintArea(150-156)src/exstruct/core/cells.py (2)
MergedCellRange(71-78)WorkbookColorsMap(53-67)
tests/models/test_models_export.py (1)
src/exstruct/models/__init__.py (4)
CellRow(103-112)MergedCells(84-100)SheetData(159-253)SmartArt(74-81)
src/exstruct/core/modeling.py (2)
src/exstruct/models/__init__.py (1)
MergedCells(84-100)src/exstruct/core/cells.py (1)
MergedCellRange(71-78)
scripts/gen_json_schema.py (1)
src/exstruct/models/__init__.py (1)
MergedCells(84-100)
tests/engine/test_engine.py (1)
src/exstruct/models/__init__.py (1)
MergedCells(84-100)
src/exstruct/core/pipeline.py (3)
src/exstruct/core/cells.py (1)
MergedCellRange(71-78)src/exstruct/models/__init__.py (1)
CellRow(103-112)src/exstruct/core/modeling.py (1)
SheetRawData(20-41)
tests/core/test_pipeline.py (3)
src/exstruct/core/cells.py (1)
MergedCellRange(71-78)src/exstruct/core/pipeline.py (3)
ExtractionInputs(42-67)ExtractionArtifacts(71-90)build_cells_tables_workbook(821-870)src/exstruct/models/__init__.py (1)
CellRow(103-112)
src/exstruct/models/__init__.py (1)
src/exstruct/io/__init__.py (1)
dict_without_empty_values(32-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (72)
.gitignore (1)
6-6: LGTM!The addition of the
drafts/directory to.gitignoreis straightforward and follows the existing pattern.tests/backends/test_auto_page_breaks.py (1)
31-31: LGTM!The new parameter
include_merged_values_in_rowsis properly typed and aligns the mock signature with the actualextract_workbookinterface. It's appropriate for test mocks to include parameters that match the real signature, even when not used in the mock implementation.sample/forms_with_many_merged_cells/ja_general_form/__init__.py (1)
1-1: LGTM!This package marker file is appropriately minimal and clearly documented.
docs/release-notes/v0.3.2.md (1)
8-8: LGTM!The documentation accurately reflects the updated API, changing from
MergedCelltoMergedCellsformat in line with the PR's refactoring.sample/forms_with_many_merged_cells/en_form_sf425/__init__.py (1)
1-1: LGTM!This package marker file follows the same clean pattern as other sample packages and is appropriately documented.
AGENTS.md (1)
202-210: New documentation section is well-structured and actionable.The addition of section "10. 各種仕様の確認" provides clear guidance for AI agents to reference relevant documentation when working on ExStruct development. The referenced files (pipeline.md, architecture.md, CODING_GUIDELINES.md, DATA_MODEL.md, TASKS.md) are appropriately listed with descriptions.
sample/__init__.py (1)
1-1: Minimal package marker is appropriate.The module docstring provides a clear identifier for the sample package.
docs/agents/ROADMAP.md (1)
43-53: Roadmap updates accurately reflect the merged cells compression feature and future work.The version assignments are clear:
- v0.3.5: Merged cell range data compression and rows structure improvements (current PR)
- v0.3.6: Formula retrieval option (forthcoming)
- v0.5.0: Excel Form Controls analysis (future major milestone)
The changes align with the PR objectives and maintain a logical feature roadmap.
docs/agents/TASKS.md (1)
1-11: LGTM!The task list accurately reflects the implementation status. One remaining item (Line 10) for documentation consistency check is appropriately marked incomplete.
src/exstruct/core/cells.py (4)
18-19: LGTM!Import is correctly organized with other internal module imports.
70-78: LGTM!The frozen dataclass is well-defined with proper type hints. The field names align with the schema definition (
r1,c1,r2,c2,v).
540-556: LGTM!The function signature, type hints, and Google-style docstring are correct. The return type properly reflects the new
MergedCellRangemodel.
556-573: LGTM!The implementation correctly:
- Normalizes empty merged cell values to a single space
" "per specification- Converts column indices to 0-based (
min_col - 1,max_col - 1)- Preserves 1-based row indices
docs/agents/DATA_MODEL.md (3)
153-165: LGTM!The MergedCells model documentation accurately describes the new schema-based format with:
schema: field ordering arrayitems: array of tuples matching the schema- Empty value normalization to
" "documented at line 164
168-189: LGTM!The SheetData model documentation correctly reflects:
merged_cellstype asMergedCells | null- The
include_merged_values_in_rowsflag behavior documented in the notes
239-253: LGTM!The changelog entries properly document the evolution from
MergedCell(v0.14) to the compressedMergedCellsschema+items format (v0.15).sample/forms_with_many_merged_cells/__init__.py (1)
1-1: LGTM!Standard package marker with an appropriate docstring.
tests/models/test_modeling.py (2)
1-3: LGTM!Import correctly added for
MergedCellRangealongside existing model imports.
29-42: LGTM!Test correctly:
- Uses
MergedCellRangewith the normalized space valuev=" "- Verifies that
merged_cellsis notNoneafter building workbook data, confirming the transformation fromlist[MergedCellRange]toMergedCellsworkstests/export/test_export_requirements.py (1)
117-120: LGTM!The test correctly uses
merged_cells=Noneinstead of an empty list, aligning with the updatedSheetData.merged_cellstype (MergedCells | None). The assertion verifies thatNonevalues are properly omitted from JSON output.tests/core/test_merged_cells_core.py (1)
18-28: LGTM!The test correctly expects
" "(single space) for the merged rangeD4:E4which has no value. This aligns with the normalization logic inextract_sheet_merged_cellswhere empty strings are replaced with a space character.docs/README.en.md (1)
380-389: LGTM! Clear documentation of the new schema-based structure.The documentation clearly illustrates the new
merged_cellsrepresentation with theschemaanditemsfields. The example effectively demonstrates how merged cell data is now structured as a tabular payload rather than a flat array.scripts/gen_json_schema.py (2)
13-13: LGTM! Import follows established pattern.The addition of
MergedCellsto the imports is consistent with the existing model imports and enables schema generation for the new data structure.
63-63: LGTM! Schema generation target added correctly.The
"merged_cells": MergedCellsentry follows the same pattern as other models in the targets dictionary, ensuring the new schema is generated alongside existing schemas.src/exstruct/core/integrate.py (3)
21-21: LGTM! Parameter correctly typed with backward-compatible default.The
include_merged_values_in_rowsparameter is properly typed asboolwith a default value ofTrue, maintaining backward compatibility while enabling the new functionality.
37-37: LGTM! Clear parameter documentation.The docstring accurately describes the purpose of the
include_merged_values_in_rowsparameter and follows the established documentation pattern.
55-55: LGTM! Parameter correctly forwarded.The parameter is properly forwarded to
resolve_extraction_inputs, maintaining consistency through the extraction pipeline.tests/core/test_pipeline_fallbacks.py (3)
38-38: LGTM! Test updated with new parameter.The test correctly includes the
include_merged_values_in_rows=Trueparameter, maintaining explicit clarity in test expectations.
72-72: LGTM! Test updated with new parameter.The test correctly includes the
include_merged_values_in_rows=Trueparameter, maintaining consistency with the updated API.
114-114: LGTM! Test updated with new parameter.The test correctly includes the
include_merged_values_in_rows=Trueparameter, ensuring all fallback scenarios are tested with the new API.README.ja.md (1)
363-371: LGTM! Merged cells representation is consistent with the new schema.The JSON example correctly demonstrates the new schema-based merged_cells structure with the
schemaarray anditemsarray format. This aligns with the MergedCells model introduced in the PR.schemas/merged_cells.json (1)
1-50: LGTM! Schema definition is correct and well-structured.The JSON Schema correctly defines the MergedCells structure with:
- Proper tuple constraints (minItems/maxItems = 5)
- Correct type definitions for the 5-element array (4 integers + 1 string)
- Enum validation for the schema field
- Clear descriptions for all fields
This aligns perfectly with the Pydantic MergedCells model introduced in the PR.
tests/engine/test_engine.py (3)
17-17: LGTM! Import updated to use the new MergedCells model.The import change from
MergedCelltoMergedCellscorrectly reflects the renamed public API introduced in this PR.
38-38: LGTM! Parameter addition maintains signature consistency.The
include_merged_values_in_rowsparameter correctly mirrors the realextract_workbooksignature with the appropriate default value ofTrue. This ensures the fake function matches the actual API for testing purposes.
69-69: LGTM! Correctly uses the new MergedCells schema-based constructor.The usage of
MergedCells(items=[(1, 0, 1, 1, "merged")])properly demonstrates the new compressed representation where merged cell data is provided as a list of tuples rather than individual field arguments.src/exstruct/engine.py (4)
74-75: LGTM - Clear documentation of new option.The docstring clearly describes the purpose of the new
include_merged_values_in_rowsoption.
85-85: LGTM - Appropriate default value.The default value of
Truepreserves backward compatibility by keeping merged values in rows by default.
362-362: LGTM - Correctly propagates new option.The new
include_merged_values_in_rowsoption is properly passed through toextract_workbook.
289-291: No action needed—None handling is already properly implemented.The
SheetData.merged_cellsfield is correctly typed asMergedCells | Noneand all consumers handleNoneappropriately:
- The type annotation already supports both values.
- Serialization uses
exclude_none=True, automatically omittingNonefrom JSON output.- Tests verify the behavior:
test_merged_cells_empty_is_omitted_in_sheet_json()confirms thatmerged_cells=Noneis excluded from exported JSON.- Callers that reference
merged_cellseither explicitly check forNone(e.g.,test_core/test_pipeline.py:197) or rely on safe serialization patterns.tests/backends/test_merged_cells.py (3)
11-11: LGTM - Import updated for renamed type.The import correctly reflects the rename from
MergedCelltoMergedCellRangeand the move fromexstruct.modelstoexstruct.core.cells.
35-35: LGTM - Test assertion updated for renamed type.The test correctly uses the renamed
MergedCellRangetype.
41-41: LGTM - Type hint updated consistently.The return type annotation correctly uses the renamed
MergedCellRangetype.src/exstruct/core/backends/base.py (2)
6-7: LGTM - Imports updated for type refactoring.The imports correctly reflect the rename from
MergedCelltoMergedCellRangeand its move fromexstruct.modelstoexstruct.core.cells. The addition ofWorkbookColorsMapimport is also appropriate.
11-11: LGTM - Type alias updated consistently.The
MergedCellDatatype alias correctly uses the renamedMergedCellRangetype, maintaining consistency with the refactoring across the codebase.src/exstruct/io/__init__.py (1)
374-376: LGTM! Consistent alias-based serialization.The addition of
by_alias=Trueacross all export and serialization paths ensures that field aliases (e.g.,schemaanditemsinMergedCells) are consistently honored in the output. This aligns well with the PR objective to compress merged cell data representation.Also applies to: 435-437, 462-464, 489-489, 528-528
schemas/sheet.json (2)
376-424: LGTM! Well-structured compressed schema.The
MergedCellsdefinition effectively compresses merged cell data using a schema-based approach. The field descriptions clearly specify the coordinate systems (1-based rows, 0-based columns), and the type constraints properly enforce the 5-element tuple structure [r1, c1, r2, c2, v].
721-732: LGTM! Proper integration with SheetData schema.The
merged_cellsfield is correctly defined as an optional field (nullable with default null) that references the newMergedCellsdefinition. This aligns with the pattern used for other optional fields in the SheetData schema.tests/core/test_pipeline.py (3)
5-5: LGTM! Proper test updates for new parameter.The import of
MergedCellRangeand the addition ofinclude_merged_values_in_rows=Trueto all existing test cases ensures backward compatibility while exercising the new API. The consistent pattern across all tests is good for maintainability.Also applies to: 30-30, 50-50, 75-75, 93-93, 117-117, 135-135, 152-152, 182-182
161-197: LGTM! Correct assertion for empty merged cells.The assertion
assert sheet.merged_cells is Noneat line 197 is correct. Whenmerged_cell_datais provided as an empty list (line 187), the resultingsheet.merged_cellsshould beNone, reflecting the absence of merged cells in the output.
200-221: LGTM! Comprehensive test for merged value filtering.The new test properly verifies that when
include_merged_values_in_rows=False, cells covered by merged cell ranges are correctly filtered from the row data. The test clearly demonstrates that cells "0" and "1" (covered by the merged range) are excluded, while cell "2" remains. This effectively validates the new feature.README.md (3)
380-389: LGTM! Clear documentation of new merged_cells format.The example effectively demonstrates the new compressed
merged_cellsrepresentation using the schema-based approach. The structure with"schema": ["r1", "c1", "r2", "c2", "v"]and corresponding items array clearly shows how merged cell data is now organized, which aligns perfectly with the PR objective.
597-601: LGTM! Helpful documentation build instructions.The new "Documentation build" section provides clear, actionable commands for contributors who need to generate and build documentation locally. The use of
uv runfor command execution is appropriate for the development environment.
5-5: All updated image paths are valid. The files exist at their new locations:
docs/assets/icon.webp✓docs/assets/demo_sheet.png✓docs/assets/demo_form_en.png✓sample/forms_with_many_merged_cells/ja_general_form/ja_form.json (1)
118-219: LGTM! Proper demonstration of new merged_cells format.The sample file correctly demonstrates the new compressed
merged_cellsrepresentation with the schema-based approach. The structure matches the JSON schema definition, showing real-world usage with merged cell ranges of varying spans and Japanese text content.docs/agents/FEATURE_SPEC.md (1)
1-58: Documentation updates are clear and well-structured.The specification clearly describes:
- The new
schema + itemsformat formerged_cells- The
include_merged_values_in_rowsflag with sensible default (True) for backward compatibility- Coordinate conventions (row 1-based, col 0-based)
This aligns well with the implementation changes across the codebase.
schemas/workbook.json (1)
597-608: Schema addition formerged_cellsinSheetDatalooks correct.The
anyOfpattern withMergedCellsornull(defaultnull) appropriately models the optional merged cells data.src/exstruct/core/modeling.py (2)
78-92: Implementation is correct and follows coding guidelines.The
_build_merged_cellsfunction:
- Has proper type hints for arguments and return value
- Includes a Google-style docstring
- Has a single responsibility (transform raw ranges to model)
- Returns
Nonefor empty input, avoiding emptyMergedCellsobjects
41-41: Type change formerged_cellsfield is consistent with the new data model.The field type
list[MergedCellRange]correctly represents raw extracted data before conversion to the compressedMergedCellsmodel.src/exstruct/models/__init__.py (3)
11-13: Helper function follows coding guidelines.The
_default_merged_cells_schemafunction has proper type hints and a concise docstring. The use ofLiteraltypes ensures type safety for the schema field names.
84-100: Well-designedMergedCellsmodel with proper alias handling.The implementation correctly:
- Uses
schema_withalias="schema"to avoid Python keyword conflict- Enables
populate_by_name=Truefor flexible input handling- Provides sensible defaults via
default_factory- Has accurate type hints for the tuple structure
194-196: Serialization correctly usesby_alias=True.This ensures the
schema_field is serialized as"schema"in JSON output. This is consistent withdict_without_empty_valuesinsrc/exstruct/io/__init__.pywhich also usesby_alias=True.sample/forms_with_many_merged_cells/en_form_sf425/sample.json (1)
165-245: Sample data correctly demonstrates the newmerged_cellsformat.The
schema + itemsstructure is properly used with:
- Schema defining
["r1", "c1", "r2", "c2", "v"]field order- Items as arrays of 5 elements matching the schema
- Multi-line strings preserved in the value field (e.g., Paperwork Burden Statement)
This serves as a good reference for the new format.
src/exstruct/core/pipeline.py (10)
24-34: Type alias and import updates are consistent with the new data model.The change from
MergedCelltoMergedCellRangealigns with the broader refactoring. TheMergedCellDatatype alias correctly reflects this change.
590-634: Row filtering logic is correct and well-structured.The function properly:
- Returns early when no filtering is needed
- Uses interval-based lookup for efficient column checking
- Preserves non-merged cells and their associated links
- Skips entirely empty rows after filtering
- Creates new
CellRowinstances to avoid mutation
637-654: Interval building logic is correct.The function correctly:
- Expands merged cell ranges to all covered rows
- Accumulates column intervals per row
- Merges overlapping intervals for efficient lookup
657-671: Interval merging is correctly implemented.The standard interval merging algorithm is correctly applied. Minor note: the docstring is minimal but adequate for this straightforward utility function.
674-681: Column interval check is efficient.The function leverages sorted intervals (from
_merge_intervals) to enable early termination whencol_index < start, avoiding unnecessary iterations.
684-689: Safe parsing handles edge cases gracefully.Non-integer column keys are preserved in the output by returning
None, ensuring they are not incorrectly filtered out.
692-744: Sheet data collection correctly integrates the new filtering behavior.The implementation:
- Accepts and documents the new
include_merged_values_in_rowsparameter- Conditionally applies row filtering based on the flag
- Preserves
merged_cellsmetadata regardless of the flag valueThis aligns with the specification that rows can omit merged values while metadata remains intact.
784-799: Pipeline correctly propagates the flag through extraction.The
include_merged_values_in_rowsflag is passed frominputstocollect_sheet_raw_data, ensuring consistent behavior throughout the extraction pipeline.
846-868: Fallback workbook building correctly applies the same filtering logic.The fallback path mirrors the main extraction path by:
- Applying row filtering when
include_merged_values_in_rowsisFalse- Preserving
merged_cellsmetadata in the outputThis ensures consistent behavior regardless of whether COM extraction succeeds or falls back.
55-67: New field properly added toExtractionInputs.The
include_merged_values_in_rowsfield is:
- Documented in the class docstring
- Passed through directly (no mode-based resolution, consistent with the spec's explicit control design)
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
exstruct/src/exstruct/core/pipeline.py
Lines 225 to 227 in c21d544
| resolved_merged_cells = ( | |
| include_merged_cells if include_merged_cells is not None else mode != "light" | |
| ) |
When include_merged_values_in_rows=False, the filtering logic depends on merged ranges being extracted, but resolved_merged_cells can still be False (e.g., light mode default or user disables merged cells). In that case no merged ranges are collected, _filter_rows_excluding_merged_values is a no-op, and merged values remain in rows, so the flag silently does nothing. Consider forcing merged-cell extraction when include_merged_values_in_rows is false (or decoupling extraction from output) so the flag consistently removes merged values as documented.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…に resolved_merged_cells=True を強制
…Web/exstruct into dev/merged_cells_compress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI Agents
In @docs/README.ja.md:
- Around line 378-415: Replace the incorrect version string "v0.15" with
"v0.3.5" in the paragraph that describes the merged_cells format (the line that
currently reads "`merged_cells` は v0.15 で…"); ensure that the text now matches
the section heading "互換性メモ(v0.3.5)", the other mentions in the same file, the
English README, and pyproject.toml so all references uniformly use "v0.3.5".
🧹 Nitpick comments (4)
tests/core/test_pipeline.py (1)
219-240: Consider adding edge case coverage.The test verifies that merged values are excluded while non-merged data is preserved. Consider adding a test case where:
- Multiple merged ranges overlap with the same row
- Merged ranges span multiple rows (verifying all rows are filtered)
src/exstruct/core/pipeline.py (3)
659-673: Add docstring for_merge_intervals.Per coding guidelines, functions should have Google-style docstrings. This function only has a one-line description without Args/Returns.
🔎 Suggested docstring
def _merge_intervals(intervals: list[tuple[int, int]]) -> list[tuple[int, int]]: - """Merge overlapping or adjacent intervals.""" + """Merge overlapping or adjacent intervals. + + Args: + intervals: List of (start, end) tuples representing column intervals. + + Returns: + List of merged non-overlapping intervals, sorted by start index. + """ if not intervals: return []
676-683: Linear search on sorted intervals could leverage binary search.
_col_in_intervalsiterates linearly over sorted intervals. For workbooks with many merged regions, consider usingbisectfor O(log n) lookup. However, given typical Excel usage patterns, this is likely not a bottleneck.🔎 Binary search alternative
import bisect def _col_in_intervals(col_index: int, intervals: list[tuple[int, int]]) -> bool: """Check whether a column index falls in any interval.""" if not intervals: return False starts = [iv[0] for iv in intervals] idx = bisect.bisect_right(starts, col_index) - 1 if idx < 0: return False return intervals[idx][0] <= col_index <= intervals[idx][1]
686-691: Add docstring for_safe_col_index.Per coding guidelines, functions should have Google-style docstrings with Args/Returns.
🔎 Suggested docstring
def _safe_col_index(col_key: str) -> int | None: - """Parse a column key to int, returning None on failure.""" + """Parse a column key to int, returning None on failure. + + Args: + col_key: String column key to parse. + + Returns: + Integer column index, or None if parsing fails. + """ try: return int(col_key) except ValueError: return None
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
README.ja.mdREADME.mddocs/README.en.mddocs/README.ja.mddocs/release-notes/v0.3.5.mdmkdocs.ymlpyproject.tomlsrc/exstruct/core/pipeline.pytests/core/test_pipeline.pytests/models/test_models_export.py
✅ Files skipped from review due to trivial changes (2)
- docs/release-notes/v0.3.5.md
- mkdocs.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- README.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Add type hints to all function and method arguments and return values, must be mypy strict compliant
UseAnytype only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Each function must have a single responsibility; cyclomatic complexity must not exceed 12
Organize imports in the following order: 1) standard library, 2) third-party packages, 3) exstruct internal modules
Add Google-style docstrings to all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external entities to Pydantic models at the boundary layer
Generated code must pass mypy strict mode with zero errors
Generated code must pass Ruff linter checks (E, W, F, I, B, UP, N, C90 rules) with zero errors
Avoid God Functions (large monolithic functions that do too much)
Avoid God Objects (classes that have too many responsibilities)
Eliminate unnecessary nesting and deeply nested conditional branches
Avoid circular dependencies between modules
Write functions that are testable and avoid unnecessary side effects
Files:
tests/models/test_models_export.pytests/core/test_pipeline.pysrc/exstruct/core/pipeline.py
🧬 Code graph analysis (3)
tests/models/test_models_export.py (1)
src/exstruct/models/__init__.py (7)
CellRow(103-112)MergedCells(84-100)SheetData(159-253)SmartArt(74-81)to_json(198-203)to_json(264-270)to_json(348-353)
tests/core/test_pipeline.py (2)
src/exstruct/core/cells.py (1)
MergedCellRange(71-78)src/exstruct/models/__init__.py (1)
CellRow(103-112)
src/exstruct/core/pipeline.py (2)
src/exstruct/core/cells.py (1)
MergedCellRange(71-78)src/exstruct/core/modeling.py (1)
SheetRawData(20-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (12)
pyproject.toml (1)
3-3: LGTM! Version bump aligns with the feature release.The version increment from 0.3.2 to 0.3.5 appropriately reflects the merged cell data model changes introduced in this PR.
README.ja.md (1)
378-415: Excellent migration documentation with clear examples.The compatibility memo effectively documents the breaking change in merged_cells format, providing:
- Clear version indication (v0.3.5)
- Side-by-side comparison of old vs new formats
- Practical migration code supporting both formats during transition
This provides consumers with the guidance needed to handle the serialization change.
tests/models/test_models_export.py (1)
141-155: LGTM! Test validates the new merged_cells schema format.The test properly verifies that:
merged_cellsis serialized with aschemafield containing the expected column namesitemsare serialized as arrays matching the schema order- The docstring clearly explains the test's purpose (addressing the previous review comment)
docs/README.en.md (1)
396-433: Clear migration documentation for English readers.The migration note effectively communicates the breaking change with:
- Version indication (v0.3.5)
- Format comparison (old array-of-objects vs new schema/items)
- Practical Python helper function for backward compatibility
This provides the same quality guidance as the Japanese version, ensuring all consumers can handle the transition.
tests/core/test_pipeline.py (3)
5-14: LGTM on imports.The new imports are correctly added for
MergedCellRangeand the internal_filter_rows_excluding_merged_valuesfunction to enable testing of the merged cell filtering functionality.
162-178: Good test coverage for the forced merged_cells behavior.This test correctly verifies that when
include_merged_values_in_rows=Falseis passed withinclude_merged_cells=False, the resolved inputs forceinclude_merged_cellstoTrue. This ensures merged cell metadata is available for filtering.
243-261: LGTM on filtering function tests.These tests effectively verify that:
- Links are correctly filtered alongside cell values (line 254)
- Empty rows are dropped after filtering (line 261)
Good coverage of the
_filter_rows_excluding_merged_valuesbehavior.src/exstruct/core/pipeline.py (5)
41-67: LGTM on ExtractionInputs extension.The new
include_merged_values_in_rowsfield is properly documented in the docstring and correctly typed asbool. The dataclass remains immutable withfrozen=True.
228-229: Good defensive logic for merged cell forcing.When
include_merged_values_in_rowsisFalse, merged cell metadata is required for filtering. Forcingresolved_merged_cells = Trueensures the data is available. This is well-documented behavior tested intest_resolve_extraction_inputs_forces_merged_cells_when_excluding_values.
592-636: LGTM on the filtering implementation.The function correctly:
- Returns early when inputs are empty (line 605-606)
- Preserves non-integer column keys (line 619-620)
- Filters links consistently with cell values (line 627-634)
- Drops rows that become empty after filtering (line 624-625)
694-746: LGTM oncollect_sheet_raw_dataupdates.The function correctly applies filtering when
include_merged_values_in_rowsisFalseand always passesmerged_cellstoSheetRawDatafor downstream use. The docstring is properly updated to document the new parameter.
853-870: LGTM onbuild_cells_tables_workbookfallback path.The fallback workbook construction correctly mirrors the main path's merged-value filtering logic, ensuring consistent behavior regardless of which code path is taken.
Summary by CodeRabbit
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.