-
Notifications
You must be signed in to change notification settings - Fork 14
Dev/refactor #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev/refactor #23
Conversation
- 抽出の事前処理をパイプライン化し、integrate.py を簡素化: integrate.py - パイプラインの挙動を検証するテスト追加: test_pipeline.py - テスト要件を追記: TEST_REQUIREMENTS.md - タスク完了チェック反映: TASKS.md
- openpyxl backend 実装: openpyxl_backend.py - COM backend 実装: com_backend.py - パイプラインの backend 移行: pipeline.py - COM 側の呼び出し整理: integrate.py - テスト追加: test_backends.py - テスト要件追記: TEST_REQUIREMENTS.md - タスク完了チェック: TASKS.md
- 共通レンジ検出のユーティリティ追加: ranges.py - テーブル検出の共通化テスト追加: test_table_detection_common.py - テスト要件追記: TEST_REQUIREMENTS.md - タスク完了チェック: TASKS.md
- 既存出力処理の共通関数利用: __init__.py - テスト追加: test_io_common.py - テスト要件追記: TEST_REQUIREMENTS.md - タスク完了チェック: TASKS.md
- xlwings の open/close を contextmanager 化して integrate.py に適用: integrate.py - openpyxl 利用箇所を contextmanager 化: cells.py, openpyxl_backend.py - フォールバック理由コードとログ統一: errors.py, logging_utils.py - テスト追加: test_workbook_context.py, test_logging_utils.py - テスト要件追記: TEST_REQUIREMENTS.md - タスク更新: TASKS.md(workbook 管理とログ統一サブ項目を更新)
- xlwings の open/close を contextmanager 化して integrate.py に適用: integrate.py - openpyxl 利用箇所を contextmanager 化: cells.py, openpyxl_backend.py - フォールバック理由コードとログ統一: errors.py, logging_utils.py - テスト追加: test_workbook_context.py, test_logging_utils.py - テスト要件追記: TEST_REQUIREMENTS.md - タスク更新: TASKS.md(workbook 管理とログ統一サブ項目を更新)
…rkbook_data で SheetData/WorkbookData を生成する流れに切り替えています。integrate_sheet_content は廃止して収集関数に分割し、pipeline 側のフォールバック生成も新ビルダーを経由するよう整理しました。
…M 上書き、print_areas の不足分補完、run_extraction_pipeline を追加 - integrate.py は resolve_extraction_inputs + run_extraction_pipeline のみで完結 - test_pipeline.py ほか関連テストを更新(COM ステップ構成、モンキーパッチ先の変更など) - TEST_REQUIREMENTS.md に pipeline の COM 統合要件を追記
📝 WalkthroughWalkthroughIntroduce a structured extraction pipeline: add backends (OpenPyXL/COM), pipeline orchestration with pre‑COM/COM steps and fallback handling, context managers and range parsing, raw→model builders, nested OutputOptions and serialization utilities, logging helpers, many tests, docs, and CI/tooling updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Pipeline
participant OpenpyxlBackend
participant ComBackend
participant Modeling
Client->>Pipeline: run_extraction_pipeline(inputs)
activate Pipeline
Pipeline->>Pipeline: resolve_extraction_inputs()
Pipeline->>Pipeline: build_pipeline_plan()
rect rgb(235,245,255)
note over Pipeline,OpenpyxlBackend: Pre‑COM (OpenPyXL) steps
Pipeline->>OpenpyxlBackend: extract_cells(include_links)
OpenpyxlBackend-->>Pipeline: CellData
Pipeline->>OpenpyxlBackend: extract_print_areas()
OpenpyxlBackend-->>Pipeline: PrintAreaData
Pipeline->>OpenpyxlBackend: extract_colors_map()
OpenpyxlBackend-->>Pipeline: WorkbookColorsMap | None
end
alt use_com == true
rect rgb(235,255,235)
note over Pipeline,ComBackend: COM (xlwings) steps
Pipeline->>ComBackend: extract_shapes_com()
ComBackend-->>Pipeline: ShapeData
Pipeline->>ComBackend: extract_charts_com()
ComBackend-->>Pipeline: ChartData
Pipeline->>ComBackend: extract_print_areas_com()
ComBackend-->>Pipeline: PrintAreaData
Pipeline->>ComBackend: extract_auto_page_breaks_com()
ComBackend-->>Pipeline: PrintAreaData
end
else COM unavailable/failed
rect rgb(255,240,230)
note right of Pipeline: log_fallback(reason)\nuse OpenPyXL-only artifacts
end
end
Pipeline->>Modeling: collect_sheet_raw_data(artifacts)
Modeling-->>Pipeline: dict[str,SheetRawData]
Pipeline->>Modeling: build_workbook_data(raw)
Modeling-->>Pipeline: WorkbookData
Pipeline-->>Client: PipelineResult{workbook, artifacts, state}
deactivate Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/exstruct/core/cells.py (2)
502-526: Redundantwb.close()call after context manager exit.Line 525 calls
wb.close()after thewith openpyxl_workbook(...)block has already exited. The context manager handles closing in itsfinallyblock, so this call is redundant and may raise an error if the workbook is already closed.🔎 Proposed fix
with openpyxl_workbook(file_path, data_only=True, read_only=False) as wb: for ws in wb.worksheets: sheet_links: dict[int, dict[str, str]] = {} for row in ws.iter_rows(): for cell in row: link = getattr(cell, "hyperlink", None) target = getattr(link, "target", None) if link else None if not target: continue col_str = str( cell.col_idx - 1 ) # zero-based to align with extract_sheet_cells sheet_links.setdefault(cell.row, {})[col_str] = target links_by_sheet[ws.title] = sheet_links merged: dict[str, list[CellRow]] = {} for sheet_name, rows in cell_rows.items(): sheet_links = links_by_sheet.get(sheet_name, {}) merged_rows: list[CellRow] = [] for row in rows: links = sheet_links.get(row.r, {}) merged_rows.append(CellRow(r=row.r, c=row.c, links=links or None)) merged[sheet_name] = merged_rows - wb.close() return merged
659-715: Worksheet object used after context manager closes workbook.The worksheet
wsis obtained inside thewith openpyxl_workbook(...)block (line 666), but it's used in the loop at lines 692-714 after the context manager exits (line 678 ends the block at line 677). This may cause issues since the workbook is closed when the context exits.🔎 Proposed fix: Move the entire processing inside the context manager
def load_border_maps_xlsx( # noqa: C901 xlsx_path: Path, sheet_name: str ) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray, int, int]: with openpyxl_workbook(xlsx_path, data_only=True, read_only=False) as wb: if sheet_name not in wb.sheetnames: raise KeyError(f"Sheet '{sheet_name}' not found in {xlsx_path}") ws = wb[sheet_name] try: min_col, min_row, max_col, max_row = range_boundaries( ws.calculate_dimension() ) except Exception: min_col, min_row, max_col, max_row = ( 1, 1, ws.max_column or 1, ws.max_row or 1, ) - shape = (max_row + 1, max_col + 1) - # ... rest of function outside context + shape = (max_row + 1, max_col + 1) + has_border = np.zeros(shape, dtype=bool) + top_edge = np.zeros(shape, dtype=bool) + bottom_edge = np.zeros(shape, dtype=bool) + left_edge = np.zeros(shape, dtype=bool) + right_edge = np.zeros(shape, dtype=bool) + + def edge_has_style(edge: object) -> bool: + if edge is None: + return False + style = getattr(edge, "style", None) + return style is not None and style != "none" + + for r in range(min_row, max_row + 1): + for c in range(min_col, max_col + 1): + cell = ws.cell(row=r, column=c) + b = getattr(cell, "border", None) + if b is None: + continue + # ... rest of processing + + return has_border, top_edge, bottom_edge, left_edge, right_edge, max_row, max_col
🧹 Nitpick comments (17)
pyproject.toml (1)
122-128: Task definitions look good; consider removing redundant--strictflag.The taskipy task definitions are well-structured and provide useful shortcuts for common development workflows. However, the
mypytask explicitly passes the--strictflag even though[tool.mypy]already setsstrict = true(line 111), making the flag redundant.🔎 Optional: Remove redundant flag
-mypy = "mypy src/exstruct --strict" +mypy = "mypy src/exstruct"tests/test_table_detection_common.py (1)
3-6: Consider testing public APIs instead of private functions.The tests directly import and invoke private functions (
_merge_rectangles,_collect_table_candidates_from_values) prefixed with underscores. Private functions are implementation details that may change, making tests brittle. If these functions represent important behavior, consider:
- Exposing public wrappers if the functionality should be tested
- Testing them indirectly through public APIs (e.g., backend methods that use these utilities)
tests/test_backends.py (1)
55-73: Consider adding type hints to the DummyWorkbook class.The
DummyWorkbookclass (lines 66-67) lacks type hints and serves as a test stub. While this is acceptable for test code, adding a protocol or explicit type would improve clarity.Optional: Add Protocol for better type safety
+from typing import Protocol + +class WorkbookProtocol(Protocol): + pass + class DummyWorkbook: pass -backend = ComBackend(DummyWorkbook()) +backend = ComBackend(DummyWorkbook()) # type: ignore[arg-type]tests/test_workbook_context.py (1)
28-65: Consider extracting the duplicated DummyWorkbook class.The
DummyWorkbookclass is defined twice (lines 11-13 and lines 42-44), violating DRY principles. Consider extracting it as a module-level fixture or test helper.Refactor to eliminate duplication
+class DummyWorkbook: + def close(self) -> None: + pass + def test_openpyxl_workbook_closes(monkeypatch: MonkeyPatch, tmp_path: Path) -> None: calls: dict[str, int] = {"close": 0} - class DummyWorkbook: + class CountingWorkbook(DummyWorkbook): def close(self) -> None: calls["close"] += 1 def fake_load_workbook( path: Path, *, data_only: bool, read_only: bool - ) -> DummyWorkbook: - return DummyWorkbook() + ) -> CountingWorkbook: + return CountingWorkbook() monkeypatch.setattr("exstruct.core.workbook.load_workbook", fake_load_workbook) with openpyxl_workbook(tmp_path / "book.xlsx", data_only=True, read_only=True): pass assert calls["close"] == 1 def test_openpyxl_workbook_sets_warning_filters( monkeypatch: MonkeyPatch, tmp_path: Path ) -> None: calls: list[tuple[str, str, type[Warning], str]] = [] def fake_filterwarnings( action: str, message: str = "", category: type[Warning] = Warning, module: str = "", **_kwargs: object, ) -> None: calls.append((action, message, category, module)) - class DummyWorkbook: - def close(self) -> None: - pass - def fake_load_workbook( path: Path, *, data_only: bool, read_only: bool ) -> DummyWorkbook: return DummyWorkbook()tests/test_modeling.py (1)
5-39: LGTM! Consider adding more specific assertions.The test correctly validates the transformation from
SheetRawData/WorkbookRawDatatoWorkbookData. The assertions verify that collections are non-empty, which confirms the data flows through the builder.For more robust validation, consider adding assertions that check specific field values (e.g.,
sheet.rows[0].r == 1,sheet.shapes[0].text == "S").src/exstruct/core/backends/base.py (2)
9-10: Consider adding type aliases to module exports.The type aliases
CellDataandPrintAreaDataare defined at module level but may not be explicitly exported via__all__. If these are intended for public use by backend implementations, consider adding:__all__ = ["BackendConfig", "Backend", "CellData", "PrintAreaData"]This makes the public API surface explicit and aids type checking and IDE support.
26-38: Consider adding complete docstrings to Protocol methods.The
BackendProtocol methods (lines 29-38) have brief docstrings but lack Args/Returns sections as recommended by coding guidelines for Google-style docstrings. While Protocol definitions often use brief docstrings, adding complete documentation would improve clarity for implementers.Example: Enhanced docstrings
def extract_cells(self, *, include_links: bool) -> CellData: - """Extract cell rows from the workbook.""" + """Extract cell rows from the workbook. + + Args: + include_links: Whether to include hyperlinks in cell data. + + Returns: + Mapping of sheet name to list of cell rows. + """ def extract_print_areas(self) -> PrintAreaData: - """Extract print areas from the workbook.""" + """Extract print areas from the workbook. + + Returns: + Mapping of sheet name to list of print areas. + """ def extract_colors_map( self, *, include_default_background: bool, ignore_colors: set[str] | None ) -> WorkbookColorsMap | None: - """Extract colors map from the workbook.""" + """Extract colors map from the workbook. + + Args: + include_default_background: Whether to include default backgrounds. + ignore_colors: Optional set of color keys to ignore. + + Returns: + WorkbookColorsMap or None if extraction fails. + """src/exstruct/core/ranges.py (1)
24-47: Consider narrowing the exception handler to ValueError.Line 40 uses a bare
except Exceptionwhich catches all exceptions. While this ensures the function returnsNoneon any parsing failure, it may hide unexpected errors. According to openpyxl documentation,range_boundaries()raisesValueErrorfor malformed range strings (e.g., missing row/column parts), so you can be more specific:try: min_col, min_row, max_col, max_row = range_boundaries(cleaned) - except Exception: + except ValueError: return NoneHowever, given that this is a parsing utility where graceful failure is desired, the current approach is acceptable for robustness. The code already passes mypy strict mode.
src/exstruct/io/__init__.py (1)
281-286: Inconsistent error types for unsupported format handling.The
_ensure_format_hintcalls use different error types:
save_print_area_views(line 284):SerializationErrorsave_auto_page_break_views(line 342):SerializationErrorserialize_workbook(line 394):SerializationErrorsave_sheets(line 450):ValueErrorConsider using
SerializationErrorconsistently for all format validation failures, as this is a serialization concern.🔎 Suggested fix for save_sheets
format_hint = _ensure_format_hint( fmt, allowed=_FORMAT_HINTS, - error_type=ValueError, + error_type=SerializationError, error_message="Unsupported sheet export format: {fmt}", )Also applies to: 339-344, 391-396, 447-452
src/exstruct/core/workbook.py (1)
56-84: Document the workbook reuse behavior for xlwings.The
xlwings_workbookcontext manager reuses an already-open workbook (lines 67-70) without closing it on exit. This is intentional but could surprise callers who expect exclusive access. Consider adding a note to the docstring about this behavior.🔎 Suggested docstring addition
def xlwings_workbook(file_path: Path, *, visible: bool = False) -> Iterator[xw.Book]: """Open an Excel workbook via xlwings and close if created. + If the workbook is already open in an Excel instance, this context manager + yields the existing workbook without closing it on exit. + Args: file_path: Workbook path. visible: Whether to show the Excel application window. Yields: xlwings workbook instance. """src/exstruct/core/backends/openpyxl_backend.py (1)
163-175: Docstring accurately describes zero-based output, but consider renaming or adding conversion context.The function correctly documents its zero-based output, which aligns with
parse_range_zero_based. However, since this is consumed byPrintArea(which expects 1-based), consider either renaming to clarify intent or documenting the conversion requirement.src/exstruct/core/backends/com_backend.py (2)
126-134: Exception handling pattern is acceptable for state restoration.The nested try-except to restore
DisplayPageBreaksstate before continuing is a valid defensive pattern. The silent pass on restoration failure is acceptable since the primary operation already failed. Consider debug-level logging for troubleshooting, but this is optional.
138-150: Duplicate_parse_print_area_rangefunction exists in openpyxl_backend.py.This helper is identical to the one in
openpyxl_backend.py(lines 163-175). Consider extracting to a shared location likeranges.pyto avoid duplication and ensure consistent behavior.🔎 Suggested approach
Move to
src/exstruct/core/ranges.py:def parse_print_area_bounds(range_str: str) -> tuple[int, int, int, int] | None: """Parse an Excel range string into zero-based coordinates. Args: range_str: Excel range string. Returns: Zero-based (r1, c1, r2, c2) tuple or None on failure. """ bounds = parse_range_zero_based(range_str) if bounds is None: return None return (bounds.r1, bounds.c1, bounds.r2, bounds.c2)Then import from both backends.
src/exstruct/core/pipeline.py (4)
82-95: Consider using immutable sequences for frozen dataclass.
PipelinePlanis frozen but containslistfields which remain internally mutable. For full immutability, consider usingtupleor documenting that consumers should not mutate the lists.🔎 Suggested immutable alternative
@dataclass(frozen=True) class PipelinePlan: - pre_com_steps: list[ExtractionStep] - com_steps: list[ComExtractionStep] + pre_com_steps: tuple[ExtractionStep, ...] + com_steps: tuple[ComExtractionStep, ...] use_com: boolThen update
build_pre_com_pipelineandbuild_com_pipelineto return tuples.
186-188: Type inconsistency between validation set andExtractionModeliteral.
allowed_modesis a runtimeset[str], butmodeis typed asExtractionMode(aLiteral). Static type checkers already ensure valid modes at call sites. Consider usingtyping.get_args()for DRY validation:🔎 DRY validation using Literal args
+from typing import get_args + def resolve_extraction_inputs( ... ) -> ExtractionInputs: - allowed_modes: set[str] = {"light", "standard", "verbose"} + allowed_modes = get_args(ExtractionMode) if mode not in allowed_modes: raise ValueError(f"Unsupported mode: {mode}")
262-298: Duplication between "standard" and "verbose" mode configurations.The step configurations for "standard" and "verbose" modes are identical. Consider extracting a shared definition to reduce duplication and maintenance burden.
🔎 Possible DRY refactor
_com_aware_pre_steps: tuple[StepConfig, ...] = ( StepConfig(name="cells", step=step_extract_cells, enabled=lambda _: True), StepConfig( name="print_areas_openpyxl", step=step_extract_print_areas_openpyxl, enabled=lambda i: i.include_print_areas, ), StepConfig( name="colors_map_openpyxl_if_skip_com", step=step_extract_colors_map_openpyxl, enabled=lambda i: i.include_colors_map and bool(os.getenv("SKIP_COM_TESTS")), ), ) step_table: dict[ExtractionMode, Sequence[StepConfig]] = { "light": (...), # unique light steps "standard": _com_aware_pre_steps, "verbose": _com_aware_pre_steps, }
393-433: Each pre-COM step independently opens the workbook file.
step_extract_cells,step_extract_print_areas_openpyxl, andstep_extract_colors_map_openpyxleach call extraction methods that independently open and read the file (viapd.read_excel,openpyxl_workbook, etc.). For large files, consider optimizing by opening the workbook once and sharing it across steps, or batching extractions into a single step. This is a design trade-off between step independence and I/O efficiency.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (43)
.github/workflows/pytest.yml.gitignoreAGENTS.mddocs/agents/API_DESIGN.mddocs/agents/ARCHITECTURE.mddocs/agents/CODING_GUIDELINES.mddocs/agents/CONTRIBUTING_FOR_AI.mddocs/agents/EXCEL_EXTRACTION.mddocs/agents/FEATURE_SPEC.mddocs/agents/OVERVIEW.mddocs/agents/ROADMAP.mddocs/agents/TASKS.mddocs/agents/TEST_REQUIREMENTS.mdpyproject.tomlsrc/exstruct/__init__.pysrc/exstruct/core/backends/__init__.pysrc/exstruct/core/backends/base.pysrc/exstruct/core/backends/com_backend.pysrc/exstruct/core/backends/openpyxl_backend.pysrc/exstruct/core/cells.pysrc/exstruct/core/integrate.pysrc/exstruct/core/logging_utils.pysrc/exstruct/core/modeling.pysrc/exstruct/core/pipeline.pysrc/exstruct/core/ranges.pysrc/exstruct/core/workbook.pysrc/exstruct/engine.pysrc/exstruct/errors.pysrc/exstruct/io/__init__.pysrc/exstruct/io/serialize.pysrc/exstruct/render/__init__.pytests/test_backends.pytests/test_engine.pytests/test_error_handling.pytests/test_integrate_fallback.pytests/test_integrate_raw_data.pytests/test_io_common.pytests/test_logging_utils.pytests/test_mode_output.pytests/test_modeling.pytests/test_pipeline.pytests/test_table_detection_common.pytests/test_workbook_context.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Add type hints to all function and method arguments and return values (mypy strict compliance)
UseAnytype only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Organize imports in this order: (1) standard library, (2) third-party packages, (3) exstruct internal modules
Use Google-style docstrings for all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Avoid writing God Functions (large, single-responsibility-violating functions)
Avoid writing God Objects (classes with too many responsibilities)
Avoid excessive nesting and deep conditional branching
Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Ensure code passes mypy strict mode with zero errors
Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Avoid circular dependencies between modules
AI should propose automatic function splitting when functions are too complex
AI should propose additional Pydantic models when data structures are insufficient
AI should automatically fix and organize imports when they are inconsistent
AI should improve docstrings when they are incomplete or insufficient
Files:
src/exstruct/core/ranges.pytests/test_integrate_fallback.pytests/test_modeling.pysrc/exstruct/core/logging_utils.pysrc/exstruct/io/serialize.pytests/test_workbook_context.pysrc/exstruct/core/backends/base.pysrc/exstruct/core/workbook.pysrc/exstruct/core/backends/openpyxl_backend.pytests/test_table_detection_common.pytests/test_io_common.pysrc/exstruct/io/__init__.pytests/test_backends.pytests/test_logging_utils.pytests/test_engine.pytests/test_mode_output.pysrc/exstruct/core/backends/com_backend.pysrc/exstruct/core/modeling.pytests/test_pipeline.pytests/test_error_handling.pysrc/exstruct/core/cells.pysrc/exstruct/__init__.pysrc/exstruct/render/__init__.pysrc/exstruct/core/integrate.pysrc/exstruct/engine.pysrc/exstruct/core/pipeline.pytests/test_integrate_raw_data.pysrc/exstruct/errors.pysrc/exstruct/core/backends/__init__.py
🧠 Learnings (6)
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Applied to files:
.gitignore
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Applied to files:
docs/agents/CODING_GUIDELINES.mddocs/agents/CONTRIBUTING_FOR_AI.mdsrc/exstruct/io/__init__.pydocs/agents/ARCHITECTURE.mdsrc/exstruct/core/modeling.pysrc/exstruct/core/cells.pysrc/exstruct/core/integrate.py
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Applied to files:
docs/agents/CODING_GUIDELINES.mddocs/agents/CONTRIBUTING_FOR_AI.mdAGENTS.md
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Applied to files:
docs/agents/CODING_GUIDELINES.mddocs/agents/CONTRIBUTING_FOR_AI.mdAGENTS.md
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : AI should propose additional Pydantic models when data structures are insufficient
Applied to files:
docs/agents/CODING_GUIDELINES.md
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Applied to files:
docs/agents/CODING_GUIDELINES.md
🧬 Code graph analysis (20)
tests/test_integrate_fallback.py (2)
src/exstruct/core/integrate.py (1)
extract_workbook(10-52)tests/test_cells_and_tables.py (1)
_raise(64-65)
tests/test_modeling.py (2)
src/exstruct/core/modeling.py (3)
SheetRawData(9-28)WorkbookRawData(32-41)build_workbook_data(64-74)src/exstruct/models/__init__.py (5)
CellRow(49-58)Chart(76-93)ChartSeries(61-73)PrintArea(96-102)Shape(11-46)
src/exstruct/core/logging_utils.py (1)
src/exstruct/errors.py (1)
FallbackReason(40-46)
src/exstruct/io/serialize.py (2)
src/exstruct/errors.py (2)
MissingDependencyError(24-25)SerializationError(20-21)tests/test_io_common.py (1)
safe_dump(12-15)
tests/test_workbook_context.py (1)
src/exstruct/core/workbook.py (1)
openpyxl_workbook(14-53)
src/exstruct/core/backends/openpyxl_backend.py (5)
src/exstruct/models/__init__.py (1)
PrintArea(96-102)src/exstruct/core/cells.py (6)
WorkbookColorsMap(53-67)detect_tables_openpyxl(1409-1447)extract_sheet_cells(467-485)extract_sheet_cells_with_links(488-526)extract_sheet_colors_map(70-91)detect_tables(1450-1487)src/exstruct/core/ranges.py (1)
parse_range_zero_based(24-47)src/exstruct/core/workbook.py (1)
openpyxl_workbook(14-53)src/exstruct/core/backends/base.py (3)
extract_cells(29-30)extract_print_areas(32-33)extract_colors_map(35-38)
tests/test_table_detection_common.py (1)
src/exstruct/core/cells.py (2)
_collect_table_candidates_from_values(1291-1334)_merge_rectangles(1242-1268)
tests/test_io_common.py (2)
src/exstruct/io/__init__.py (1)
save_sheets(435-472)src/exstruct/models/__init__.py (2)
SheetData(105-191)WorkbookData(194-257)
tests/test_backends.py (3)
src/exstruct/core/backends/com_backend.py (2)
extract_colors_map(51-74)extract_print_areas(27-49)src/exstruct/core/backends/openpyxl_backend.py (4)
extract_cells(32-45)detect_tables(88-100)extract_colors_map(64-86)extract_print_areas(47-62)src/exstruct/core/ranges.py (1)
parse_range_zero_based(24-47)
tests/test_logging_utils.py (2)
src/exstruct/core/logging_utils.py (1)
log_fallback(8-16)src/exstruct/errors.py (1)
FallbackReason(40-46)
tests/test_engine.py (1)
src/exstruct/engine.py (6)
DestinationOptions(126-141)ExStructEngine(166-562)FilterOptions(99-123)OutputOptions(144-163)StructOptions(60-82)serialize(353-376)
src/exstruct/core/backends/com_backend.py (5)
src/exstruct/models/__init__.py (1)
PrintArea(96-102)src/exstruct/core/cells.py (2)
WorkbookColorsMap(53-67)extract_sheet_colors_map_com(94-119)src/exstruct/core/ranges.py (1)
parse_range_zero_based(24-47)src/exstruct/core/backends/openpyxl_backend.py (3)
extract_print_areas(47-62)_parse_print_area_range(163-175)extract_colors_map(64-86)src/exstruct/core/backends/base.py (2)
extract_print_areas(32-33)extract_colors_map(35-38)
src/exstruct/core/modeling.py (1)
src/exstruct/models/__init__.py (6)
CellRow(49-58)Chart(76-93)PrintArea(96-102)Shape(11-46)SheetData(105-191)WorkbookData(194-257)
tests/test_pipeline.py (2)
src/exstruct/core/pipeline.py (5)
ExtractionArtifacts(58-75)ExtractionInputs(33-54)build_cells_tables_workbook(654-695)build_com_pipeline(306-348)resolve_extraction_inputs(157-216)src/exstruct/models/__init__.py (2)
CellRow(49-58)PrintArea(96-102)
tests/test_error_handling.py (2)
tests/test_integrate_fallback.py (1)
_raise(71-72)tests/test_cells_and_tables.py (1)
_raise(64-65)
src/exstruct/core/cells.py (2)
src/exstruct/models/__init__.py (1)
CellRow(49-58)src/exstruct/core/workbook.py (1)
openpyxl_workbook(14-53)
src/exstruct/__init__.py (1)
src/exstruct/engine.py (3)
FormatOptions(85-96)FilterOptions(99-123)DestinationOptions(126-141)
src/exstruct/render/__init__.py (1)
src/exstruct/errors.py (1)
RenderError(28-29)
src/exstruct/core/integrate.py (2)
src/exstruct/models/__init__.py (1)
WorkbookData(194-257)src/exstruct/core/pipeline.py (2)
resolve_extraction_inputs(157-216)run_extraction_pipeline(582-651)
src/exstruct/core/backends/__init__.py (3)
src/exstruct/core/backends/base.py (1)
Backend(26-38)src/exstruct/core/backends/com_backend.py (1)
ComBackend(18-135)src/exstruct/core/backends/openpyxl_backend.py (1)
OpenpyxlBackend(23-100)
🪛 GitHub Check: Codacy Static Code Analysis
.github/workflows/pytest.yml
[warning] 43-43: .github/workflows/pytest.yml#L43
An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release.
src/exstruct/core/workbook.py
[warning] 52-52: src/exstruct/core/workbook.py#L52
Try, Except, Pass detected.
[warning] 102-102: src/exstruct/core/workbook.py#L102
Try, Except, Continue detected.
src/exstruct/core/backends/com_backend.py
[warning] 37-37: src/exstruct/core/backends/com_backend.py#L37
Try, Except, Continue detected.
[warning] 132-132: src/exstruct/core/backends/com_backend.py#L132
Try, Except, Pass detected.
src/exstruct/core/cells.py
[warning] 1394-1394: src/exstruct/core/cells.py#L1394
Try, Except, Continue detected.
🪛 markdownlint-cli2 (0.18.1)
docs/agents/CODING_GUIDELINES.md
49-49: Strong style
Expected: asterisk; Actual: underscore
(MD050, strong-style)
49-49: Strong style
Expected: asterisk; Actual: underscore
(MD050, strong-style)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: test (windows-latest, 3.11)
🔇 Additional comments (74)
src/exstruct/render/__init__.py (2)
52-54: LGTM: Error message formatting improved.The string formatting change maintains clarity while consolidating the error message construction. The implicit concatenation of the plain string literal with the f-string is valid and produces the same result as the previous multi-line version.
61-61: LGTM: Concise error message formatting.The single-line f-string format is clear and maintains the same error message content as before.
pyproject.toml (1)
35-35: No action needed. taskipy version 1.14.1 is available on PyPI (released Nov 26, 2024), has no known security vulnerabilities, and fully supports Python 3.11+..gitignore (1)
23-25: Coverage artifacts properly ignored.The additions align with the codecov integration and coverage reporting infrastructure. No concerns.
docs/agents/CONTRIBUTING_FOR_AI.md (1)
29-29: Documentation update aligns with architectural boundary changes.The clarifications about BaseModel for boundaries and dataclass for internals, plus the regression testing guidance, properly communicate the new data structure separation strategy.
Also applies to: 39-39
AGENTS.md (1)
126-126: Prompt templates updated to reflect boundary/internal data structure pattern.The additions consistently communicate the BaseModel-for-boundaries / dataclass-for-internals expectation across both the developer guidance and Codex prompts, ensuring AI-generated code aligns with architectural decisions.
Also applies to: 178-178
docs/agents/TASKS.md (1)
1-93: Task documentation comprehensively captures completed refactoring work.The task list properly documents the major architectural components introduced in this PR: pipeline infrastructure, backend abstraction, utility consolidation, and data structure separation. The implementation notes (especially around lines 70–90) provide valuable context for understanding the design decisions and rationale. Well-organized and maintainable.
docs/agents/CODING_GUIDELINES.md (1)
46-50: Fix markdown formatting: wrap__init__in backticks.The content updates correctly document the boundary/internal separation pattern. However, line 49 contains
__init__without backticks, which markdownlint flags as unintended strong emphasis. Wrap it as`__init__`to prevent markdown interpretation.🔎 Proposed fix for markdown formatting
Raw*Data (e.g., SheetRawData/WorkbookRawData) は内部専用で公開APIに出さない。 - Do not expose them via public APIs or re-export from package __init__. + Do not expose them via public APIs or re-export from package `__init__`.docs/agents/OVERVIEW.md (1)
1-43: Overview rewrite improves clarity and user orientation.The updated overview is more concise and user-focused, emphasizing extraction targets and practical usage patterns. The shift from feature-centric to target-and-usage-oriented documentation is well-suited to the new pipeline-centric architecture and helps both users and contributors understand the library's scope quickly.
src/exstruct/core/backends/__init__.py (1)
1-7: Clean package structure with proper export definitions.The backends module initializer correctly exports the public API (Backend protocol and implementations). The import organization and
__all__definition follow best practices. No concerns.docs/agents/ROADMAP.md (1)
30-36: Roadmap entries properly document refactoring milestone and future shape decomposition.The new v0.3.0 and v0.3.1 entries accurately capture the scope of this PR and position the upcoming SmartArt work. Version sequence is logical and well-placed.
tests/test_mode_output.py (1)
61-61: LGTM! Monkeypatch target correctly updated for pipeline architecture.The update from
exstruct.core.integrate.xw.Booktoexstruct.core.pipeline.xlwings_workbookaligns with the pipeline-centric refactoring. The test continues to verify that COM is not accessed in light mode.tests/test_logging_utils.py (1)
9-14: LGTM! Test correctly verifies standardized fallback logging format.The test properly validates that
log_fallbackemits a warning with the expected format[light_mode] fallback. Import ordering follows the coding guidelines (stdlib → third-party → internal), and type hints are present.tests/test_io_common.py (1)
9-23: LGTM! Test properly verifies YAML serialization path.The test correctly monkeypatches
_require_yamlto validate YAML format handling. Import ordering follows guidelines, type hints are present, and the test logic is sound..github/workflows/pytest.yml (2)
34-34: LGTM! Coverage collection added for non-COM tests.The addition of
--cov=exstruct --cov-report=xmlenables coverage tracking for the test suite.
40-40: LGTM! Coverage collection added for full test suite.The addition of
--cov=exstruct --cov-report=xmlenables coverage tracking for the Windows test suite.src/exstruct/core/logging_utils.py (1)
8-16: LGTM! Clean implementation of standardized fallback logging.The function provides a clear, single-responsibility utility for logging fallback events. Import ordering follows guidelines, type hints are complete, and the Google-style docstring is well-structured.
docs/agents/FEATURE_SPEC.md (1)
7-12: Documentation updated to focus on SmartArt parsing.The updated feature specification clearly outlines the plan to introduce separate classes for Shape, Arrow, and SmartArt with recursive node structure. This aligns with the broader architectural refactoring in the PR.
src/exstruct/errors.py (1)
40-46: LGTM! Well-designed enum for fallback reason tracking.The
FallbackReasonenum subclasses bothstrandEnum, which enables direct string serialization while maintaining type safety. The four reason codes cover the key fallback scenarios introduced in the pipeline architecture.tests/test_error_handling.py (3)
3-4: LGTM! Required imports added for contextmanager pattern.The imports
Iteratorandcontextmanagersupport the cleaner test scaffolding introduced later in the file.
36-36: LGTM! Monkeypatch target updated for pipeline architecture.The change from
exstruct.core.integrate.xw.Booktoexstruct.core.pipeline.xlwings_workbookcorrectly reflects the new pipeline-centric extraction flow.
54-69: LGTM! Cleaner test scaffolding with contextmanager pattern.The refactored approach using
@contextmanagerand_dummy_workbookis more idiomatic and aligns with the pipeline's workbook context management. The monkeypatch target updates are correct.tests/test_table_detection_common.py (2)
15-23: LGTM!The test correctly validates that
_collect_table_candidates_from_valuesbuilds a range string from a 2x2 value matrix with the expected base coordinates.
9-12: Test is correct - no changes needed.The
_merge_rectanglesfunction explicitly preserves contained rectangles by returningFalsefrom_rectangles_overlap_for_mergewhen one rectangle fully contains another. For the test input[(1, 1, 3, 3), (2, 2, 2, 2)], the second rectangle is fully contained in the first, so the containment check triggers and prevents merging. The test expectationlen(merged) == 2accurately reflects this intentional behavior of preserving both rectangles.tests/test_backends.py (4)
11-37: LGTM!The test effectively verifies that
OpenpyxlBackend.extract_cellsdelegates to the correct internal function based on theinclude_linksflag, and the assertion confirms both paths are exercised.
40-52: LGTM!The test correctly validates that
OpenpyxlBackend.detect_tablesgracefully handles failures by returning an empty list when the underlying detection function raises an exception.
76-93: LGTM!The integration test creates a real workbook with a print area and verifies that
OpenpyxlBackend.extract_print_areascorrectly extracts and converts the coordinates to zero-based indexing.
95-101: LGTM!The test validates that
parse_range_zero_basedcorrectly handles sheet-prefixed range strings and produces zero-based coordinate bounds.tests/test_workbook_context.py (1)
8-25: LGTM!The test effectively validates that the
openpyxl_workbookcontext manager callsclose()exactly once upon exit, ensuring proper resource cleanup.tests/test_pipeline.py (2)
58-75: LGTM!These tests effectively validate:
- COM pipeline construction respects individual flags (lines 58-75)
- Light mode produces an empty COM pipeline (lines 78-90)
- Input resolution applies correct defaults based on mode (lines 93-107)
The test assertions are precise and verify the expected pipeline behavior.
Also applies to: 78-90, 93-107
110-142: LGTM!The integration test correctly verifies that
build_cells_tables_workbookintegrates print areas from artifacts and invokes table detection via the backend, producing the expectedWorkbookDatastructure with print areas and table candidates.docs/agents/API_DESIGN.md (1)
50-56: LGTM!The documentation clearly demonstrates the updated API structure with nested
OutputOptionscontainingformatandfiltersgroups, aligning with the PR's refactoring of the public API surface.src/exstruct/core/ranges.py (1)
8-21: LGTM!The
RangeBoundsNamedTuple is well-documented with clear field descriptions and uses proper type hints. Using a NamedTuple for immutable data is appropriate for this use case.src/exstruct/core/backends/base.py (1)
13-23: LGTM!The
BackendConfigdataclass is properly frozen, has complete type hints, and includes a clear docstring. Usingfrozen=Trueensures immutability, which is appropriate for configuration objects.tests/test_integrate_fallback.py (3)
27-30: LGTM! Helper and patch target updated consistently.The
boomhelper signature now accepts arbitrary*argsand**kwargs, aligning with the pattern used elsewhere (e.g.,_raiseintests/test_cells_and_tables.py). The patch target correctly points toexstruct.core.pipeline.xlwings_workbook, consistent with the pipeline-based architecture refactoring.
38-56: LGTM! New fallback test for print_areas coverage.This test validates the critical requirement [INT-02] COM フォールバック時も print_areas を保持する from TEST_REQUIREMENTS.md. The test correctly:
- Creates a workbook with
print_areaset via openpyxl- Patches the COM pathway to raise an exception
- Verifies that
print_areasare still extracted via the fallback path
59-79: LGTM! Test updated to use the new patch target.The
test_extract_workbook_with_linkscorrectly updates its patch target toexstruct.core.pipeline.xlwings_workbook, ensuring hyperlink extraction continues to work when COM is unavailable.docs/agents/TEST_REQUIREMENTS.md (1)
1-227: LGTM! Documentation is well-structured and comprehensive.The restructured TEST_REQUIREMENTS.md provides clear coverage of the new pipeline-based architecture with well-organized sections for Pipeline, Backend, Ranges, Table Detection, Workbook context management, and Logging. The bullet-style format improves readability.
tests/test_integrate_raw_data.py (3)
12-29: LGTM! Well-structured helper with proper type hints and docstring.The
_make_charthelper provides a minimal Chart instance for tests with all required fields. The Google-style docstring follows the coding guidelines.
32-73: LGTM! Comprehensive test for raw data collection.This test validates requirement
[MOD-02] collect_sheet_raw_data は抽出済みデータを raw コンテナにまとめるby verifying that all extracted components (rows, shapes, charts, table_candidates, print_areas, auto_print_areas, colors_map) are correctly collected into the raw container.
76-100: LGTM! Light mode test correctly validates chart exclusion.This test ensures charts are excluded in light mode (
mode="light"), aligning with requirement[MODE-02] light: セル+テーブルのみ、shapes/charts 空、COM 不使用.src/exstruct/io/__init__.py (2)
74-82: LGTM! Clean wrapper for range parsing.The
_parse_range_zero_basedfunction correctly delegates to the coreparse_range_zero_basedutility and converts the result to a plain tuple for backward compatibility with existing callers expecting(r1, c1, r2, c2).
486-487: Verify: Exporting underscore-prefixed functions in__all__.Exporting
_require_yamland_require_toonin__all__is unconventional since underscore-prefixed names typically denote private/internal APIs. This appears intentional to allow test patching (as seen intests/test_io_common.py). If this is the intended pattern, consider documenting this decision or renaming to non-underscore names if they're truly public.docs/agents/ARCHITECTURE.md (1)
1-81: LGTM! Architecture documentation clearly describes the pipeline-centric design.The updated ARCHITECTURE.md provides a clear overview of the refactored system:
- Directory structure reflects new modules (
pipeline.py,modeling.py,workbook.py,backends/)- Pipeline design section explains the separation of concerns (immutable
PipelinePlan, execution state inPipelineState/PipelineResult)- Module responsibilities are well-defined
- AI agent guidelines are updated with relevant context
src/exstruct/core/workbook.py (2)
13-53: LGTM! Robust context manager with appropriate warning suppression.The
openpyxl_workbookcontext manager correctly:
- Suppresses known non-critical openpyxl warnings (extensions, conditional formatting, header/footer parsing)
- Uses
Anyreturn type appropriately at the external library boundary (per coding guidelines)- Ensures cleanup in the
finallyblock, swallowing exceptions to avoid masking the original errorThe static analysis warning about
try/except/passat line 52 is a deliberate pattern here—cleanup failures should not mask the original exception.
87-106: LGTM! Defensive helper for finding open workbooks.The
_find_open_workbookhelper has comprehensive exception handling:
- Inner try/except (line 102) handles individual workbook access failures
- Outer try/except (line 104) handles app iteration failures
- Path comparison uses
.resolve()for reliable matchingThe static analysis warning about
try/except/continueat line 102 is appropriate here—COM objects can raise unpredictably, and we want to continue checking other workbooks.tests/test_engine.py (4)
6-12: LGTM! Import updated to include new option types.The import statement correctly includes
DestinationOptionsandFilterOptionsalongsideExStructEngine,OutputOptions, andStructOptions, reflecting the new nested option group API.
70-86: LGTM! Filter tests updated to use nested FilterOptions.Tests
test_engine_serialize_filters_shapesandtest_engine_serialize_filters_tablescorrectly use the newOutputOptions(filters=FilterOptions(...))pattern to configure include/exclude flags.
124-136: LGTM! Destination tests updated to use nested DestinationOptions.Tests for
sheets_dirandprint_areas_dircorrectly useOutputOptions(destinations=DestinationOptions(...))to configure output destinations, validating the new API surface.Also applies to: 138-152
154-168: LGTM! Combined filter and destination tests exercise the full API.These tests correctly combine
FilterOptionsandDestinationOptionswithinOutputOptions, covering:
- Print areas inclusion flag with destination
- Light mode behavior with destinations
- String path handling for destinations
Also applies to: 170-186, 188-207
src/exstruct/io/serialize.py (3)
10-10: LGTM! Format hints centralized in a constant.The
_FORMAT_HINTSset provides a single source of truth for allowed formats, used consistently across the module and exported for use inio/__init__.py.
52-90: LGTM! Clean serialization dispatcher with defensive fallback.The
_serialize_payload_from_hintfunction:
- Uses a clear
matchstatement for format dispatch- Handles JSON indentation logic correctly (line 72:
2 if pretty and indent is None else indent)- Lazily loads optional dependencies (yaml, toon)
- Includes a defensive default case (lines 87-90) that's technically unreachable when callers use
_ensure_format_hint, but provides safety if called directly
93-112: LGTM! Dependency checkers provide clear installation guidance.Both
_require_yamland_require_tooncorrectly:
- Use
importlib.import_modulefor lazy loading- Raise
MissingDependencyErrorwith actionable install instructions- Preserve the original
ImportErrorin the exception chain (usingfrom e)src/exstruct/core/modeling.py (2)
1-29: Clean data container design with proper immutability.The frozen dataclasses
SheetRawDataandWorkbookRawDataprovide a clean separation between raw extraction output and typed Pydantic models. This aligns well with the pipeline architecture and enables immutable data flow.
44-74: LGTM!Builder functions are well-typed with clear docstrings. The transformation from raw dataclasses to Pydantic models is straightforward and maintainable.
src/exstruct/core/backends/openpyxl_backend.py (2)
22-45: LGTM!The
OpenpyxlBackendclass is well-structured with proper type hints and docstrings. The conditional delegation inextract_cellsis clean and readable.
47-62: LGTM!The fallback from defined names to sheet properties is a good defensive pattern. The broad exception handling is acceptable here since the caller expects graceful degradation.
src/exstruct/core/integrate.py (1)
10-52: Clean refactoring to pipeline-based extraction.The
extract_workbookfunction now cleanly delegates to the pipeline infrastructure while preserving the public API. This is a significant simplification that centralizes the extraction logic.docs/agents/EXCEL_EXTRACTION.md (1)
1-61: Documentation accurately reflects the new pipeline architecture.The updated specification clearly describes the extraction flow, mode behaviors, and fallback mechanisms. The priority of COM results over OpenPyXL for colors_map is well-documented.
src/exstruct/__init__.py (2)
14-14: LGTM!Adding
FormatOptionsto the public API export is consistent with the new nestedOutputOptionsstructure. This enables users to construct format options separately.Also applies to: 80-80
367-398: LGTM!The
process_excelfunction correctly constructsOutputOptionswith the new nested structure. The parameter passthrough toengine.processmaintains backward compatibility.src/exstruct/core/backends/com_backend.py (1)
173-206: LGTM!The
_split_csv_respecting_quotesfunction correctly handles Excel's CSV-like range syntax with quoted sheet names containing commas. The state machine approach is appropriate for this parsing task.src/exstruct/engine.py (2)
144-163: LGTM!The
OutputOptionsclass now enforces strict field validation withextra="forbid", which will catch configuration errors early. The nested structure withFormatOptions,FilterOptions, andDestinationOptionsis cleaner and more maintainable.
341-351: LGTM!The
extractmethod correctly forwards the new color options fromself.options.colorstoextract_workbook. The use ofignore_colors_set()ensures proper type conversion from list to set.src/exstruct/core/cells.py (5)
4-4: LGTM!Good addition of
CallableandSequencefromcollections.abcand theopenpyxl_workbookcontext manager import for safer resource handling.Also applies to: 19-19
1161-1178: LGTM!The
_extract_listobject_tableshelper cleanly extracts table ranges from Excel ListObjects via COM, with appropriate exception handling.
1227-1239: Good abstraction layer.The
_detect_border_rectanglesfunction provides a clean abstraction overdetect_border_clusters, handling the numpy array conversion internally.
1291-1348: Well-structured table candidate collection.The
_collect_table_candidates_from_valuesfunction centralizes the table detection heuristics with clear parameter documentation. The use of aCallablefor column naming enables reuse across openpyxl and xlwings backends.
1409-1447: LGTM!The
detect_tables_openpyxlfunction cleanly combines openpyxl table references with border-based detection, using proper deduplication. This aligns well with the backend abstraction.src/exstruct/core/pipeline.py (5)
1-29: LGTM!Imports are properly organized following the coding guidelines (stdlib → third-party → internal), and the type aliases are clear and appropriately scoped.
32-55: LGTM!The immutable
ExtractionInputsdataclass is well-structured with complete type hints and a proper Google-style docstring. Usingfrozen=Trueappropriately enforces immutability for pipeline inputs.
57-76: LGTM!Proper use of
field(default_factory=dict)avoids the mutable default argument pitfall. The mutable design is intentional for artifact accumulation during pipeline execution.
351-390: LGTM!The pipeline runner functions are simple and correct. Steps mutate artifacts in-place, and any step failure propagates up appropriately. The pattern is clean and easy to follow.
564-579: [No action required]The
detect_tablesfunction has the signaturedef detect_tables(sheet: xw.Sheet) -> list[str]:and is explicitly designed to accept xlwingsSheetobjects. The usage at line 571 in pipeline.py is correct—passing an xlwings sheet fromworkbook.sheets[sheet_name]is exactly what the function expects. There is no compatibility issue.Likely an incorrect or invalid review comment.
- TEST_REQUIREMENTS.md の重複 ID を解消([PIPE-MOD-01/02]) - rowは1-based, columnは0-basedであることを明文化、統一 - COM パイプライン失敗時の理由/文言を汎用化し、FallbackReason を追加(errors.py, pipeline.py) - build_cells_tables_workbook の reason をログに活用 - test_pipeline.py の不要な delenv を削除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/exstruct/io/__init__.py (1)
74-82: ReturnRangeBoundsdirectly instead of converting to tuple.This wrapper function unpacks the Pydantic
RangeBoundsmodel into a tuple, which violates the coding guideline: "Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data." Callers should access bounds via the model's fields (bounds.r1,bounds.c1, etc.) to preserve type safety and semantic clarity.As per coding guidelines, structured data should be returned as Pydantic models, not tuples.
🔎 Proposed refactor to eliminate tuple conversion
Option 1: Remove the wrapper entirely and use
parse_range_zero_baseddirectlyUpdate callers (e.g., line 129) to use the model fields:
def _filter_table_candidates_to_area( table_candidates: list[str], area: PrintArea ) -> list[str]: filtered: list[str] = [] for candidate in table_candidates: - bounds = _parse_range_zero_based(candidate) - if not bounds: + bounds = parse_range_zero_based(candidate) + if bounds is None: continue - r1, c1, r2, c2 = bounds + r1, c1, r2, c2 = bounds.r1, bounds.c1, bounds.r2, bounds.c2 r1 += 1 r2 += 1 if r1 >= area.r1 and r2 <= area.r2 and c1 >= area.c1 and c2 <= area.c2: filtered.append(candidate) return filteredOption 2: If the wrapper is needed, change return type to
RangeBounds | None-def _parse_range_zero_based(range_str: str) -> tuple[int, int, int, int] | None: - """ - Parse an Excel range string into zero-based (r1, c1, r2, c2) bounds. - Returns None on failure. - """ +def _parse_range_zero_based(range_str: str) -> RangeBounds | None: + """Parse an Excel range string into zero-based bounds. + + Args: + range_str: Excel range string (e.g., "Sheet1!A1:B2"). + + Returns: + RangeBounds in zero-based coordinates, or None on failure. + """ bounds = parse_range_zero_based(range_str) - if bounds is None: - return None - return (bounds.r1, bounds.c1, bounds.r2, bounds.c2) + return bounds
♻️ Duplicate comments (3)
src/exstruct/core/pipeline.py (2)
644-645: Previous review concern addressed.The error message has been updated to use a generic "COM pipeline failed" message with
FallbackReason.COM_PIPELINE_FAILED, addressing the earlier review comment about misleading error messages.
670-670: Previous review concern addressed.The
reasonparameter is now properly used for logging, addressing the earlier review comment about the unused parameter.docs/agents/TEST_REQUIREMENTS.md (1)
203-204: Previous review concern addressed.The duplicate requirement IDs have been resolved by renaming them to
[PIPE-MOD-01]and[PIPE-MOD-02], making each requirement ID unique throughout the document.
🧹 Nitpick comments (7)
src/exstruct/io/__init__.py (1)
488-489: Consider renaming publicly exported functions to remove underscore prefix.The functions
_require_yamland_require_toonare now part of the public API (exported in__all__), but their names start with an underscore, which by Python convention indicates private/internal functions. For a cleaner public API, consider renaming them torequire_yamlandrequire_toon(without the underscore).🔎 Suggested changes for consistent public API naming
In
src/exstruct/io/serialize.py, rename the functions:-def _require_yaml() -> ModuleType: +def require_yaml() -> ModuleType: """Ensure pyyaml is installed; otherwise raise with guidance.""" ... -def _require_toon() -> ModuleType: +def require_toon() -> ModuleType: """Ensure python-toon is installed; otherwise raise with guidance.""" ...In this file, update the import and export:
from .serialize import ( _FORMAT_HINTS, _ensure_format_hint, - _require_toon, - _require_yaml, + require_toon, + require_yaml, _serialize_payload_from_hint, ) __all__ = [ "dict_without_empty_values", "save_as_json", "save_as_yaml", "save_as_toon", "save_sheets", "save_sheets_as_json", "build_print_area_views", "save_print_area_views", "save_auto_page_break_views", "serialize_workbook", - "_require_yaml", - "_require_toon", + "require_yaml", + "require_toon", ]tests/test_backends.py (3)
16-22: Add type hints to mock functions.The fake functions should have explicit type hints for maintainability and mypy compliance.
🔎 Proposed refactor
- def fake_cells(_: Path) -> dict[str, list[object]]: + def fake_cells(file_path: Path) -> dict[str, list[object]]: calls.append("cells") return {} - def fake_cells_links(_: Path) -> dict[str, list[object]]: + def fake_cells_links(file_path: Path) -> dict[str, list[object]]: calls.append("links") return {}As per coding guidelines, avoid using
_for actual parameters; use descriptive names with proper type hints.
43-44: Use explicit parameter names with type hints.Replace generic
_and__with descriptive parameter names for better readability.🔎 Proposed refactor
- def fake_detect(_: Path, __: str) -> list[str]: + def fake_detect(file_path: Path, sheet_name: str) -> list[str]: raise RuntimeError("boom")As per coding guidelines, use descriptive parameter names.
58-59: Use explicit type hints instead of generic object.The mock function should use proper type signatures for clarity.
🔎 Proposed refactor
- def fake_colors_map(*_: object, **__: object) -> object: + def fake_colors_map( + workbook: object, + *, + include_default_background: bool, + ignore_colors: set[str] | None + ) -> object: raise RuntimeError("boom")As per coding guidelines, provide explicit type hints for all parameters.
src/exstruct/core/backends/openpyxl_backend.py (1)
103-125: Consider adding a module-level docstring.The file contains well-structured code with proper docstrings for classes and methods, but lacks a module-level docstring explaining the openpyxl backend's role in the extraction pipeline.
🔎 Suggested addition
Add at the top of the file after imports:
"""Openpyxl-based backend for Excel workbook extraction. This module provides the OpenpyxlBackend class which uses the openpyxl library to extract cells, print areas, color maps, and table candidates from Excel files. It serves as the primary extraction backend with fallback support when COM is unavailable. """As per coding guidelines, adding comprehensive module documentation improves maintainability.
src/exstruct/core/backends/com_backend.py (2)
173-206: Consider extracting CSV parsing to a utility function for reusability.The
_split_csv_respecting_quotesfunction implements custom CSV parsing with quote handling. While the implementation is correct, this type of utility might be useful elsewhere in the codebase.Consider moving this to a shared utilities module if similar parsing is needed elsewhere, or documenting that Python's
csvmodule with appropriate dialect settings could be an alternative.
1-206: Add module-level docstring for better documentation.The file contains well-structured code with proper class and method docstrings, but lacks a module-level docstring explaining the COM backend's role.
🔎 Suggested addition
Add at the top after imports:
"""COM-based backend for Excel workbook extraction via xlwings. This module provides the ComBackend class which uses xlwings/COM to extract print areas, color maps, and auto page breaks from Excel files. It complements the OpenpyxlBackend and is used when COM automation is available and required features (like auto page breaks) need native Excel API access. """As per coding guidelines, comprehensive module documentation improves maintainability.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
.github/workflows/pytest.ymldocs/agents/CODE_REVIEW.mddocs/agents/DATA_MODEL.mddocs/agents/EXCEL_EXTRACTION.mddocs/agents/TEST_REQUIREMENTS.mdsrc/exstruct/core/backends/com_backend.pysrc/exstruct/core/backends/openpyxl_backend.pysrc/exstruct/core/pipeline.pysrc/exstruct/errors.pysrc/exstruct/io/__init__.pysrc/exstruct/models/__init__.pytests/test_auto_page_breaks.pytests/test_backends.pytests/test_engine.pytests/test_integrate_raw_data.pytests/test_modeling.pytests/test_models_validation.pytests/test_pipeline.pytests/test_print_area_views.pytests/test_print_areas_openpyxl.py
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/test_pipeline.py
- tests/test_modeling.py
- tests/test_integrate_raw_data.py
- src/exstruct/errors.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Add type hints to all function and method arguments and return values (mypy strict compliance)
UseAnytype only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Organize imports in this order: (1) standard library, (2) third-party packages, (3) exstruct internal modules
Use Google-style docstrings for all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Avoid writing God Functions (large, single-responsibility-violating functions)
Avoid writing God Objects (classes with too many responsibilities)
Avoid excessive nesting and deep conditional branching
Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Ensure code passes mypy strict mode with zero errors
Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Avoid circular dependencies between modules
AI should propose automatic function splitting when functions are too complex
AI should propose additional Pydantic models when data structures are insufficient
AI should automatically fix and organize imports when they are inconsistent
AI should improve docstrings when they are incomplete or insufficient
Files:
tests/test_print_areas_openpyxl.pytests/test_backends.pysrc/exstruct/core/backends/openpyxl_backend.pysrc/exstruct/io/__init__.pysrc/exstruct/core/backends/com_backend.pytests/test_print_area_views.pytests/test_engine.pytests/test_auto_page_breaks.pysrc/exstruct/core/pipeline.pytests/test_models_validation.pysrc/exstruct/models/__init__.py
🧠 Learnings (1)
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Applied to files:
src/exstruct/io/__init__.py
🧬 Code graph analysis (6)
tests/test_backends.py (3)
src/exstruct/core/backends/com_backend.py (3)
ComBackend(18-135)extract_colors_map(51-74)extract_print_areas(27-49)src/exstruct/core/backends/openpyxl_backend.py (5)
OpenpyxlBackend(23-100)extract_cells(32-45)detect_tables(88-100)extract_colors_map(64-86)extract_print_areas(47-62)src/exstruct/core/ranges.py (1)
parse_range_zero_based(24-47)
src/exstruct/core/backends/openpyxl_backend.py (3)
src/exstruct/models/__init__.py (1)
PrintArea(97-103)src/exstruct/core/ranges.py (1)
parse_range_zero_based(24-47)src/exstruct/core/workbook.py (1)
openpyxl_workbook(14-53)
src/exstruct/io/__init__.py (3)
src/exstruct/core/ranges.py (1)
parse_range_zero_based(24-47)src/exstruct/errors.py (2)
OutputError(32-33)SerializationError(20-21)src/exstruct/io/serialize.py (4)
_ensure_format_hint(28-49)_require_toon(104-112)_require_yaml(93-101)_serialize_payload_from_hint(52-90)
tests/test_print_area_views.py (1)
src/exstruct/models/__init__.py (2)
CellRow(50-59)PrintArea(97-103)
tests/test_auto_page_breaks.py (1)
src/exstruct/models/__init__.py (1)
PrintArea(97-103)
tests/test_models_validation.py (1)
src/exstruct/models/__init__.py (1)
CellRow(50-59)
🪛 GitHub Check: Codacy Static Code Analysis
src/exstruct/core/backends/com_backend.py
[warning] 37-37: src/exstruct/core/backends/com_backend.py#L37
Try, Except, Continue detected.
[warning] 132-132: src/exstruct/core/backends/com_backend.py#L132
Try, Except, Pass detected.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: test (windows-latest, 3.11)
🔇 Additional comments (24)
docs/agents/DATA_MODEL.md (2)
96-96: Clear coordinate semantics for PrintArea rows—good clarification.Adding "(1-based, inclusive)" to
r1andr2aligns with the existingc1andc2documentation and makes the row coordinate interpretation explicit. This improves clarity for API consumers.Also applies to: 98-98
140-140: Documentation forcolors_mapcoordinate system is correct and matches implementation.The codebase confirms that
colors_mapuses (row=1-based, col=0-based) coordinates as documented. Both extraction paths—openpyxl (cell.row, cell.col_idx - 1) and COM (row, col - 1)—apply the same logic: rows are preserved as 1-based, columns are converted from 1-based to 0-based. Test coverage validates this with{"#FFFFFF": [(1, 0)]}, confirming row 1 with column 0.src/exstruct/io/__init__.py (3)
8-18: Well done removing the OpenPyXL dependency!The refactor successfully eliminates direct dependency on
openpyxl.utils.cell.range_boundariesby using the internalparse_range_zero_basedfunction, and centralizes serialization logic through the new helpers from.serialize. This aligns with the coding guideline to avoid depending on internal structures of external libraries.Based on learnings, this change properly normalizes external data to Pydantic models at boundaries.
283-288: Excellent serialization refactoring!The centralization of format validation and serialization through
_ensure_format_hintand_serialize_payload_from_hinteliminates code duplication and provides a single source of truth for format handling. This improves maintainability and consistency across all export functions.Also applies to: 315-318, 341-346, 374-377, 393-402, 429-432, 449-454, 469-471
133-134: Row index adjustments correctly implement 1-based row coordinates.The adjustments on lines 133-134 (adding 1 to
r1andr2) and lines 148, 150 (subtracting 1 fromarea.r1in pixel calculations) correctly convert between the 0-based coordinates returned byparse_range_zero_basedand the 1-based row coordinates used inPrintArea, as documented in the PR objectives.Also applies to: 148-148, 150-150
tests/test_models_validation.py (1)
19-19: LGTM: Coordinate system alignment.The update from
r=0tor=1correctly reflects the 1-based row indexing documented in theCellRowmodel definition.tests/test_auto_page_breaks.py (1)
47-47: LGTM: Coordinate system alignment.The updated PrintArea coordinates correctly reflect the 1-based row indexing (r1=1, r2=2) and 0-based column indexing (c1=0, c2=1) documented in the PrintArea model.
.github/workflows/pytest.yml (1)
34-46: LGTM: Coverage collection and Codecov integration properly configured.The pytest coverage flags and Codecov upload step are correctly implemented. The Codecov action is now properly pinned to a full commit SHA, addressing the previous security concern.
src/exstruct/models/__init__.py (3)
14-16: LGTM: Field definition reflow.The Shape.id field definition has been reflowed for readability while maintaining the same semantic meaning.
101-103: LGTM: Coordinate system documentation clarified.The field descriptions now explicitly document that columns use 0-based indexing, improving clarity for API consumers.
129-132: LGTM: colors_map coordinate semantics documented.The description now explicitly clarifies that rows are 1-based and columns are 0-based, which is essential for correct usage of the colors_map field.
docs/agents/EXCEL_EXTRACTION.md (1)
1-66: LGTM: Comprehensive pipeline documentation.This documentation provides clear guidance on the extraction pipeline architecture, coordinate system semantics, modes, and fallback behavior. The coordinate system clarification (rows 1-based, columns 0-based) is consistent with the code changes throughout the PR.
src/exstruct/core/pipeline.py (1)
1-696: LGTM: Pipeline architecture with proper type safety.The pipeline module demonstrates excellent adherence to coding guidelines:
- Complete type hints on all functions and parameters
- Google-style docstrings throughout
- Immutable dataclasses for pipeline configuration and state
- Well-organized imports (stdlib → third-party → internal)
- Clear separation of pre-COM and COM extraction steps
The architecture provides a solid foundation for the extraction workflow with explicit fallback handling and state tracking.
docs/agents/TEST_REQUIREMENTS.md (1)
1-228: LGTM: Comprehensive test requirements specification.The test requirements document provides thorough coverage of functional, non-functional, and integration requirements. The organization by category (pipeline, backend, ranges, etc.) aligns well with the modular architecture introduced in this PR.
tests/test_print_areas_openpyxl.py (1)
26-26: LGTM: Coordinate system alignment.The assertion now correctly expects 1-based row coordinates (r1=1, r2=2) and 0-based column coordinates (c1=0, c2=1), consistent with the PrintArea model definition and the broader coordinate system updates in this PR.
tests/test_print_area_views.py (1)
39-46: LGTM! Coordinate system update correctly applied.The test data has been properly updated to reflect the documented coordinate convention (1-based rows, 0-based columns). The CellRow indices and PrintArea bounds are now consistent with the model definitions.
tests/test_backends.py (2)
76-93: LGTM! Print area extraction test validates coordinate system.The test correctly verifies that print areas extracted via openpyxl use 1-based row indexing (r1=1) and 0-based column indexing (c1=0), consistent with the PrintArea model definition.
95-101: LGTM! Range parsing test validates zero-based intermediate representation.The test correctly verifies that
parse_range_zero_basedreturns zero-based coordinates (r1=0, c1=0, r2=1, c2=1) for the range "Sheet1!A1:B2", which are then converted to the appropriate coordinate system by backend methods.src/exstruct/core/backends/openpyxl_backend.py (1)
159-162: LGTM! Coordinate conversion correctly implemented.The conversion from zero-based coordinates (from
_parse_print_area_range) to the PrintArea model is now correct:
- Rows (r1, r2): converted from 0-based to 1-based with
+ 1- Columns (c1, c2): kept as 0-based (no adjustment)
This matches the PrintArea model definition where rows are 1-based and columns are 0-based.
tests/test_engine.py (2)
6-12: LGTM! Improved API design with nested option groups.The introduction of
FilterOptionsandDestinationOptionsimproves the API's organization and clarity, grouping related settings together underOutputOptions.
65-65: LGTM! PrintArea coordinate updated correctly.The PrintArea coordinates have been updated to use 1-based row indexing (r1=1) consistent with the model definition and broader coordinate system clarification in this PR.
docs/agents/CODE_REVIEW.md (1)
1-609: Documentation file - no code review needed.This file contains example code review comments and AI agent prompts for documentation purposes. It does not contain production code requiring review.
src/exstruct/core/backends/com_backend.py (2)
45-48: LGTM! Coordinate conversion correctly implemented.The conversion from zero-based coordinates (from
_parse_print_area_range) to PrintArea is correct:
- Rows (r1, r2):
+ 1converts 0-based to 1-based- Columns (c1, c2): no adjustment keeps 0-based
This matches the PrintArea model definition.
119-125: LGTM! Coordinate conversion for COM values is correct.The coordinate handling here is actually correct, contrary to the past review comment:
rows[i]andcols[j]are 1-based values from Excel COM- Rows (r1, r2): used as-is since PrintArea expects 1-based rows ✓
- Columns (c1, c2):
- 1converts from 1-based COM to 0-based for PrintArea ✓This is consistent with the PrintArea model where rows are 1-based and columns are 0-based.
- __init__.py の _parse_range_zero_based を RangeBounds 返却に変更し、呼び出し側もフィールド参照へ更新 - openpyxl_backend.py と com_backend.py にモジュール docstring 追加 - test_backends.py のモック関数に型ヒント追加
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/exstruct/core/cells.py (1)
502-526: Remove redundantwb.close()call.Line 525 explicitly calls
wb.close(), but theopenpyxl_workbookcontext manager on line 502 already handles closing in itsfinallyblock (seesrc/exstruct/core/workbook.pylines 53-56). This redundant call could:
- Mask exceptions from the context block if
close()raises- Create confusion about resource ownership
- Cause double-close issues
🔎 Proposed fix
links = sheet_links.get(row.r, {}) merged_rows.append(CellRow(r=row.r, c=row.c, links=links or None)) merged[sheet_name] = merged_rows - wb.close() return merged
🧹 Nitpick comments (3)
src/exstruct/core/backends/openpyxl_backend.py (1)
167-179: Consider returningRangeBoundsdirectly instead of unpacking to tuple.The function unpacks the
RangeBoundsPydantic model fromparse_range_zero_basedinto a tuple, which reduces type safety and semantic clarity. The coding guideline recommends using Pydantic models for structured data.🔎 Proposed refactor
-def _parse_print_area_range(range_str: str) -> tuple[int, int, int, int] | None: - """Parse an Excel range string into zero-based coordinates. +def _parse_print_area_range(range_str: str) -> RangeBounds | None: + """Parse an Excel range string into zero-based bounds. Args: range_str: Excel range string. Returns: - Zero-based (r1, c1, r2, c2) tuple or None on failure. + RangeBounds in zero-based coordinates, or None on failure. """ - bounds = parse_range_zero_based(range_str) - if bounds is None: - return None - return (bounds.r1, bounds.c1, bounds.r2, bounds.c2) + return parse_range_zero_based(range_str)Then update the caller at line 161:
for part in str(range_str).split(","): parsed = _parse_print_area_range(part) if not parsed: continue - r1, c1, r2, c2 = parsed + r1, c1, r2, c2 = parsed.r1, parsed.c1, parsed.r2, parsed.c2 areas.setdefault(sheet_name, []).append( PrintArea(r1=r1 + 1, c1=c1, r2=r2 + 1, c2=c2) )As per coding guidelines, structured data should be represented as Pydantic models rather than tuples.
src/exstruct/core/backends/com_backend.py (1)
158-170: Consider returningRangeBoundsdirectly instead of tuple.Same issue as in
openpyxl_backend.py: the function unpacks the PydanticRangeBoundsmodel into a tuple, reducing type safety.🔎 Proposed refactor
-def _parse_print_area_range(range_str: str) -> tuple[int, int, int, int] | None: - """Parse an Excel range string into zero-based coordinates. +def _parse_print_area_range(range_str: str) -> RangeBounds | None: + """Parse an Excel range string into zero-based bounds. Args: range_str: Excel range string. Returns: - Zero-based (r1, c1, r2, c2) tuple or None on failure. + RangeBounds in zero-based coordinates, or None on failure. """ - bounds = parse_range_zero_based(range_str) - if bounds is None: - return None - return (bounds.r1, bounds.c1, bounds.r2, bounds.c2) + return parse_range_zero_based(range_str)Then update the caller at line 52:
for part in str(raw).split(","): parsed = _parse_print_area_range(part) if not parsed: continue - r1, c1, r2, c2 = parsed + r1, c1, r2, c2 = parsed.r1, parsed.c1, parsed.r2, parsed.c2 areas.setdefault(sheet.name, []).append( PrintArea(r1=r1 + 1, c1=c1, r2=r2 + 1, c2=c2) )As per coding guidelines, structured data should be represented as Pydantic models rather than tuples.
src/exstruct/io/__init__.py (1)
74-83: Consider removing this wrapper function.
_parse_range_zero_basedsimply delegates toparse_range_zero_basedwith duplicate documentation, adding no additional logic or value. Direct usage ofparse_range_zero_basedfromcore.rangeswould eliminate this indirection and improve maintainability.🔎 Proposed refactor
Remove this wrapper entirely and update internal call sites (lines 130, if any others) to directly import and use
parse_range_zero_basedfromcore.ranges.-def _parse_range_zero_based(range_str: str) -> RangeBounds | None: - """Parse an Excel range string into zero-based bounds. - - Args: - range_str: Excel range string (e.g., "Sheet1!A1:B2"). - - Returns: - RangeBounds in zero-based coordinates, or None on failure. - """ - return parse_range_zero_based(range_str)Then update call sites:
- bounds = _parse_range_zero_based(candidate) + bounds = parse_range_zero_based(candidate)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/agents/CODE_REVIEW.mdsrc/exstruct/core/backends/com_backend.pysrc/exstruct/core/backends/openpyxl_backend.pysrc/exstruct/core/cells.pysrc/exstruct/core/ranges.pysrc/exstruct/core/workbook.pysrc/exstruct/io/__init__.pytests/test_backends.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/test_backends.py
- src/exstruct/core/ranges.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Add type hints to all function and method arguments and return values (mypy strict compliance)
UseAnytype only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Organize imports in this order: (1) standard library, (2) third-party packages, (3) exstruct internal modules
Use Google-style docstrings for all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Avoid writing God Functions (large, single-responsibility-violating functions)
Avoid writing God Objects (classes with too many responsibilities)
Avoid excessive nesting and deep conditional branching
Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Ensure code passes mypy strict mode with zero errors
Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Avoid circular dependencies between modules
AI should propose automatic function splitting when functions are too complex
AI should propose additional Pydantic models when data structures are insufficient
AI should automatically fix and organize imports when they are inconsistent
AI should improve docstrings when they are incomplete or insufficient
Files:
src/exstruct/core/backends/openpyxl_backend.pysrc/exstruct/core/cells.pysrc/exstruct/io/__init__.pysrc/exstruct/core/workbook.pysrc/exstruct/core/backends/com_backend.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Applied to files:
src/exstruct/core/cells.pysrc/exstruct/io/__init__.py
🧬 Code graph analysis (4)
src/exstruct/core/cells.py (2)
src/exstruct/models/__init__.py (1)
CellRow(50-59)src/exstruct/core/workbook.py (1)
openpyxl_workbook(17-56)
src/exstruct/io/__init__.py (4)
src/exstruct/core/ranges.py (2)
RangeBounds(9-22)parse_range_zero_based(25-48)src/exstruct/errors.py (2)
OutputError(32-33)SerializationError(20-21)src/exstruct/engine.py (1)
serialize(353-376)src/exstruct/io/serialize.py (4)
_ensure_format_hint(28-49)_require_toon(104-112)_require_yaml(93-101)_serialize_payload_from_hint(52-90)
src/exstruct/core/workbook.py (1)
tests/test_workbook_context.py (2)
close(12-13)close(43-44)
src/exstruct/core/backends/com_backend.py (4)
src/exstruct/models/__init__.py (1)
PrintArea(97-103)src/exstruct/core/cells.py (2)
WorkbookColorsMap(53-67)extract_sheet_colors_map_com(94-119)src/exstruct/core/ranges.py (1)
parse_range_zero_based(25-48)src/exstruct/core/backends/base.py (2)
extract_print_areas(32-33)extract_colors_map(35-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (9)
docs/agents/CODE_REVIEW.md (1)
1-779: Documentation artifact - no code review required.This file contains code review examples and AI agent prompts for documentation purposes. It does not contain production code requiring review.
src/exstruct/core/cells.py (2)
85-91: Excellent refactoring to use context-managed workbook access.The replacement of direct
load_workbookcalls with theopenpyxl_workbookcontext manager ensures proper resource cleanup and centralizes warning suppression logic.Based on learnings, this change properly manages external library resources at boundaries.
1161-1454: Well-structured table detection refactoring.The new helper functions follow good separation of concerns:
_extract_listobject_tables: COM ListObject extraction_detect_border_rectangles_xlwings: Border-based detection via COM_merge_rectangles: Deduplication logic_collect_table_candidates_from_values: Heuristic scoringdetect_tables_openpyxl: Public openpyxl entry pointAll functions have proper type hints, Google-style docstrings, and appropriate error handling.
src/exstruct/core/workbook.py (1)
1-114: Excellent context manager implementation for workbook lifecycle management.The two context managers demonstrate best practices:
- Resource safety: Guaranteed cleanup in
finallyblocks (lines 52-56, 80-87)- Error resilience: Swallows cleanup exceptions to avoid masking original errors
- Defensive reuse:
_find_open_workbooksafely reuses existing Excel instances via path resolution (lines 90-114)- Warning filtering: Suppresses known non-critical openpyxl warnings (lines 30-48)
All functions have complete type hints and proper Google-style docstrings, adhering to coding guidelines.
src/exstruct/core/backends/openpyxl_backend.py (1)
24-102: Well-designed backend abstraction with proper separation of concerns.The
OpenpyxlBackendclass cleanly encapsulates openpyxl-based extraction:
- Single responsibility: Each method delegates to specialized utilities
- Error handling: Graceful degradation with empty returns or
None- Type safety: Complete type hints throughout
- Documentation: Clear Google-style docstrings
The coordinate conversion at line 163 correctly adjusts rows to 1-based while keeping columns 0-based, matching the
PrintAreamodel specification.src/exstruct/core/backends/com_backend.py (2)
29-56: Coordinate conversion correctly implemented.Line 54 properly converts from zero-based (parsed) to the
PrintAreamodel's coordinate system:
- Rows:
+1converts 0-based to 1-based ✓- Columns: no adjustment keeps them 0-based ✓
This matches the
PrintAreaspecification:r1/r2are 1-based,c1/c2are 0-based.
83-155: Auto page-break extraction with proper state management.The method demonstrates good practices:
- State preservation: Saves and restores
DisplayPageBreaksinfinallyblock (lines 143-152)- Error resilience: Continues processing other sheets on failure (line 153-154)
- Clear structure: Despite length, the logic flows clearly through setup → extraction → restoration
The coordinate conversion at line 134 is correct:
- Excel COM returns 1-based row/column values
- Rows stay 1-based (no adjustment) ✓
- Columns converted from 1-based to 0-based (
-1) ✓This aligns with the
PrintAreamodel specification.src/exstruct/io/__init__.py (2)
8-18: LGTM: Clean import refactoring.The imports are well-organized and follow the project's structure, centralizing range parsing and serialization utilities.
320-323: Excellent refactoring: Centralized serialization.The refactoring to use
_serialize_payload_from_hintacross all serialization call sites eliminates code duplication and improves maintainability. This is a clean application of the DRY principle.Also applies to: 379-382, 405-407, 434-437, 474-476
| "_require_yaml", | ||
| "_require_toon", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconsider exporting private functions in public API.
_require_yaml and _require_toon use underscore prefixes indicating private/internal usage, yet they're exported in __all__, making them part of the public API. This creates inconsistency:
- If these functions are meant to be public, remove the underscore prefix
- If they should remain internal, remove them from
__all__
Based on their usage in serialize.py and re-export here, if they're genuinely needed by external consumers, consider renaming to require_yaml and require_toon.
🔎 Proposed fix (if making them truly public)
__all__ = [
"dict_without_empty_values",
"save_as_json",
"save_as_yaml",
"save_as_toon",
"save_sheets",
"save_sheets_as_json",
"build_print_area_views",
"save_print_area_views",
"save_auto_page_break_views",
"serialize_workbook",
- "_require_yaml",
- "_require_toon",
+ "require_yaml",
+ "require_toon",
]And in the imports:
from .serialize import (
_FORMAT_HINTS,
_ensure_format_hint,
- _require_toon,
- _require_yaml,
+ require_toon as _require_toon, # If still needed internally with underscore
+ require_yaml as _require_yaml,
_serialize_payload_from_hint,
)Alternatively, if they should remain private, simply remove them from __all__.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/exstruct/io/__init__.py around lines 493-494, the module exports two
names prefixed with underscores ("_require_yaml", "_require_toon") in __all__,
which exposes private/internal functions as part of the public API; either
remove them from __all__ to keep them internal, or rename the functions and
update all imports/usages (e.g., in serialize.py and re-exports) to public names
(require_yaml, require_toon) so the public API and identifiers are
consistent—pick one approach and apply it across the package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
schemas/sheet.json (1)
219-219: Consistent with print_area_view.json schema changes.The PrintArea column indexing update to 0-based is consistent across schemas. See verification request in print_area_view.json review.
Also applies to: 224-224
schemas/workbook.json (1)
219-219: Schema changes consistent across all files.The PrintArea column indexing (lines 219, 224) and colors_map coordinate clarification (line 451) are consistent with print_area_view.json and sheet.json. This ensures uniform coordinate semantics across the schema hierarchy.
Also applies to: 224-224, 451-451
src/exstruct/io/__init__.py (1)
493-494: Private naming convention inconsistent with public export.The underscore prefix on
_require_yamland_require_toonsignals internal use, yet they're exported in__all__. As noted in previous reviews, consider either:
- Remove from
__all__if they should remain internal- Rename to
require_yaml/require_toonif they're intended for external use
🧹 Nitpick comments (1)
pyproject.toml (1)
35-35: taskipy>=1.14.1 follows the project's established pattern for dev dependencies. While the concern about preventing breaking changes is valid, other dev dependencies in the project (pytest, mypy, ruff, etc.) use the same open-ended constraint format. Version pinning is already managed by the uv lock file, which locks taskipy to 1.14.1. Adding an upper bound would diverge from the project's convention without providing additional protection.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
pyproject.tomlschemas/print_area.jsonschemas/print_area_view.jsonschemas/sheet.jsonschemas/workbook.jsonsrc/exstruct/io/__init__.py
✅ Files skipped from review due to trivial changes (1)
- schemas/print_area.json
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Add type hints to all function and method arguments and return values (mypy strict compliance)
UseAnytype only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Organize imports in this order: (1) standard library, (2) third-party packages, (3) exstruct internal modules
Use Google-style docstrings for all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Avoid writing God Functions (large, single-responsibility-violating functions)
Avoid writing God Objects (classes with too many responsibilities)
Avoid excessive nesting and deep conditional branching
Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Ensure code passes mypy strict mode with zero errors
Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Avoid circular dependencies between modules
AI should propose automatic function splitting when functions are too complex
AI should propose additional Pydantic models when data structures are insufficient
AI should automatically fix and organize imports when they are inconsistent
AI should improve docstrings when they are incomplete or insufficient
Files:
src/exstruct/io/__init__.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Codacy Static Code Analysis
- GitHub Check: test (windows-latest, 3.12)
- GitHub Check: test (windows-latest, 3.11)
🔇 Additional comments (10)
pyproject.toml (2)
3-3: LGTM!Version bump to 0.3.0 aligns with the release commit and follows semantic versioning conventions.
122-128: Fix [tool.mypy] configuration for src-layout project.The
[tool.mypy]configuration specifiespackages = ["exstruct"], but the package only exists atsrc/exstruct/. For src-layout projects, addmypy_path = "src"to the[tool.mypy]section so mypy can locate the package. The current task commandmypy src/exstruct --strictworks because it explicitly specifies the path, but the configuration itself is broken for IDE integration and standalone mypy runs.⛔ Skipped due to learnings
Learnt from: CR Repo: harumiWeb/exstruct PR: 0 File: AGENTS.md:0-0 Timestamp: 2025-12-21T10:55:22.238Z Learning: Applies to **/*.py : Ensure code passes mypy strict mode with zero errorsLearnt from: CR Repo: harumiWeb/exstruct PR: 0 File: AGENTS.md:0-0 Timestamp: 2025-12-21T10:55:22.238Z Learning: Applies to **/*.py : Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errorsschemas/sheet.json (1)
452-452: Good clarification of colors_map coordinate semantics.The updated description correctly documents the mixed indexing convention (1-based rows, 0-based columns) for the colors_map tuples. Both the openpyxl and COM backends consistently implement this convention: row values are taken directly from the cell/worksheet (1-based), and column indices are decremented by 1 to convert to 0-based.
schemas/print_area_view.json (1)
219-219: Breaking change: Column indexing changed to 0-based.The schema now documents columns as 0-based while rows remain 1-based. Verification confirms this breaking change has been coordinated across the codebase: both extraction backends (openpyxl and COM) convert coordinates to the new format, and all consumers of PrintArea (filtering, shape/chart mapping, range parsing) correctly treat columns as 0-based and rows as 1-based.
src/exstruct/io/__init__.py (6)
8-18: LGTM!The imports are well-organized and follow the coding guidelines: standard library imports first, followed by internal modules. The centralized serialization helpers from
.serializesupport the refactoring goals.
74-83: Good refactoring to centralized range parsing.The delegation pattern eliminates code duplication while maintaining a local interface. Type hints and docstring are complete.
145-156: LGTM - Pixel rectangle calculation.The 1-based row to pixel conversion is handled correctly:
(area.r1 - 1) * row_pxfor top edge andarea.r2 * row_pxfor bottom edge, consistent with 1-based row indexing inPrintArea.
288-293: Consistent error handling across all format validation.All serialization functions now use
SerializationErrorvia_ensure_format_hint. The previous inconsistency (noted in past reviews forsave_sheets) has been resolved.Also applies to: 346-351, 398-403, 454-459
320-323: Good centralization of serialization logic.All save/serialize functions now delegate to
_serialize_payload_from_hint, eliminating duplicated format-switching code and ensuring consistent behavior across JSON, YAML, and TOON outputs.Also applies to: 379-382, 405-407, 434-437, 474-476
133-140: The coordinate system is correct and intentional.PrintAreauses 1-based rows and 0-based columns (as documented in the model).RangeBoundsuses zero-based for all coordinates. The code correctly converts rows from zero-based to 1-based (bounds.r1 + 1,bounds.r2 + 1) and uses columns directly without conversion (both are zero-based). This asymmetric coordinate system is consistent throughout the codebase and poses no issue.
Add v0.3.0 release notes (internal refactor PR #23)
#22 内部実装のリファクタリングを行いました。
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.