Skip to content

Conversation

@harumiWeb
Copy link
Owner

@harumiWeb harumiWeb commented Dec 27, 2025

#22 内部実装のリファクタリングを行いました。

Summary by CodeRabbit

  • New Features
    • Nested output options (format/filters/destinations) and new public option types for configuring exports.
    • New extraction pipeline with improved OpenPyXL/COM coordination, safer workbook handling, and more reliable fallback behavior.
  • Documentation
    • Major docs overhaul: architecture, API design, specs, roadmap, and contributor/agent guides updated.
  • Chores
    • CI now collects/uploads test coverage; coverage artifacts are ignored locally.
  • Tests
    • Expanded test suite covering backends, pipeline, IO/serialization, workbook context, logging, and modeling.

✏️ Tip: You can customize this high-level summary in your review settings.

- 抽出の事前処理をパイプライン化し、integrate.py を簡素化: integrate.py
- パイプラインの挙動を検証するテスト追加: test_pipeline.py
- テスト要件を追記: TEST_REQUIREMENTS.md
- タスク完了チェック反映: TASKS.md
- openpyxl backend 実装: openpyxl_backend.py
- COM backend 実装: com_backend.py
- パイプラインの backend 移行: pipeline.py
- COM 側の呼び出し整理: integrate.py
- テスト追加: test_backends.py
- テスト要件追記: TEST_REQUIREMENTS.md
- タスク完了チェック: TASKS.md
- 共通レンジ検出のユーティリティ追加: ranges.py
- テーブル検出の共通化テスト追加: test_table_detection_common.py
- テスト要件追記: TEST_REQUIREMENTS.md
- タスク完了チェック: TASKS.md
- 既存出力処理の共通関数利用: __init__.py
- テスト追加: test_io_common.py
- テスト要件追記: TEST_REQUIREMENTS.md
- タスク完了チェック: TASKS.md
- xlwings の open/close を contextmanager 化して integrate.py に適用: integrate.py
- openpyxl 利用箇所を contextmanager 化: cells.py, openpyxl_backend.py
- フォールバック理由コードとログ統一: errors.py, logging_utils.py
- テスト追加: test_workbook_context.py, test_logging_utils.py
- テスト要件追記: TEST_REQUIREMENTS.md
- タスク更新: TASKS.md(workbook 管理とログ統一サブ項目を更新)
- xlwings の open/close を contextmanager 化して integrate.py に適用: integrate.py
- openpyxl 利用箇所を contextmanager 化: cells.py, openpyxl_backend.py
- フォールバック理由コードとログ統一: errors.py, logging_utils.py
- テスト追加: test_workbook_context.py, test_logging_utils.py
- テスト要件追記: TEST_REQUIREMENTS.md
- タスク更新: TASKS.md(workbook 管理とログ統一サブ項目を更新)
…rkbook_data で SheetData/WorkbookData を生成する流れに切り替えています。integrate_sheet_content は廃止して収集関数に分割し、pipeline 側のフォールバック生成も新ビルダーを経由するよう整理しました。
…M 上書き、print_areas の不足分補完、run_extraction_pipeline を追加

- integrate.py は resolve_extraction_inputs + run_extraction_pipeline のみで完結
- test_pipeline.py ほか関連テストを更新(COM ステップ構成、モンキーパッチ先の変更など)
- TEST_REQUIREMENTS.md に pipeline の COM 統合要件を追記
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 27, 2025

📝 Walkthrough

Walkthrough

Introduce a structured extraction pipeline: add backends (OpenPyXL/COM), pipeline orchestration with pre‑COM/COM steps and fallback handling, context managers and range parsing, raw→model builders, nested OutputOptions and serialization utilities, logging helpers, many tests, docs, and CI/tooling updates.

Changes

Cohort / File(s) Summary
CI & Tooling
\.github/workflows/pytest.yml, \.gitignore, pyproject.toml
Add coverage flags and a Codecov upload step; ignore coverage artifacts; bump version and add taskipy tasks (ruff, mypy, test, docs).
Package exports & engine
src/exstruct/__init__.py, src/exstruct/engine.py
Re-export FormatOptions; convert OutputOptions to nested groups (format, filters, destinations); remove legacy coercion and legacy properties; forbid extra keys.
Errors & logging
src/exstruct/errors.py, src/exstruct/core/logging_utils.py
Add FallbackReason enum and log_fallback helper for standardized fallback warnings.
Pipeline core & orchestration
src/exstruct/core/pipeline.py, src/exstruct/core/integrate.py, src/exstruct/core/modeling.py
Add ExtractionInputs/Artifacts/Plan/State/Result types, step builders and runners, pre‑COM/COM orchestration with COM fallback, collect/build raw→typed models; integrate.extract_workbook now delegates to pipeline.
Backends
src/exstruct/core/backends/__init__.py, .../base.py, .../openpyxl_backend.py, .../com_backend.py
Introduce Backend protocol and BackendConfig; add OpenpyxlBackend and ComBackend with methods: extract_cells, extract_print_areas, extract_colors_map, detect_tables, extract_auto_page_breaks and error handling.
Workbook & resources
src/exstruct/core/workbook.py, src/exstruct/core/ranges.py
Add openpyxl_workbook and xlwings_workbook context managers, _find_open_workbook helper, RangeBounds dataclass and parse_range_zero_based.
Cells & table detection
src/exstruct/core/cells.py
Refactor table detection and border‑rectangle heuristics; add detect_tables_openpyxl, helper utilities, and use workbook context manager for safe resource handling.
I/O & serialization
src/exstruct/io/__init__.py, src/exstruct/io/serialize.py
Centralize format handling via _serialize_payload_from_hint, normalize/validate format hints, add lazy _require_yaml and _require_toon with clear MissingDependencyError messages; re‑export require helpers.
Integrate / engine behaviour
src/exstruct/core/integrate.py, src/exstruct/engine.py, src/exstruct/__init__.py
integrate delegates extraction to pipeline; engine enforces nested options and stricter OutputOptions model.
I/O surface & schemas
src/exstruct/io/__init__.py, schemas/*.json
Adapted range handling to RangeBounds; update PrintArea column documentation to 0‑based; clarify colors_map row/col semantics in schemas.
Render & minor refactors
src/exstruct/render/__init__.py
Minor message formatting tweaks only.
Tests
tests/... (many files)
Add/update tests for backends, pipeline, modeling, workbook context managers, serialization, fallback behavior; update monkeypatch targets and expectations to reflect pipeline/context managers, nested option types, and coordinate shifts.
Documentation & agent guides
docs/agents/*, AGENTS.md, docs/agents/CODE_REVIEW.md, docs/agents/TASKS.md, docs/agents/DATA_MODEL.md
Large docs rewrite to pipeline‑centric architecture, boundary BaseModel vs internal dataclasses, extraction spec updates, feature roadmap and code‑review notes.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Pipeline
    participant OpenpyxlBackend
    participant ComBackend
    participant Modeling

    Client->>Pipeline: run_extraction_pipeline(inputs)
    activate Pipeline

    Pipeline->>Pipeline: resolve_extraction_inputs()
    Pipeline->>Pipeline: build_pipeline_plan()

    rect rgb(235,245,255)
        note over Pipeline,OpenpyxlBackend: Pre‑COM (OpenPyXL) steps
        Pipeline->>OpenpyxlBackend: extract_cells(include_links)
        OpenpyxlBackend-->>Pipeline: CellData
        Pipeline->>OpenpyxlBackend: extract_print_areas()
        OpenpyxlBackend-->>Pipeline: PrintAreaData
        Pipeline->>OpenpyxlBackend: extract_colors_map()
        OpenpyxlBackend-->>Pipeline: WorkbookColorsMap | None
    end

    alt use_com == true
        rect rgb(235,255,235)
            note over Pipeline,ComBackend: COM (xlwings) steps
            Pipeline->>ComBackend: extract_shapes_com()
            ComBackend-->>Pipeline: ShapeData
            Pipeline->>ComBackend: extract_charts_com()
            ComBackend-->>Pipeline: ChartData
            Pipeline->>ComBackend: extract_print_areas_com()
            ComBackend-->>Pipeline: PrintAreaData
            Pipeline->>ComBackend: extract_auto_page_breaks_com()
            ComBackend-->>Pipeline: PrintAreaData
        end
    else COM unavailable/failed
        rect rgb(255,240,230)
            note right of Pipeline: log_fallback(reason)\nuse OpenPyXL-only artifacts
        end
    end

    Pipeline->>Modeling: collect_sheet_raw_data(artifacts)
    Modeling-->>Pipeline: dict[str,SheetRawData]
    Pipeline->>Modeling: build_workbook_data(raw)
    Modeling-->>Pipeline: WorkbookData

    Pipeline-->>Client: PipelineResult{workbook, artifacts, state}
    deactivate Pipeline
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐇 I nibbled lines of code and stitched a trail,
Pipelines hum where workbooks used to wail,
OpenPyXL first, COM steps take the lead,
Fallbacks whispered gently when things misbehave,
Now sheets unfold their stories, neat and hale.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'Dev/refactor' is vague and generic, providing no meaningful information about the primary change or objectives of the pull request. Replace with a descriptive title summarizing the main refactoring effort, such as 'Refactor extraction pipeline and consolidate backends' or 'Extract pipeline architecture with nested option structure and COM fallback handling'.
Docstring Coverage ⚠️ Warning Docstring coverage is 56.64% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev/refactor

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/exstruct/core/cells.py (2)

502-526: Redundant wb.close() call after context manager exit.

Line 525 calls wb.close() after the with openpyxl_workbook(...) block has already exited. The context manager handles closing in its finally block, so this call is redundant and may raise an error if the workbook is already closed.

🔎 Proposed fix
     with openpyxl_workbook(file_path, data_only=True, read_only=False) as wb:
         for ws in wb.worksheets:
             sheet_links: dict[int, dict[str, str]] = {}
             for row in ws.iter_rows():
                 for cell in row:
                     link = getattr(cell, "hyperlink", None)
                     target = getattr(link, "target", None) if link else None
                     if not target:
                         continue
                     col_str = str(
                         cell.col_idx - 1
                     )  # zero-based to align with extract_sheet_cells
                     sheet_links.setdefault(cell.row, {})[col_str] = target
             links_by_sheet[ws.title] = sheet_links

     merged: dict[str, list[CellRow]] = {}
     for sheet_name, rows in cell_rows.items():
         sheet_links = links_by_sheet.get(sheet_name, {})
         merged_rows: list[CellRow] = []
         for row in rows:
             links = sheet_links.get(row.r, {})
             merged_rows.append(CellRow(r=row.r, c=row.c, links=links or None))
         merged[sheet_name] = merged_rows
-    wb.close()
     return merged

659-715: Worksheet object used after context manager closes workbook.

The worksheet ws is obtained inside the with openpyxl_workbook(...) block (line 666), but it's used in the loop at lines 692-714 after the context manager exits (line 678 ends the block at line 677). This may cause issues since the workbook is closed when the context exits.

🔎 Proposed fix: Move the entire processing inside the context manager
 def load_border_maps_xlsx(  # noqa: C901
     xlsx_path: Path, sheet_name: str
 ) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray, int, int]:
     with openpyxl_workbook(xlsx_path, data_only=True, read_only=False) as wb:
         if sheet_name not in wb.sheetnames:
             raise KeyError(f"Sheet '{sheet_name}' not found in {xlsx_path}")

         ws = wb[sheet_name]
         try:
             min_col, min_row, max_col, max_row = range_boundaries(
                 ws.calculate_dimension()
             )
         except Exception:
             min_col, min_row, max_col, max_row = (
                 1,
                 1,
                 ws.max_column or 1,
                 ws.max_row or 1,
             )

-    shape = (max_row + 1, max_col + 1)
-    # ... rest of function outside context
+        shape = (max_row + 1, max_col + 1)
+        has_border = np.zeros(shape, dtype=bool)
+        top_edge = np.zeros(shape, dtype=bool)
+        bottom_edge = np.zeros(shape, dtype=bool)
+        left_edge = np.zeros(shape, dtype=bool)
+        right_edge = np.zeros(shape, dtype=bool)
+
+        def edge_has_style(edge: object) -> bool:
+            if edge is None:
+                return False
+            style = getattr(edge, "style", None)
+            return style is not None and style != "none"
+
+        for r in range(min_row, max_row + 1):
+            for c in range(min_col, max_col + 1):
+                cell = ws.cell(row=r, column=c)
+                b = getattr(cell, "border", None)
+                if b is None:
+                    continue
+                # ... rest of processing
+
+        return has_border, top_edge, bottom_edge, left_edge, right_edge, max_row, max_col
🧹 Nitpick comments (17)
pyproject.toml (1)

122-128: Task definitions look good; consider removing redundant --strict flag.

The taskipy task definitions are well-structured and provide useful shortcuts for common development workflows. However, the mypy task explicitly passes the --strict flag even though [tool.mypy] already sets strict = true (line 111), making the flag redundant.

🔎 Optional: Remove redundant flag
-mypy = "mypy src/exstruct --strict"
+mypy = "mypy src/exstruct"
tests/test_table_detection_common.py (1)

3-6: Consider testing public APIs instead of private functions.

The tests directly import and invoke private functions (_merge_rectangles, _collect_table_candidates_from_values) prefixed with underscores. Private functions are implementation details that may change, making tests brittle. If these functions represent important behavior, consider:

  • Exposing public wrappers if the functionality should be tested
  • Testing them indirectly through public APIs (e.g., backend methods that use these utilities)
tests/test_backends.py (1)

55-73: Consider adding type hints to the DummyWorkbook class.

The DummyWorkbook class (lines 66-67) lacks type hints and serves as a test stub. While this is acceptable for test code, adding a protocol or explicit type would improve clarity.

Optional: Add Protocol for better type safety
+from typing import Protocol
+
+class WorkbookProtocol(Protocol):
+    pass
+
 class DummyWorkbook:
     pass
 
-backend = ComBackend(DummyWorkbook())
+backend = ComBackend(DummyWorkbook())  # type: ignore[arg-type]
tests/test_workbook_context.py (1)

28-65: Consider extracting the duplicated DummyWorkbook class.

The DummyWorkbook class is defined twice (lines 11-13 and lines 42-44), violating DRY principles. Consider extracting it as a module-level fixture or test helper.

Refactor to eliminate duplication
+class DummyWorkbook:
+    def close(self) -> None:
+        pass
+
 def test_openpyxl_workbook_closes(monkeypatch: MonkeyPatch, tmp_path: Path) -> None:
     calls: dict[str, int] = {"close": 0}
 
-    class DummyWorkbook:
+    class CountingWorkbook(DummyWorkbook):
         def close(self) -> None:
             calls["close"] += 1
 
     def fake_load_workbook(
         path: Path, *, data_only: bool, read_only: bool
-    ) -> DummyWorkbook:
-        return DummyWorkbook()
+    ) -> CountingWorkbook:
+        return CountingWorkbook()
 
     monkeypatch.setattr("exstruct.core.workbook.load_workbook", fake_load_workbook)
 
     with openpyxl_workbook(tmp_path / "book.xlsx", data_only=True, read_only=True):
         pass
 
     assert calls["close"] == 1
 
 
 def test_openpyxl_workbook_sets_warning_filters(
     monkeypatch: MonkeyPatch, tmp_path: Path
 ) -> None:
     calls: list[tuple[str, str, type[Warning], str]] = []
 
     def fake_filterwarnings(
         action: str,
         message: str = "",
         category: type[Warning] = Warning,
         module: str = "",
         **_kwargs: object,
     ) -> None:
         calls.append((action, message, category, module))
 
-    class DummyWorkbook:
-        def close(self) -> None:
-            pass
-
     def fake_load_workbook(
         path: Path, *, data_only: bool, read_only: bool
     ) -> DummyWorkbook:
         return DummyWorkbook()
tests/test_modeling.py (1)

5-39: LGTM! Consider adding more specific assertions.

The test correctly validates the transformation from SheetRawData/WorkbookRawData to WorkbookData. The assertions verify that collections are non-empty, which confirms the data flows through the builder.

For more robust validation, consider adding assertions that check specific field values (e.g., sheet.rows[0].r == 1, sheet.shapes[0].text == "S").

src/exstruct/core/backends/base.py (2)

9-10: Consider adding type aliases to module exports.

The type aliases CellData and PrintAreaData are defined at module level but may not be explicitly exported via __all__. If these are intended for public use by backend implementations, consider adding:

__all__ = ["BackendConfig", "Backend", "CellData", "PrintAreaData"]

This makes the public API surface explicit and aids type checking and IDE support.


26-38: Consider adding complete docstrings to Protocol methods.

The Backend Protocol methods (lines 29-38) have brief docstrings but lack Args/Returns sections as recommended by coding guidelines for Google-style docstrings. While Protocol definitions often use brief docstrings, adding complete documentation would improve clarity for implementers.

Example: Enhanced docstrings
     def extract_cells(self, *, include_links: bool) -> CellData:
-        """Extract cell rows from the workbook."""
+        """Extract cell rows from the workbook.
+        
+        Args:
+            include_links: Whether to include hyperlinks in cell data.
+        
+        Returns:
+            Mapping of sheet name to list of cell rows.
+        """

     def extract_print_areas(self) -> PrintAreaData:
-        """Extract print areas from the workbook."""
+        """Extract print areas from the workbook.
+        
+        Returns:
+            Mapping of sheet name to list of print areas.
+        """

     def extract_colors_map(
         self, *, include_default_background: bool, ignore_colors: set[str] | None
     ) -> WorkbookColorsMap | None:
-        """Extract colors map from the workbook."""
+        """Extract colors map from the workbook.
+        
+        Args:
+            include_default_background: Whether to include default backgrounds.
+            ignore_colors: Optional set of color keys to ignore.
+        
+        Returns:
+            WorkbookColorsMap or None if extraction fails.
+        """
src/exstruct/core/ranges.py (1)

24-47: Consider narrowing the exception handler to ValueError.

Line 40 uses a bare except Exception which catches all exceptions. While this ensures the function returns None on any parsing failure, it may hide unexpected errors. According to openpyxl documentation, range_boundaries() raises ValueError for malformed range strings (e.g., missing row/column parts), so you can be more specific:

    try:
        min_col, min_row, max_col, max_row = range_boundaries(cleaned)
-   except Exception:
+   except ValueError:
        return None

However, given that this is a parsing utility where graceful failure is desired, the current approach is acceptable for robustness. The code already passes mypy strict mode.

src/exstruct/io/__init__.py (1)

281-286: Inconsistent error types for unsupported format handling.

The _ensure_format_hint calls use different error types:

  • save_print_area_views (line 284): SerializationError
  • save_auto_page_break_views (line 342): SerializationError
  • serialize_workbook (line 394): SerializationError
  • save_sheets (line 450): ValueError

Consider using SerializationError consistently for all format validation failures, as this is a serialization concern.

🔎 Suggested fix for save_sheets
     format_hint = _ensure_format_hint(
         fmt,
         allowed=_FORMAT_HINTS,
-        error_type=ValueError,
+        error_type=SerializationError,
         error_message="Unsupported sheet export format: {fmt}",
     )

Also applies to: 339-344, 391-396, 447-452

src/exstruct/core/workbook.py (1)

56-84: Document the workbook reuse behavior for xlwings.

The xlwings_workbook context manager reuses an already-open workbook (lines 67-70) without closing it on exit. This is intentional but could surprise callers who expect exclusive access. Consider adding a note to the docstring about this behavior.

🔎 Suggested docstring addition
 def xlwings_workbook(file_path: Path, *, visible: bool = False) -> Iterator[xw.Book]:
     """Open an Excel workbook via xlwings and close if created.
 
+    If the workbook is already open in an Excel instance, this context manager
+    yields the existing workbook without closing it on exit.
+
     Args:
         file_path: Workbook path.
         visible: Whether to show the Excel application window.
 
     Yields:
         xlwings workbook instance.
     """
src/exstruct/core/backends/openpyxl_backend.py (1)

163-175: Docstring accurately describes zero-based output, but consider renaming or adding conversion context.

The function correctly documents its zero-based output, which aligns with parse_range_zero_based. However, since this is consumed by PrintArea (which expects 1-based), consider either renaming to clarify intent or documenting the conversion requirement.

src/exstruct/core/backends/com_backend.py (2)

126-134: Exception handling pattern is acceptable for state restoration.

The nested try-except to restore DisplayPageBreaks state before continuing is a valid defensive pattern. The silent pass on restoration failure is acceptable since the primary operation already failed. Consider debug-level logging for troubleshooting, but this is optional.


138-150: Duplicate _parse_print_area_range function exists in openpyxl_backend.py.

This helper is identical to the one in openpyxl_backend.py (lines 163-175). Consider extracting to a shared location like ranges.py to avoid duplication and ensure consistent behavior.

🔎 Suggested approach

Move to src/exstruct/core/ranges.py:

def parse_print_area_bounds(range_str: str) -> tuple[int, int, int, int] | None:
    """Parse an Excel range string into zero-based coordinates.
    
    Args:
        range_str: Excel range string.
    
    Returns:
        Zero-based (r1, c1, r2, c2) tuple or None on failure.
    """
    bounds = parse_range_zero_based(range_str)
    if bounds is None:
        return None
    return (bounds.r1, bounds.c1, bounds.r2, bounds.c2)

Then import from both backends.

src/exstruct/core/pipeline.py (4)

82-95: Consider using immutable sequences for frozen dataclass.

PipelinePlan is frozen but contains list fields which remain internally mutable. For full immutability, consider using tuple or documenting that consumers should not mutate the lists.

🔎 Suggested immutable alternative
 @dataclass(frozen=True)
 class PipelinePlan:
-    pre_com_steps: list[ExtractionStep]
-    com_steps: list[ComExtractionStep]
+    pre_com_steps: tuple[ExtractionStep, ...]
+    com_steps: tuple[ComExtractionStep, ...]
     use_com: bool

Then update build_pre_com_pipeline and build_com_pipeline to return tuples.


186-188: Type inconsistency between validation set and ExtractionMode literal.

allowed_modes is a runtime set[str], but mode is typed as ExtractionMode (a Literal). Static type checkers already ensure valid modes at call sites. Consider using typing.get_args() for DRY validation:

🔎 DRY validation using Literal args
+from typing import get_args
+
 def resolve_extraction_inputs(
     ...
 ) -> ExtractionInputs:
-    allowed_modes: set[str] = {"light", "standard", "verbose"}
+    allowed_modes = get_args(ExtractionMode)
     if mode not in allowed_modes:
         raise ValueError(f"Unsupported mode: {mode}")

262-298: Duplication between "standard" and "verbose" mode configurations.

The step configurations for "standard" and "verbose" modes are identical. Consider extracting a shared definition to reduce duplication and maintenance burden.

🔎 Possible DRY refactor
_com_aware_pre_steps: tuple[StepConfig, ...] = (
    StepConfig(name="cells", step=step_extract_cells, enabled=lambda _: True),
    StepConfig(
        name="print_areas_openpyxl",
        step=step_extract_print_areas_openpyxl,
        enabled=lambda i: i.include_print_areas,
    ),
    StepConfig(
        name="colors_map_openpyxl_if_skip_com",
        step=step_extract_colors_map_openpyxl,
        enabled=lambda i: i.include_colors_map and bool(os.getenv("SKIP_COM_TESTS")),
    ),
)

step_table: dict[ExtractionMode, Sequence[StepConfig]] = {
    "light": (...),  # unique light steps
    "standard": _com_aware_pre_steps,
    "verbose": _com_aware_pre_steps,
}

393-433: Each pre-COM step independently opens the workbook file.

step_extract_cells, step_extract_print_areas_openpyxl, and step_extract_colors_map_openpyxl each call extraction methods that independently open and read the file (via pd.read_excel, openpyxl_workbook, etc.). For large files, consider optimizing by opening the workbook once and sharing it across steps, or batching extractions into a single step. This is a design trade-off between step independence and I/O efficiency.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f608a46 and 79d4bbb.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (43)
  • .github/workflows/pytest.yml
  • .gitignore
  • AGENTS.md
  • docs/agents/API_DESIGN.md
  • docs/agents/ARCHITECTURE.md
  • docs/agents/CODING_GUIDELINES.md
  • docs/agents/CONTRIBUTING_FOR_AI.md
  • docs/agents/EXCEL_EXTRACTION.md
  • docs/agents/FEATURE_SPEC.md
  • docs/agents/OVERVIEW.md
  • docs/agents/ROADMAP.md
  • docs/agents/TASKS.md
  • docs/agents/TEST_REQUIREMENTS.md
  • pyproject.toml
  • src/exstruct/__init__.py
  • src/exstruct/core/backends/__init__.py
  • src/exstruct/core/backends/base.py
  • src/exstruct/core/backends/com_backend.py
  • src/exstruct/core/backends/openpyxl_backend.py
  • src/exstruct/core/cells.py
  • src/exstruct/core/integrate.py
  • src/exstruct/core/logging_utils.py
  • src/exstruct/core/modeling.py
  • src/exstruct/core/pipeline.py
  • src/exstruct/core/ranges.py
  • src/exstruct/core/workbook.py
  • src/exstruct/engine.py
  • src/exstruct/errors.py
  • src/exstruct/io/__init__.py
  • src/exstruct/io/serialize.py
  • src/exstruct/render/__init__.py
  • tests/test_backends.py
  • tests/test_engine.py
  • tests/test_error_handling.py
  • tests/test_integrate_fallback.py
  • tests/test_integrate_raw_data.py
  • tests/test_io_common.py
  • tests/test_logging_utils.py
  • tests/test_mode_output.py
  • tests/test_modeling.py
  • tests/test_pipeline.py
  • tests/test_table_detection_common.py
  • tests/test_workbook_context.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Add type hints to all function and method arguments and return values (mypy strict compliance)
Use Any type only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Organize imports in this order: (1) standard library, (2) third-party packages, (3) exstruct internal modules
Use Google-style docstrings for all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Avoid writing God Functions (large, single-responsibility-violating functions)
Avoid writing God Objects (classes with too many responsibilities)
Avoid excessive nesting and deep conditional branching
Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Ensure code passes mypy strict mode with zero errors
Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Avoid circular dependencies between modules
AI should propose automatic function splitting when functions are too complex
AI should propose additional Pydantic models when data structures are insufficient
AI should automatically fix and organize imports when they are inconsistent
AI should improve docstrings when they are incomplete or insufficient

Files:

  • src/exstruct/core/ranges.py
  • tests/test_integrate_fallback.py
  • tests/test_modeling.py
  • src/exstruct/core/logging_utils.py
  • src/exstruct/io/serialize.py
  • tests/test_workbook_context.py
  • src/exstruct/core/backends/base.py
  • src/exstruct/core/workbook.py
  • src/exstruct/core/backends/openpyxl_backend.py
  • tests/test_table_detection_common.py
  • tests/test_io_common.py
  • src/exstruct/io/__init__.py
  • tests/test_backends.py
  • tests/test_logging_utils.py
  • tests/test_engine.py
  • tests/test_mode_output.py
  • src/exstruct/core/backends/com_backend.py
  • src/exstruct/core/modeling.py
  • tests/test_pipeline.py
  • tests/test_error_handling.py
  • src/exstruct/core/cells.py
  • src/exstruct/__init__.py
  • src/exstruct/render/__init__.py
  • src/exstruct/core/integrate.py
  • src/exstruct/engine.py
  • src/exstruct/core/pipeline.py
  • tests/test_integrate_raw_data.py
  • src/exstruct/errors.py
  • src/exstruct/core/backends/__init__.py
🧠 Learnings (6)
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors

Applied to files:

  • .gitignore
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries

Applied to files:

  • docs/agents/CODING_GUIDELINES.md
  • docs/agents/CONTRIBUTING_FOR_AI.md
  • src/exstruct/io/__init__.py
  • docs/agents/ARCHITECTURE.md
  • src/exstruct/core/modeling.py
  • src/exstruct/core/cells.py
  • src/exstruct/core/integrate.py
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data

Applied to files:

  • docs/agents/CODING_GUIDELINES.md
  • docs/agents/CONTRIBUTING_FOR_AI.md
  • AGENTS.md
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data

Applied to files:

  • docs/agents/CODING_GUIDELINES.md
  • docs/agents/CONTRIBUTING_FOR_AI.md
  • AGENTS.md
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : AI should propose additional Pydantic models when data structures are insufficient

Applied to files:

  • docs/agents/CODING_GUIDELINES.md
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)

Applied to files:

  • docs/agents/CODING_GUIDELINES.md
🧬 Code graph analysis (20)
tests/test_integrate_fallback.py (2)
src/exstruct/core/integrate.py (1)
  • extract_workbook (10-52)
tests/test_cells_and_tables.py (1)
  • _raise (64-65)
tests/test_modeling.py (2)
src/exstruct/core/modeling.py (3)
  • SheetRawData (9-28)
  • WorkbookRawData (32-41)
  • build_workbook_data (64-74)
src/exstruct/models/__init__.py (5)
  • CellRow (49-58)
  • Chart (76-93)
  • ChartSeries (61-73)
  • PrintArea (96-102)
  • Shape (11-46)
src/exstruct/core/logging_utils.py (1)
src/exstruct/errors.py (1)
  • FallbackReason (40-46)
src/exstruct/io/serialize.py (2)
src/exstruct/errors.py (2)
  • MissingDependencyError (24-25)
  • SerializationError (20-21)
tests/test_io_common.py (1)
  • safe_dump (12-15)
tests/test_workbook_context.py (1)
src/exstruct/core/workbook.py (1)
  • openpyxl_workbook (14-53)
src/exstruct/core/backends/openpyxl_backend.py (5)
src/exstruct/models/__init__.py (1)
  • PrintArea (96-102)
src/exstruct/core/cells.py (6)
  • WorkbookColorsMap (53-67)
  • detect_tables_openpyxl (1409-1447)
  • extract_sheet_cells (467-485)
  • extract_sheet_cells_with_links (488-526)
  • extract_sheet_colors_map (70-91)
  • detect_tables (1450-1487)
src/exstruct/core/ranges.py (1)
  • parse_range_zero_based (24-47)
src/exstruct/core/workbook.py (1)
  • openpyxl_workbook (14-53)
src/exstruct/core/backends/base.py (3)
  • extract_cells (29-30)
  • extract_print_areas (32-33)
  • extract_colors_map (35-38)
tests/test_table_detection_common.py (1)
src/exstruct/core/cells.py (2)
  • _collect_table_candidates_from_values (1291-1334)
  • _merge_rectangles (1242-1268)
tests/test_io_common.py (2)
src/exstruct/io/__init__.py (1)
  • save_sheets (435-472)
src/exstruct/models/__init__.py (2)
  • SheetData (105-191)
  • WorkbookData (194-257)
tests/test_backends.py (3)
src/exstruct/core/backends/com_backend.py (2)
  • extract_colors_map (51-74)
  • extract_print_areas (27-49)
src/exstruct/core/backends/openpyxl_backend.py (4)
  • extract_cells (32-45)
  • detect_tables (88-100)
  • extract_colors_map (64-86)
  • extract_print_areas (47-62)
src/exstruct/core/ranges.py (1)
  • parse_range_zero_based (24-47)
tests/test_logging_utils.py (2)
src/exstruct/core/logging_utils.py (1)
  • log_fallback (8-16)
src/exstruct/errors.py (1)
  • FallbackReason (40-46)
tests/test_engine.py (1)
src/exstruct/engine.py (6)
  • DestinationOptions (126-141)
  • ExStructEngine (166-562)
  • FilterOptions (99-123)
  • OutputOptions (144-163)
  • StructOptions (60-82)
  • serialize (353-376)
src/exstruct/core/backends/com_backend.py (5)
src/exstruct/models/__init__.py (1)
  • PrintArea (96-102)
src/exstruct/core/cells.py (2)
  • WorkbookColorsMap (53-67)
  • extract_sheet_colors_map_com (94-119)
src/exstruct/core/ranges.py (1)
  • parse_range_zero_based (24-47)
src/exstruct/core/backends/openpyxl_backend.py (3)
  • extract_print_areas (47-62)
  • _parse_print_area_range (163-175)
  • extract_colors_map (64-86)
src/exstruct/core/backends/base.py (2)
  • extract_print_areas (32-33)
  • extract_colors_map (35-38)
src/exstruct/core/modeling.py (1)
src/exstruct/models/__init__.py (6)
  • CellRow (49-58)
  • Chart (76-93)
  • PrintArea (96-102)
  • Shape (11-46)
  • SheetData (105-191)
  • WorkbookData (194-257)
tests/test_pipeline.py (2)
src/exstruct/core/pipeline.py (5)
  • ExtractionArtifacts (58-75)
  • ExtractionInputs (33-54)
  • build_cells_tables_workbook (654-695)
  • build_com_pipeline (306-348)
  • resolve_extraction_inputs (157-216)
src/exstruct/models/__init__.py (2)
  • CellRow (49-58)
  • PrintArea (96-102)
tests/test_error_handling.py (2)
tests/test_integrate_fallback.py (1)
  • _raise (71-72)
tests/test_cells_and_tables.py (1)
  • _raise (64-65)
src/exstruct/core/cells.py (2)
src/exstruct/models/__init__.py (1)
  • CellRow (49-58)
src/exstruct/core/workbook.py (1)
  • openpyxl_workbook (14-53)
src/exstruct/__init__.py (1)
src/exstruct/engine.py (3)
  • FormatOptions (85-96)
  • FilterOptions (99-123)
  • DestinationOptions (126-141)
src/exstruct/render/__init__.py (1)
src/exstruct/errors.py (1)
  • RenderError (28-29)
src/exstruct/core/integrate.py (2)
src/exstruct/models/__init__.py (1)
  • WorkbookData (194-257)
src/exstruct/core/pipeline.py (2)
  • resolve_extraction_inputs (157-216)
  • run_extraction_pipeline (582-651)
src/exstruct/core/backends/__init__.py (3)
src/exstruct/core/backends/base.py (1)
  • Backend (26-38)
src/exstruct/core/backends/com_backend.py (1)
  • ComBackend (18-135)
src/exstruct/core/backends/openpyxl_backend.py (1)
  • OpenpyxlBackend (23-100)
🪛 GitHub Check: Codacy Static Code Analysis
.github/workflows/pytest.yml

[warning] 43-43: .github/workflows/pytest.yml#L43
An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release.

src/exstruct/core/workbook.py

[warning] 52-52: src/exstruct/core/workbook.py#L52
Try, Except, Pass detected.


[warning] 102-102: src/exstruct/core/workbook.py#L102
Try, Except, Continue detected.

src/exstruct/core/backends/com_backend.py

[warning] 37-37: src/exstruct/core/backends/com_backend.py#L37
Try, Except, Continue detected.


[warning] 132-132: src/exstruct/core/backends/com_backend.py#L132
Try, Except, Pass detected.

src/exstruct/core/cells.py

[warning] 1394-1394: src/exstruct/core/cells.py#L1394
Try, Except, Continue detected.

🪛 markdownlint-cli2 (0.18.1)
docs/agents/CODING_GUIDELINES.md

49-49: Strong style
Expected: asterisk; Actual: underscore

(MD050, strong-style)


49-49: Strong style
Expected: asterisk; Actual: underscore

(MD050, strong-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: test (windows-latest, 3.11)
🔇 Additional comments (74)
src/exstruct/render/__init__.py (2)

52-54: LGTM: Error message formatting improved.

The string formatting change maintains clarity while consolidating the error message construction. The implicit concatenation of the plain string literal with the f-string is valid and produces the same result as the previous multi-line version.


61-61: LGTM: Concise error message formatting.

The single-line f-string format is clear and maintains the same error message content as before.

pyproject.toml (1)

35-35: No action needed. taskipy version 1.14.1 is available on PyPI (released Nov 26, 2024), has no known security vulnerabilities, and fully supports Python 3.11+.

.gitignore (1)

23-25: Coverage artifacts properly ignored.

The additions align with the codecov integration and coverage reporting infrastructure. No concerns.

docs/agents/CONTRIBUTING_FOR_AI.md (1)

29-29: Documentation update aligns with architectural boundary changes.

The clarifications about BaseModel for boundaries and dataclass for internals, plus the regression testing guidance, properly communicate the new data structure separation strategy.

Also applies to: 39-39

AGENTS.md (1)

126-126: Prompt templates updated to reflect boundary/internal data structure pattern.

The additions consistently communicate the BaseModel-for-boundaries / dataclass-for-internals expectation across both the developer guidance and Codex prompts, ensuring AI-generated code aligns with architectural decisions.

Also applies to: 178-178

docs/agents/TASKS.md (1)

1-93: Task documentation comprehensively captures completed refactoring work.

The task list properly documents the major architectural components introduced in this PR: pipeline infrastructure, backend abstraction, utility consolidation, and data structure separation. The implementation notes (especially around lines 70–90) provide valuable context for understanding the design decisions and rationale. Well-organized and maintainable.

docs/agents/CODING_GUIDELINES.md (1)

46-50: Fix markdown formatting: wrap __init__ in backticks.

The content updates correctly document the boundary/internal separation pattern. However, line 49 contains __init__ without backticks, which markdownlint flags as unintended strong emphasis. Wrap it as `__init__` to prevent markdown interpretation.

🔎 Proposed fix for markdown formatting
  Raw*Data (e.g., SheetRawData/WorkbookRawData) は内部専用で公開APIに出さない。
- Do not expose them via public APIs or re-export from package __init__.
+ Do not expose them via public APIs or re-export from package `__init__`.
docs/agents/OVERVIEW.md (1)

1-43: Overview rewrite improves clarity and user orientation.

The updated overview is more concise and user-focused, emphasizing extraction targets and practical usage patterns. The shift from feature-centric to target-and-usage-oriented documentation is well-suited to the new pipeline-centric architecture and helps both users and contributors understand the library's scope quickly.

src/exstruct/core/backends/__init__.py (1)

1-7: Clean package structure with proper export definitions.

The backends module initializer correctly exports the public API (Backend protocol and implementations). The import organization and __all__ definition follow best practices. No concerns.

docs/agents/ROADMAP.md (1)

30-36: Roadmap entries properly document refactoring milestone and future shape decomposition.

The new v0.3.0 and v0.3.1 entries accurately capture the scope of this PR and position the upcoming SmartArt work. Version sequence is logical and well-placed.

tests/test_mode_output.py (1)

61-61: LGTM! Monkeypatch target correctly updated for pipeline architecture.

The update from exstruct.core.integrate.xw.Book to exstruct.core.pipeline.xlwings_workbook aligns with the pipeline-centric refactoring. The test continues to verify that COM is not accessed in light mode.

tests/test_logging_utils.py (1)

9-14: LGTM! Test correctly verifies standardized fallback logging format.

The test properly validates that log_fallback emits a warning with the expected format [light_mode] fallback. Import ordering follows the coding guidelines (stdlib → third-party → internal), and type hints are present.

tests/test_io_common.py (1)

9-23: LGTM! Test properly verifies YAML serialization path.

The test correctly monkeypatches _require_yaml to validate YAML format handling. Import ordering follows guidelines, type hints are present, and the test logic is sound.

.github/workflows/pytest.yml (2)

34-34: LGTM! Coverage collection added for non-COM tests.

The addition of --cov=exstruct --cov-report=xml enables coverage tracking for the test suite.


40-40: LGTM! Coverage collection added for full test suite.

The addition of --cov=exstruct --cov-report=xml enables coverage tracking for the Windows test suite.

src/exstruct/core/logging_utils.py (1)

8-16: LGTM! Clean implementation of standardized fallback logging.

The function provides a clear, single-responsibility utility for logging fallback events. Import ordering follows guidelines, type hints are complete, and the Google-style docstring is well-structured.

docs/agents/FEATURE_SPEC.md (1)

7-12: Documentation updated to focus on SmartArt parsing.

The updated feature specification clearly outlines the plan to introduce separate classes for Shape, Arrow, and SmartArt with recursive node structure. This aligns with the broader architectural refactoring in the PR.

src/exstruct/errors.py (1)

40-46: LGTM! Well-designed enum for fallback reason tracking.

The FallbackReason enum subclasses both str and Enum, which enables direct string serialization while maintaining type safety. The four reason codes cover the key fallback scenarios introduced in the pipeline architecture.

tests/test_error_handling.py (3)

3-4: LGTM! Required imports added for contextmanager pattern.

The imports Iterator and contextmanager support the cleaner test scaffolding introduced later in the file.


36-36: LGTM! Monkeypatch target updated for pipeline architecture.

The change from exstruct.core.integrate.xw.Book to exstruct.core.pipeline.xlwings_workbook correctly reflects the new pipeline-centric extraction flow.


54-69: LGTM! Cleaner test scaffolding with contextmanager pattern.

The refactored approach using @contextmanager and _dummy_workbook is more idiomatic and aligns with the pipeline's workbook context management. The monkeypatch target updates are correct.

tests/test_table_detection_common.py (2)

15-23: LGTM!

The test correctly validates that _collect_table_candidates_from_values builds a range string from a 2x2 value matrix with the expected base coordinates.


9-12: Test is correct - no changes needed.

The _merge_rectangles function explicitly preserves contained rectangles by returning False from _rectangles_overlap_for_merge when one rectangle fully contains another. For the test input [(1, 1, 3, 3), (2, 2, 2, 2)], the second rectangle is fully contained in the first, so the containment check triggers and prevents merging. The test expectation len(merged) == 2 accurately reflects this intentional behavior of preserving both rectangles.

tests/test_backends.py (4)

11-37: LGTM!

The test effectively verifies that OpenpyxlBackend.extract_cells delegates to the correct internal function based on the include_links flag, and the assertion confirms both paths are exercised.


40-52: LGTM!

The test correctly validates that OpenpyxlBackend.detect_tables gracefully handles failures by returning an empty list when the underlying detection function raises an exception.


76-93: LGTM!

The integration test creates a real workbook with a print area and verifies that OpenpyxlBackend.extract_print_areas correctly extracts and converts the coordinates to zero-based indexing.


95-101: LGTM!

The test validates that parse_range_zero_based correctly handles sheet-prefixed range strings and produces zero-based coordinate bounds.

tests/test_workbook_context.py (1)

8-25: LGTM!

The test effectively validates that the openpyxl_workbook context manager calls close() exactly once upon exit, ensuring proper resource cleanup.

tests/test_pipeline.py (2)

58-75: LGTM!

These tests effectively validate:

  • COM pipeline construction respects individual flags (lines 58-75)
  • Light mode produces an empty COM pipeline (lines 78-90)
  • Input resolution applies correct defaults based on mode (lines 93-107)

The test assertions are precise and verify the expected pipeline behavior.

Also applies to: 78-90, 93-107


110-142: LGTM!

The integration test correctly verifies that build_cells_tables_workbook integrates print areas from artifacts and invokes table detection via the backend, producing the expected WorkbookData structure with print areas and table candidates.

docs/agents/API_DESIGN.md (1)

50-56: LGTM!

The documentation clearly demonstrates the updated API structure with nested OutputOptions containing format and filters groups, aligning with the PR's refactoring of the public API surface.

src/exstruct/core/ranges.py (1)

8-21: LGTM!

The RangeBounds NamedTuple is well-documented with clear field descriptions and uses proper type hints. Using a NamedTuple for immutable data is appropriate for this use case.

src/exstruct/core/backends/base.py (1)

13-23: LGTM!

The BackendConfig dataclass is properly frozen, has complete type hints, and includes a clear docstring. Using frozen=True ensures immutability, which is appropriate for configuration objects.

tests/test_integrate_fallback.py (3)

27-30: LGTM! Helper and patch target updated consistently.

The boom helper signature now accepts arbitrary *args and **kwargs, aligning with the pattern used elsewhere (e.g., _raise in tests/test_cells_and_tables.py). The patch target correctly points to exstruct.core.pipeline.xlwings_workbook, consistent with the pipeline-based architecture refactoring.


38-56: LGTM! New fallback test for print_areas coverage.

This test validates the critical requirement [INT-02] COM フォールバック時も print_areas を保持する from TEST_REQUIREMENTS.md. The test correctly:

  • Creates a workbook with print_area set via openpyxl
  • Patches the COM pathway to raise an exception
  • Verifies that print_areas are still extracted via the fallback path

59-79: LGTM! Test updated to use the new patch target.

The test_extract_workbook_with_links correctly updates its patch target to exstruct.core.pipeline.xlwings_workbook, ensuring hyperlink extraction continues to work when COM is unavailable.

docs/agents/TEST_REQUIREMENTS.md (1)

1-227: LGTM! Documentation is well-structured and comprehensive.

The restructured TEST_REQUIREMENTS.md provides clear coverage of the new pipeline-based architecture with well-organized sections for Pipeline, Backend, Ranges, Table Detection, Workbook context management, and Logging. The bullet-style format improves readability.

tests/test_integrate_raw_data.py (3)

12-29: LGTM! Well-structured helper with proper type hints and docstring.

The _make_chart helper provides a minimal Chart instance for tests with all required fields. The Google-style docstring follows the coding guidelines.


32-73: LGTM! Comprehensive test for raw data collection.

This test validates requirement [MOD-02] collect_sheet_raw_data は抽出済みデータを raw コンテナにまとめる by verifying that all extracted components (rows, shapes, charts, table_candidates, print_areas, auto_print_areas, colors_map) are correctly collected into the raw container.


76-100: LGTM! Light mode test correctly validates chart exclusion.

This test ensures charts are excluded in light mode (mode="light"), aligning with requirement [MODE-02] light: セル+テーブルのみ、shapes/charts 空、COM 不使用.

src/exstruct/io/__init__.py (2)

74-82: LGTM! Clean wrapper for range parsing.

The _parse_range_zero_based function correctly delegates to the core parse_range_zero_based utility and converts the result to a plain tuple for backward compatibility with existing callers expecting (r1, c1, r2, c2).


486-487: Verify: Exporting underscore-prefixed functions in __all__.

Exporting _require_yaml and _require_toon in __all__ is unconventional since underscore-prefixed names typically denote private/internal APIs. This appears intentional to allow test patching (as seen in tests/test_io_common.py). If this is the intended pattern, consider documenting this decision or renaming to non-underscore names if they're truly public.

docs/agents/ARCHITECTURE.md (1)

1-81: LGTM! Architecture documentation clearly describes the pipeline-centric design.

The updated ARCHITECTURE.md provides a clear overview of the refactored system:

  • Directory structure reflects new modules (pipeline.py, modeling.py, workbook.py, backends/)
  • Pipeline design section explains the separation of concerns (immutable PipelinePlan, execution state in PipelineState/PipelineResult)
  • Module responsibilities are well-defined
  • AI agent guidelines are updated with relevant context
src/exstruct/core/workbook.py (2)

13-53: LGTM! Robust context manager with appropriate warning suppression.

The openpyxl_workbook context manager correctly:

  • Suppresses known non-critical openpyxl warnings (extensions, conditional formatting, header/footer parsing)
  • Uses Any return type appropriately at the external library boundary (per coding guidelines)
  • Ensures cleanup in the finally block, swallowing exceptions to avoid masking the original error

The static analysis warning about try/except/pass at line 52 is a deliberate pattern here—cleanup failures should not mask the original exception.


87-106: LGTM! Defensive helper for finding open workbooks.

The _find_open_workbook helper has comprehensive exception handling:

  • Inner try/except (line 102) handles individual workbook access failures
  • Outer try/except (line 104) handles app iteration failures
  • Path comparison uses .resolve() for reliable matching

The static analysis warning about try/except/continue at line 102 is appropriate here—COM objects can raise unpredictably, and we want to continue checking other workbooks.

tests/test_engine.py (4)

6-12: LGTM! Import updated to include new option types.

The import statement correctly includes DestinationOptions and FilterOptions alongside ExStructEngine, OutputOptions, and StructOptions, reflecting the new nested option group API.


70-86: LGTM! Filter tests updated to use nested FilterOptions.

Tests test_engine_serialize_filters_shapes and test_engine_serialize_filters_tables correctly use the new OutputOptions(filters=FilterOptions(...)) pattern to configure include/exclude flags.


124-136: LGTM! Destination tests updated to use nested DestinationOptions.

Tests for sheets_dir and print_areas_dir correctly use OutputOptions(destinations=DestinationOptions(...)) to configure output destinations, validating the new API surface.

Also applies to: 138-152


154-168: LGTM! Combined filter and destination tests exercise the full API.

These tests correctly combine FilterOptions and DestinationOptions within OutputOptions, covering:

  • Print areas inclusion flag with destination
  • Light mode behavior with destinations
  • String path handling for destinations

Also applies to: 170-186, 188-207

src/exstruct/io/serialize.py (3)

10-10: LGTM! Format hints centralized in a constant.

The _FORMAT_HINTS set provides a single source of truth for allowed formats, used consistently across the module and exported for use in io/__init__.py.


52-90: LGTM! Clean serialization dispatcher with defensive fallback.

The _serialize_payload_from_hint function:

  • Uses a clear match statement for format dispatch
  • Handles JSON indentation logic correctly (line 72: 2 if pretty and indent is None else indent)
  • Lazily loads optional dependencies (yaml, toon)
  • Includes a defensive default case (lines 87-90) that's technically unreachable when callers use _ensure_format_hint, but provides safety if called directly

93-112: LGTM! Dependency checkers provide clear installation guidance.

Both _require_yaml and _require_toon correctly:

  • Use importlib.import_module for lazy loading
  • Raise MissingDependencyError with actionable install instructions
  • Preserve the original ImportError in the exception chain (using from e)
src/exstruct/core/modeling.py (2)

1-29: Clean data container design with proper immutability.

The frozen dataclasses SheetRawData and WorkbookRawData provide a clean separation between raw extraction output and typed Pydantic models. This aligns well with the pipeline architecture and enables immutable data flow.


44-74: LGTM!

Builder functions are well-typed with clear docstrings. The transformation from raw dataclasses to Pydantic models is straightforward and maintainable.

src/exstruct/core/backends/openpyxl_backend.py (2)

22-45: LGTM!

The OpenpyxlBackend class is well-structured with proper type hints and docstrings. The conditional delegation in extract_cells is clean and readable.


47-62: LGTM!

The fallback from defined names to sheet properties is a good defensive pattern. The broad exception handling is acceptable here since the caller expects graceful degradation.

src/exstruct/core/integrate.py (1)

10-52: Clean refactoring to pipeline-based extraction.

The extract_workbook function now cleanly delegates to the pipeline infrastructure while preserving the public API. This is a significant simplification that centralizes the extraction logic.

docs/agents/EXCEL_EXTRACTION.md (1)

1-61: Documentation accurately reflects the new pipeline architecture.

The updated specification clearly describes the extraction flow, mode behaviors, and fallback mechanisms. The priority of COM results over OpenPyXL for colors_map is well-documented.

src/exstruct/__init__.py (2)

14-14: LGTM!

Adding FormatOptions to the public API export is consistent with the new nested OutputOptions structure. This enables users to construct format options separately.

Also applies to: 80-80


367-398: LGTM!

The process_excel function correctly constructs OutputOptions with the new nested structure. The parameter passthrough to engine.process maintains backward compatibility.

src/exstruct/core/backends/com_backend.py (1)

173-206: LGTM!

The _split_csv_respecting_quotes function correctly handles Excel's CSV-like range syntax with quoted sheet names containing commas. The state machine approach is appropriate for this parsing task.

src/exstruct/engine.py (2)

144-163: LGTM!

The OutputOptions class now enforces strict field validation with extra="forbid", which will catch configuration errors early. The nested structure with FormatOptions, FilterOptions, and DestinationOptions is cleaner and more maintainable.


341-351: LGTM!

The extract method correctly forwards the new color options from self.options.colors to extract_workbook. The use of ignore_colors_set() ensures proper type conversion from list to set.

src/exstruct/core/cells.py (5)

4-4: LGTM!

Good addition of Callable and Sequence from collections.abc and the openpyxl_workbook context manager import for safer resource handling.

Also applies to: 19-19


1161-1178: LGTM!

The _extract_listobject_tables helper cleanly extracts table ranges from Excel ListObjects via COM, with appropriate exception handling.


1227-1239: Good abstraction layer.

The _detect_border_rectangles function provides a clean abstraction over detect_border_clusters, handling the numpy array conversion internally.


1291-1348: Well-structured table candidate collection.

The _collect_table_candidates_from_values function centralizes the table detection heuristics with clear parameter documentation. The use of a Callable for column naming enables reuse across openpyxl and xlwings backends.


1409-1447: LGTM!

The detect_tables_openpyxl function cleanly combines openpyxl table references with border-based detection, using proper deduplication. This aligns well with the backend abstraction.

src/exstruct/core/pipeline.py (5)

1-29: LGTM!

Imports are properly organized following the coding guidelines (stdlib → third-party → internal), and the type aliases are clear and appropriately scoped.


32-55: LGTM!

The immutable ExtractionInputs dataclass is well-structured with complete type hints and a proper Google-style docstring. Using frozen=True appropriately enforces immutability for pipeline inputs.


57-76: LGTM!

Proper use of field(default_factory=dict) avoids the mutable default argument pitfall. The mutable design is intentional for artifact accumulation during pipeline execution.


351-390: LGTM!

The pipeline runner functions are simple and correct. Steps mutate artifacts in-place, and any step failure propagates up appropriately. The pattern is clean and easy to follow.


564-579: [No action required]

The detect_tables function has the signature def detect_tables(sheet: xw.Sheet) -> list[str]: and is explicitly designed to accept xlwings Sheet objects. The usage at line 571 in pipeline.py is correct—passing an xlwings sheet from workbook.sheets[sheet_name] is exactly what the function expects. There is no compatibility issue.

Likely an incorrect or invalid review comment.

- TEST_REQUIREMENTS.md の重複 ID を解消([PIPE-MOD-01/02])
- rowは1-based, columnは0-basedであることを明文化、統一
- COM パイプライン失敗時の理由/文言を汎用化し、FallbackReason を追加(errors.py, pipeline.py)
- build_cells_tables_workbook の reason をログに活用
- test_pipeline.py の不要な delenv を削除
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/exstruct/io/__init__.py (1)

74-82: Return RangeBounds directly instead of converting to tuple.

This wrapper function unpacks the Pydantic RangeBounds model into a tuple, which violates the coding guideline: "Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data." Callers should access bounds via the model's fields (bounds.r1, bounds.c1, etc.) to preserve type safety and semantic clarity.

As per coding guidelines, structured data should be returned as Pydantic models, not tuples.

🔎 Proposed refactor to eliminate tuple conversion

Option 1: Remove the wrapper entirely and use parse_range_zero_based directly

Update callers (e.g., line 129) to use the model fields:

 def _filter_table_candidates_to_area(
     table_candidates: list[str], area: PrintArea
 ) -> list[str]:
     filtered: list[str] = []
     for candidate in table_candidates:
-        bounds = _parse_range_zero_based(candidate)
-        if not bounds:
+        bounds = parse_range_zero_based(candidate)
+        if bounds is None:
             continue
-        r1, c1, r2, c2 = bounds
+        r1, c1, r2, c2 = bounds.r1, bounds.c1, bounds.r2, bounds.c2
         r1 += 1
         r2 += 1
         if r1 >= area.r1 and r2 <= area.r2 and c1 >= area.c1 and c2 <= area.c2:
             filtered.append(candidate)
     return filtered

Option 2: If the wrapper is needed, change return type to RangeBounds | None

-def _parse_range_zero_based(range_str: str) -> tuple[int, int, int, int] | None:
-    """
-    Parse an Excel range string into zero-based (r1, c1, r2, c2) bounds.
-    Returns None on failure.
-    """
+def _parse_range_zero_based(range_str: str) -> RangeBounds | None:
+    """Parse an Excel range string into zero-based bounds.
+    
+    Args:
+        range_str: Excel range string (e.g., "Sheet1!A1:B2").
+        
+    Returns:
+        RangeBounds in zero-based coordinates, or None on failure.
+    """
     bounds = parse_range_zero_based(range_str)
-    if bounds is None:
-        return None
-    return (bounds.r1, bounds.c1, bounds.r2, bounds.c2)
+    return bounds
♻️ Duplicate comments (3)
src/exstruct/core/pipeline.py (2)

644-645: Previous review concern addressed.

The error message has been updated to use a generic "COM pipeline failed" message with FallbackReason.COM_PIPELINE_FAILED, addressing the earlier review comment about misleading error messages.


670-670: Previous review concern addressed.

The reason parameter is now properly used for logging, addressing the earlier review comment about the unused parameter.

docs/agents/TEST_REQUIREMENTS.md (1)

203-204: Previous review concern addressed.

The duplicate requirement IDs have been resolved by renaming them to [PIPE-MOD-01] and [PIPE-MOD-02], making each requirement ID unique throughout the document.

🧹 Nitpick comments (7)
src/exstruct/io/__init__.py (1)

488-489: Consider renaming publicly exported functions to remove underscore prefix.

The functions _require_yaml and _require_toon are now part of the public API (exported in __all__), but their names start with an underscore, which by Python convention indicates private/internal functions. For a cleaner public API, consider renaming them to require_yaml and require_toon (without the underscore).

🔎 Suggested changes for consistent public API naming

In src/exstruct/io/serialize.py, rename the functions:

-def _require_yaml() -> ModuleType:
+def require_yaml() -> ModuleType:
     """Ensure pyyaml is installed; otherwise raise with guidance."""
     ...

-def _require_toon() -> ModuleType:
+def require_toon() -> ModuleType:
     """Ensure python-toon is installed; otherwise raise with guidance."""
     ...

In this file, update the import and export:

 from .serialize import (
     _FORMAT_HINTS,
     _ensure_format_hint,
-    _require_toon,
-    _require_yaml,
+    require_toon,
+    require_yaml,
     _serialize_payload_from_hint,
 )

 __all__ = [
     "dict_without_empty_values",
     "save_as_json",
     "save_as_yaml",
     "save_as_toon",
     "save_sheets",
     "save_sheets_as_json",
     "build_print_area_views",
     "save_print_area_views",
     "save_auto_page_break_views",
     "serialize_workbook",
-    "_require_yaml",
-    "_require_toon",
+    "require_yaml",
+    "require_toon",
 ]
tests/test_backends.py (3)

16-22: Add type hints to mock functions.

The fake functions should have explicit type hints for maintainability and mypy compliance.

🔎 Proposed refactor
-    def fake_cells(_: Path) -> dict[str, list[object]]:
+    def fake_cells(file_path: Path) -> dict[str, list[object]]:
         calls.append("cells")
         return {}
 
-    def fake_cells_links(_: Path) -> dict[str, list[object]]:
+    def fake_cells_links(file_path: Path) -> dict[str, list[object]]:
         calls.append("links")
         return {}

As per coding guidelines, avoid using _ for actual parameters; use descriptive names with proper type hints.


43-44: Use explicit parameter names with type hints.

Replace generic _ and __ with descriptive parameter names for better readability.

🔎 Proposed refactor
-    def fake_detect(_: Path, __: str) -> list[str]:
+    def fake_detect(file_path: Path, sheet_name: str) -> list[str]:
         raise RuntimeError("boom")

As per coding guidelines, use descriptive parameter names.


58-59: Use explicit type hints instead of generic object.

The mock function should use proper type signatures for clarity.

🔎 Proposed refactor
-    def fake_colors_map(*_: object, **__: object) -> object:
+    def fake_colors_map(
+        workbook: object,
+        *,
+        include_default_background: bool,
+        ignore_colors: set[str] | None
+    ) -> object:
         raise RuntimeError("boom")

As per coding guidelines, provide explicit type hints for all parameters.

src/exstruct/core/backends/openpyxl_backend.py (1)

103-125: Consider adding a module-level docstring.

The file contains well-structured code with proper docstrings for classes and methods, but lacks a module-level docstring explaining the openpyxl backend's role in the extraction pipeline.

🔎 Suggested addition

Add at the top of the file after imports:

"""Openpyxl-based backend for Excel workbook extraction.

This module provides the OpenpyxlBackend class which uses the openpyxl library
to extract cells, print areas, color maps, and table candidates from Excel files.
It serves as the primary extraction backend with fallback support when COM is unavailable.
"""

As per coding guidelines, adding comprehensive module documentation improves maintainability.

src/exstruct/core/backends/com_backend.py (2)

173-206: Consider extracting CSV parsing to a utility function for reusability.

The _split_csv_respecting_quotes function implements custom CSV parsing with quote handling. While the implementation is correct, this type of utility might be useful elsewhere in the codebase.

Consider moving this to a shared utilities module if similar parsing is needed elsewhere, or documenting that Python's csv module with appropriate dialect settings could be an alternative.


1-206: Add module-level docstring for better documentation.

The file contains well-structured code with proper class and method docstrings, but lacks a module-level docstring explaining the COM backend's role.

🔎 Suggested addition

Add at the top after imports:

"""COM-based backend for Excel workbook extraction via xlwings.

This module provides the ComBackend class which uses xlwings/COM to extract
print areas, color maps, and auto page breaks from Excel files. It complements
the OpenpyxlBackend and is used when COM automation is available and required
features (like auto page breaks) need native Excel API access.
"""

As per coding guidelines, comprehensive module documentation improves maintainability.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79d4bbb and 93b10e2.

📒 Files selected for processing (20)
  • .github/workflows/pytest.yml
  • docs/agents/CODE_REVIEW.md
  • docs/agents/DATA_MODEL.md
  • docs/agents/EXCEL_EXTRACTION.md
  • docs/agents/TEST_REQUIREMENTS.md
  • src/exstruct/core/backends/com_backend.py
  • src/exstruct/core/backends/openpyxl_backend.py
  • src/exstruct/core/pipeline.py
  • src/exstruct/errors.py
  • src/exstruct/io/__init__.py
  • src/exstruct/models/__init__.py
  • tests/test_auto_page_breaks.py
  • tests/test_backends.py
  • tests/test_engine.py
  • tests/test_integrate_raw_data.py
  • tests/test_modeling.py
  • tests/test_models_validation.py
  • tests/test_pipeline.py
  • tests/test_print_area_views.py
  • tests/test_print_areas_openpyxl.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/test_pipeline.py
  • tests/test_modeling.py
  • tests/test_integrate_raw_data.py
  • src/exstruct/errors.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Add type hints to all function and method arguments and return values (mypy strict compliance)
Use Any type only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Organize imports in this order: (1) standard library, (2) third-party packages, (3) exstruct internal modules
Use Google-style docstrings for all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Avoid writing God Functions (large, single-responsibility-violating functions)
Avoid writing God Objects (classes with too many responsibilities)
Avoid excessive nesting and deep conditional branching
Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Ensure code passes mypy strict mode with zero errors
Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Avoid circular dependencies between modules
AI should propose automatic function splitting when functions are too complex
AI should propose additional Pydantic models when data structures are insufficient
AI should automatically fix and organize imports when they are inconsistent
AI should improve docstrings when they are incomplete or insufficient

Files:

  • tests/test_print_areas_openpyxl.py
  • tests/test_backends.py
  • src/exstruct/core/backends/openpyxl_backend.py
  • src/exstruct/io/__init__.py
  • src/exstruct/core/backends/com_backend.py
  • tests/test_print_area_views.py
  • tests/test_engine.py
  • tests/test_auto_page_breaks.py
  • src/exstruct/core/pipeline.py
  • tests/test_models_validation.py
  • src/exstruct/models/__init__.py
🧠 Learnings (1)
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries

Applied to files:

  • src/exstruct/io/__init__.py
🧬 Code graph analysis (6)
tests/test_backends.py (3)
src/exstruct/core/backends/com_backend.py (3)
  • ComBackend (18-135)
  • extract_colors_map (51-74)
  • extract_print_areas (27-49)
src/exstruct/core/backends/openpyxl_backend.py (5)
  • OpenpyxlBackend (23-100)
  • extract_cells (32-45)
  • detect_tables (88-100)
  • extract_colors_map (64-86)
  • extract_print_areas (47-62)
src/exstruct/core/ranges.py (1)
  • parse_range_zero_based (24-47)
src/exstruct/core/backends/openpyxl_backend.py (3)
src/exstruct/models/__init__.py (1)
  • PrintArea (97-103)
src/exstruct/core/ranges.py (1)
  • parse_range_zero_based (24-47)
src/exstruct/core/workbook.py (1)
  • openpyxl_workbook (14-53)
src/exstruct/io/__init__.py (3)
src/exstruct/core/ranges.py (1)
  • parse_range_zero_based (24-47)
src/exstruct/errors.py (2)
  • OutputError (32-33)
  • SerializationError (20-21)
src/exstruct/io/serialize.py (4)
  • _ensure_format_hint (28-49)
  • _require_toon (104-112)
  • _require_yaml (93-101)
  • _serialize_payload_from_hint (52-90)
tests/test_print_area_views.py (1)
src/exstruct/models/__init__.py (2)
  • CellRow (50-59)
  • PrintArea (97-103)
tests/test_auto_page_breaks.py (1)
src/exstruct/models/__init__.py (1)
  • PrintArea (97-103)
tests/test_models_validation.py (1)
src/exstruct/models/__init__.py (1)
  • CellRow (50-59)
🪛 GitHub Check: Codacy Static Code Analysis
src/exstruct/core/backends/com_backend.py

[warning] 37-37: src/exstruct/core/backends/com_backend.py#L37
Try, Except, Continue detected.


[warning] 132-132: src/exstruct/core/backends/com_backend.py#L132
Try, Except, Pass detected.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: test (windows-latest, 3.11)
🔇 Additional comments (24)
docs/agents/DATA_MODEL.md (2)

96-96: Clear coordinate semantics for PrintArea rows—good clarification.

Adding "(1-based, inclusive)" to r1 and r2 aligns with the existing c1 and c2 documentation and makes the row coordinate interpretation explicit. This improves clarity for API consumers.

Also applies to: 98-98


140-140: Documentation for colors_map coordinate system is correct and matches implementation.

The codebase confirms that colors_map uses (row=1-based, col=0-based) coordinates as documented. Both extraction paths—openpyxl (cell.row, cell.col_idx - 1) and COM (row, col - 1)—apply the same logic: rows are preserved as 1-based, columns are converted from 1-based to 0-based. Test coverage validates this with {"#FFFFFF": [(1, 0)]}, confirming row 1 with column 0.

src/exstruct/io/__init__.py (3)

8-18: Well done removing the OpenPyXL dependency!

The refactor successfully eliminates direct dependency on openpyxl.utils.cell.range_boundaries by using the internal parse_range_zero_based function, and centralizes serialization logic through the new helpers from .serialize. This aligns with the coding guideline to avoid depending on internal structures of external libraries.

Based on learnings, this change properly normalizes external data to Pydantic models at boundaries.


283-288: Excellent serialization refactoring!

The centralization of format validation and serialization through _ensure_format_hint and _serialize_payload_from_hint eliminates code duplication and provides a single source of truth for format handling. This improves maintainability and consistency across all export functions.

Also applies to: 315-318, 341-346, 374-377, 393-402, 429-432, 449-454, 469-471


133-134: Row index adjustments correctly implement 1-based row coordinates.

The adjustments on lines 133-134 (adding 1 to r1 and r2) and lines 148, 150 (subtracting 1 from area.r1 in pixel calculations) correctly convert between the 0-based coordinates returned by parse_range_zero_based and the 1-based row coordinates used in PrintArea, as documented in the PR objectives.

Also applies to: 148-148, 150-150

tests/test_models_validation.py (1)

19-19: LGTM: Coordinate system alignment.

The update from r=0 to r=1 correctly reflects the 1-based row indexing documented in the CellRow model definition.

tests/test_auto_page_breaks.py (1)

47-47: LGTM: Coordinate system alignment.

The updated PrintArea coordinates correctly reflect the 1-based row indexing (r1=1, r2=2) and 0-based column indexing (c1=0, c2=1) documented in the PrintArea model.

.github/workflows/pytest.yml (1)

34-46: LGTM: Coverage collection and Codecov integration properly configured.

The pytest coverage flags and Codecov upload step are correctly implemented. The Codecov action is now properly pinned to a full commit SHA, addressing the previous security concern.

src/exstruct/models/__init__.py (3)

14-16: LGTM: Field definition reflow.

The Shape.id field definition has been reflowed for readability while maintaining the same semantic meaning.


101-103: LGTM: Coordinate system documentation clarified.

The field descriptions now explicitly document that columns use 0-based indexing, improving clarity for API consumers.


129-132: LGTM: colors_map coordinate semantics documented.

The description now explicitly clarifies that rows are 1-based and columns are 0-based, which is essential for correct usage of the colors_map field.

docs/agents/EXCEL_EXTRACTION.md (1)

1-66: LGTM: Comprehensive pipeline documentation.

This documentation provides clear guidance on the extraction pipeline architecture, coordinate system semantics, modes, and fallback behavior. The coordinate system clarification (rows 1-based, columns 0-based) is consistent with the code changes throughout the PR.

src/exstruct/core/pipeline.py (1)

1-696: LGTM: Pipeline architecture with proper type safety.

The pipeline module demonstrates excellent adherence to coding guidelines:

  • Complete type hints on all functions and parameters
  • Google-style docstrings throughout
  • Immutable dataclasses for pipeline configuration and state
  • Well-organized imports (stdlib → third-party → internal)
  • Clear separation of pre-COM and COM extraction steps

The architecture provides a solid foundation for the extraction workflow with explicit fallback handling and state tracking.

docs/agents/TEST_REQUIREMENTS.md (1)

1-228: LGTM: Comprehensive test requirements specification.

The test requirements document provides thorough coverage of functional, non-functional, and integration requirements. The organization by category (pipeline, backend, ranges, etc.) aligns well with the modular architecture introduced in this PR.

tests/test_print_areas_openpyxl.py (1)

26-26: LGTM: Coordinate system alignment.

The assertion now correctly expects 1-based row coordinates (r1=1, r2=2) and 0-based column coordinates (c1=0, c2=1), consistent with the PrintArea model definition and the broader coordinate system updates in this PR.

tests/test_print_area_views.py (1)

39-46: LGTM! Coordinate system update correctly applied.

The test data has been properly updated to reflect the documented coordinate convention (1-based rows, 0-based columns). The CellRow indices and PrintArea bounds are now consistent with the model definitions.

tests/test_backends.py (2)

76-93: LGTM! Print area extraction test validates coordinate system.

The test correctly verifies that print areas extracted via openpyxl use 1-based row indexing (r1=1) and 0-based column indexing (c1=0), consistent with the PrintArea model definition.


95-101: LGTM! Range parsing test validates zero-based intermediate representation.

The test correctly verifies that parse_range_zero_based returns zero-based coordinates (r1=0, c1=0, r2=1, c2=1) for the range "Sheet1!A1:B2", which are then converted to the appropriate coordinate system by backend methods.

src/exstruct/core/backends/openpyxl_backend.py (1)

159-162: LGTM! Coordinate conversion correctly implemented.

The conversion from zero-based coordinates (from _parse_print_area_range) to the PrintArea model is now correct:

  • Rows (r1, r2): converted from 0-based to 1-based with + 1
  • Columns (c1, c2): kept as 0-based (no adjustment)

This matches the PrintArea model definition where rows are 1-based and columns are 0-based.

tests/test_engine.py (2)

6-12: LGTM! Improved API design with nested option groups.

The introduction of FilterOptions and DestinationOptions improves the API's organization and clarity, grouping related settings together under OutputOptions.


65-65: LGTM! PrintArea coordinate updated correctly.

The PrintArea coordinates have been updated to use 1-based row indexing (r1=1) consistent with the model definition and broader coordinate system clarification in this PR.

docs/agents/CODE_REVIEW.md (1)

1-609: Documentation file - no code review needed.

This file contains example code review comments and AI agent prompts for documentation purposes. It does not contain production code requiring review.

src/exstruct/core/backends/com_backend.py (2)

45-48: LGTM! Coordinate conversion correctly implemented.

The conversion from zero-based coordinates (from _parse_print_area_range) to PrintArea is correct:

  • Rows (r1, r2): + 1 converts 0-based to 1-based
  • Columns (c1, c2): no adjustment keeps 0-based

This matches the PrintArea model definition.


119-125: LGTM! Coordinate conversion for COM values is correct.

The coordinate handling here is actually correct, contrary to the past review comment:

  • rows[i] and cols[j] are 1-based values from Excel COM
  • Rows (r1, r2): used as-is since PrintArea expects 1-based rows ✓
  • Columns (c1, c2): - 1 converts from 1-based COM to 0-based for PrintArea ✓

This is consistent with the PrintArea model where rows are 1-based and columns are 0-based.

- __init__.py の _parse_range_zero_based を RangeBounds 返却に変更し、呼び出し側もフィールド参照へ更新
- openpyxl_backend.py と com_backend.py にモジュール docstring 追加
- test_backends.py のモック関数に型ヒント追加
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/exstruct/core/cells.py (1)

502-526: Remove redundant wb.close() call.

Line 525 explicitly calls wb.close(), but the openpyxl_workbook context manager on line 502 already handles closing in its finally block (see src/exstruct/core/workbook.py lines 53-56). This redundant call could:

  • Mask exceptions from the context block if close() raises
  • Create confusion about resource ownership
  • Cause double-close issues
🔎 Proposed fix
             links = sheet_links.get(row.r, {})
             merged_rows.append(CellRow(r=row.r, c=row.c, links=links or None))
         merged[sheet_name] = merged_rows
-    wb.close()
     return merged
🧹 Nitpick comments (3)
src/exstruct/core/backends/openpyxl_backend.py (1)

167-179: Consider returning RangeBounds directly instead of unpacking to tuple.

The function unpacks the RangeBounds Pydantic model from parse_range_zero_based into a tuple, which reduces type safety and semantic clarity. The coding guideline recommends using Pydantic models for structured data.

🔎 Proposed refactor
-def _parse_print_area_range(range_str: str) -> tuple[int, int, int, int] | None:
-    """Parse an Excel range string into zero-based coordinates.
+def _parse_print_area_range(range_str: str) -> RangeBounds | None:
+    """Parse an Excel range string into zero-based bounds.
 
     Args:
         range_str: Excel range string.
 
     Returns:
-        Zero-based (r1, c1, r2, c2) tuple or None on failure.
+        RangeBounds in zero-based coordinates, or None on failure.
     """
-    bounds = parse_range_zero_based(range_str)
-    if bounds is None:
-        return None
-    return (bounds.r1, bounds.c1, bounds.r2, bounds.c2)
+    return parse_range_zero_based(range_str)

Then update the caller at line 161:

     for part in str(range_str).split(","):
         parsed = _parse_print_area_range(part)
         if not parsed:
             continue
-        r1, c1, r2, c2 = parsed
+        r1, c1, r2, c2 = parsed.r1, parsed.c1, parsed.r2, parsed.c2
         areas.setdefault(sheet_name, []).append(
             PrintArea(r1=r1 + 1, c1=c1, r2=r2 + 1, c2=c2)
         )

As per coding guidelines, structured data should be represented as Pydantic models rather than tuples.

src/exstruct/core/backends/com_backend.py (1)

158-170: Consider returning RangeBounds directly instead of tuple.

Same issue as in openpyxl_backend.py: the function unpacks the Pydantic RangeBounds model into a tuple, reducing type safety.

🔎 Proposed refactor
-def _parse_print_area_range(range_str: str) -> tuple[int, int, int, int] | None:
-    """Parse an Excel range string into zero-based coordinates.
+def _parse_print_area_range(range_str: str) -> RangeBounds | None:
+    """Parse an Excel range string into zero-based bounds.
 
     Args:
         range_str: Excel range string.
 
     Returns:
-        Zero-based (r1, c1, r2, c2) tuple or None on failure.
+        RangeBounds in zero-based coordinates, or None on failure.
     """
-    bounds = parse_range_zero_based(range_str)
-    if bounds is None:
-        return None
-    return (bounds.r1, bounds.c1, bounds.r2, bounds.c2)
+    return parse_range_zero_based(range_str)

Then update the caller at line 52:

             for part in str(raw).split(","):
                 parsed = _parse_print_area_range(part)
                 if not parsed:
                     continue
-                r1, c1, r2, c2 = parsed
+                r1, c1, r2, c2 = parsed.r1, parsed.c1, parsed.r2, parsed.c2
                 areas.setdefault(sheet.name, []).append(
                     PrintArea(r1=r1 + 1, c1=c1, r2=r2 + 1, c2=c2)
                 )

As per coding guidelines, structured data should be represented as Pydantic models rather than tuples.

src/exstruct/io/__init__.py (1)

74-83: Consider removing this wrapper function.

_parse_range_zero_based simply delegates to parse_range_zero_based with duplicate documentation, adding no additional logic or value. Direct usage of parse_range_zero_based from core.ranges would eliminate this indirection and improve maintainability.

🔎 Proposed refactor

Remove this wrapper entirely and update internal call sites (lines 130, if any others) to directly import and use parse_range_zero_based from core.ranges.

-def _parse_range_zero_based(range_str: str) -> RangeBounds | None:
-    """Parse an Excel range string into zero-based bounds.
-
-    Args:
-        range_str: Excel range string (e.g., "Sheet1!A1:B2").
-
-    Returns:
-        RangeBounds in zero-based coordinates, or None on failure.
-    """
-    return parse_range_zero_based(range_str)

Then update call sites:

-        bounds = _parse_range_zero_based(candidate)
+        bounds = parse_range_zero_based(candidate)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93b10e2 and 5195fa0.

📒 Files selected for processing (8)
  • docs/agents/CODE_REVIEW.md
  • src/exstruct/core/backends/com_backend.py
  • src/exstruct/core/backends/openpyxl_backend.py
  • src/exstruct/core/cells.py
  • src/exstruct/core/ranges.py
  • src/exstruct/core/workbook.py
  • src/exstruct/io/__init__.py
  • tests/test_backends.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_backends.py
  • src/exstruct/core/ranges.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Add type hints to all function and method arguments and return values (mypy strict compliance)
Use Any type only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Organize imports in this order: (1) standard library, (2) third-party packages, (3) exstruct internal modules
Use Google-style docstrings for all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Avoid writing God Functions (large, single-responsibility-violating functions)
Avoid writing God Objects (classes with too many responsibilities)
Avoid excessive nesting and deep conditional branching
Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Ensure code passes mypy strict mode with zero errors
Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Avoid circular dependencies between modules
AI should propose automatic function splitting when functions are too complex
AI should propose additional Pydantic models when data structures are insufficient
AI should automatically fix and organize imports when they are inconsistent
AI should improve docstrings when they are incomplete or insufficient

Files:

  • src/exstruct/core/backends/openpyxl_backend.py
  • src/exstruct/core/cells.py
  • src/exstruct/io/__init__.py
  • src/exstruct/core/workbook.py
  • src/exstruct/core/backends/com_backend.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
📚 Learning: 2025-12-21T10:55:22.238Z
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries

Applied to files:

  • src/exstruct/core/cells.py
  • src/exstruct/io/__init__.py
🧬 Code graph analysis (4)
src/exstruct/core/cells.py (2)
src/exstruct/models/__init__.py (1)
  • CellRow (50-59)
src/exstruct/core/workbook.py (1)
  • openpyxl_workbook (17-56)
src/exstruct/io/__init__.py (4)
src/exstruct/core/ranges.py (2)
  • RangeBounds (9-22)
  • parse_range_zero_based (25-48)
src/exstruct/errors.py (2)
  • OutputError (32-33)
  • SerializationError (20-21)
src/exstruct/engine.py (1)
  • serialize (353-376)
src/exstruct/io/serialize.py (4)
  • _ensure_format_hint (28-49)
  • _require_toon (104-112)
  • _require_yaml (93-101)
  • _serialize_payload_from_hint (52-90)
src/exstruct/core/workbook.py (1)
tests/test_workbook_context.py (2)
  • close (12-13)
  • close (43-44)
src/exstruct/core/backends/com_backend.py (4)
src/exstruct/models/__init__.py (1)
  • PrintArea (97-103)
src/exstruct/core/cells.py (2)
  • WorkbookColorsMap (53-67)
  • extract_sheet_colors_map_com (94-119)
src/exstruct/core/ranges.py (1)
  • parse_range_zero_based (25-48)
src/exstruct/core/backends/base.py (2)
  • extract_print_areas (32-33)
  • extract_colors_map (35-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (9)
docs/agents/CODE_REVIEW.md (1)

1-779: Documentation artifact - no code review required.

This file contains code review examples and AI agent prompts for documentation purposes. It does not contain production code requiring review.

src/exstruct/core/cells.py (2)

85-91: Excellent refactoring to use context-managed workbook access.

The replacement of direct load_workbook calls with the openpyxl_workbook context manager ensures proper resource cleanup and centralizes warning suppression logic.

Based on learnings, this change properly manages external library resources at boundaries.


1161-1454: Well-structured table detection refactoring.

The new helper functions follow good separation of concerns:

  • _extract_listobject_tables: COM ListObject extraction
  • _detect_border_rectangles_xlwings: Border-based detection via COM
  • _merge_rectangles: Deduplication logic
  • _collect_table_candidates_from_values: Heuristic scoring
  • detect_tables_openpyxl: Public openpyxl entry point

All functions have proper type hints, Google-style docstrings, and appropriate error handling.

src/exstruct/core/workbook.py (1)

1-114: Excellent context manager implementation for workbook lifecycle management.

The two context managers demonstrate best practices:

  • Resource safety: Guaranteed cleanup in finally blocks (lines 52-56, 80-87)
  • Error resilience: Swallows cleanup exceptions to avoid masking original errors
  • Defensive reuse: _find_open_workbook safely reuses existing Excel instances via path resolution (lines 90-114)
  • Warning filtering: Suppresses known non-critical openpyxl warnings (lines 30-48)

All functions have complete type hints and proper Google-style docstrings, adhering to coding guidelines.

src/exstruct/core/backends/openpyxl_backend.py (1)

24-102: Well-designed backend abstraction with proper separation of concerns.

The OpenpyxlBackend class cleanly encapsulates openpyxl-based extraction:

  • Single responsibility: Each method delegates to specialized utilities
  • Error handling: Graceful degradation with empty returns or None
  • Type safety: Complete type hints throughout
  • Documentation: Clear Google-style docstrings

The coordinate conversion at line 163 correctly adjusts rows to 1-based while keeping columns 0-based, matching the PrintArea model specification.

src/exstruct/core/backends/com_backend.py (2)

29-56: Coordinate conversion correctly implemented.

Line 54 properly converts from zero-based (parsed) to the PrintArea model's coordinate system:

  • Rows: +1 converts 0-based to 1-based ✓
  • Columns: no adjustment keeps them 0-based ✓

This matches the PrintArea specification: r1/r2 are 1-based, c1/c2 are 0-based.


83-155: Auto page-break extraction with proper state management.

The method demonstrates good practices:

  • State preservation: Saves and restores DisplayPageBreaks in finally block (lines 143-152)
  • Error resilience: Continues processing other sheets on failure (line 153-154)
  • Clear structure: Despite length, the logic flows clearly through setup → extraction → restoration

The coordinate conversion at line 134 is correct:

  • Excel COM returns 1-based row/column values
  • Rows stay 1-based (no adjustment) ✓
  • Columns converted from 1-based to 0-based (-1) ✓

This aligns with the PrintArea model specification.

src/exstruct/io/__init__.py (2)

8-18: LGTM: Clean import refactoring.

The imports are well-organized and follow the project's structure, centralizing range parsing and serialization utilities.


320-323: Excellent refactoring: Centralized serialization.

The refactoring to use _serialize_payload_from_hint across all serialization call sites eliminates code duplication and improves maintainability. This is a clean application of the DRY principle.

Also applies to: 379-382, 405-407, 434-437, 474-476

Comment on lines +493 to +494
"_require_yaml",
"_require_toon",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reconsider exporting private functions in public API.

_require_yaml and _require_toon use underscore prefixes indicating private/internal usage, yet they're exported in __all__, making them part of the public API. This creates inconsistency:

  • If these functions are meant to be public, remove the underscore prefix
  • If they should remain internal, remove them from __all__

Based on their usage in serialize.py and re-export here, if they're genuinely needed by external consumers, consider renaming to require_yaml and require_toon.

🔎 Proposed fix (if making them truly public)
 __all__ = [
     "dict_without_empty_values",
     "save_as_json",
     "save_as_yaml",
     "save_as_toon",
     "save_sheets",
     "save_sheets_as_json",
     "build_print_area_views",
     "save_print_area_views",
     "save_auto_page_break_views",
     "serialize_workbook",
-    "_require_yaml",
-    "_require_toon",
+    "require_yaml",
+    "require_toon",
 ]

And in the imports:

 from .serialize import (
     _FORMAT_HINTS,
     _ensure_format_hint,
-    _require_toon,
-    _require_yaml,
+    require_toon as _require_toon,  # If still needed internally with underscore
+    require_yaml as _require_yaml,
     _serialize_payload_from_hint,
 )

Alternatively, if they should remain private, simply remove them from __all__.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/exstruct/io/__init__.py around lines 493-494, the module exports two
names prefixed with underscores ("_require_yaml", "_require_toon") in __all__,
which exposes private/internal functions as part of the public API; either
remove them from __all__ to keep them internal, or rename the functions and
update all imports/usages (e.g., in serialize.py and re-exports) to public names
(require_yaml, require_toon) so the public API and identifiers are
consistent—pick one approach and apply it across the package.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
schemas/sheet.json (1)

219-219: Consistent with print_area_view.json schema changes.

The PrintArea column indexing update to 0-based is consistent across schemas. See verification request in print_area_view.json review.

Also applies to: 224-224

schemas/workbook.json (1)

219-219: Schema changes consistent across all files.

The PrintArea column indexing (lines 219, 224) and colors_map coordinate clarification (line 451) are consistent with print_area_view.json and sheet.json. This ensures uniform coordinate semantics across the schema hierarchy.

Also applies to: 224-224, 451-451

src/exstruct/io/__init__.py (1)

493-494: Private naming convention inconsistent with public export.

The underscore prefix on _require_yaml and _require_toon signals internal use, yet they're exported in __all__. As noted in previous reviews, consider either:

  • Remove from __all__ if they should remain internal
  • Rename to require_yaml / require_toon if they're intended for external use
🧹 Nitpick comments (1)
pyproject.toml (1)

35-35: taskipy>=1.14.1 follows the project's established pattern for dev dependencies. While the concern about preventing breaking changes is valid, other dev dependencies in the project (pytest, mypy, ruff, etc.) use the same open-ended constraint format. Version pinning is already managed by the uv lock file, which locks taskipy to 1.14.1. Adding an upper bound would diverge from the project's convention without providing additional protection.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5195fa0 and 63d008f.

📒 Files selected for processing (6)
  • pyproject.toml
  • schemas/print_area.json
  • schemas/print_area_view.json
  • schemas/sheet.json
  • schemas/workbook.json
  • src/exstruct/io/__init__.py
✅ Files skipped from review due to trivial changes (1)
  • schemas/print_area.json
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Add type hints to all function and method arguments and return values (mypy strict compliance)
Use Any type only at external library boundaries (xlwings, pandas, numpy, etc.)
Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data
Ensure each function has a single responsibility (cyclomatic complexity should not exceed 12)
Organize imports in this order: (1) standard library, (2) third-party packages, (3) exstruct internal modules
Use Google-style docstrings for all functions and classes
Do not depend on internal structures of external libraries (xlwings, pandas, numpy); normalize external data to Pydantic models at boundaries
Avoid writing God Functions (large, single-responsibility-violating functions)
Avoid writing God Objects (classes with too many responsibilities)
Avoid excessive nesting and deep conditional branching
Do not return dictionaries or tuples; always use Pydantic BaseModel for structured data
Ensure code passes mypy strict mode with zero errors
Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
Avoid circular dependencies between modules
AI should propose automatic function splitting when functions are too complex
AI should propose additional Pydantic models when data structures are insufficient
AI should automatically fix and organize imports when they are inconsistent
AI should improve docstrings when they are incomplete or insufficient

Files:

  • src/exstruct/io/__init__.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Codacy Static Code Analysis
  • GitHub Check: test (windows-latest, 3.12)
  • GitHub Check: test (windows-latest, 3.11)
🔇 Additional comments (10)
pyproject.toml (2)

3-3: LGTM!

Version bump to 0.3.0 aligns with the release commit and follows semantic versioning conventions.


122-128: Fix [tool.mypy] configuration for src-layout project.

The [tool.mypy] configuration specifies packages = ["exstruct"], but the package only exists at src/exstruct/. For src-layout projects, add mypy_path = "src" to the [tool.mypy] section so mypy can locate the package. The current task command mypy src/exstruct --strict works because it explicitly specifies the path, but the configuration itself is broken for IDE integration and standalone mypy runs.

⛔ Skipped due to learnings
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Ensure code passes mypy strict mode with zero errors
Learnt from: CR
Repo: harumiWeb/exstruct PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-21T10:55:22.238Z
Learning: Applies to **/*.py : Ensure code passes Ruff checks (E, W, F, I, B, UP, N, C90) with zero errors
schemas/sheet.json (1)

452-452: Good clarification of colors_map coordinate semantics.

The updated description correctly documents the mixed indexing convention (1-based rows, 0-based columns) for the colors_map tuples. Both the openpyxl and COM backends consistently implement this convention: row values are taken directly from the cell/worksheet (1-based), and column indices are decremented by 1 to convert to 0-based.

schemas/print_area_view.json (1)

219-219: Breaking change: Column indexing changed to 0-based.

The schema now documents columns as 0-based while rows remain 1-based. Verification confirms this breaking change has been coordinated across the codebase: both extraction backends (openpyxl and COM) convert coordinates to the new format, and all consumers of PrintArea (filtering, shape/chart mapping, range parsing) correctly treat columns as 0-based and rows as 1-based.

src/exstruct/io/__init__.py (6)

8-18: LGTM!

The imports are well-organized and follow the coding guidelines: standard library imports first, followed by internal modules. The centralized serialization helpers from .serialize support the refactoring goals.


74-83: Good refactoring to centralized range parsing.

The delegation pattern eliminates code duplication while maintaining a local interface. Type hints and docstring are complete.


145-156: LGTM - Pixel rectangle calculation.

The 1-based row to pixel conversion is handled correctly: (area.r1 - 1) * row_px for top edge and area.r2 * row_px for bottom edge, consistent with 1-based row indexing in PrintArea.


288-293: Consistent error handling across all format validation.

All serialization functions now use SerializationError via _ensure_format_hint. The previous inconsistency (noted in past reviews for save_sheets) has been resolved.

Also applies to: 346-351, 398-403, 454-459


320-323: Good centralization of serialization logic.

All save/serialize functions now delegate to _serialize_payload_from_hint, eliminating duplicated format-switching code and ensuring consistent behavior across JSON, YAML, and TOON outputs.

Also applies to: 379-382, 405-407, 434-437, 474-476


133-140: The coordinate system is correct and intentional. PrintArea uses 1-based rows and 0-based columns (as documented in the model). RangeBounds uses zero-based for all coordinates. The code correctly converts rows from zero-based to 1-based (bounds.r1 + 1, bounds.r2 + 1) and uses columns directly without conversion (both are zero-based). This asymmetric coordinate system is consistent throughout the codebase and poses no issue.

@harumiWeb harumiWeb merged commit 9b3478b into main Dec 27, 2025
9 checks passed
@harumiWeb harumiWeb deleted the dev/refactor branch December 27, 2025 14:32
harumiWeb added a commit that referenced this pull request Dec 27, 2025
Add v0.3.0 release notes (internal refactor PR #23)
This was referenced Dec 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants