forked from Unstructured-IO/unstructured
-
Notifications
You must be signed in to change notification settings - Fork 0
⚡️ Speed up function _get_bbox_to_page_ratio by 353%
#51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
codeflash-ai
wants to merge
24
commits into
main
Choose a base branch
from
codeflash/optimize-_get_bbox_to_page_ratio-mjdkzmao
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
⚡️ Speed up function _get_bbox_to_page_ratio by 353%
#51
codeflash-ai
wants to merge
24
commits into
main
from
codeflash/optimize-_get_bbox_to_page_ratio-mjdkzmao
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…nstructured-IO#4130) Saurabh's comments - This looks like a good, easy straightforward and impactful optimization <!-- CODEFLASH_OPTIMIZATION: {"function":"OCRAgentTesseract.extract_word_from_hocr","file":"unstructured/partition/utils/ocr_models/tesseract_ocr.py","speedup_pct":"35%","speedup_x":"0.35x","original_runtime":"7.18 milliseconds","best_runtime":"5.31 milliseconds","optimization_type":"loop","timestamp":"2025-12-19T03:15:54.368Z","version":"1.0"} --> #### 📄 35% (0.35x) speedup for ***`OCRAgentTesseract.extract_word_from_hocr` in `unstructured/partition/utils/ocr_models/tesseract_ocr.py`*** ⏱️ Runtime : **`7.18 milliseconds`** **→** **`5.31 milliseconds`** (best of `13` runs) #### 📝 Explanation and details The optimized code achieves a **35% speedup** through two key performance improvements: **1. Regex Precompilation** The original code calls `re.search(r"x_conf (\d+\.\d+)", char_title)` inside the loop, recompiling the regex pattern on every iteration. The optimization moves this to module level as `_RE_X_CONF = re.compile(r"x_conf (\d+\.\d+)")`, compiling it once at import time. The line profiler shows the regex search time improved from 12.73ms (42.9% of total time) to 3.02ms (16.2% of total time) - a **76% reduction** in regex overhead. **2. Efficient String Building** The original code uses string concatenation (`word_text += char`) which creates a new string object each time due to Python's immutable strings. With 6,339 character additions in the profiled run, this becomes expensive. The optimization collects characters in a list (`chars.append(char)`) and builds the final string once with `"".join(chars)`. This reduces the character accumulation overhead from 1.52ms to 1.58ms for appends plus a single 46μs join operation. **Performance Impact** These optimizations are particularly effective for OCR processing where: - The same regex pattern is applied thousands of times per document - Words contain multiple characters that need accumulation - The function is likely called frequently during document processing The 35% speedup directly translates to faster document processing in OCR workflows, with the most significant gains occurring when processing documents with many detected characters that pass the confidence threshold. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **27 Passed** | | 🌀 Generated Regression Tests | ✅ **22 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/pdf_image/test_ocr.py::test_extract_word_from_hocr` | 63.2μs | 49.1μs | 28.7%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python ``` </details> To edit these changes `git checkout codeflash/optimize-OCRAgentTesseract.extract_word_from_hocr-mjcarjk8` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
The optimization applies **Numba's Just-In-Time (JIT) compilation** using the `@njit(cache=True)` decorator to dramatically speed up this mathematical computation function. **Key changes:** - Added `from numba import njit` import - Applied `@njit(cache=True)` decorator to the function - No changes to the algorithm logic itself **Why this leads to a speedup:** Numba compiles Python bytecode to optimized machine code at runtime, eliminating Python's interpreter overhead for numerical computations. The function performs several floating-point operations (`math.sqrt`, exponentiation, arithmetic) that benefit significantly from native machine code execution. The `cache=True` parameter ensures the compiled version is cached for subsequent calls, avoiding recompilation overhead. **Performance characteristics:** - **352% speedup** (930μs → 205μs) demonstrates Numba's effectiveness on math-heavy functions - The line profiler shows no timing data for the optimized version because Numba-compiled code runs outside Python's profiling mechanisms - All test cases show consistent **180-370% speedups**, with larger improvements on simple cases and slightly smaller gains on edge cases like exception handling **Impact on workloads:** Based on `function_references`, this function is called from `_get_optimal_value_for_bbox()`, which suggests it's used in document analysis pipelines where bounding box calculations are performed repeatedly. The substantial speedup will be particularly beneficial when processing documents with many bounding boxes, as demonstrated by the large-scale test cases showing **300%+ improvements** when processing thousands of bboxes. **Optimization effectiveness:** Most effective for computational workloads with repeated calls to this function, especially when processing large documents or batch operations where the function is called hundreds or thousands of times.
<!-- CURSOR_SUMMARY --> > [!NOTE] > Migrates automated dependency updates from Dependabot to Renovate. > > - Removes `.github/dependabot.yml` > - Adds `renovate.json5` extending `github>unstructured-io/renovate-config` to manage updates via Renovate > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 2a2b728. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
…d-IO#4145) Add version bumping script and enable postUpgradeTasks for Python security updates via Renovate. Changes: - Add scripts/renovate-security-bump.sh from renovate-config repo - Configure postUpgradeTasks in renovate.json5 to run the script - Script automatically bumps version and updates CHANGELOG on security fixes When Renovate creates a Python security update PR, it will now: 1. Detect changed dependencies 2. Bump patch version (or release current -dev version) 3. Add security fix entry to CHANGELOG.md 4. Include version and CHANGELOG changes in the PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Automates release housekeeping for Python security updates via Renovate. > > - Adds `scripts/renovate-security-bump.sh` to bump `unstructured/__version__.py` (strip `-dev` or increment patch), detect changed dependencies, and append a security entry to `CHANGELOG.md` > - Updates `renovate.json5` to run the script as a `postUpgradeTasks` step for `pypi` vulnerability alerts on the PR branch > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 7be1a7c. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
Unstructured-IO#4147) …tcher Move postUpgradeTasks from packageRules to vulnerabilityAlerts object. The matchIsVulnerabilityAlert option doesn't exist in Renovate's schema. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Shifts Renovate config to correctly trigger version bump tasks on security alerts. > > - Removes `packageRules` with non-existent `matchIsVulnerabilityAlert` > - Adds `vulnerabilityAlerts.postUpgradeTasks` to run `scripts/renovate-security-bump.sh` with specified `fileFilters` and `executionMode: branch` > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 676af0a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [filelock](https://redirect.github.com/tox-dev/py-filelock) | `==3.20.0` → `==3.20.1` |  |  | | [marshmallow](https://redirect.github.com/marshmallow-code/marshmallow) ([changelog](https://marshmallow.readthedocs.io/en/latest/changelog.html)) | `==3.26.1` → `==3.26.2` |  |  | | [pypdf](https://redirect.github.com/py-pdf/pypdf) ([changelog](https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html)) | `==6.3.0` → `==6.4.0` |  |  | | [urllib3](https://redirect.github.com/urllib3/urllib3) ([changelog](https://redirect.github.com/urllib3/urllib3/blob/main/CHANGES.rst)) | `==2.5.0` → `==2.6.0` |  |  | ### GitHub Vulnerability Alerts #### [CVE-2025-68146](https://redirect.github.com/tox-dev/filelock/security/advisories/GHSA-w853-jp5j-5j7f) ### Impact A Time-of-Check-Time-of-Use (TOCTOU) race condition allows local attackers to corrupt or truncate arbitrary user files through symlink attacks. The vulnerability exists in both Unix and Windows lock file creation where filelock checks if a file exists before opening it with O_TRUNC. An attacker can create a symlink pointing to a victim file in the time gap between the check and open, causing os.open() to follow the symlink and truncate the target file. **Who is impacted:** All users of filelock on Unix, Linux, macOS, and Windows systems. The vulnerability cascades to dependent libraries: - **virtualenv users**: Configuration files can be overwritten with virtualenv metadata, leaking sensitive paths - **PyTorch users**: CPU ISA cache or model checkpoints can be corrupted, causing crashes or ML pipeline failures - **poetry/tox users**: through using virtualenv or filelock on their own. Attack requires local filesystem access and ability to create symlinks (standard user permissions on Unix; Developer Mode on Windows 10+). Exploitation succeeds within 1-3 attempts when lock file paths are predictable. ### Patches Fixed in version **3.20.1**. **Unix/Linux/macOS fix:** Added O_NOFOLLOW flag to os.open() in UnixFileLock.\_acquire() to prevent symlink following. **Windows fix:** Added GetFileAttributesW API check to detect reparse points (symlinks/junctions) before opening files in WindowsFileLock.\_acquire(). **Users should upgrade to filelock 3.20.1 or later immediately.** ### Workarounds If immediate upgrade is not possible: 1. Use SoftFileLock instead of UnixFileLock/WindowsFileLock (note: different locking semantics, may not be suitable for all use cases) 2. Ensure lock file directories have restrictive permissions (chmod 0700) to prevent untrusted users from creating symlinks 3. Monitor lock file directories for suspicious symlinks before running trusted applications **Warning:** These workarounds provide only partial mitigation. The race condition remains exploitable. Upgrading to version 3.20.1 is strongly recommended. ______________________________________________________________________ ## Technical Details: How the Exploit Works ### The Vulnerable Code Pattern **Unix/Linux/macOS** (`src/filelock/_unix.py:39-44`): ```python def _acquire(self) -> None: ensure_directory_exists(self.lock_file) open_flags = os.O_RDWR | os.O_TRUNC # (1) Prepare to truncate if not Path(self.lock_file).exists(): # (2) CHECK: Does file exist? open_flags |= os.O_CREAT fd = os.open(self.lock_file, open_flags, ...) # (3) USE: Open and truncate ``` **Windows** (`src/filelock/_windows.py:19-28`): ```python def _acquire(self) -> None: raise_on_not_writable_file(self.lock_file) # (1) Check writability ensure_directory_exists(self.lock_file) flags = os.O_RDWR | os.O_CREAT | os.O_TRUNC # (2) Prepare to truncate fd = os.open(self.lock_file, flags, ...) # (3) Open and truncate ``` ### The Race Window The vulnerability exists in the gap between operations: **Unix variant:** ``` Time Victim Thread Attacker Thread ---- ------------- --------------- T0 Check: lock_file exists? → False T1 ↓ RACE WINDOW T2 Create symlink: lock → victim_file T3 Open lock_file with O_TRUNC → Follows symlink → Opens victim_file → Truncates victim_file to 0 bytes! ☠️ ``` **Windows variant:** ``` Time Victim Thread Attacker Thread ---- ------------- --------------- T0 Check: lock_file writable? T1 ↓ RACE WINDOW T2 Create symlink: lock → victim_file T3 Open lock_file with O_TRUNC → Follows symlink/junction → Opens victim_file → Truncates victim_file to 0 bytes! ☠️ ``` ### Step-by-Step Attack Flow **1. Attacker Setup:** ```python # Attacker identifies target application using filelock lock_path = "/tmp/myapp.lock" # Predictable lock path victim_file = "/home/victim/.ssh/config" # High-value target ``` **2. Attacker Creates Race Condition:** ```python import os import threading def attacker_thread(): # Remove any existing lock file try: os.unlink(lock_path) except FileNotFoundError: pass # Create symlink pointing to victim file os.symlink(victim_file, lock_path) print(f"[Attacker] Created: {lock_path} → {victim_file}") # Launch attack threading.Thread(target=attacker_thread).start() ``` **3. Victim Application Runs:** ```python from filelock import UnixFileLock # Normal application code lock = UnixFileLock("/tmp/myapp.lock") lock.acquire() # ← VULNERABILITY TRIGGERED HERE # At this point, /home/victim/.ssh/config is now 0 bytes! ``` **4. What Happens Inside os.open():** On Unix systems, when `os.open()` is called: ```c // Linux kernel behavior (simplified) int open(const char *pathname, int flags) { struct file *f = path_lookup(pathname); // Resolves symlinks by default! if (flags & O_TRUNC) { truncate_file(f); // ← Truncates the TARGET of the symlink } return file_descriptor; } ``` Without `O_NOFOLLOW` flag, the kernel follows the symlink and truncates the target file. ### Why the Attack Succeeds Reliably **Timing Characteristics:** - **Check operation** (Path.exists()): ~100-500 nanoseconds - **Symlink creation** (os.symlink()): ~1-10 microseconds - **Race window**: ~1-5 microseconds (very small but exploitable) - **Thread scheduling quantum**: ~1-10 milliseconds **Success factors:** 1. **Tight loop**: Running attack in a loop hits the race window within 1-3 attempts 2. **CPU scheduling**: Modern OS thread schedulers frequently context-switch during I/O operations 3. **No synchronization**: No atomic file creation prevents the race 4. **Symlink speed**: Creating symlinks is extremely fast (metadata-only operation) ### Real-World Attack Scenarios **Scenario 1: virtualenv Exploitation** ```python # Victim runs: python -m venv /tmp/myenv # Attacker racing to create: os.symlink("/home/victim/.bashrc", "/tmp/myenv/pyvenv.cfg") # Result: /home/victim/.bashrc overwritten with: # home = /usr/bin/python3 # include-system-site-packages = false # version = 3.11.2 # ← Original .bashrc contents LOST + virtualenv metadata LEAKED to attacker ``` **Scenario 2: PyTorch Cache Poisoning** ```python # Victim runs: import torch # PyTorch checks CPU capabilities, uses filelock on cache # Attacker racing to create: os.symlink("/home/victim/.torch/compiled_model.pt", "/home/victim/.cache/torch/cpu_isa_check.lock") # Result: Trained ML model checkpoint truncated to 0 bytes # Impact: Weeks of training lost, ML pipeline DoS ``` ### Why Standard Defenses Don't Help **File permissions don't prevent this:** - Attacker doesn't need write access to victim_file - os.open() with O_TRUNC follows symlinks using the *victim's* permissions - The victim process truncates its own file **Directory permissions help but aren't always feasible:** - Lock files often created in shared /tmp directory (mode 1777) - Applications may not control lock file location - Many apps use predictable paths in user-writable directories **File locking doesn't prevent this:** - The truncation happens *during* the open() call, before any lock is acquired - fcntl.flock() only prevents concurrent lock acquisition, not symlink attacks ### Exploitation Proof-of-Concept Results From empirical testing with the provided PoCs: **Simple Direct Attack** (`filelock_simple_poc.py`): - Success rate: 33% per attempt (1 in 3 tries) - Average attempts to success: 2.1 - Target file reduced to 0 bytes in \<100ms **virtualenv Attack** (`weaponized_virtualenv.py`): - Success rate: ~90% on first attempt (deterministic timing) - Information leaked: File paths, Python version, system configuration - Data corruption: Complete loss of original file contents **PyTorch Attack** (`weaponized_pytorch.py`): - Success rate: 25-40% per attempt - Impact: Application crashes, model loading failures - Recovery: Requires cache rebuild or model retraining **Discovered and reported by:** George Tsigourakos (@​tsigouris007) #### [CVE-2025-68480](https://redirect.github.com/marshmallow-code/marshmallow/security/advisories/GHSA-428g-f7cq-pgp5) ### Impact `Schema.load(data, many=True)` is vulnerable to denial of service attacks. A moderately sized request can consume a disproportionate amount of CPU time. ### Patches 4.1.2, 3.26.2 ### Workarounds ```py # Fail fast def load_many(schema, data, **kwargs): if not isinstance(data, list): raise ValidationError(['Invalid input type.']) return [schema.load(item, **kwargs) for item in data] ``` #### [CVE-2025-66019](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-jfx9-29x2-rv3j) ### Impact An attacker who uses this vulnerability can craft a PDF which leads to a memory usage of up to 1 GB per stream. This requires parsing the content stream of a page using the LZWDecode filter. This is a follow up to [GHSA-jfx9-29x2-rv3j](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-jfx9-29x2-rv3j) to align the default limit with the one for *zlib*. ### Patches This has been fixed in [pypdf==6.4.0](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.4.0). ### Workarounds If users cannot upgrade yet, use the line below to overwrite the default in their code: ```python pypdf.filters.LZW_MAX_OUTPUT_LENGTH = 75_000_000 ``` #### [CVE-2025-66418](https://redirect.github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53) ## Impact urllib3 supports chained HTTP encoding algorithms for response content according to RFC 9110 (e.g., `Content-Encoding: gzip, zstd`). However, the number of links in the decompression chain was unbounded allowing a malicious server to insert a virtually unlimited number of compression steps leading to high CPU usage and massive memory allocation for the decompressed data. ## Affected usages Applications and libraries using urllib3 version 2.5.0 and earlier for HTTP requests to untrusted sources unless they disable content decoding explicitly. ## Remediation Upgrade to at least urllib3 v2.6.0 in which the library limits the number of links to 5. If upgrading is not immediately possible, use [`preload_content=False`](https://urllib3.readthedocs.io/en/2.5.0/advanced-usage.html#streaming-and-i-o) and ensure that `resp.headers["content-encoding"]` contains a safe number of encodings before reading the response content. #### [CVE-2025-66471](https://redirect.github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37) ### Impact urllib3's [streaming API](https://urllib3.readthedocs.io/en/2.5.0/advanced-usage.html#streaming-and-i-o) is designed for the efficient handling of large HTTP responses by reading the content in chunks, rather than loading the entire response body into memory at once. When streaming a compressed response, urllib3 can perform decoding or decompression based on the HTTP `Content-Encoding` header (e.g., `gzip`, `deflate`, `br`, or `zstd`). The library must read compressed data from the network and decompress it until the requested chunk size is met. Any resulting decompressed data that exceeds the requested amount is held in an internal buffer for the next read operation. The decompression logic could cause urllib3 to fully decode a small amount of highly compressed data in a single operation. This can result in excessive resource consumption (high CPU usage and massive memory allocation for the decompressed data; CWE-409) on the client side, even if the application only requested a small chunk of data. ### Affected usages Applications and libraries using urllib3 version 2.5.0 and earlier to stream large compressed responses or content from untrusted sources. `stream()`, `read(amt=256)`, `read1(amt=256)`, `read_chunked(amt=256)`, `readinto(b)` are examples of `urllib3.HTTPResponse` method calls using the affected logic unless decoding is disabled explicitly. ### Remediation Upgrade to at least urllib3 v2.6.0 in which the library avoids decompressing data that exceeds the requested amount. If your environment contains a package facilitating the Brotli encoding, upgrade to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 too. These versions are enforced by the `urllib3[brotli]` extra in the patched versions of urllib3. ### Credits The issue was reported by @​Cycloctane. Supplemental information was provided by @​stamparm during a security audit performed by [7ASecurity](https://7asecurity.com/) and facilitated by [OSTIF](https://ostif.org/). --- ### Release Notes <details> <summary>tox-dev/py-filelock (filelock)</summary> ### [`v3.20.1`](https://redirect.github.com/tox-dev/filelock/releases/tag/3.20.1) [Compare Source](https://redirect.github.com/tox-dev/py-filelock/compare/3.20.0...3.20.1) <!-- Release notes generated using configuration in .github/release.yml at main --> ##### What's Changed - CVE-2025-68146: Fix TOCTOU symlink vulnerability in lock file creation by [@​gaborbernat](https://redirect.github.com/gaborbernat) in [tox-dev/filelock#461](https://redirect.github.com/tox-dev/filelock/pull/461) **Full Changelog**: <tox-dev/filelock@3.20.0...3.20.1> </details> <details> <summary>marshmallow-code/marshmallow (marshmallow)</summary> ### [`v3.26.2`](https://redirect.github.com/marshmallow-code/marshmallow/blob/HEAD/CHANGELOG.rst#3262-2025-12-19) [Compare Source](https://redirect.github.com/marshmallow-code/marshmallow/compare/3.26.1...3.26.2) Bug fixes: - :cve:`2025-68480`: Merge error store messages without rebuilding collections. Thanks 카푸치노 for reporting and :user:`deckar01` for the fix. </details> <details> <summary>py-pdf/pypdf (pypdf)</summary> ### [`v6.4.0`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-641-2025-12-07) [Compare Source](https://redirect.github.com/py-pdf/pypdf/compare/6.3.0...6.4.0) ##### Performance Improvements (PI) - Optimize loop for layout mode text extraction ([#​3543](https://redirect.github.com/py-pdf/pypdf/issues/3543)) ##### Bug Fixes (BUG) - Do not fail on choice field without /Opt key ([#​3540](https://redirect.github.com/py-pdf/pypdf/issues/3540)) ##### Documentation (DOC) - Document possible issues with merge\_page and clipping ([#​3546](https://redirect.github.com/py-pdf/pypdf/issues/3546)) - Add some notes about library security ([#​3545](https://redirect.github.com/py-pdf/pypdf/issues/3545)) ##### Maintenance (MAINT) - Use CORE\_FONT\_METRICS for widths where possible ([#​3526](https://redirect.github.com/py-pdf/pypdf/issues/3526)) [Full Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.4.0...6.4.1) </details> <details> <summary>urllib3/urllib3 (urllib3)</summary> ### [`v2.6.0`](https://redirect.github.com/urllib3/urllib3/blob/HEAD/CHANGES.rst#260-2025-12-05) [Compare Source](https://redirect.github.com/urllib3/urllib3/compare/2.5.0...2.6.0) \================== ## Security - Fixed a security issue where streaming API could improperly handle highly compressed HTTP content ("decompression bombs") leading to excessive resource consumption even when a small amount of data was requested. Reading small chunks of compressed data is safer and much more efficient now. (`GHSA-2xpw-w6gg-jr37 <https://github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37>`\_\_) - Fixed a security issue where an attacker could compose an HTTP response with virtually unlimited links in the `Content-Encoding` header, potentially leading to a denial of service (DoS) attack by exhausting system resources during decoding. The number of allowed chained encodings is now limited to 5. (`GHSA-gm62-xv2j-4w53 <https://github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53>`\_\_) .. caution:: - If urllib3 is not installed with the optional `urllib3[brotli]` extra, but your environment contains a Brotli/brotlicffi/brotlipy package anyway, make sure to upgrade it to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 to benefit from the security fixes and avoid warnings. Prefer using `urllib3[brotli]` to install a compatible Brotli package automatically. - If you use custom decompressors, please make sure to update them to respect the changed API of `urllib3.response.ContentDecoder`. ## Features - Enabled retrieval, deletion, and membership testing in `HTTPHeaderDict` using bytes keys. (`#​3653 <https://github.com/urllib3/urllib3/issues/3653>`\_\_) - Added host and port information to string representations of `HTTPConnection`. (`#​3666 <https://github.com/urllib3/urllib3/issues/3666>`\_\_) - Added support for Python 3.14 free-threading builds explicitly. (`#​3696 <https://github.com/urllib3/urllib3/issues/3696>`\_\_) ## Removals - Removed the `HTTPResponse.getheaders()` method in favor of `HTTPResponse.headers`. Removed the `HTTPResponse.getheader(name, default)` method in favor of `HTTPResponse.headers.get(name, default)`. (`#​3622 <https://github.com/urllib3/urllib3/issues/3622>`\_\_) ## Bugfixes - Fixed redirect handling in `urllib3.PoolManager` when an integer is passed for the retries parameter. (`#​3649 <https://github.com/urllib3/urllib3/issues/3649>`\_\_) - Fixed `HTTPConnectionPool` when used in Emscripten with no explicit port. (`#​3664 <https://github.com/urllib3/urllib3/issues/3664>`\_\_) - Fixed handling of `SSLKEYLOGFILE` with expandable variables. (`#​3700 <https://github.com/urllib3/urllib3/issues/3700>`\_\_) ## Misc - Changed the `zstd` extra to install `backports.zstd` instead of `zstandard` on Python 3.13 and before. (`#​3693 <https://github.com/urllib3/urllib3/issues/3693>`\_\_) - Improved the performance of content decoding by optimizing `BytesQueueBuffer` class. (`#​3710 <https://github.com/urllib3/urllib3/issues/3710>`\_\_) - Allowed building the urllib3 package with newer setuptools-scm v9.x. (`#​3652 <https://github.com/urllib3/urllib3/issues/3652>`\_\_) - Ensured successful urllib3 builds by setting Hatchling requirement to >= 1.27.0. (`#​3638 <https://github.com/urllib3/urllib3/issues/3638>`\_\_) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 👻 **Immortal**: This PR will be recreated if closed unmerged. Get [config help](https://redirect.github.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://redirect.github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi42Ni4zIiwidXBkYXRlZEluVmVyIjoiNDIuNjYuMyIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsic2VjdXJpdHkiXX0=--> Co-authored-by: utic-renovate[bot] <235200891+utic-renovate[bot]@users.noreply.github.com>
aseembits93
approved these changes
Jan 2, 2026
…high severity CVEs (Unstructured-IO#4156) <!-- CURSOR_SUMMARY --> > [!NOTE] > Security-focused dependency updates and alignment with new pdfminer behavior. > > - Remove `pdfminer.six` constraint; bump `pdfminer-six` to `20251230` and `urllib3` to `2.6.2` across requirement sets, plus assorted minor dependency bumps > - Update tests (`test_pdfminer_processing`) to reflect pdfminer’s hidden OCR text handling; add clarifying docstring in `text_is_embedded` > - Bump version to `0.18.25` and update `CHANGELOG.md` > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 04f70ee. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary - Pin `deltalake<1.3.0` to fix ARM64 Docker build failures ## Problem `deltalake` 1.3.0 is missing Linux ARM64 wheels due to a builder OOM issue on their CI. When pip can't find a wheel, it tries to build from source, which fails because the Wolfi base image doesn't have a C compiler (`cc`). This causes the `unstructured-ingest[delta-table]` install to fail, breaking the ARM64 Docker image. delta-io/delta-rs#4041 ## Solution Temporarily pin `deltalake<1.3.0` until: - deltalake publishes ARM64 wheels for 1.3.0+, OR - unstructured-ingest adds the pin to its `delta-table` extra ## Test plan - [ ] ARM64 Docker build succeeds 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Pins a dependency to unblock ARM64 builds and publishes a patch release. > > - Add `deltalake<1.3.0` to `requirements/ingest/ingest.txt` to avoid missing Linux ARM64 wheels breaking Docker builds > - Bump version to `0.18.26` and add corresponding CHANGELOG entry > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b4f15b4. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ed-IO#4160) <!-- CODEFLASH_OPTIMIZATION: {"function":"sentence_count","file":"unstructured/partition/text_type.py","speedup_pct":"1,038%","speedup_x":"10.38x","original_runtime":"51.8 milliseconds","best_runtime":"4.55 milliseconds","optimization_type":"loop","timestamp":"2025-12-23T11:08:46.623Z","version":"1.0"} --> #### 📄 1,038% (10.38x) speedup for ***`sentence_count` in `unstructured/partition/text_type.py`*** ⏱️ Runtime : **`51.8 milliseconds`** **→** **`4.55 milliseconds`** (best of `14` runs) #### 📝 Explanation and details The optimized code achieves a **1037% speedup (51.8ms → 4.55ms)** through two key optimizations: ## 1. **Caching Fix for `sent_tokenize` (Primary Speedup)** **Problem**: The original code applied `@lru_cache` directly to `sent_tokenize`, but NLTK's `_sent_tokenize` returns a `List[str]`, which is **unhashable** and cannot be cached properly by Python's `lru_cache`. **Solution**: The optimized version introduces a two-layer approach: - `_tokenize_for_cache()` - Cached function that returns `Tuple[str, ...]` (hashable) - `sent_tokenize()` - Public wrapper that converts tuple to list **Why it's faster**: This enables **actual caching** of tokenization results. The test annotations show dramatic speedups (up to **35,000% faster**) on repeated text, confirming the cache now works. Since `sentence_count` tokenizes the same text patterns repeatedly across function calls, this cache hit rate is crucial. **Impact on hot paths**: Based on `function_references`, this function is called from: - `is_possible_narrative_text()` - checks if text contains ≥2 sentences with `sentence_count(text, 3)` - `is_possible_title()` - validates single-sentence constraint with `sentence_count(text, min_length=...)` - `exceeds_cap_ratio()` - checks sentence count to avoid multi-sentence text These are all text classification functions likely invoked repeatedly during document parsing, making the caching fix highly impactful. ## 2. **Branch Prediction Optimization in `sentence_count`** **Change**: Split the loop into two branches - one for `min_length` case, one for no filtering: ```python if min_length: # Loop with filtering logic else: # Simple counting loop ``` **Why it's faster**: - Eliminates repeated `if min_length:` checks inside the loop (7,181 checks in profiler) - Allows CPU branch predictor to optimize each loop independently - Hoists `trace_logger.detail` lookup outside loop (68 calls vs 3,046+ attribute lookups) **Test results validation**: - Cases **without** `min_length` show **massive speedups** (3,000-35,000%) due to pure caching benefits - Cases **with** `min_length` show **moderate speedups** (60-940%) since filtering logic still executes, but benefits from reduced overhead and hoisting The optimization is most effective for workloads that process similar text patterns repeatedly (common in document parsing pipelines) and particularly when `min_length` is not specified, which appears to be the common case based on function references. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **21 Passed** | | 🌀 Generated Regression Tests | ✅ **60 Passed** | | ⏪ Replay Tests | ✅ **5 Passed** | | 🔎 Concolic Coverage Tests | ✅ **1 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Click to see Existing Unit Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_text_type.py::test_item_titles` | 47.2μs | 8.06μs | 486%✅ | | `partition/test_text_type.py::test_sentence_count` | 4.34μs | 1.81μs | 139%✅ | </details> <details> <summary>🌀 Click to see Generated Regression Tests</summary> ```python # imports from unstructured.partition.text_type import sentence_count # Basic Test Cases def test_single_sentence(): # Simple single sentence text = "This is a test sentence." codeflash_output = sentence_count(text) # 20.1μs -> 2.52μs (697% faster) def test_multiple_sentences(): # Multiple sentences separated by periods text = "This is the first sentence. This is the second sentence. Here is a third." codeflash_output = sentence_count(text) # 62.7μs -> 1.58μs (3868% faster) def test_sentences_with_various_punctuation(): # Sentences ending with different punctuation text = "Is this a question? Yes! It is." codeflash_output = sentence_count(text) # 44.1μs -> 1.48μs (2879% faster) def test_sentence_with_min_length_none(): # min_length=None should count all sentences text = "Short. Another one." codeflash_output = sentence_count(text, min_length=None) # 27.0μs -> 1.59μs (1595% faster) def test_sentence_with_min_length(): # Only sentences with at least min_length words are counted text = "Short. This is a long enough sentence." codeflash_output = sentence_count(text, min_length=4) # 33.2μs -> 13.5μs (146% faster) def test_sentence_with_min_length_exact(): # Sentence with exactly min_length words should be counted text = "One two three four." codeflash_output = sentence_count(text, min_length=4) # 10.1μs -> 5.04μs (99.5% faster) # Edge Test Cases def test_empty_string(): # Empty string should return 0 codeflash_output = sentence_count("") # 5.30μs -> 1.04μs (409% faster) def test_whitespace_only(): # String with only whitespace should return 0 codeflash_output = sentence_count(" ") # 5.26μs -> 888ns (493% faster) def test_no_sentence_punctuation(): # Text with no sentence-ending punctuation is treated as one sentence by NLTK text = "This is just a run on sentence with no punctuation" codeflash_output = sentence_count(text) # 8.34μs -> 1.13μs (638% faster) def test_sentence_with_only_punctuation(): # Sentences that are just punctuation should not be counted if min_length is set text = "!!! ... ???" codeflash_output = sentence_count(text, min_length=1) # 79.0μs -> 7.59μs (940% faster) def test_sentence_with_non_ascii_punctuation(): # Sentences with Unicode punctuation text = "This is a test sentence。This is another!" # NLTK may not split these as sentences; check for at least 1 codeflash_output = sentence_count(text) # 10.9μs -> 1.13μs (871% faster) def test_sentence_with_abbreviations(): # Abbreviations should not split sentences incorrectly text = "Dr. Smith went to Washington. He arrived at 10 a.m. sharp." codeflash_output = sentence_count(text) # 57.9μs -> 1.43μs (3959% faster) def test_sentence_with_newlines(): # Sentences separated by newlines text = "First sentence.\nSecond sentence!\n\nThird sentence?" codeflash_output = sentence_count(text) # 43.2μs -> 1.34μs (3113% faster) def test_sentence_with_multiple_spaces(): # Sentences with irregular spacing text = "First sentence. Second sentence. " codeflash_output = sentence_count(text) # 27.6μs -> 1.16μs (2282% faster) def test_sentence_with_min_length_zero(): # min_length=0 should count all sentences text = "A. B." codeflash_output = sentence_count(text, min_length=0) # 27.7μs -> 1.38μs (1909% faster) def test_sentence_with_min_length_greater_than_any_sentence(): # All sentences are too short for min_length text = "A. B." codeflash_output = sentence_count(text, min_length=10) # 5.47μs -> 6.16μs (11.2% slower) def test_sentence_with_just_numbers(): # Sentences that are just numbers text = "12345. 67890." codeflash_output = sentence_count(text) # 31.7μs -> 1.29μs (2350% faster) def test_sentence_with_only_punctuation_and_spaces(): # Only punctuation and spaces text = " . . . " codeflash_output = sentence_count(text) # 34.2μs -> 1.31μs (2502% faster) def test_sentence_with_ellipsis(): # Ellipsis should not break sentence count text = "Wait... what happened? I don't know..." codeflash_output = sentence_count(text) # 44.7μs -> 1.36μs (3182% faster) # Large Scale Test Cases def test_large_number_of_sentences(): # 1000 short sentences text = "Sentence. " * 1000 codeflash_output = sentence_count(text) # 8.26ms -> 23.5μs (35048% faster) def test_large_text_with_long_sentences(): # 500 sentences, each with 10 words sentence = "This is a sentence with exactly ten words." text = " ".join([sentence for _ in range(500)]) codeflash_output = sentence_count(text) # 4.11ms -> 17.3μs (23651% faster) def test_large_text_min_length_filtering(): # 1000 sentences, only half meet min_length short_sentence = "Short." long_sentence = "This is a sufficiently long sentence for testing." text = " ".join([short_sentence, long_sentence] * 500) codeflash_output = sentence_count(text, min_length=5) # 8.78ms -> 1.15ms (664% faster) def test_large_text_all_filtered(): # All sentences filtered out by min_length sentence = "A." text = " ".join([sentence for _ in range(1000)]) codeflash_output = sentence_count(text, min_length=3) # 7.74ms -> 499μs (1450% faster) # Regression/Mutation tests def test_min_length_does_not_count_punctuation_as_word(): # Punctuation-only tokens should not be counted as words text = "This . is . a . test." # Each "is .", "a .", "test." is a sentence, but only the last is a real sentence # NLTK will likely see this as one sentence codeflash_output = sentence_count(text, min_length=2) # 52.5μs -> 7.96μs (560% faster) def test_sentences_with_internal_periods(): # Internal periods (e.g., in abbreviations) do not split sentences text = "This is Mr. Smith. He lives on St. Patrick's street." codeflash_output = sentence_count(text) # 55.1μs -> 1.23μs (4371% faster) def test_sentence_with_trailing_spaces_and_newlines(): # Sentences with trailing spaces and newlines text = "First sentence. \nSecond sentence. \n" codeflash_output = sentence_count(text) # 29.0μs -> 1.19μs (2337% faster) def test_sentence_with_tabs(): # Sentences separated by tabs text = "First sentence.\tSecond sentence." codeflash_output = sentence_count(text) # 30.1μs -> 1.10μs (2645% faster) def test_sentence_with_multiple_types_of_whitespace(): # Sentences separated by various whitespace text = "First sentence.\n\t Second sentence.\r\nThird sentence." codeflash_output = sentence_count(text) # 45.0μs -> 1.30μs (3373% faster) def test_sentence_with_unicode_whitespace(): # Sentences separated by Unicode whitespace text = "First sentence.\u2003Second sentence.\u2029Third sentence." codeflash_output = sentence_count(text) # 47.4μs -> 1.24μs (3714% faster) def test_sentence_with_emojis(): # Sentences containing emojis text = "Hello world! 😀 How are you? 👍" codeflash_output = sentence_count(text) # 47.4μs -> 1.16μs (3989% faster) def test_sentence_with_quotes(): # Sentences with quoted text text = "\"Hello,\" she said. 'How are you?'" codeflash_output = sentence_count(text) # 41.7μs -> 1.07μs (3812% faster) def test_sentence_with_parentheses(): # Sentences with parentheses text = "This is a sentence (with parentheses). Here is another." codeflash_output = sentence_count(text) # 31.5μs -> 1.25μs (2430% faster) def test_sentence_with_brackets_and_braces(): # Sentences with brackets and braces text = "This is [a test]. {Another one}." codeflash_output = sentence_count(text) # 32.4μs -> 1.19μs (2624% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python # function to test # For testing, we need to define the sentence_count function and its dependencies. # We'll use the real NLTK sent_tokenize for realistic behavior. # imports from unstructured.partition.text_type import sentence_count # Dummy trace_logger for completeness (no-op) class DummyLogger: def detail(self, msg): pass trace_logger = DummyLogger() # unit tests class TestSentenceCount: # --- Basic Test Cases --- def test_empty_string(self): # Should return 0 for empty string codeflash_output = sentence_count("") # 747ns -> 1.25μs (40.0% slower) def test_single_sentence(self): # Should return 1 for a simple sentence codeflash_output = sentence_count("This is a test.") # 10.2μs -> 1.09μs (834% faster) def test_multiple_sentences(self): # Should return correct count for multiple sentences codeflash_output = sentence_count( "This is a test. Here is another sentence. And a third one!" ) # 51.5μs -> 1.38μs (3625% faster) def test_sentences_with_varied_punctuation(self): # Should handle sentences ending with ! and ? codeflash_output = sentence_count( "Is this working? Yes! It is." ) # 43.1μs -> 1.18μs (3552% faster) def test_sentences_with_abbreviations(self): # Should not split on abbreviations like "Dr.", "Mr.", "e.g." text = "Dr. Smith went to Washington. He arrived at 10 a.m. sharp." # NLTK correctly splits into 2 sentences codeflash_output = sentence_count(text) # 4.49μs -> 1.24μs (261% faster) def test_sentences_with_newlines(self): # Should handle newlines between sentences text = "First sentence.\nSecond sentence!\n\nThird sentence?" codeflash_output = sentence_count(text) # 4.22μs -> 1.08μs (289% faster) def test_min_length_parameter(self): # Only sentences with >= min_length words should be counted text = "Short. This one is long enough. Ok." # Only "This one is long enough" has >= 4 words codeflash_output = sentence_count(text, min_length=4) # 49.1μs -> 10.5μs (366% faster) def test_min_length_zero(self): # min_length=0 should count all sentences text = "A. B. C." codeflash_output = sentence_count(text, min_length=0) # 43.5μs -> 1.42μs (2954% faster) def test_min_length_none(self): # min_length=None should count all sentences text = "A. B. C." codeflash_output = sentence_count(text, min_length=None) # 2.09μs -> 1.28μs (63.4% faster) # --- Edge Test Cases --- def test_only_punctuation(self): # Only punctuation, no words codeflash_output = sentence_count("...!!!???") # 33.4μs -> 1.27μs (2525% faster) def test_sentence_with_only_spaces(self): # Spaces only should yield 0 codeflash_output = sentence_count(" ") # 5.67μs -> 862ns (557% faster) def test_sentence_with_emoji_and_symbols(self): # Emojis and symbols should not count as sentences codeflash_output = sentence_count("😀 😂 🤔") # 8.09μs -> 1.16μs (598% faster) def test_sentence_with_mixed_unicode(self): # Should handle unicode characters and punctuation text = "Café. Voilà! Привет мир. こんにちは世界。" # NLTK may split Japanese as one sentence, Russian as one, etc. # Let's check for at least 3 sentences (English, French, Russian) codeflash_output = sentence_count(text) count = codeflash_output # 71.8μs -> 1.34μs (5243% faster) def test_sentence_with_no_sentence_endings(self): # No sentence-ending punctuation, should be one sentence text = "This is a sentence without ending punctuation" codeflash_output = sentence_count(text) # 8.12μs -> 1.07μs (659% faster) def test_sentence_with_ellipses(self): # Ellipses should not break sentences text = "Wait... what happened? I don't know..." codeflash_output = sentence_count(text) # 3.83μs -> 1.17μs (227% faster) def test_sentence_with_multiple_spaces_and_tabs(self): # Should handle excessive whitespace correctly text = "Sentence one. \t Sentence two. \n\n Sentence three." codeflash_output = sentence_count(text) # 43.0μs -> 1.12μs (3753% faster) def test_sentence_with_numbers_and_periods(self): # Numbers with periods should not split sentences text = "The value is 3.14. Next sentence." codeflash_output = sentence_count(text) # 32.3μs -> 1.15μs (2714% faster) def test_sentence_with_bullet_points(self): # Should not count bullets as sentences text = "- Item one\n- Item two\n- Item three" codeflash_output = sentence_count(text) # 7.78μs -> 1.01μs (666% faster) def test_sentence_with_long_word_and_min_length(self): # One long word (no spaces) with min_length > 1 should not count codeflash_output = sentence_count( "Supercalifragilisticexpialidocious.", min_length=2 ) # 11.3μs -> 7.04μs (59.9% faster) def test_sentence_with_repeated_punctuation(self): # Should not split on repeated punctuation without sentence-ending text = "Hello!!! How are you??? Fine..." codeflash_output = sentence_count(text) # 48.3μs -> 1.22μs (3867% faster) def test_sentence_with_internal_periods(self): # Internal periods (e.g., URLs) should not split sentences text = "Check out www.example.com. This is a new sentence." codeflash_output = sentence_count(text) # 31.0μs -> 1.22μs (2439% faster) def test_sentence_with_parentheses_and_quotes(self): text = 'He said, "Hello there." (And then he left.)' # Should count as two sentences codeflash_output = sentence_count(text) # 41.6μs -> 1.18μs (3430% faster) # --- Large Scale Test Cases --- def test_large_text_many_sentences(self): # Test with 500 sentences text = "This is a sentence. " * 500 codeflash_output = sentence_count(text) # 3.91ms -> 13.9μs (28106% faster) def test_large_text_with_min_length(self): # 1000 sentences, but only every other one is long enough text = "" for i in range(1000): if i % 2 == 0: text += "Short. " else: text += "This sentence is long enough for the test. " # Only 500 sentences should meet min_length=5 codeflash_output = sentence_count(text, min_length=5) # 8.33ms -> 1.08ms (671% faster) def test_large_text_no_sentence_endings(self): # One very long sentence without punctuation text = " ".join(["word"] * 1000) codeflash_output = sentence_count(text) # 31.3μs -> 3.09μs (913% faster) def test_large_text_all_too_short(self): # 1000 one-word sentences, min_length=2, should return 0 text = ". ".join(["A"] * 1000) + "." codeflash_output = sentence_count(text, min_length=2) # 538μs -> 502μs (7.18% faster) def test_large_text_all_counted(self): # 1000 sentences, all long enough text = "This is a valid sentence. " * 1000 codeflash_output = sentence_count(text, min_length=4) # 8.46ms -> 1.12ms (655% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from unstructured.partition.text_type import sentence_count def test_sentence_count(): sentence_count("!", min_length=None) ``` </details> <details> <summary>⏪ Click to see Replay Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `test_benchmark6_py__replay_test_0.py::test_unstructured_partition_text_type_sentence_count` | 35.2μs | 20.5μs | 72.0%✅ | </details> <details> <summary>🔎 Click to see Concolic Coverage Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:-----------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `codeflash_concolic_jzsax6p2/tmpkbdw6p4k/test_concolic_coverage.py::test_sentence_count` | 10.8μs | 2.23μs | 385%✅ | </details> To edit these changes `git checkout codeflash/optimize-sentence_count-mjihf0yi` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
…y 266% (Unstructured-IO#4162) <!-- CODEFLASH_OPTIMIZATION: {"function":"_PartitionerLoader._load_partitioner","file":"unstructured/partition/auto.py","speedup_pct":"266%","speedup_x":"2.66x","original_runtime":"2.33 milliseconds","best_runtime":"635 microseconds","optimization_type":"memory","timestamp":"2025-12-20T13:16:17.303Z","version":"1.0"} --> #### 📄 266% (2.66x) speedup for ***`_PartitionerLoader._load_partitioner` in `unstructured/partition/auto.py`*** ⏱️ Runtime : **`2.33 milliseconds`** **→** **`635 microseconds`** (best of `250` runs) #### 📝 Explanation and details The optimization adds `@lru_cache(maxsize=128)` to the `dependency_exists` function, providing **266% speedup** by eliminating redundant dependency checks. **Key optimization:** The original code repeatedly calls `importlib.import_module()` for the same dependency packages during partition loading. Looking at the line profiler results, `dependency_exists` was called 659 times and spent 97.9% of its time (9.33ms out of 9.53ms) in `importlib.import_module()`. The optimized version reduces this to just 1.27ms total time for dependency checks. **Why this works:** `importlib.import_module()` is expensive because it performs filesystem operations, module compilation, and import resolution. With caching, subsequent calls for the same dependency name return immediately from memory rather than re-importing. The cache size of 128 is sufficient for typical use cases where the same few dependencies are checked repeatedly. **Performance impact by test case:** - **Massive gains** for scenarios with many dependencies: The test with 500 dependencies shows **7166% speedup** (1.73ms → 23.9μs) - **Modest slowdowns** for single-call scenarios: 0-25% slower due to caching overhead - **Best suited for:** Applications that load multiple partitioners or repeatedly validate the same dependencies **Trade-offs:** Small memory overhead for the cache and slight performance penalty for first-time dependency checks, but these are negligible compared to the gains in repeated usage scenarios. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **195 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations import importlib import sys import types from typing import Callable # imports import pytest from typing_extensions import TypeAlias from unstructured.partition.auto import _PartitionerLoader Partitioner: TypeAlias = Callable[..., list] class DummyElement: pass # Dummy FileType class for testing class FileType: def __init__( self, importable_package_dependencies, partitioner_function_name, partitioner_module_qname, extra_name, is_partitionable=True, ): self.importable_package_dependencies = importable_package_dependencies self.partitioner_function_name = partitioner_function_name self.partitioner_module_qname = partitioner_module_qname self.extra_name = extra_name self.is_partitionable = is_partitionable # --- Helper functions for test setup --- def create_fake_module(module_name, func_name, func): """Dynamically creates a module and injects it into sys.modules.""" mod = types.ModuleType(module_name) setattr(mod, func_name, func) sys.modules[module_name] = mod return mod def fake_partitioner(*args, **kwargs): return [DummyElement()] # --- Basic Test Cases --- def test_load_partitioner_basic_success(): """Test loading a partitioner when all dependencies are present and everything is correct.""" module_name = "test_partitioner_module.basic" func_name = "partition_func" create_fake_module(module_name, func_name, fake_partitioner) file_type = FileType( importable_package_dependencies=[], # No dependencies partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="test", is_partitionable=True, ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) part_func = codeflash_output # 6.38μs -> 6.08μs (4.80% faster) def test_load_partitioner_with_single_dependency(monkeypatch): """Test loading a partitioner with a single dependency that exists.""" module_name = "test_partitioner_module.singledep" func_name = "partition_func" create_fake_module(module_name, func_name, fake_partitioner) # Simulate dependency_exists returns True monkeypatch.setattr( "importlib.import_module", lambda name: types.SimpleNamespace() if name == "somepkg" else sys.modules[module_name], ) file_type = FileType( importable_package_dependencies=["somepkg"], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="test", is_partitionable=True, ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) part_func = codeflash_output # 1.21μs -> 1.62μs (25.7% slower) def test_load_partitioner_with_multiple_dependencies(monkeypatch): """Test loading a partitioner with multiple dependencies that all exist.""" module_name = "test_partitioner_module.multidep" func_name = "partition_func" create_fake_module(module_name, func_name, fake_partitioner) # Simulate import_module returns dummy for all dependencies def import_module_side_effect(name): if name in ("pkgA", "pkgB"): return types.SimpleNamespace() return sys.modules[module_name] monkeypatch.setattr("importlib.import_module", import_module_side_effect) file_type = FileType( importable_package_dependencies=["pkgA", "pkgB"], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="test", is_partitionable=True, ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) part_func = codeflash_output # 1.42μs -> 1.67μs (14.9% slower) def test_load_partitioner_returns_correct_function(): """Test that the returned function is the actual partitioner function from the module.""" module_name = "test_partitioner_module.correct_func" func_name = "partition_func" create_fake_module(module_name, func_name, fake_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="test", is_partitionable=True, ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) part_func = codeflash_output # 7.29μs -> 7.25μs (0.579% faster) # --- Edge Test Cases --- def test_load_partitioner_missing_dependency(monkeypatch): """Test that ImportError is raised when a dependency is missing.""" module_name = "test_partitioner_module.missingdep" func_name = "partition_func" create_fake_module(module_name, func_name, fake_partitioner) # Simulate dependency_exists returns False for missingpkg original_import_module = importlib.import_module def import_module_side_effect(name): if name == "missingpkg": raise ImportError("No module named 'missingpkg'") return original_import_module(name) monkeypatch.setattr("importlib.import_module", import_module_side_effect) file_type = FileType( importable_package_dependencies=["missingpkg"], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="missing", is_partitionable=True, ) loader = _PartitionerLoader() with pytest.raises(ImportError) as excinfo: loader._load_partitioner(file_type) # 2.33μs -> 2.62μs (11.1% slower) def test_load_partitioner_not_partitionable(): """Test that an assertion is raised if file_type.is_partitionable is False.""" module_name = "test_partitioner_module.notpartitionable" func_name = "partition_func" create_fake_module(module_name, func_name, fake_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="test", is_partitionable=False, ) loader = _PartitionerLoader() with pytest.raises(AssertionError): loader._load_partitioner(file_type) # 541ns -> 542ns (0.185% slower) def test_load_partitioner_function_not_found(): """Test that AttributeError is raised if the function is not in the module.""" module_name = "test_partitioner_module.nofunc" func_name = "partition_func" # Create module without the function mod = types.ModuleType(module_name) sys.modules[module_name] = mod file_type = FileType( importable_package_dependencies=[], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="test", is_partitionable=True, ) loader = _PartitionerLoader() with pytest.raises(AttributeError): loader._load_partitioner(file_type) # 8.38μs -> 8.38μs (0.000% faster) def test_load_partitioner_module_not_found(): """Test that ModuleNotFoundError is raised if the module does not exist.""" module_name = "test_partitioner_module.doesnotexist" func_name = "partition_func" # Do not create the module file_type = FileType( importable_package_dependencies=[], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="test", is_partitionable=True, ) loader = _PartitionerLoader() with pytest.raises(ModuleNotFoundError): loader._load_partitioner(file_type) # 101μs -> 103μs (1.86% slower) def test_load_partitioner_many_dependencies(monkeypatch): """Test loading a partitioner with a large number of dependencies.""" module_name = "test_partitioner_module.large" func_name = "partition_func" create_fake_module(module_name, func_name, fake_partitioner) dep_names = [f"pkg{i}" for i in range(100)] # Simulate import_module returns dummy for all dependencies def import_module_side_effect(name): if name in dep_names: return types.SimpleNamespace() return sys.modules[module_name] monkeypatch.setattr("importlib.import_module", import_module_side_effect) file_type = FileType( importable_package_dependencies=dep_names, partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="large", is_partitionable=True, ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) part_func = codeflash_output # 45.9μs -> 56.2μs (18.4% slower) def test_load_partitioner_many_calls(monkeypatch): """Test repeated calls to _load_partitioner with different modules and dependencies.""" for i in range(50): module_name = f"test_partitioner_module.many_{i}" func_name = f"partition_func_{i}" def make_func(idx): return lambda *a, **k: [DummyElement(), idx] func = make_func(i) create_fake_module(module_name, func_name, func) dep_name = f"pkg_{i}" def import_module_side_effect(name): if name == dep_name: return types.SimpleNamespace() return sys.modules[module_name] monkeypatch.setattr("importlib.import_module", import_module_side_effect) file_type = FileType( importable_package_dependencies=[dep_name], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name=f"many_{i}", is_partitionable=True, ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) part_func = codeflash_output # 25.2μs -> 29.3μs (14.2% slower) def test_load_partitioner_large_function_name(): """Test loading a partitioner with a very long function name.""" module_name = "test_partitioner_module.longfunc" func_name = "partition_func_" + "x" * 200 create_fake_module(module_name, func_name, fake_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_function_name=func_name, partitioner_module_qname=module_name, extra_name="longfunc", is_partitionable=True, ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) part_func = codeflash_output # 8.92μs -> 9.17μs (2.73% slower) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from __future__ import annotations import importlib import sys import types from typing import Callable # imports import pytest from typing_extensions import TypeAlias from unstructured.partition.auto import _PartitionerLoader Partitioner: TypeAlias = Callable[..., list] class DummyElement: pass # Minimal FileType stub for testing class FileType: def __init__( self, importable_package_dependencies, partitioner_module_qname, partitioner_function_name, extra_name, is_partitionable=True, ): self.importable_package_dependencies = importable_package_dependencies self.partitioner_module_qname = partitioner_module_qname self.partitioner_function_name = partitioner_function_name self.extra_name = extra_name self.is_partitionable = is_partitionable # --- Test Suite --- # Helper: create a dummy partitioner function def dummy_partitioner(*args, **kwargs): return [DummyElement()] # Helper: create a dummy module with a partitioner function def make_dummy_module(mod_name, func_name, func): mod = types.ModuleType(mod_name) setattr(mod, func_name, func) sys.modules[mod_name] = mod return mod # Helper: remove dummy module from sys.modules after test def remove_dummy_module(mod_name): if mod_name in sys.modules: del sys.modules[mod_name] # 1. Basic Test Cases def test_load_partitioner_success_single_dependency(): """Should load partitioner when dependency exists and function is present.""" mod_name = "dummy_mod1" func_name = "partition_func" make_dummy_module(mod_name, func_name, dummy_partitioner) file_type = FileType( importable_package_dependencies=[], # No dependencies partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) partitioner = codeflash_output # 7.33μs -> 7.38μs (0.556% slower) remove_dummy_module(mod_name) def test_load_partitioner_success_multiple_dependencies(monkeypatch): """Should load partitioner when all dependencies exist.""" mod_name = "dummy_mod2" func_name = "partition_func" make_dummy_module(mod_name, func_name, dummy_partitioner) file_type = FileType( importable_package_dependencies=["sys", "types"], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) partitioner = codeflash_output # 15.3μs -> 15.9μs (3.41% slower) remove_dummy_module(mod_name) def test_load_partitioner_dependency_missing(monkeypatch): """Should raise ImportError if a dependency is missing.""" mod_name = "dummy_mod3" func_name = "partition_func" make_dummy_module(mod_name, func_name, dummy_partitioner) file_type = FileType( importable_package_dependencies=["definitely_not_a_real_package_12345"], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() with pytest.raises(ImportError) as excinfo: loader._load_partitioner(file_type) # 72.8μs -> 73.4μs (0.851% slower) remove_dummy_module(mod_name) def test_load_partitioner_function_missing(): """Should raise AttributeError if the partitioner function is missing.""" mod_name = "dummy_mod4" func_name = "not_present_func" make_dummy_module(mod_name, "some_other_func", dummy_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() with pytest.raises(AttributeError): loader._load_partitioner(file_type) # 8.12μs -> 8.29μs (2.01% slower) remove_dummy_module(mod_name) def test_load_partitioner_module_missing(): """Should raise ModuleNotFoundError if the partitioner module does not exist.""" mod_name = "definitely_not_a_real_module_12345" func_name = "partition_func" file_type = FileType( importable_package_dependencies=[], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() with pytest.raises(ModuleNotFoundError): loader._load_partitioner(file_type) # 61.2μs -> 61.3μs (0.271% slower) def test_load_partitioner_not_partitionable(): """Should raise AssertionError if file_type.is_partitionable is False.""" mod_name = "dummy_mod5" func_name = "partition_func" make_dummy_module(mod_name, func_name, dummy_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", is_partitionable=False, ) loader = _PartitionerLoader() with pytest.raises(AssertionError): loader._load_partitioner(file_type) # 500ns -> 459ns (8.93% faster) remove_dummy_module(mod_name) # 2. Edge Test Cases def test_load_partitioner_empty_function_name(): """Should raise AttributeError if function name is empty.""" mod_name = "dummy_mod6" func_name = "" make_dummy_module(mod_name, "some_func", dummy_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() with pytest.raises(AttributeError): loader._load_partitioner(file_type) # 8.08μs -> 8.33μs (2.99% slower) remove_dummy_module(mod_name) def test_load_partitioner_dependency_name_in_error(monkeypatch): """Should only return False if ImportError is for the actual dependency.""" # Patch importlib.import_module to raise ImportError with unrelated message orig_import_module = importlib.import_module def fake_import_module(name): raise ImportError("unrelated error") monkeypatch.setattr(importlib, "import_module", fake_import_module) monkeypatch.setattr(importlib, "import_module", orig_import_module) # 3. Large Scale Test Cases def test_load_partitioner_many_dependencies(monkeypatch): """Should handle a large number of dependencies efficiently.""" # All dependencies are 'sys', which exists deps = ["sys"] * 500 mod_name = "dummy_mod8" func_name = "partition_func" make_dummy_module(mod_name, func_name, dummy_partitioner) file_type = FileType( importable_package_dependencies=deps, partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) partitioner = codeflash_output # 1.73ms -> 23.9μs (7166% faster) remove_dummy_module(mod_name) def test_load_partitioner_large_module_name(monkeypatch): """Should handle a very long module name (within sys.modules limit).""" mod_name = "dummy_mod_" + "x" * 200 func_name = "partition_func" make_dummy_module(mod_name, func_name, dummy_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) partitioner = codeflash_output # 7.25μs -> 7.67μs (5.43% slower) remove_dummy_module(mod_name) def test_load_partitioner_many_calls(monkeypatch): """Should remain correct and performant under repeated calls for different modules.""" n = 50 loader = _PartitionerLoader() for i in range(n): mod_name = f"dummy_mod_bulk_{i}" func_name = "partition_func" make_dummy_module(mod_name, func_name, dummy_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) codeflash_output = loader._load_partitioner(file_type) partitioner = codeflash_output # 194μs -> 195μs (0.832% slower) remove_dummy_module(mod_name) def test_load_partitioner_function_returns_large_list(): """Should not choke if partitioner returns a large list (scalability).""" def big_partitioner(*args, **kwargs): return [DummyElement() for _ in range(900)] mod_name = "dummy_mod9" func_name = "partition_func" make_dummy_module(mod_name, func_name, big_partitioner) file_type = FileType( importable_package_dependencies=[], partitioner_module_qname=mod_name, partitioner_function_name=func_name, extra_name="dummy", ) loader = _PartitionerLoader() codeflash_output = loader._load_partitioner(file_type) partitioner = codeflash_output # 7.04μs -> 6.88μs (2.41% faster) result = partitioner() remove_dummy_module(mod_name) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-_PartitionerLoader._load_partitioner-mjebngyb` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
…-IO#4163) <!-- CODEFLASH_OPTIMIZATION: {"function":"detect_languages","file":"unstructured/partition/common/lang.py","speedup_pct":"5%","speedup_x":"0.05x","original_runtime":"133 milliseconds","best_runtime":"127 milliseconds","optimization_type":"general","timestamp":"2025-12-23T16:16:38.424Z","version":"1.0"} --> #### 📄 5% (0.05x) speedup for ***`detect_languages` in `unstructured/partition/common/lang.py`*** ⏱️ Runtime : **`133 milliseconds`** **→** **`127 milliseconds`** (best of `14` runs) #### 📝 Explanation and details The optimized code achieves a ~5% speedup through three targeted performance improvements: ## Key Optimizations ### 1. **LRU Cache for ISO639 Language Lookups** The `iso639.Language.match()` call is expensive, consuming ~29% of `_get_iso639_language_object`'s time in the baseline. By wrapping it in `@lru_cache(maxsize=256)`, repeated lookups of the same language codes (common in real workloads) are served from cache instead of re-executing the match logic. The cache hit reduces lookup time from ~25μs to near-zero for cached entries. **Impact:** The line profiler shows `_get_iso639_language_object` time dropping from 5.28ms to 4.34ms (18% faster). Test cases with repeated language codes see 20-55% improvements (e.g., `test_large_languages_list`: 54.7% faster). ### 2. **Precompiled Regex Pattern** The ASCII detection regex `r"^[\x00-\x7F]+$"` was compiled on every call to `detect_languages()`. Moving it to module-level (`_ASCII_RE`) eliminates repeated compilation overhead. Line profiler shows this path dropping from 1.66ms to 945μs (~43% faster) when the regex is evaluated. **Impact:** Short ASCII text test cases show 20-33% speedups (e.g., `test_short_ascii_text_defaults_to_english`: 28.5% faster). ### 3. **Set-Based Deduplication** The original code checked `if lang not in doc_languages` using list membership (O(n) per check). The optimized version maintains a parallel `set` for O(1) membership checks while preserving list order for output. This is critical when `langdetect_result` returns multiple languages. **Impact:** Minimal overhead for typical cases (<5 languages), but prevents O(n²) behavior for edge cases with many detected languages. ## Workload Context Based on `function_references`, `detect_languages()` is called from `apply_lang_metadata()`, which: - Processes **batches of document elements** (potentially hundreds per document) - Calls `detect_languages()` once per element when `detect_language_per_element=True` or per-document otherwise This makes the optimizations highly effective because: - **Cache benefits compound**: The same language codes (e.g., "eng", "fra") are looked up repeatedly across elements - **Regex precompilation scales**: Short text elements trigger the ASCII check frequently - **Batch processing amplifies gains**: Even a 5% per-call improvement multiplies across document pipelines ## Test Case Patterns - **User-supplied language tests** (20-55% faster): Benefit most from cached ISO639 lookups since they bypass langdetect - **Short ASCII text tests** (20-33% faster): Benefit from precompiled regex - **Auto-detection tests** (2-10% faster): Benefit from all optimizations but are dominated by the slow `detect_langs()` library call (99.5% of runtime), limiting overall gains ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **28 Passed** | | 🌀 Generated Regression Tests | ✅ **64 Passed** | | ⏪ Replay Tests | ✅ **1 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 92.5% | <details> <summary>⚙️ Click to see Existing Unit Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:----------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/common/test_lang.py::test_detect_languages_english_auto` | 1.07ms | 926μs | 15.4%✅ | | `partition/common/test_lang.py::test_detect_languages_english_provided` | 8.99μs | 4.51μs | 99.4%✅ | | `partition/common/test_lang.py::test_detect_languages_gets_multiple_languages` | 5.47ms | 5.04ms | 8.46%✅ | | `partition/common/test_lang.py::test_detect_languages_handles_spelled_out_languages` | 10.0μs | 6.19μs | 61.6%✅ | | `partition/common/test_lang.py::test_detect_languages_korean_auto` | 267μs | 239μs | 11.7%✅ | | `partition/common/test_lang.py::test_detect_languages_raises_TypeError_for_invalid_languages` | 1.62μs | 1.57μs | 3.64%✅ | | `partition/common/test_lang.py::test_detect_languages_warns_for_auto_and_other_input` | 1.57ms | 1.44ms | 8.99%✅ | </details> <details> <summary>🌀 Click to see Generated Regression Tests</summary> ```python from __future__ import annotations # imports import pytest # used for our unit tests from unstructured.partition.common.lang import detect_languages # Dummy logger for test isolation (since the real logger is not available) class DummyLogger: def debug(self, msg): pass def warning(self, msg): pass logger = DummyLogger() # Minimal TESSERACT_LANGUAGES_AND_CODES for test coverage TESSERACT_LANGUAGES_AND_CODES = { "eng": "eng", "en": "eng", "fra": "fra", "fre": "fra", "fr": "fra", "spa": "spa", "es": "spa", "deu": "deu", "de": "deu", "zho": "zho", "zh": "zho", "chi": "zho", "kor": "kor", "ko": "kor", "rus": "rus", "ru": "rus", "ita": "ita", "it": "ita", "jpn": "jpn", "ja": "jpn", } # unit tests # Basic Test Cases def test_english_detection_auto(): # Should detect English for a simple English sentence text = "This is a simple English sentence." codeflash_output = detect_languages(text) result = codeflash_output # 1.08ms -> 940μs (15.0% faster) def test_french_detection_auto(): # Should detect French for a simple French sentence text = "Ceci est une phrase en français." codeflash_output = detect_languages(text) result = codeflash_output # 1.00ms -> 912μs (9.74% faster) def test_spanish_detection_auto(): # Should detect Spanish for a simple Spanish sentence text = "Esta es una oración en español." codeflash_output = detect_languages(text) result = codeflash_output # 777μs -> 714μs (8.77% faster) def test_german_detection_auto(): # Should detect German for a simple German sentence text = "Dies ist ein deutscher Satz." codeflash_output = detect_languages(text) result = codeflash_output # 626μs -> 616μs (1.61% faster) def test_chinese_detection_auto(): # Should detect Chinese for a simple Chinese sentence text = "这是一个中文句子。" codeflash_output = detect_languages(text) result = codeflash_output # 771μs -> 722μs (6.87% faster) def test_korean_detection_auto(): # Should detect Korean for a simple Korean sentence text = "이것은 한국어 문장입니다." codeflash_output = detect_languages(text) result = codeflash_output # 272μs -> 260μs (4.76% faster) def test_russian_detection_auto(): # Should detect Russian for a simple Russian sentence text = "Это русское предложение." codeflash_output = detect_languages(text) result = codeflash_output # 863μs -> 827μs (4.34% faster) def test_japanese_detection_auto(): # Should detect Japanese for a simple Japanese sentence text = "これは日本語の文です。" codeflash_output = detect_languages(text) result = codeflash_output # 255μs -> 237μs (7.88% faster) def test_user_supplied_languages(): # Should return the user-supplied language codes in ISO 639-2/B format text = "Does not matter." codeflash_output = detect_languages(text, ["eng"]) result = codeflash_output # 5.01μs -> 4.08μs (22.8% faster) def test_user_supplied_multiple_languages(): # Should return all valid user-supplied language codes text = "Does not matter." codeflash_output = detect_languages(text, ["eng", "fra", "spa"]) result = codeflash_output # 3.74μs -> 3.18μs (17.8% faster) def test_user_supplied_language_aliases(): # Should convert aliases to ISO 639-2/B codes text = "Does not matter." codeflash_output = detect_languages(text, ["en", "fr", "es"]) result = codeflash_output # 3.51μs -> 2.89μs (21.6% faster) def test_user_supplied_language_mixed_case(): # Should handle mixed-case language codes text = "Does not matter." codeflash_output = detect_languages(text, ["EnG", "FrA"]) result = codeflash_output # 3.43μs -> 2.86μs (19.8% faster) def test_auto_overrides_user_supplied(): # Should ignore user-supplied languages if "auto" is present text = "Ceci est une phrase en français." codeflash_output = detect_languages(text, ["auto", "eng"]) result = codeflash_output # 1.78ms -> 1.65ms (8.18% faster) def test_none_languages_defaults_to_auto(): # Should default to auto if languages=None text = "Dies ist ein deutscher Satz." codeflash_output = detect_languages(text, None) result = codeflash_output # 619μs -> 583μs (6.12% faster) def test_short_ascii_text_defaults_to_english(): # Should default to English for short ASCII text text = "Hi!" codeflash_output = detect_languages(text) result = codeflash_output # 5.71μs -> 4.45μs (28.5% faster) def test_short_ascii_text_with_spaces_defaults_to_english(): # Should default to English for short ASCII text with spaces text = "Hi there" codeflash_output = detect_languages(text) result = codeflash_output # 4.05μs -> 3.31μs (22.4% faster) # Edge Test Cases def test_empty_text_returns_none(): # Should return None for empty text codeflash_output = detect_languages("") # 751ns -> 747ns (0.535% faster) def test_whitespace_text_returns_none(): # Should return None for whitespace-only text codeflash_output = detect_languages(" ") # 754ns -> 726ns (3.86% faster) def test_languages_first_element_empty_string_returns_none(): # Should return None if languages[0] == "" text = "Some text" codeflash_output = detect_languages(text, [""]) # 540ns -> 544ns (0.735% slower) def test_non_list_languages_raises_type_error(): # Should raise TypeError if languages is not a list with pytest.raises(TypeError): detect_languages("Some text", "eng") # 1.20μs -> 1.23μs (2.20% slower) def test_invalid_language_code_ignored(): # Should ignore invalid language codes in user-supplied list text = "Does not matter." codeflash_output = detect_languages(text, ["eng", "invalid_code"]) result = codeflash_output # 4.13μs -> 3.45μs (19.8% faster) def test_only_invalid_language_codes_returns_empty_list(): # Should return empty list if all user-supplied codes are invalid text = "Does not matter." codeflash_output = detect_languages(text, ["invalid1", "invalid2"]) result = codeflash_output # 3.93μs -> 2.91μs (35.0% faster) def test_text_with_special_characters(): # Should not default to English if text has special characters text = "niño año jalapeño" codeflash_output = detect_languages(text) result = codeflash_output # 705μs -> 626μs (12.7% faster) def test_text_with_multiple_languages(): # Should detect multiple languages in text (order may vary) text = "This is English. Ceci est français. Esto es español." codeflash_output = detect_languages(text) result = codeflash_output # 2.65ms -> 2.41ms (10.3% faster) def test_text_with_chinese_variants_normalizes_to_zho(): # Should normalize all Chinese variants to "zho" text = "这是中文。這是中文。這是中國話。" codeflash_output = detect_languages(text) result = codeflash_output # 454μs -> 426μs (6.63% faster) def test_text_with_unsupported_language_returns_none(): # Should return None for gibberish text (langdetect fails) text = "asdfqwerzxcv" codeflash_output = detect_languages(text) result = codeflash_output # 4.67μs -> 3.77μs (23.8% faster) def test_text_with_numbers_and_symbols(): # Should default to English for short ASCII text with numbers/symbols text = "1234!?" codeflash_output = detect_languages(text) result = codeflash_output # 3.81μs -> 2.87μs (32.8% faster) def test_text_with_long_ascii_non_english(): # Should not default to English for long ASCII text that is not English text = "Ceci est une phrase en francais sans accents mais en francais" codeflash_output = detect_languages(text) result = codeflash_output # 1.36ms -> 1.27ms (6.90% faster) def test_text_with_newlines_and_tabs(): # Should handle text with newlines and tabs text = "This is English.\nCeci est français.\tEsto es español." codeflash_output = detect_languages(text) result = codeflash_output # 2.48ms -> 2.29ms (8.09% faster) # Large Scale Test Cases def test_large_text_english(): # Should detect English in a large English text text = " ".join(["This is a sentence."] * 500) codeflash_output = detect_languages(text) result = codeflash_output # 8.37ms -> 8.16ms (2.51% faster) def test_large_text_french(): # Should detect French in a large French text text = " ".join(["Ceci est une phrase."] * 500) codeflash_output = detect_languages(text) result = codeflash_output # 9.60ms -> 9.12ms (5.21% faster) def test_large_text_mixed_languages(): # Should detect multiple languages in a large mixed-language text text = ("This is English. " * 300) + ("Ceci est français. " * 300) + ("Esto es español. " * 300) codeflash_output = detect_languages(text) result = codeflash_output # 9.71ms -> 9.30ms (4.33% faster) def test_large_user_supplied_languages(): # Should handle a large list of user-supplied languages (but only valid ones returned) text = "Does not matter." languages = ["eng"] * 500 + ["fra"] * 400 + ["invalid"] * 50 codeflash_output = detect_languages(text, languages) result = codeflash_output # 6.49μs -> 4.51μs (44.0% faster) def test_large_text_with_special_characters(): # Should detect Spanish in a large text with special characters text = "niño año jalapeño " * 500 codeflash_output = detect_languages(text) result = codeflash_output # 8.76ms -> 8.27ms (5.91% faster) def test_large_text_with_chinese_and_english(): # Should detect both Chinese and English in a large mixed text text = ("This is English. " * 400) + ("这是中文。 " * 400) codeflash_output = detect_languages(text) result = codeflash_output # 9.67ms -> 9.38ms (3.15% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from __future__ import annotations # imports import pytest # used for our unit tests from langdetect import lang_detect_exception from unstructured.partition.common.lang import detect_languages # unit tests # Basic Test Cases def test_detect_languages_english_auto(): # Basic: English text, auto detection codeflash_output = detect_languages("This is a simple English sentence.") result = codeflash_output # 1.18ms -> 1.11ms (6.25% faster) def test_detect_languages_french_auto(): # Basic: French text, auto detection codeflash_output = detect_languages("Ceci est une phrase française simple.") result = codeflash_output # 1.25ms -> 1.15ms (8.97% faster) def test_detect_languages_spanish_auto(): # Basic: Spanish text, auto detection codeflash_output = detect_languages("Esta es una oración en español.") result = codeflash_output # 794μs -> 742μs (7.04% faster) def test_detect_languages_user_input_single(): # Basic: User provides a single valid language code codeflash_output = detect_languages("Some text", ["eng"]) result = codeflash_output # 6.02μs -> 4.75μs (26.8% faster) def test_detect_languages_user_input_multiple(): # Basic: User provides multiple valid language codes codeflash_output = detect_languages("Some text", ["eng", "fra"]) result = codeflash_output # 3.68μs -> 2.83μs (29.8% faster) def test_detect_languages_user_input_nonstandard_code(): # Basic: User provides a nonstandard but mapped language code # e.g. "en" maps to "eng" via iso639 codeflash_output = detect_languages("Some text", ["en"]) result = codeflash_output # 3.68μs -> 2.80μs (31.3% faster) def test_detect_languages_auto_overrides_user_input(): # Basic: "auto" in languages overrides user input codeflash_output = detect_languages("Ceci est une phrase française simple.", ["auto", "eng"]) result = codeflash_output # 2.05ms -> 1.90ms (7.49% faster) def test_detect_languages_short_ascii_text_defaults_to_english(): # Basic: Short ASCII text should default to English codeflash_output = detect_languages("Hi!") result = codeflash_output # 5.07μs -> 4.20μs (20.7% faster) def test_detect_languages_short_non_ascii_text(): # Basic: Short non-ASCII text should not default to English codeflash_output = detect_languages("¡Hola!") result = codeflash_output # 3.21ms -> 2.94ms (9.05% faster) # Edge Test Cases def test_detect_languages_empty_text_returns_none(): # Edge: Empty string should return None codeflash_output = detect_languages("") result = codeflash_output # 759ns -> 750ns (1.20% faster) def test_detect_languages_whitespace_text_returns_none(): # Edge: Whitespace only should return None codeflash_output = detect_languages(" \n\t ") result = codeflash_output # 932ns -> 808ns (15.3% faster) def test_detect_languages_languages_empty_string_returns_none(): # Edge: languages[0] == "" should return None codeflash_output = detect_languages("Some text", [""]) result = codeflash_output # 538ns -> 517ns (4.06% faster) def test_detect_languages_languages_none_defaults_to_auto(): # Edge: languages=None should act like ["auto"] codeflash_output = detect_languages("Bonjour tout le monde", None) result = codeflash_output # 4.49μs -> 3.66μs (22.7% faster) def test_detect_languages_invalid_languages_type_raises(): # Edge: languages is not a list, should raise TypeError with pytest.raises(TypeError): detect_languages("Some text", "eng") # 1.32μs -> 1.24μs (6.64% faster) def test_detect_languages_invalid_language_code_skipped(): # Edge: User provides an invalid code, should skip it codeflash_output = detect_languages("Some text", ["eng", "notacode"]) result = codeflash_output # 3.87μs -> 3.01μs (28.7% faster) def test_detect_languages_mixed_valid_invalid_codes(): # Edge: User provides mixed valid/invalid codes codeflash_output = detect_languages("Some text", ["eng", "fra", "badcode"]) result = codeflash_output # 3.60μs -> 2.79μs (29.0% faster) def test_detect_languages_detect_langs_exception_returns_none(monkeypatch): # Edge: langdetect raises exception, should return None def raise_exception(text): raise lang_detect_exception.LangDetectException("No features in text.") monkeypatch.setattr("langdetect.detect_langs", raise_exception) codeflash_output = detect_languages("This will error out.") result = codeflash_output # 3.63μs -> 3.12μs (16.3% faster) def test_detect_languages_chinese_variant_normalization(): # Edge: Chinese variants normalized to "zho" # "你好,世界" is Chinese codeflash_output = detect_languages("你好,世界") result = codeflash_output # 2.06ms -> 1.92ms (7.65% faster) def test_detect_languages_multiple_languages_in_text(): # Edge: Mixed language text text = "Hello world. Bonjour le monde. Hola mundo." codeflash_output = detect_languages(text) result = codeflash_output # 3.92ms -> 3.70ms (5.93% faster) def test_detect_languages_duplicate_chinese_not_repeated(): # Edge: Multiple Chinese variants should not duplicate "zho" # Simulate langdetect returning zh-cn and zh-tw class DummyLangObj: def __init__(self, lang): self.lang = lang def fake_detect_langs(text): return [DummyLangObj("zh-cn"), DummyLangObj("zh-tw")] import langdetect monkeypatch = pytest.MonkeyPatch() monkeypatch.setattr(langdetect, "detect_langs", fake_detect_langs) codeflash_output = detect_languages("中文文本") result = codeflash_output # 1.00ms -> 928μs (7.89% faster) monkeypatch.undo() def test_detect_languages_non_ascii_short_text_not_default_eng(): # Edge: Short non-ascii text should not default to English codeflash_output = detect_languages("你好") result = codeflash_output # 1.37ms -> 1.26ms (8.34% faster) def test_detect_languages_tesseract_code_mapping(): # Edge: TESSERACT_LANGUAGES_AND_CODES mapping # For example, "chi_sim" should map to "zho" codeflash_output = detect_languages("Some text", ["chi_sim"]) result = codeflash_output # 4.56μs -> 3.45μs (32.0% faster) # Large Scale Test Cases def test_detect_languages_large_text_english(): # Large: Large English text text = "This is a sentence. " * 500 # 500 sentences codeflash_output = detect_languages(text) result = codeflash_output # 8.32ms -> 8.13ms (2.36% faster) def test_detect_languages_large_text_french(): # Large: Large French text text = "Ceci est une phrase. " * 500 codeflash_output = detect_languages(text) result = codeflash_output # 9.50ms -> 9.12ms (4.18% faster) def test_detect_languages_large_text_mixed(): # Large: Large mixed language text text = ( "This is an English sentence. " * 333 + "Ceci est une phrase française. " * 333 + "Esta es una oración en español. " * 333 ) codeflash_output = detect_languages(text) result = codeflash_output # 9.10ms -> 8.79ms (3.48% faster) def test_detect_languages_large_languages_list(): # Large: User provides a large list of valid codes codes = ["eng", "fra", "spa", "deu", "ita", "por", "rus", "zho"] * 10 # 80 codes codeflash_output = detect_languages("Some text", codes) result = codeflash_output # 6.75μs -> 4.37μs (54.7% faster) # Should contain all unique codes in iso639-3 form expected = ["eng", "fra", "spa", "deu", "ita", "por", "rus", "zho"] def test_detect_languages_large_invalid_codes(): # Large: User provides a large list of invalid codes codes = ["badcode" + str(i) for i in range(100)] codeflash_output = detect_languages("Some text", codes) result = codeflash_output # 3.57μs -> 3.08μs (16.2% faster) def test_detect_languages_performance_large_input(): # Large: Performance with large input (under 1000 elements) text = "Hello world! " * 999 codeflash_output = detect_languages(text) result = codeflash_output # 14.5ms -> 13.7ms (5.79% faster) def test_detect_languages_performance_large_languages_list(): # Large: Performance with large languages list (under 1000 elements) codes = ["eng"] * 999 codeflash_output = detect_languages("Some text", codes) result = codeflash_output # 6.01μs -> 3.87μs (55.5% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> <details> <summary>⏪ Click to see Replay Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:-------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `test_benchmark5_py__replay_test_0.py::test_unstructured_partition_common_lang_detect_languages` | 4.94ms | 4.78ms | 3.27%✅ | </details> To edit these changes `git checkout codeflash/optimize-detect_languages-mjisezcy` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
…ured-IO#4164) <!-- CODEFLASH_OPTIMIZATION: {"function":"zoom_image","file":"unstructured/partition/utils/ocr_models/tesseract_ocr.py","speedup_pct":"12%","speedup_x":"0.12x","original_runtime":"18.1 milliseconds","best_runtime":"16.1 milliseconds","optimization_type":"memory","timestamp":"2025-12-19T03:24:39.274Z","version":"1.0"} --> #### 📄 12% (0.12x) speedup for ***`zoom_image` in `unstructured/partition/utils/ocr_models/tesseract_ocr.py`*** ⏱️ Runtime : **`18.1 milliseconds`** **→** **`16.1 milliseconds`** (best of `12` runs) #### 📝 Explanation and details The optimization removes unnecessary morphological operations (dilation followed by erosion) that were being performed with a 1x1 kernel. Since a 1x1 kernel has no effect on the image during dilation and erosion operations, these steps were pure computational overhead. **Key changes:** - Eliminated the creation of a 1x1 kernel (`np.ones((1, 1), np.uint8)`) - Removed the `cv2.dilate()` and `cv2.erode()` calls that used this ineffective kernel - Added explanatory comments about why these operations were removed **Why this leads to speedup:** The line profiler shows that the morphological operations consumed 27.7% of the total runtime (18.5% for dilation + 9.2% for erosion). A 1x1 kernel performs no actual morphological transformation - it's equivalent to applying the identity operation. Removing these no-op calls eliminates unnecessary OpenCV function overhead and memory operations. **Performance impact based on function references:** The `zoom_image` function is called within Tesseract OCR processing, specifically in `get_layout_from_image()` when text height falls outside optimal ranges. This optimization will improve OCR preprocessing performance, especially beneficial since OCR is typically a computationally intensive operation that may be called repeatedly on document processing pipelines. **Test case analysis:** The optimization shows consistent 7-35% speedups across various test cases, with particularly strong gains for: - Identity zoom operations (35.8% faster) - most common case where zoom=1 - Upscaling operations (21-32% faster) - when OCR requires image enlargement - Large images (8-22% faster) - where the removed operations had more overhead The optimization maintains identical visual output since the removed operations were mathematically no-ops, ensuring OCR accuracy is preserved while reducing processing time. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **27 Passed** | | 🌀 Generated Regression Tests | ✅ **38 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------|:--------------|:---------------|:----------| | `partition/pdf_image/test_ocr.py::test_zoom_image` | 707μs | 632μs | 11.9%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations import numpy as np # imports from PIL import Image as PILImage from unstructured.partition.utils.ocr_models.tesseract_ocr import zoom_image # --------- UNIT TESTS --------- # Helper function to create a simple RGB PIL image of given size and color def make_image(size=(10, 10), color=(255, 0, 0)): img = PILImage.new("RGB", size, color) return img # ---------------- BASIC TEST CASES ---------------- def test_zoom_identity(): """Zoom factor 1 should return an image of the same size (but not necessarily the same object).""" img = make_image((20, 30), (123, 45, 67)) codeflash_output = zoom_image(img, 1) out = codeflash_output # 75.0μs -> 55.2μs (35.8% faster) # The pixel values may not be identical due to dilation/erosion, but should be very close diff = np.abs(np.array(out, dtype=int) - np.array(img, dtype=int)) def test_zoom_upscale(): """Zoom factor >1 should increase image size proportionally.""" img = make_image((10, 20), (0, 255, 0)) codeflash_output = zoom_image(img, 2) out = codeflash_output # 35.2μs -> 29.0μs (21.4% faster) # The output image should still be greenish arr = np.array(out) def test_zoom_downscale(): """Zoom factor <1 should decrease image size proportionally.""" img = make_image((10, 10), (0, 0, 255)) codeflash_output = zoom_image(img, 0.5) out = codeflash_output # 25.3μs -> 21.6μs (17.1% faster) arr = np.array(out) def test_zoom_non_integer_factor(): """Non-integer zoom factors should produce correct output size.""" img = make_image((8, 8), (100, 200, 50)) codeflash_output = zoom_image(img, 1.5) out = codeflash_output # 30.2μs -> 22.8μs (32.1% faster) def test_zoom_no_side_effects(): """The input image should not be modified.""" img = make_image((5, 5), (10, 20, 30)) img_before = np.array(img).copy() codeflash_output = zoom_image(img, 2) _ = codeflash_output # 22.9μs -> 18.3μs (25.0% faster) # ---------------- EDGE TEST CASES ---------------- def test_zoom_zero_factor(): """Zoom factor 0 should be treated as 1 (no scaling).""" img = make_image((7, 13), (50, 100, 150)) codeflash_output = zoom_image(img, 0) out = codeflash_output # 24.6μs -> 20.0μs (23.2% faster) def test_zoom_negative_factor(): """Negative zoom factors should be treated as 1 (no scaling).""" img = make_image((12, 8), (200, 100, 50)) codeflash_output = zoom_image(img, -2) out = codeflash_output # 26.1μs -> 20.0μs (30.4% faster) def test_zoom_large_factor_on_small_image(): """Zooming a small image by a large factor should scale up.""" img = make_image((2, 2), (42, 84, 126)) codeflash_output = zoom_image(img, 10) out = codeflash_output # 42.8μs -> 33.5μs (27.5% faster) def test_zoom_non_rgb_image(): """Function should work with grayscale images (converted to RGB).""" img = PILImage.new("L", (5, 5), 128) # Grayscale img_rgb = img.convert("RGB") codeflash_output = zoom_image(img, 2) out = codeflash_output # 31.0μs -> 25.7μs (20.8% faster) def test_zoom_alpha_channel_image(): """Function should ignore alpha channel and process as RGB.""" img = PILImage.new("RGBA", (6, 6), (100, 150, 200, 128)) img_rgb = img.convert("RGB") codeflash_output = zoom_image(img, 2) out = codeflash_output # 28.0μs -> 24.9μs (12.6% faster) def test_zoom_large_image_upscale(): """Zooming a large image up should work and not crash.""" img = make_image((500, 500), (10, 20, 30)) codeflash_output = zoom_image(img, 1.5) out = codeflash_output # 1.23ms -> 1.09ms (12.5% faster) # Check a corner pixel is still close to original color arr = np.array(out) def test_zoom_large_image_downscale(): """Zooming a large image down should work and not crash.""" img = make_image((800, 600), (200, 100, 50)) codeflash_output = zoom_image(img, 0.5) out = codeflash_output # 942μs -> 923μs (2.03% faster) arr = np.array(out) def test_zoom_maximum_allowed_size(): """Test with the largest allowed image under 1000x1000.""" img = make_image((999, 999), (1, 2, 3)) codeflash_output = zoom_image(img, 1) out = codeflash_output # 1.47ms -> 1.30ms (13.0% faster) arr = np.array(out) def test_zoom_many_colors(): """Test with an image with many colors (gradient).""" arr = np.zeros((100, 100, 3), dtype=np.uint8) for i in range(100): for j in range(100): arr[i, j] = [i * 2 % 256, j * 2 % 256, (i + j) % 256] img = PILImage.fromarray(arr) codeflash_output = zoom_image(img, 0.9) out = codeflash_output # 112μs -> 97.0μs (16.3% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from __future__ import annotations import numpy as np # imports from PIL import Image as PILImage from unstructured.partition.utils.ocr_models.tesseract_ocr import zoom_image # --- Helper functions for tests --- def create_test_image(size=(10, 10), color=(255, 0, 0), mode="RGB"): """Create a plain color PIL image for testing.""" return PILImage.new(mode, size, color) # --- Unit tests --- # 1. Basic Test Cases def test_zoom_identity(): """Test zoom=1 returns image of same size and content is similar.""" img = create_test_image((10, 10), (123, 222, 111)) codeflash_output = zoom_image(img, 1) result = codeflash_output # 57.2μs -> 53.3μs (7.43% faster) # The content may not be pixel-perfect due to cv2 conversion, but should be close arr_orig = np.array(img) arr_result = np.array(result) def test_zoom_double_size(): """Test zoom=2 increases both dimensions by 2x.""" img = create_test_image((10, 5), (10, 20, 30)) codeflash_output = zoom_image(img, 2) result = codeflash_output # 38.6μs -> 30.6μs (26.3% faster) def test_zoom_half_size(): """Test zoom=0.5 reduces both dimensions by half (rounded).""" img = create_test_image((10, 6), (200, 100, 50)) codeflash_output = zoom_image(img, 0.5) result = codeflash_output # 29.6μs -> 25.4μs (16.7% faster) def test_zoom_arbitrary_factor(): """Test zoom=1.7 scales image correctly.""" img = create_test_image((10, 10), (0, 255, 0)) codeflash_output = zoom_image(img, 1.7) result = codeflash_output # 30.3μs -> 23.8μs (27.3% faster) expected_size = (int(round(10 * 1.7)), int(round(10 * 1.7))) # 2. Edge Test Cases def test_zoom_zero(): """Test zoom=0 is treated as 1 (no scaling).""" img = create_test_image((8, 8), (50, 50, 50)) codeflash_output = zoom_image(img, 0) result = codeflash_output # 26.3μs -> 23.1μs (13.7% faster) arr_orig = np.array(img) arr_result = np.array(result) def test_zoom_negative(): """Test negative zoom is treated as 1 (no scaling).""" img = create_test_image((7, 9), (100, 200, 50)) codeflash_output = zoom_image(img, -3) result = codeflash_output # 24.4μs -> 20.4μs (19.6% faster) arr_orig = np.array(img) arr_result = np.array(result) def test_zoom_minimal_size(): """Test 1x1 image with zoom=2 and zoom=0.5.""" img = create_test_image((1, 1), (0, 0, 0)) codeflash_output = zoom_image(img, 2) result_up = codeflash_output codeflash_output = zoom_image(img, 0.5) result_down = codeflash_output def test_zoom_non_rgb_image(): """Test grayscale and RGBA images.""" # Grayscale img_gray = PILImage.new("L", (10, 10), 128) # Convert to RGB for function compatibility img_gray_rgb = img_gray.convert("RGB") codeflash_output = zoom_image(img_gray_rgb, 2) result_gray = codeflash_output # 41.8μs -> 54.2μs (22.9% slower) # RGBA img_rgba = PILImage.new("RGBA", (10, 10), (10, 20, 30, 40)) img_rgba_rgb = img_rgba.convert("RGB") codeflash_output = zoom_image(img_rgba_rgb, 0.5) result_rgba = codeflash_output # 22.4μs -> 19.7μs (13.8% faster) def test_zoom_non_integer_zoom(): """Test zoom with non-integer floats.""" img = create_test_image((9, 7), (10, 20, 30)) codeflash_output = zoom_image(img, 1.333) result = codeflash_output # 26.9μs -> 24.6μs (9.32% faster) expected_size = (int(9 * 1.333), int(7 * 1.333)) def test_zoom_unusual_aspect_ratio(): """Test tall and wide images.""" img_tall = create_test_image((3, 100), (1, 2, 3)) codeflash_output = zoom_image(img_tall, 0.5) result_tall = codeflash_output # 31.7μs -> 32.0μs (0.911% slower) img_wide = create_test_image((100, 3), (4, 5, 6)) codeflash_output = zoom_image(img_wide, 0.5) result_wide = codeflash_output # 21.8μs -> 24.0μs (9.20% slower) def test_zoom_large_zoom_factor(): """Test very large zoom factor (e.g., 20x).""" img = create_test_image((2, 2), (255, 255, 255)) codeflash_output = zoom_image(img, 20) result = codeflash_output # 33.6μs -> 26.0μs (29.1% faster) def test_zoom_extreme_color_values(): """Test image with extreme color values (black/white).""" img_black = create_test_image((5, 5), (0, 0, 0)) img_white = create_test_image((5, 5), (255, 255, 255)) codeflash_output = zoom_image(img_black, 1) result_black = codeflash_output # 23.6μs -> 21.3μs (10.8% faster) codeflash_output = zoom_image(img_white, 1) result_white = codeflash_output # 17.5μs -> 14.9μs (17.9% faster) # 3. Large Scale Test Cases def test_zoom_large_image_no_scale(): """Test zoom=1 on a large image.""" img = create_test_image((500, 400), (100, 150, 200)) codeflash_output = zoom_image(img, 1) result = codeflash_output # 300μs -> 274μs (9.51% faster) arr_orig = np.array(img) arr_result = np.array(result) def test_zoom_large_image_upscale(): """Test zoom=2 on a large image.""" img = create_test_image((200, 300), (10, 20, 30)) codeflash_output = zoom_image(img, 2) result = codeflash_output # 446μs -> 415μs (7.60% faster) def test_zoom_large_image_downscale(): """Test zoom=0.5 on a large image.""" img = create_test_image((800, 600), (50, 60, 70)) codeflash_output = zoom_image(img, 0.5) result = codeflash_output # 934μs -> 945μs (1.19% slower) def test_zoom_large_non_square(): """Test large non-square image with zoom=1.5.""" img = create_test_image((333, 777), (123, 45, 67)) codeflash_output = zoom_image(img, 1.5) result = codeflash_output # 1.51ms -> 1.24ms (21.9% faster) expected_size = (int(333 * 1.5), int(777 * 1.5)) def test_zoom_maximum_allowed_size(): """Test image at upper bound of allowed size (1000x1000).""" img = create_test_image((1000, 1000), (222, 111, 0)) codeflash_output = zoom_image(img, 1) result = codeflash_output # 1.81ms -> 1.66ms (8.62% faster) # Downscale codeflash_output = zoom_image(img, 0.1) result_down = codeflash_output # 870μs -> 871μs (0.153% slower) # Upscale (should not exceed 1000*2=2000, which is still reasonable) codeflash_output = zoom_image(img, 2) result_up = codeflash_output # 6.98ms -> 5.98ms (16.7% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-zoom_image-mjcb2smb` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
…#4161) <!-- CODEFLASH_OPTIMIZATION: {"function":"contains_verb","file":"unstructured/partition/text_type.py","speedup_pct":"8%","speedup_x":"0.08x","original_runtime":"890 milliseconds","best_runtime":"827 milliseconds","optimization_type":"loop","timestamp":"2025-12-23T16:34:05.083Z","version":"1.0"} --> #### 📄 8% (0.08x) speedup for ***`contains_verb` in `unstructured/partition/text_type.py`*** ⏱️ Runtime : **`890 milliseconds`** **→** **`827 milliseconds`** (best of `7` runs) #### 📝 Explanation and details The optimization achieves a **7% speedup** by replacing NLTK's sequential sentence-by-sentence POS tagging with batch processing using `pos_tag_sents`. **What Changed:** - **Batch POS tagging**: Instead of calling `_pos_tag()` individually for each sentence in a loop, the code now tokenizes all sentences first, then passes them together to `_pos_tag_sents()`. This single batched call processes all sentences at once. - **List comprehension for flattening**: The nested loop that extended `parts_of_speech` is replaced with a list comprehension that flattens the result from `_pos_tag_sents()`. **Why It's Faster:** NLTK's `pos_tag()` performs setup overhead (model loading, context initialization) on each invocation. When processing multi-sentence text, calling it N times incurs N × overhead. By contrast, `pos_tag_sents()` performs this setup once and processes all sentences in a single batch, reducing overhead from O(N) to O(1). This is particularly effective for texts with multiple sentences. **Impact Based on Context:** The `contains_verb()` function is called from `is_possible_narrative_text()`, which appears to be in a document classification/partitioning pipeline. Given that this function checks for narrative text characteristics, it likely runs on many text segments during document processing. The optimization provides: - **~9% speedup** for large-scale tests with many sentences (e.g., 200+ repeated sentences) - **5-8% speedup** for typical multi-sentence inputs - **Minimal/negative impact** on very short inputs (empty strings, single words) due to the overhead of creating intermediate lists, but these cases are typically cached via `@lru_cache` The batch processing particularly benefits workloads where `is_possible_narrative_text()` processes longer text segments with multiple sentences, which is common in document partitioning tasks. Since the function is cached, the optimization's impact is most significant on cache misses with multi-sentence text. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **23 Passed** | | 🌀 Generated Regression Tests | ✅ **108 Passed** | | ⏪ Replay Tests | ✅ **8 Passed** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Click to see Existing Unit Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:--------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_text_type.py::test_contains_verb` | 435μs | 438μs | -0.586%⚠️ | </details> <details> <summary>🌀 Click to see Generated Regression Tests</summary> ```python from __future__ import annotations from typing import Final, List # imports from unstructured.partition.text_type import contains_verb POS_VERB_TAGS: Final[List[str]] = ["VB", "VBG", "VBD", "VBN", "VBP", "VBZ"] # ---- UNIT TESTS ---- # Basic Test Cases def test_simple_sentence_with_verb(): # Checks a simple sentence with an obvious verb codeflash_output = contains_verb("The cat runs.") # 203μs -> 193μs (5.46% faster) def test_simple_sentence_without_verb(): # Checks a sentence with no verb codeflash_output = contains_verb("The blue sky.") # 130μs -> 124μs (5.04% faster) def test_question_with_verb(): # Checks a question containing a verb codeflash_output = contains_verb("Is this your book?") # 95.0μs -> 92.5μs (2.73% faster) def test_sentence_with_multiple_verbs(): # Checks a sentence containing more than one verb codeflash_output = contains_verb("He jumped and ran.") # 140μs -> 132μs (6.12% faster) def test_sentence_with_verb_in_past_tense(): # Checks a sentence with a past tense verb codeflash_output = contains_verb("She walked home.") # 132μs -> 121μs (8.76% faster) def test_sentence_with_verb_in_present_participle(): # Checks a sentence with a present participle verb codeflash_output = contains_verb("The dog is barking.") # 130μs -> 124μs (4.97% faster) def test_sentence_with_verb_in_past_participle(): # Checks a sentence with a past participle verb codeflash_output = contains_verb("The cake was eaten.") # 125μs -> 121μs (4.06% faster) def test_sentence_with_modal_verb(): # Checks a sentence with a modal verb ("can" is not in POS_VERB_TAGS, but "run" is) codeflash_output = contains_verb("He can run.") # 84.0μs -> 81.7μs (2.83% faster) def test_sentence_with_no_alphabetic_characters(): # Checks a string with only punctuation codeflash_output = contains_verb("!!!") # 97.1μs -> 95.7μs (1.44% faster) def test_sentence_with_numbers_only(): # Checks a string with only numbers codeflash_output = contains_verb("1234567890") # 87.6μs -> 82.4μs (6.32% faster) # Edge Test Cases def test_empty_string(): # Checks empty input string codeflash_output = contains_verb("") # 6.38μs -> 6.66μs (4.21% slower) def test_whitespace_only(): # Checks string with only whitespace codeflash_output = contains_verb(" ") # 6.30μs -> 6.78μs (7.15% slower) def test_uppercase_sentence_with_verb(): # Checks that all-uppercase input is lowercased and verbs are detected codeflash_output = contains_verb("THE DOG BARKED.") # 131μs -> 122μs (7.51% faster) def test_uppercase_sentence_without_verb(): # Checks that all-uppercase input with no verb returns False codeflash_output = contains_verb("THE BLUE SKY.") # 123μs -> 116μs (5.93% faster) def test_sentence_with_non_ascii_characters_and_verb(): # Checks sentence with accented characters and a verb codeflash_output = contains_verb("Él corre rápido.") # 144μs -> 145μs (0.863% slower) def test_sentence_with_verb_as_ambiguous_word(): # "Run" as a noun codeflash_output = contains_verb("He went for a run.") # 88.4μs -> 87.2μs (1.38% faster) def test_sentence_with_verb_as_ambiguous_word_verb_usage(): # "Run" as a verb codeflash_output = contains_verb("He will run tomorrow.") # 88.9μs -> 86.9μs (2.35% faster) def test_sentence_with_abbreviation(): # Checks sentence with abbreviation and verb codeflash_output = contains_verb("Dr. Smith arrived.") # 136μs -> 132μs (3.40% faster) def test_sentence_with_newlines_and_tab_characters(): # Checks sentence with newlines and tabs codeflash_output = contains_verb( "The dog\nbarked.\tThe cat slept." ) # 236μs -> 220μs (7.22% faster) def test_sentence_with_only_stopwords(): # Checks sentence with only stopwords (no verbs) codeflash_output = contains_verb("and the but or") # 34.5μs -> 33.4μs (3.27% faster) def test_sentence_with_conjunctions_and_verb(): # Checks sentence with conjunctions and a verb codeflash_output = contains_verb("And then he laughed.") # 92.7μs -> 97.1μs (4.55% slower) def test_sentence_with_special_characters_and_verb(): # Checks sentence with special characters and a verb codeflash_output = contains_verb("@user replied!") # 163μs -> 153μs (6.70% faster) def test_sentence_with_url_and_verb(): # Checks sentence with a URL and a verb codeflash_output = contains_verb( "Check https://example.com and see." ) # 217μs -> 206μs (5.12% faster) def test_sentence_with_emoji_and_verb(): # Checks sentence with emoji and a verb codeflash_output = contains_verb("She runs fast 🏃♀️.") # 178μs -> 167μs (6.75% faster) def test_sentence_with_unicode_and_no_verb(): # Checks sentence with unicode and no verb codeflash_output = contains_verb("🍎🍏🍐") # 72.7μs -> 70.9μs (2.50% faster) def test_sentence_with_single_verb_only(): # Checks a sentence that is just a verb codeflash_output = contains_verb("Run") # 76.4μs -> 73.1μs (4.46% faster) def test_sentence_with_single_noun_only(): # Checks a sentence that is just a noun codeflash_output = contains_verb("Tree") # 78.7μs -> 73.9μs (6.45% faster) def test_sentence_with_verb_in_quotes(): # Checks a verb inside quotes codeflash_output = contains_verb('"Run" is a verb.') # 149μs -> 138μs (7.65% faster) def test_sentence_with_parentheses_and_verb(): # Checks a verb inside parentheses codeflash_output = contains_verb("He (runs) every day.") # 92.4μs -> 89.8μs (2.91% faster) def test_sentence_with_dash_and_verb(): # Checks a sentence with a dash and a verb codeflash_output = contains_verb("He - runs.") # 80.6μs -> 81.4μs (1.02% slower) def test_sentence_with_multiple_sentences_and_one_verb(): # Checks multiple sentences, only one has a verb codeflash_output = contains_verb("The blue sky. The cat runs.") # 252μs -> 248μs (1.88% faster) def test_sentence_with_multiple_sentences_no_verbs(): # Checks multiple sentences, none have verbs codeflash_output = contains_verb("The blue sky. The red car.") # 199μs -> 195μs (1.93% faster) def test_sentence_with_number_and_verb(): # Checks sentence with number and verb codeflash_output = contains_verb("There are 5 cats.") # 88.4μs -> 86.2μs (2.54% faster) def test_sentence_with_number_and_no_verb(): # Checks sentence with number and no verb codeflash_output = contains_verb("5 cats.") # 76.5μs -> 74.9μs (2.11% faster) def test_sentence_with_plural_noun_no_verb(): # Checks plural noun with no verb codeflash_output = contains_verb("Cats.") # 77.7μs -> 74.4μs (4.52% faster) def test_sentence_with_verb_and_compound_noun(): # Checks sentence with compound noun and verb codeflash_output = contains_verb("The ice-cream melts.") # 130μs -> 130μs (0.354% faster) # Large Scale Test Cases def test_large_text_with_many_verbs(): # Checks a long text with many verbs text = " ".join(["The dog runs. The cat jumps. The bird flies." for _ in range(200)]) codeflash_output = contains_verb(text) # 51.3ms -> 47.0ms (9.18% faster) def test_large_text_with_no_verbs(): # Checks a long text with no verbs text = " ".join(["The blue sky. The red car. The green grass." for _ in range(200)]) codeflash_output = contains_verb(text) # 46.4ms -> 42.5ms (9.19% faster) def test_large_text_with_verbs_in_middle(): # Checks a long text with verbs only in the middle text = ( " ".join(["The blue sky." for _ in range(100)]) + " The cat ran. " + " ".join(["The green grass." for _ in range(100)]) ) codeflash_output = contains_verb(text) # 17.0ms -> 16.1ms (5.72% faster) def test_large_text_with_uppercase_and_verbs(): # Checks a long uppercase text with verbs text = " ".join(["THE DOG RAN. THE CAT JUMPED. THE BIRD FLEW." for _ in range(200)]) codeflash_output = contains_verb(text) # 51.6ms -> 47.1ms (9.56% faster) def test_large_text_with_mixed_case_and_verbs(): # Checks a long text with mixed case and verbs text = "The dog ran. " * 500 + "the cat slept. " * 500 codeflash_output = contains_verb(text) # 83.5ms -> 77.5ms (7.64% faster) def test_large_text_with_numbers_and_no_verbs(): # Checks a long text with only numbers and no verbs text = "1234567890 " * 1000 codeflash_output = contains_verb(text) # 32.3ms -> 31.0ms (4.08% faster) def test_large_text_with_emojis_and_no_verbs(): # Checks a long text with only emojis and no verbs text = "😀😃😄😁😆😅😂🤣☺️ 😊 " * 100 codeflash_output = contains_verb(text) # 2.24ms -> 2.20ms (1.97% faster) def test_large_text_with_verbs_and_special_characters(): # Checks a long text with verbs and special characters text = "He runs! @user replied. #hashtag " * 300 codeflash_output = contains_verb(text) # 57.6ms -> 52.8ms (9.10% faster) def test_large_text_all_uppercase_no_verbs(): # Checks a long uppercase text with no verbs text = ("THE BLUE SKY. THE RED CAR. " * 400).strip() codeflash_output = contains_verb(text) # 55.7ms -> 52.2ms (6.80% faster) def test_large_text_with_sentences_and_newlines(): # Checks a long text with newlines and verbs text = "\n".join(["The dog barked." for _ in range(300)]) codeflash_output = contains_verb(text) # 26.0ms -> 24.0ms (8.08% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python import pytest # used for our unit tests from unstructured.partition.text_type import contains_verb # function to test # (Assume the code for pos_tag and contains_verb is as given in the prompt.) # --- Basic Test Cases --- def test_contains_verb_simple_sentence(): # Basic sentence with a single verb codeflash_output = contains_verb("The cat sleeps.") # 153μs -> 169μs (8.96% slower) def test_contains_verb_multiple_verbs(): # Sentence with multiple verbs codeflash_output = contains_verb( "She runs and jumps every morning." ) # 144μs -> 140μs (2.87% faster) def test_contains_verb_no_verb(): # Sentence with no verbs codeflash_output = contains_verb("The blue sky.") # 128μs -> 123μs (4.15% faster) def test_contains_verb_question(): # Question form with a verb codeflash_output = contains_verb("Is this your book?") # 98.0μs -> 94.5μs (3.77% faster) def test_contains_verb_negative_sentence(): # Sentence with negation codeflash_output = contains_verb("He does not like apples.") # 142μs -> 142μs (0.153% slower) def test_contains_verb_verb_ing(): # Sentence with present participle verb codeflash_output = contains_verb("Running is fun.") # 136μs -> 127μs (7.00% faster) def test_contains_verb_past_tense(): # Sentence with past tense verb codeflash_output = contains_verb("He walked home.") # 133μs -> 125μs (6.28% faster) def test_contains_verb_passive_voice(): # Passive voice sentence codeflash_output = contains_verb("The cake was eaten.") # 129μs -> 124μs (3.86% faster) def test_contains_verb_uppercase_text(): # Text in uppercase, should be normalized codeflash_output = contains_verb("THE DOG BARKED.") # 120μs -> 111μs (8.03% faster) def test_contains_verb_mixed_case_text(): # Mixed case, should work codeflash_output = contains_verb("tHe CaT SlePt.") # 151μs -> 147μs (3.01% faster) # --- Edge Test Cases --- def test_contains_verb_empty_string(): # Empty string input codeflash_output = contains_verb("") # 6.85μs -> 7.21μs (4.95% slower) def test_contains_verb_whitespace_only(): # String with only whitespace codeflash_output = contains_verb(" ") # 6.69μs -> 6.93μs (3.43% slower) def test_contains_verb_non_english(): # Non-English text (should return False as no English verbs) codeflash_output = contains_verb("これは日本語の文です。") # 91.3μs -> 88.4μs (3.33% faster) def test_contains_verb_numbers_and_symbols(): # String with only numbers and symbols codeflash_output = contains_verb("12345 !@#$%") # 177μs -> 180μs (1.75% slower) def test_contains_verb_one_word_noun(): # Single noun word codeflash_output = contains_verb("Table") # 78.6μs -> 72.2μs (8.81% faster) def test_contains_verb_one_word_verb(): # Single verb word codeflash_output = contains_verb("Run") # 74.7μs -> 73.2μs (2.02% faster) def test_contains_verb_command(): # Imperative/command sentence codeflash_output = contains_verb("Sit!") # 73.2μs -> 76.4μs (4.14% slower) def test_contains_verb_sentence_with_url(): # Sentence containing a URL codeflash_output = contains_verb( "Visit https://example.com for more info." ) # 254μs -> 244μs (4.09% faster) def test_contains_verb_sentence_with_abbreviation(): # Sentence containing abbreviations codeflash_output = contains_verb("Dr. Smith arrived.") # 129μs -> 129μs (0.051% slower) def test_contains_verb_sentence_with_apostrophe(): # Sentence with contractions codeflash_output = contains_verb("He can't go.") # 93.0μs -> 91.8μs (1.22% faster) def test_contains_verb_sentence_with_quotes(): # Sentence with quoted verb codeflash_output = contains_verb('He said, "Run!"') # 134μs -> 132μs (2.13% faster) def test_contains_verb_sentence_with_parentheses(): # Sentence with verb inside parentheses codeflash_output = contains_verb("The dog (barked) loudly.") # 159μs -> 166μs (4.20% slower) def test_contains_verb_sentence_with_no_alpha(): # String with no alphabetic characters codeflash_output = contains_verb("1234567890") # 75.7μs -> 75.5μs (0.327% faster) def test_contains_verb_sentence_with_newlines(): # Sentence with newlines codeflash_output = contains_verb("The dog\nbarked.") # 120μs -> 109μs (9.95% faster) def test_contains_verb_sentence_with_tabs(): # Sentence with tabs codeflash_output = contains_verb("The\tdog\tbarked.") # 114μs -> 104μs (9.09% faster) def test_contains_verb_sentence_with_multiple_sentences(): # Multiple sentences, at least one with a verb codeflash_output = contains_verb( "The sky. The dog barked. The tree." ) # 276μs -> 260μs (5.88% faster) def test_contains_verb_sentence_with_multiple_sentences_no_verbs(): # Multiple sentences, none with verbs codeflash_output = contains_verb( "The sky. The tree. The mountain." ) # 229μs -> 220μs (4.43% faster) def test_contains_verb_sentence_with_hyphenated_words(): # Sentence with hyphenated words and a verb codeflash_output = contains_verb( "The well-known actor performed." ) # 163μs -> 165μs (0.896% slower) def test_contains_verb_sentence_with_non_ascii_chars(): # Sentence with accented characters and a verb codeflash_output = contains_verb("José runs every day.") # 124μs -> 123μs (1.38% faster) def test_contains_verb_sentence_with_emojis(): # Sentence with emojis and a verb codeflash_output = contains_verb("He runs 🏃♂️ every day.") # 126μs -> 127μs (1.02% slower) def test_contains_verb_sentence_with_verb_as_noun(): # Word that can be both noun and verb, used as noun codeflash_output = contains_verb("The run was long.") # 127μs -> 135μs (6.02% slower) def test_contains_verb_sentence_with_verb_as_noun_and_verb(): # Word that can be both noun and verb, used as verb codeflash_output = contains_verb("They run every day.") # 83.9μs -> 76.5μs (9.70% faster) # --- Large Scale Test Cases --- def test_contains_verb_large_text_with_verbs(): # Large text (about 1000 words) with verbs scattered throughout text = " ".join(["He runs."] * 500 + ["The cat sleeps."] * 500) codeflash_output = contains_verb(text) # 68.4ms -> 62.7ms (9.04% faster) def test_contains_verb_large_text_no_verbs(): # Large text (about 1000 words) with no verbs text = " ".join(["The mountain."] * 1000) codeflash_output = contains_verb(text) # 57.4ms -> 53.2ms (7.83% faster) def test_contains_verb_large_text_mixed(): # Large text with verbs only in the last sentence text = " ".join(["The mountain."] * 999 + ["He runs."]) codeflash_output = contains_verb(text) # 57.8ms -> 53.1ms (8.73% faster) def test_contains_verb_large_text_all_uppercase(): # Large uppercase text with verbs, should normalize text = " ".join(["THE DOG BARKED."] * 1000) codeflash_output = contains_verb(text) # 85.5ms -> 78.6ms (8.74% faster) def test_contains_verb_large_text_with_newlines(): # Large text with newlines separating sentences text = "\n".join(["He runs."] * 1000) codeflash_output = contains_verb(text) # 53.3ms -> 49.7ms (7.36% faster) def test_contains_verb_large_text_with_numbers_and_symbols(): # Large text with numbers, symbols, and a single verb sentence text = "12345 !@#$% " * 999 + "He runs." codeflash_output = contains_verb(text) # 78.4ms -> 73.0ms (7.37% faster) def test_contains_verb_large_text_all_nouns(): # Large text with only nouns text = " ".join(["Table"] * 1000) codeflash_output = contains_verb(text) # 27.4ms -> 27.0ms (1.51% faster) def test_contains_verb_large_text_all_verbs(): # Large text with only verbs text = " ".join(["Run"] * 1000) codeflash_output = contains_verb(text) # 25.5ms -> 24.8ms (2.85% faster) # --- Mutation Testing Cases (to catch subtle bugs) --- @pytest.mark.parametrize( "text,expected", [ ("run", True), # verb, lower case ("RUN", True), # verb, upper case ("Running", True), # verb, gerund ("RAN", True), # verb, past tense ("", False), # empty (" ", False), # whitespace ("Table", False), # noun ("Table run", True), # noun and verb ("The", False), # article ("quickly", False), # adverb ("quickly run", True), # adverb + verb ("run quickly", True), # verb + adverb ("He", False), # pronoun ("He runs", True), # pronoun + verb ("He run", True), # pronoun + verb (incorrect grammar but verb present) ("He is", True), # verb 'is' ("He was", True), # verb 'was' ("He be", True), # verb 'be' ("He been", True), # verb 'been' ("He being", True), # verb 'being' ("He am", True), # verb 'am' ("He are", True), # verb 'are' ], ) def test_contains_verb_parametrized(text, expected): # Parametrized test for common verb forms and edge cases codeflash_output = contains_verb(text) # 1.07ms -> 1.05ms (2.21% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python import pytest from unstructured.partition.text_type import contains_verb def test_contains_verb(): with pytest.raises( SideEffectDetected, match='We\'ve\\ blocked\\ a\\ file\\ writing\\ operation\\ on\\ "/tmp/z0fmgvet"\\.\\ It\'s\\ dangerous\\ to\\ run\\ CrossHair\\ on\\ code\\ with\\ side\\ effects\\.\\ To\\ allow\\ this\\ operation\\ anyway,\\ use\\ "\\-\\-unblock=open:/tmp/z0fmgvet:None:655554"\\.\\ \\(or\\ some\\ colon\\-delimited\\ prefix\\)', ): contains_verb("🄰") ``` </details> <details> <summary>⏪ Click to see Replay Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:--------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `test_benchmark5_py__replay_test_0.py::test_unstructured_partition_text_type_contains_verb` | 3.19ms | 3.08ms | 3.40%✅ | </details> To edit these changes `git checkout codeflash/optimize-contains_verb-mjit1e7b` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Fixes ~15-18% performance regression introduced in 20251230 where f-strings were evaluated eagerly even when logging was disabled. See: pdfminer/pdfminer.six#1233 Fix: pdfminer/pdfminer.six#1234 <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Restores PDF parsing performance by updating dependency and releasing a new dev version. > > - **Deps:** Upgrade `pdfminer-six` from `20251230` to `20260107` in `requirements/extra-pdf-image.txt` to fix ~15–18% slowdown from eager f-string evaluation in logging > - **Release:** Bump `__version__` to `0.18.27-dev5` and add CHANGELOG entry under *Enhancement* > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 3dfed88. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
…ctured-IO#4165) <!-- CODEFLASH_OPTIMIZATION: {"function":"get_bbox_thickness","file":"unstructured/partition/pdf_image/analysis/bbox_visualisation.py","speedup_pct":"1,267%","speedup_x":"12.67x","original_runtime":"5.01 milliseconds","best_runtime":"367 microseconds","optimization_type":"general","timestamp":"2025-12-20T01:04:43.833Z","version":"1.0"} --> #### 📄 1,267% (12.67x) speedup for ***`get_bbox_thickness` in `unstructured/partition/pdf_image/analysis/bbox_visualisation.py`*** ⏱️ Runtime : **`5.01 milliseconds`** **→** **`367 microseconds`** (best of `250` runs) #### 📝 Explanation and details The optimization replaces `np.polyfit` with direct linear interpolation, achieving a **13x speedup** by eliminating unnecessary computational overhead. **Key Optimization:** - **Removed `np.polyfit`**: The original code used NumPy's polynomial fitting for a simple linear interpolation between two points, which is computationally expensive - **Direct linear interpolation**: Replaced with manual slope calculation: `slope = (max_value - min_value) / (ratio_for_max_value - ratio_for_min_value)` **Why This is Faster:** - `np.polyfit` performs general polynomial regression using least squares, involving matrix operations and SVD decomposition - overkill for two points - Direct slope calculation requires only basic arithmetic operations (subtraction and division) - Line profiler shows the `np.polyfit` line consumed 91.7% of execution time (10.67ms out of 11.64ms total) **Performance Impact:** The function is called from `draw_bbox_on_image` which processes bounding boxes for PDF image visualization. Since this appears to be in a rendering pipeline that could process many bounding boxes per page, the 13x speedup significantly improves visualization performance. Test results show consistent 12-13x improvements across all scenarios, from single bbox calls (~25μs → ~2μs) to batch processing of 100 random bboxes (1.6ms → 116μs). **Optimization Benefits:** - **Small bboxes**: 1329% faster (basic cases) - **Large bboxes**: 1283% faster - **Batch processing**: 1297% faster for 100 random bboxes - **Scale-intensive workloads**: 1341% faster for processing 1000+ bboxes This optimization is particularly valuable for PDF processing workflows where many bounding boxes need thickness calculations for visualization. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **8 Passed** | | 🌀 Generated Regression Tests | ✅ **285 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:----------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/pdf_image/test_analysis.py::test_get_bbox_thickness` | 75.5μs | 5.58μs | 1252%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python # imports import pytest # used for our unit tests from unstructured.partition.pdf_image.analysis.bbox_visualisation import get_bbox_thickness # unit tests # ---------- BASIC TEST CASES ---------- def test_basic_small_bbox_returns_min_thickness(): # Small bbox on a normal page should return min_thickness bbox = (10, 10, 20, 20) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 30.4μs -> 2.12μs (1329% faster) def test_basic_large_bbox_returns_max_thickness(): # Large bbox close to page size should return max_thickness bbox = (0, 0, 950, 950) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 27.1μs -> 1.96μs (1283% faster) def test_basic_medium_bbox_returns_intermediate_thickness(): # Medium bbox should return a value between min and max bbox = (100, 100, 500, 500) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.4μs -> 1.88μs (1256% faster) def test_basic_custom_min_max_thickness(): # Test with custom min and max thickness bbox = (0, 0, 500, 500) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=2, max_thickness=8) result = codeflash_output # 25.5μs -> 2.00μs (1175% faster) # ---------- EDGE TEST CASES ---------- def test_zero_area_bbox(): # Bbox with zero area (x1==x2 and y1==y2) should return min_thickness bbox = (100, 100, 100, 100) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.2μs -> 1.92μs (1214% faster) def test_bbox_exceeds_page_size(): # Bbox larger than page should still clamp to max_thickness bbox = (-100, -100, 1200, 1200) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.0μs -> 1.83μs (1264% faster) def test_negative_coordinates_bbox(): # Bbox with negative coordinates should still work bbox = (-10, -10, 20, 20) page_size = (100, 100) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.0μs -> 1.92μs (1205% faster) def test_min_equals_max_thickness(): # If min_thickness == max_thickness, always return that value bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=3, max_thickness=3) result = codeflash_output # 24.9μs -> 2.04μs (1119% faster) def test_page_size_zero_raises(): # Page size of zero should raise ZeroDivisionError bbox = (0, 0, 10, 10) page_size = (0, 0) with pytest.raises(ZeroDivisionError): get_bbox_thickness(bbox, page_size) # 1.96μs -> 1.88μs (4.43% faster) def test_bbox_on_line(): # Bbox that's a line (x1==x2 or y1==y2) should return min_thickness bbox = (10, 10, 10, 100) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 25.4μs -> 2.04μs (1143% faster) def test_min_thickness_greater_than_max_thickness(): # If min_thickness > max_thickness, function should clamp to min_thickness bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=5, max_thickness=2) result = codeflash_output # 24.9μs -> 2.00μs (1146% faster) # ---------- LARGE SCALE TEST CASES ---------- def test_many_bboxes_scaling(): # Test with 1000 bboxes of increasing size page_size = (1000, 1000) min_thickness, max_thickness = 1, 8 for i in range(1, 1001, 100): # 10 steps to keep runtime reasonable bbox = (0, 0, i, i) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness, max_thickness) result = codeflash_output # 181μs -> 12.9μs (1307% faster) def test_large_page_and_bbox(): # Test with large page and bbox values bbox = (0, 0, 999_999, 999_999) page_size = (1_000_000, 1_000_000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 24.2μs -> 2.08μs (1064% faster) def test_randomized_bboxes(): # Test with random bboxes within a page, ensure all results in bounds import random page_size = (1000, 1000) min_thickness, max_thickness = 1, 4 for _ in range(100): x1 = random.randint(0, 900) y1 = random.randint(0, 900) x2 = random.randint(x1, 1000) y2 = random.randint(y1, 1000) bbox = (x1, y1, x2, y2) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness, max_thickness) result = codeflash_output # 1.64ms -> 117μs (1297% faster) def test_performance_large_number_of_calls(): # Ensure function does not degrade with many calls (not a timing test, just functional) page_size = (500, 500) for i in range(1, 1001, 100): # 10 steps bbox = (0, 0, i, i) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 173μs -> 12.7μs (1264% faster) # ---------- ADDITIONAL EDGE CASES ---------- def test_bbox_with_float_coordinates(): # Non-integer coordinates should still work (since function expects int, but let's see) bbox = (0.0, 0.0, 500.0, 500.0) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(tuple(map(int, bbox)), page_size) result = codeflash_output # 24.0μs -> 1.88μs (1178% faster) def test_bbox_equal_to_page(): # Bbox exactly same as page should return max_thickness bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 23.8μs -> 1.83μs (1200% faster) def test_bbox_minimal_size(): # Bbox of size 1x1 should return min_thickness bbox = (10, 10, 11, 11) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) result = codeflash_output # 23.9μs -> 1.88μs (1176% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python # imports import pytest # used for our unit tests from unstructured.partition.pdf_image.analysis.bbox_visualisation import get_bbox_thickness # unit tests # ---------------------- BASIC TEST CASES ---------------------- def test_basic_small_bbox_min_thickness(): # Very small bbox compared to page, should get min_thickness bbox = (10, 10, 20, 20) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 24.1μs -> 1.88μs (1184% faster) def test_basic_large_bbox_max_thickness(): # Very large bbox, nearly the page size, should get max_thickness bbox = (0, 0, 900, 900) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 23.9μs -> 1.79μs (1235% faster) def test_basic_middle_bbox(): # Bbox size between min and max, should interpolate bbox = (100, 100, 500, 500) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 23.9μs -> 1.83μs (1205% faster) def test_basic_non_square_bbox(): # Non-square bbox, checks diagonal calculation bbox = (10, 10, 110, 410) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 24.0μs -> 1.83μs (1207% faster) def test_basic_custom_thickness_range(): # Custom min/max thickness values bbox = (0, 0, 500, 500) page_size = (1000, 1000) codeflash_output = get_bbox_thickness( bbox, page_size, min_thickness=2, max_thickness=8 ) # 24.0μs -> 1.92μs (1155% faster) # ---------------------- EDGE TEST CASES ---------------------- def test_edge_bbox_zero_size(): # Zero-area bbox, should always return min_thickness bbox = (100, 100, 100, 100) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 24.0μs -> 1.83μs (1209% faster) def test_edge_bbox_full_page(): # Bbox covers the whole page, should return max_thickness bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 23.9μs -> 1.83μs (1205% faster) def test_edge_bbox_negative_coordinates(): # Bbox with negative coordinates, still valid diagonal bbox = (-50, -50, 50, 50) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 23.9μs -> 1.83μs (1203% faster) def test_edge_bbox_larger_than_page(): # Bbox larger than page, should clamp to max_thickness bbox = (-100, -100, 1200, 1200) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 23.8μs -> 1.79μs (1228% faster) def test_edge_min_greater_than_max(): # min_thickness > max_thickness, should always return min_thickness (clamped) bbox = (0, 0, 1000, 1000) page_size = (1000, 1000) codeflash_output = get_bbox_thickness( bbox, page_size, min_thickness=5, max_thickness=2 ) # 24.1μs -> 1.92μs (1156% faster) def test_edge_zero_page_size(): # Page size zero, should raise ZeroDivisionError bbox = (0, 0, 10, 10) page_size = (0, 0) with pytest.raises(ZeroDivisionError): get_bbox_thickness(bbox, page_size) # 1.88μs -> 1.75μs (7.14% faster) def test_edge_bbox_on_page_border(): # Bbox on the edge of the page, not exceeding bounds bbox = (0, 0, 1000, 10) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 24.8μs -> 2.00μs (1138% faster) def test_edge_non_integer_bbox_and_page(): # Bbox and page_size with float values, should still work bbox = (0.0, 0.0, 500.5, 500.5) page_size = (1000.0, 1000.0) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 23.9μs -> 1.54μs (1448% faster) def test_edge_bbox_swapped_coordinates(): # Bbox with x2 < x1 or y2 < y1, negative width/height bbox = (100, 100, 50, 50) page_size = (1000, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 24.3μs -> 1.96μs (1143% faster) # ---------------------- LARGE SCALE TEST CASES ---------------------- def test_large_scale_many_bboxes(): # Test many bboxes on a large page page_size = (10000, 10000) for i in range(1, 1001, 100): # 10 iterations, up to 1000 bbox = (i, i, i + 100, i + 100) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 177μs -> 12.3μs (1341% faster) def test_large_scale_increasing_bbox_size(): # Test increasing bbox sizes from tiny to almost page size page_size = (1000, 1000) for size in range(1, 1001, 100): bbox = (0, 0, size, size) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 173μs -> 12.7μs (1263% faster) # Should be monotonic non-decreasing if size > 1: codeflash_output = get_bbox_thickness((0, 0, size - 100, size - 100), page_size) prev_thickness = codeflash_output def test_large_scale_random_bboxes(): # Generate 100 random bboxes and check thickness is in range import random page_size = (1000, 1000) for _ in range(100): x1 = random.randint(0, 900) y1 = random.randint(0, 900) x2 = random.randint(x1, 1000) y2 = random.randint(y1, 1000) bbox = (x1, y1, x2, y2) codeflash_output = get_bbox_thickness(bbox, page_size) thickness = codeflash_output # 1.63ms -> 116μs (1296% faster) def test_large_scale_extreme_aspect_ratios(): # Very thin or very flat bboxes page_size = (1000, 1000) # Very thin vertical bbox = (500, 0, 501, 1000) codeflash_output = get_bbox_thickness(bbox, page_size) # 23.8μs -> 1.88μs (1167% faster) # Very thin horizontal bbox = (0, 500, 1000, 501) codeflash_output = get_bbox_thickness(bbox, page_size) # 18.3μs -> 1.38μs (1230% faster) def test_large_scale_varied_thickness_range(): # Test with large min/max thickness range page_size = (1000, 1000) for size in range(1, 1001, 200): bbox = (0, 0, size, size) codeflash_output = get_bbox_thickness(bbox, page_size, min_thickness=10, max_thickness=100) thickness = codeflash_output # 93.3μs -> 7.17μs (1202% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-get_bbox_thickness-mjdlipbj` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io>
…ured-IO#4169) This PR fixes an issue where elements with partially filled extracted text is marked as extracted. ## bug scenario This PR adds a new unit test to show case a scenario: - during merging inferred and extracted layout the function `aggregate_embedded_text_by_block` aggregates extracted text that falls into an inferred element; and if all text has the flag `is_extracted` being `"true"` the inferred element is marked as such as well - however, there can be a case where the extracted text only partially fills the inferred element. There might be text in the inferred element region that are not present as extracted text (i.e., require OCR). But the current logic would still mark this inferred element as `is_extracted = "true"` ## Fix The fix adds another check in the function `aggregate_embedded_text_by_block`: if the intersect over union of between the source regions and target region cross a given threshold. This new check correctly identifies the case in the unit test that the inferred element should be be marked a `is_extracted = "false"`.
…` by 68% (Unstructured-IO#4166) <!-- CODEFLASH_OPTIMIZATION: {"function":"clean_extra_whitespace_with_index_run","file":"unstructured/cleaners/core.py","speedup_pct":"68%","speedup_x":"0.68x","original_runtime":"3.74 milliseconds","best_runtime":"2.22 milliseconds","optimization_type":"loop","timestamp":"2025-12-23T05:49:45.872Z","version":"1.0"} --> #### 📄 68% (0.68x) speedup for ***`clean_extra_whitespace_with_index_run` in `unstructured/cleaners/core.py`*** ⏱️ Runtime : **`3.74 milliseconds`** **→** **`2.22 milliseconds`** (best of `19` runs) #### 📝 Explanation and details The optimized code achieves a **68% speedup** through two key changes that eliminate expensive operations in the main loop: ## What Changed 1. **Character replacement optimization**: Replaced `re.sub(r"[\xa0\n]", " ", text)` with `text.translate()` using a translation table. This avoids regex compilation and pattern matching for simple character substitutions. 2. **Main loop optimization**: Eliminated two `re.match()` calls per iteration by: - Pre-computing character comparisons (`c_orig = text_chars[original_index]`) - Using set membership (`c_orig in ws_chars`) instead of regex matching - Direct character comparison (`c_clean == ' '`) instead of regex ## Why It's Faster Looking at the line profiler data, the original code spent **15.4% of total time** (10.8% + 4.6%) on regex matching inside the loop: - `bool(re.match("[\xa0\n]", text[original_index]))` - 7.12ms (10.8%) - `bool(re.match(" ", cleaned_text[cleaned_index]))` - 3.02ms (4.6%) The optimized version replaces these with: - Set membership check: `c_orig in ws_chars` - 1.07ms (1.4%) - Direct comparison: `c_clean == ' '` (included in same line) **Result**: Regex overhead is eliminated, saving ~9ms per 142 invocations in the benchmark. ## Performance Profile The annotated tests show the optimization excels when: - **Large inputs with whitespace**: `test_large_leading_and_trailing_whitespace` shows 291% speedup (203μs → 52.1μs) - **Many consecutive whitespace characters**: `test_large_mixed_whitespace_everywhere` shows 297% speedup (189μs → 47.8μs) - **Mixed whitespace types** (spaces, newlines, nbsp): `test_edge_all_whitespace_between_words` shows 47.9% speedup Small inputs with minimal whitespace see minor regressions (~5-17% slower) due to setup overhead, but these are negligible in absolute terms (< 2μs difference). ## Impact on Production Workloads The function is called in `_process_pdfminer_pages()` during PDF text extraction, processing **every text snippet on every page**. Given that PDFs often contain: - Multiple spaces/tabs between words - Newlines from paragraph breaks - Non-breaking spaces from formatting This optimization will provide substantial cumulative benefits when processing large documents with hundreds of pages, as the per-snippet savings compound across the entire document. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **45 Passed** | | ⏪ Replay Tests | ✅ **16 Passed** | | 🔎 Concolic Coverage Tests | ✅ **1 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Click to see Generated Regression Tests</summary> ```python from __future__ import annotations # imports from unstructured.cleaners.core import clean_extra_whitespace_with_index_run # unit tests # --- BASIC TEST CASES --- def test_basic_single_spaces(): # No extra whitespace, should remain unchanged text = "Hello world" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 8.95μs -> 9.71μs (7.88% slower) def test_basic_multiple_spaces(): # Multiple spaces between words should be reduced to one text = "Hello world" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 11.0μs -> 10.00μs (10.0% faster) def test_basic_newlines_and_nbsp(): # Newlines and non-breaking spaces replaced with single space text = "Hello\n\xa0world" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 12.8μs -> 10.2μs (25.2% faster) def test_basic_leading_and_trailing_spaces(): # Leading and trailing spaces should be stripped text = " Hello world " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 10.4μs -> 9.88μs (5.62% faster) def test_basic_only_spaces(): # Only spaces should return an empty string text = " " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 6.10μs -> 6.45μs (5.43% slower) def test_basic_only_newlines_and_nbsp(): # Only newlines and non-breaking spaces should return empty string text = "\n\xa0\n\xa0" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 6.47μs -> 6.21μs (4.25% faster) def test_basic_mixed_whitespace_between_words(): # Mixed spaces, newlines, and nbsp between words text = "A\n\n\xa0 B" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 12.9μs -> 9.07μs (41.9% faster) # --- EDGE TEST CASES --- def test_edge_empty_string(): # Empty string should return empty string and empty indices text = "" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 5.53μs -> 5.62μs (1.73% slower) def test_edge_all_whitespace(): # String with only whitespace, newlines, and nbsp text = " \n\xa0 \n\xa0" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 6.91μs -> 7.15μs (3.40% slower) def test_edge_one_character(): # Single non-whitespace character text = "A" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 5.86μs -> 6.33μs (7.52% slower) def test_edge_one_whitespace_character(): # Single whitespace character text = " " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 5.26μs -> 5.96μs (11.8% slower) def test_edge_whitespace_between_every_char(): # Whitespace between every character text = "H E L L O" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 7.13μs -> 8.59μs (17.0% slower) def test_edge_multiple_types_of_whitespace(): # Combination of spaces, newlines, and nbsp between words text = "A \n\xa0 B" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 12.3μs -> 8.56μs (44.1% faster) def test_edge_trailing_newlines_and_nbsp(): # Trailing newlines and nbsp should be stripped text = "Hello world\n\xa0" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 8.36μs -> 9.20μs (9.07% slower) def test_edge_leading_newlines_and_nbsp(): # Leading newlines and nbsp should be stripped text = "\n\xa0Hello world" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 11.3μs -> 9.86μs (14.6% faster) def test_edge_alternating_whitespace(): # Alternating whitespace and characters text = " H E L L O " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 8.30μs -> 8.81μs (5.80% slower) def test_edge_long_run_of_whitespace(): # Long run of whitespace in the middle text = "Hello" + (" " * 50) + "world" cleaned, indices = clean_extra_whitespace_with_index_run(text) # 27.5μs -> 13.4μs (106% faster) # --- LARGE SCALE TEST CASES --- def test_large_no_extra_whitespace(): # Large string with no extra whitespace text = "A" * 1000 cleaned, indices = clean_extra_whitespace_with_index_run(text) # 106μs -> 93.6μs (13.3% faster) def test_large_all_whitespace(): # Large string of only whitespace text = " " * 1000 cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 13.1μs -> 8.95μs (46.6% faster) def test_large_alternating_char_and_whitespace(): # Large string alternating between character and whitespace text = "".join(["A " for _ in range(500)]) # 500 'A ', total length 1000 cleaned, indices = clean_extra_whitespace_with_index_run(text) # 106μs -> 95.5μs (11.5% faster) def test_large_multiple_whitespace_blocks(): # Large string with random blocks of whitespace text = "A" + (" " * 10) + "B" + ("\n" * 10) + "C" + ("\xa0" * 10) + "D" cleaned, indices = clean_extra_whitespace_with_index_run(text) # 28.6μs -> 12.9μs (122% faster) def test_large_leading_and_trailing_whitespace(): # Large leading and trailing whitespace text = (" " * 500) + "Hello world" + (" " * 500) cleaned, indices = clean_extra_whitespace_with_index_run(text) # 203μs -> 52.1μs (291% faster) def test_large_mixed_whitespace_everywhere(): # Large text with mixed whitespace everywhere text = (" " * 100) + "A" + ("\n" * 100) + "B" + ("\xa0" * 100) + "C" + (" " * 100) cleaned, indices = clean_extra_whitespace_with_index_run(text) # 189μs -> 47.8μs (297% faster) # --- FUNCTIONALITY AND INTEGRITY TESTS --- def test_mutation_detection_extra_space(): # If function fails to remove extra spaces, test should fail text = "Test case" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 9.65μs -> 8.87μs (8.84% faster) def test_mutation_detection_strip(): # If function fails to strip leading/trailing whitespace, test should fail text = " Test case " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 9.64μs -> 8.97μs (7.41% faster) def test_mutation_detection_newline_nbsp(): # If function fails to replace newlines or nbsp, test should fail text = "Test\n\xa0case" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 11.7μs -> 9.45μs (23.5% faster) def test_mutation_detection_index_integrity(): # Changing the index logic should break this test text = "A B" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 8.82μs -> 7.73μs (14.2% faster) def test_mutation_detection_empty_output(): # If function fails to return empty string for all whitespace, test should fail text = " \n\xa0 " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 7.79μs -> 8.53μs (8.65% slower) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from __future__ import annotations # imports from unstructured.cleaners.core import clean_extra_whitespace_with_index_run # unit tests # 1. Basic Test Cases def test_basic_no_extra_whitespace(): # Text with no extra whitespace should remain unchanged text = "Hello world!" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 10.3μs -> 10.9μs (5.46% slower) def test_basic_multiple_spaces_between_words(): # Multiple spaces between words should be reduced to one text = "Hello world!" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 11.1μs -> 10.2μs (9.12% faster) def test_basic_leading_and_trailing_spaces(): # Leading and trailing spaces should be stripped text = " Hello world! " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 10.5μs -> 9.89μs (6.26% faster) def test_basic_newline_and_nonbreaking_space(): # Newlines and non-breaking spaces should be converted to single spaces text = "Hello\nworld!\xa0Test" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 12.4μs -> 9.69μs (28.0% faster) def test_basic_combined_whitespace_types(): # Combination of spaces, newlines, and non-breaking spaces text = "A \n\xa0 B" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 11.4μs -> 9.02μs (26.3% faster) # 2. Edge Test Cases def test_edge_empty_string(): # Empty string should return empty string and empty indices text = "" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 5.58μs -> 5.64μs (1.01% slower) def test_edge_only_spaces(): # String with only spaces should return empty string text = " " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 5.98μs -> 6.55μs (8.71% slower) def test_edge_only_newlines_and_nbsp(): # String with only newlines and non-breaking spaces text = "\n\xa0\n\xa0" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 6.54μs -> 6.06μs (7.91% faster) def test_edge_single_character(): # Single character should remain unchanged text = "A" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 6.01μs -> 6.45μs (6.78% slower) def test_edge_all_whitespace_between_words(): # All whitespace between words should be reduced to one space text = "A \n\xa0 B" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 13.4μs -> 9.08μs (47.9% faster) def test_edge_whitespace_at_various_positions(): # Whitespace at start, middle, and end text = " A B " cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 9.58μs -> 8.26μs (16.0% faster) def test_edge_multiple_consecutive_whitespace_groups(): # Several groups of consecutive whitespace text = "A \n\n B C" cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 12.9μs -> 9.42μs (37.3% faster) # 3. Large Scale Test Cases def test_large_long_string_with_regular_spacing(): # Large string with regular words and single spaces text = "word " * 200 cleaned, indices = clean_extra_whitespace_with_index_run( text.strip() ) # 107μs -> 95.9μs (12.2% faster) def test_large_long_string_with_extra_spaces(): # Large string with extra spaces between words text = ("word " * 200).strip() cleaned, indices = clean_extra_whitespace_with_index_run(text) # 402μs -> 180μs (123% faster) def test_large_mixed_whitespace(): # Large string with mixed whitespace types words = ["word"] * 500 text = " \n\xa0 ".join(words) cleaned, indices = clean_extra_whitespace_with_index_run(text) # 1.37ms -> 598μs (129% faster) def test_large_leading_and_trailing_whitespace(): # Large string with leading and trailing whitespace text = " " * 100 + "word " * 800 + " " * 100 cleaned, indices = clean_extra_whitespace_with_index_run(text) # 468μs -> 374μs (25.1% faster) def test_large_string_all_whitespace(): # Large string of only whitespace text = " " * 999 cleaned, indices = clean_extra_whitespace_with_index_run( text ) # 13.8μs -> 8.85μs (55.9% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from unstructured.cleaners.core import clean_extra_whitespace_with_index_run def test_clean_extra_whitespace_with_index_run(): clean_extra_whitespace_with_index_run("\n\x00") ``` </details> <details> <summary>⏪ Click to see Replay Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:--------------------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `test_benchmark1_py__replay_test_0.py::test_unstructured_cleaners_core_clean_extra_whitespace_with_index_run` | 376μs | 347μs | 8.63%✅ | </details> <details> <summary>🔎 Click to see Concolic Coverage Tests</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:----------------------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `codeflash_concolic_3yq4ufg_/tmp5dfyu5tu/test_concolic_coverage.py::test_clean_extra_whitespace_with_index_run` | 27.1μs | 17.7μs | 52.7%✅ | </details> To edit these changes `git checkout codeflash/optimize-clean_extra_whitespace_with_index_run-mji60td0` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io>
…structured-IO#4173) <!-- CODEFLASH_OPTIMIZATION: {"function":"recursive_xy_cut_swapped","file":"unstructured/partition/utils/xycut.py","speedup_pct":"221%","speedup_x":"2.21x","original_runtime":"74.9 milliseconds","best_runtime":"23.4 milliseconds","optimization_type":"loop","timestamp":"2025-12-19T10:16:38.619Z","version":"1.0"} --> #### 📄 221% (2.21x) speedup for ***`recursive_xy_cut_swapped` in `unstructured/partition/utils/xycut.py`*** ⏱️ Runtime : **`74.9 milliseconds`** **→** **`23.4 milliseconds`** (best of `57` runs) #### 📝 Explanation and details The optimized code achieves a **220% speedup** by applying **Numba JIT compilation** to the two most computationally expensive functions: `projection_by_bboxes` and `split_projection_profile`. **Key optimizations:** 1. **`@njit(cache=True)` decorators** on both bottleneck functions compile them to optimized machine code, eliminating Python interpreter overhead 2. **Explicit loop replacement** in `projection_by_bboxes`: Changed from `for start, end in boxes[:, axis::2]` with NumPy slice updates to explicit integer loops accessing individual array elements, which is much faster in Numba's nopython mode 3. **Manual array construction** in `split_projection_profile`: Replaced `np.insert()` and `np.append()` with pre-allocated arrays and explicit assignment loops, avoiding expensive array concatenation operations **Performance impact analysis:** From the line profiler results, the optimized functions show dramatic improvements: - `projection_by_bboxes` calls went from ~21ms to ~1.17s total runtime (but this is misleading due to JIT compilation overhead being included) - The actual per-call performance shows the functions are much faster, as evidenced by the overall 220% speedup **Workload benefits:** Based on the function references and test results, this optimization is particularly valuable for: - **Document layout analysis** where `recursive_xy_cut_swapped` processes many bounding boxes - **Large-scale scenarios** (500+ boxes) showing 200-240% speedups consistently - **Recursive processing** workflows where these functions are called repeatedly in nested operations The optimization maintains identical behavior while dramatically reducing computational overhead for any workload involving spatial partitioning of bounding boxes, especially beneficial for document processing pipelines that handle complex layouts with many text regions. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **40 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python # function to test import numpy as np # imports from unstructured.partition.utils.xycut import recursive_xy_cut_swapped # unit tests # Basic Test Cases def test_single_box(): # Test with a single bounding box boxes = np.array([[0, 0, 10, 10]]) indices = np.array([0]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 44.5μs -> 12.6μs (252% faster) def test_two_non_overlapping_boxes(): # Two boxes far apart horizontally boxes = np.array([[0, 0, 10, 10], [20, 0, 30, 10]]) indices = np.array([0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 63.7μs -> 19.0μs (235% faster) def test_two_overlapping_boxes_y(): # Two boxes stacked vertically boxes = np.array([[0, 0, 10, 10], [0, 20, 10, 30]]) indices = np.array([0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 118μs -> 35.5μs (235% faster) def test_three_boxes_grid(): # Three boxes in a grid boxes = np.array([[0, 0, 10, 10], [20, 0, 30, 10], [0, 20, 10, 30]]) indices = np.array([0, 1, 2]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 136μs -> 40.9μs (234% faster) def test_boxes_already_sorted(): # Boxes already sorted by x then y boxes = np.array([[0, 0, 10, 10], [0, 20, 10, 30], [20, 0, 30, 10]]) indices = np.array([0, 1, 2]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 136μs -> 40.2μs (239% faster) # Edge Test Cases def test_boxes_with_zero_area(): # Box with zero width and/or height boxes = np.array([[0, 0, 0, 10], [10, 10, 20, 10]]) indices = np.array([0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 55.6μs -> 36.7μs (51.5% faster) def test_boxes_with_negative_coordinates(): # Boxes with negative coordinates boxes = np.array([[-10, -10, 0, 0], [0, 0, 10, 10]]) indices = np.array([0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 53.8μs -> 14.7μs (266% faster) def test_boxes_with_overlap(): # Overlapping boxes boxes = np.array([[0, 0, 10, 10], [5, 5, 15, 15]]) indices = np.array([0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 50.5μs -> 15.0μs (236% faster) def test_boxes_with_same_coordinates(): # Multiple boxes with same coordinates boxes = np.array([[0, 0, 10, 10], [0, 0, 10, 10]]) indices = np.array([0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 48.2μs -> 13.1μs (268% faster) def test_boxes_with_minimal_gap(): # Boxes that barely touch (gap = 1) boxes = np.array([[0, 0, 10, 10], [11, 0, 21, 10]]) indices = np.array([0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 66.6μs -> 19.8μs (237% faster) def test_boxes_with_no_split_possible(): # All boxes overlap so no split boxes = np.array([[0, 0, 10, 10], [5, 0, 15, 10], [8, 0, 18, 10]]) indices = np.array([0, 1, 2]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 49.7μs -> 13.3μs (272% faster) # Large Scale Test Cases def test_large_number_of_boxes_horizontal(): # 500 boxes in a row horizontally boxes = np.array([[i * 2, 0, i * 2 + 1, 10] for i in range(500)]) indices = np.arange(500) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 10.00ms -> 3.33ms (200% faster) def test_large_number_of_boxes_vertical(): # 500 boxes in a column vertically boxes = np.array([[0, i * 2, 10, i * 2 + 1] for i in range(500)]) indices = np.arange(500) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 19.6ms -> 6.31ms (211% faster) def test_large_grid_of_boxes(): # 20x20 grid of boxes boxes = [] indices = [] idx = 0 for i in range(20): for j in range(20): boxes.append([i * 5, j * 5, i * 5 + 4, j * 5 + 4]) indices.append(idx) idx += 1 boxes = np.array(boxes) indices = np.array(indices) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 14.9ms -> 4.36ms (242% faster) def test_boxes_with_random_order(): # 100 boxes, shuffled boxes = np.array([[i, i, i + 10, i + 10] for i in range(100)]) indices = np.arange(100) rng = np.random.default_rng(42) perm = rng.permutation(100) boxes = boxes[perm] indices = indices[perm] res = [] recursive_xy_cut_swapped(boxes, indices, res) # 223μs -> 22.8μs (880% faster) def test_boxes_with_dense_overlap(): # 100 boxes all overlapping at the same spot boxes = np.array([[0, 0, 10, 10] for _ in range(100)]) indices = np.arange(100) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 219μs -> 19.8μs (1011% faster) # Edge: degenerate case with one pixel boxes def test_one_pixel_boxes(): boxes = np.array([[i, i, i + 1, i + 1] for i in range(50)]) indices = np.arange(50) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 162μs -> 16.3μs (895% faster) # Edge: maximal coordinates def test_boxes_with_max_coordinates(): boxes = np.array([[0, 0, 999, 999], [500, 500, 999, 999]]) indices = np.array([0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 52.3μs -> 21.2μs (147% faster) # Edge: indices are not in order def test_indices_not_in_order(): boxes = np.array([[0, 0, 10, 10], [10, 0, 20, 10], [0, 10, 10, 20]]) indices = np.array([2, 0, 1]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 48.5μs -> 13.1μs (271% faster) # Edge: all boxes touching at one point def test_boxes_touching_at_one_point(): boxes = np.array([[0, 0, 10, 10], [10, 10, 20, 20], [20, 20, 30, 30]]) indices = np.array([0, 1, 2]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 49.2μs -> 13.2μs (273% faster) # Edge: very thin boxes def test_very_thin_boxes(): boxes = np.array([[i, 0, i + 1, 100] for i in range(30)]) indices = np.arange(30) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 106μs -> 16.9μs (531% faster) # Edge: very flat boxes def test_very_flat_boxes(): boxes = np.array([[0, i, 100, i + 1] for i in range(30)]) indices = np.arange(30) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 106μs -> 16.5μs (544% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python import numpy as np # imports from unstructured.partition.utils.xycut import recursive_xy_cut_swapped # unit tests # Basic Test Cases def test_single_box(): # One box, should return the single index boxes = np.array([[0, 0, 10, 10]]) indices = np.array([42]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 46.6μs -> 13.2μs (254% faster) def test_two_non_overlapping_boxes(): # Two boxes, non-overlapping, should return indices sorted by x then y boxes = np.array( [ [0, 0, 10, 10], # left box [20, 0, 30, 10], # right box ] ) indices = np.array([1, 2]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 65.5μs -> 19.1μs (242% faster) def test_two_vertically_stacked_boxes(): # Two boxes, stacked vertically, should be sorted by y within x boxes = np.array( [ [0, 0, 10, 10], # top box [0, 20, 10, 30], # bottom box ] ) indices = np.array([3, 4]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 121μs -> 35.7μs (241% faster) def test_three_boxes_mixed(): # Boxes in different positions, tests sorting and splitting boxes = np.array( [ [0, 0, 10, 10], # top left [20, 0, 30, 10], # top right [0, 20, 10, 30], # bottom left ] ) indices = np.array([10, 11, 12]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 137μs -> 41.3μs (232% faster) # Edge Test Cases def test_boxes_with_zero_area(): # Boxes with zero width or height should be ignored boxes = np.array( [ [0, 0, 0, 10], # zero width [10, 10, 20, 10], # zero height [5, 5, 15, 15], # valid box ] ) indices = np.array([100, 101, 102]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 71.4μs -> 38.7μs (84.7% faster) def test_boxes_touching_edges(): # Boxes that touch but do not overlap boxes = np.array( [ [0, 0, 10, 10], [10, 0, 20, 10], # touches right edge of first [20, 0, 30, 10], # touches right edge of second ] ) indices = np.array([200, 201, 202]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 53.4μs -> 15.2μs (252% faster) def test_boxes_with_identical_coordinates(): # Multiple boxes with identical coordinates boxes = np.array( [ [0, 0, 10, 10], [0, 0, 10, 10], [0, 0, 10, 10], ] ) indices = np.array([301, 302, 303]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 50.3μs -> 13.5μs (274% faster) def test_boxes_with_negative_coordinates(): # Boxes with negative coordinates boxes = np.array( [ [-10, -10, 0, 0], [0, 0, 10, 10], [10, 10, 20, 20], ] ) indices = np.array([400, 401, 402]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 49.8μs -> 13.5μs (267% faster) def test_boxes_fully_overlapping(): # All boxes overlap completely boxes = np.array( [ [0, 0, 10, 10], [0, 0, 10, 10], [0, 0, 10, 10], ] ) indices = np.array([501, 502, 503]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 48.8μs -> 13.0μs (275% faster) def test_boxes_with_minimal_gap(): # Boxes separated by minimal gap (just enough to split) boxes = np.array( [ [0, 0, 10, 10], [11, 0, 21, 10], # gap of 1 [22, 0, 32, 10], # gap of 1 ] ) indices = np.array([601, 602, 603]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 86.4μs -> 26.2μs (229% faster) # Large Scale Test Cases def test_many_boxes_horizontal(): # 100 boxes in a horizontal row N = 100 boxes = np.array([[i * 10, 0, i * 10 + 9, 10] for i in range(N)]) indices = np.arange(N) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 1.87ms -> 556μs (236% faster) def test_many_boxes_vertical(): # 100 boxes in a vertical column N = 100 boxes = np.array([[0, i * 10, 10, i * 10 + 9] for i in range(N)]) indices = np.arange(N) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 3.78ms -> 1.16ms (225% faster) def test_grid_of_boxes(): # 10x10 grid of boxes N = 10 boxes = [] indices = [] idx = 0 for i in range(N): for j in range(N): boxes.append([i * 10, j * 10, i * 10 + 9, j * 10 + 9]) indices.append(idx) idx += 1 boxes = np.array(boxes) indices = np.array(indices) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 3.86ms -> 1.13ms (242% faster) # Should be sorted first by x (columns), then by y (rows) within each column expected = [] for i in range(N): col_indices = [i * N + j for j in range(N)] expected.extend(col_indices) def test_large_random_boxes(): # 500 random boxes, test performance and correctness np.random.seed(42) N = 500 left = np.random.randint(0, 1000, size=N) top = np.random.randint(0, 1000, size=N) width = np.random.randint(1, 10, size=N) height = np.random.randint(1, 10, size=N) right = left + width bottom = top + height boxes = np.stack([left, top, right, bottom], axis=1) indices = np.arange(N) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 18.2ms -> 5.82ms (212% faster) def test_boxes_with_max_coordinates(): # Boxes with coordinates at the upper range boxes = np.array( [ [990, 990, 999, 999], [995, 995, 999, 999], [900, 900, 950, 950], ] ) indices = np.array([800, 801, 802]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 69.8μs -> 23.5μs (198% faster) # Additional edge case: test with all boxes in a single point (degenerate case) def test_boxes_degenerate_point(): boxes = np.array( [ [5, 5, 5, 5], [5, 5, 5, 5], ] ) indices = np.array([900, 901]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 11.2μs -> 4.29μs (160% faster) # Additional: test with a single tall, thin box and a single short, wide box def test_tall_and_wide_boxes(): boxes = np.array( [ [0, 0, 2, 100], # tall, thin [0, 0, 100, 2], # short, wide ] ) indices = np.array([1000, 1001]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 47.4μs -> 13.9μs (240% faster) # Additional: test with overlapping but not identical boxes def test_overlapping_boxes(): boxes = np.array( [ [0, 0, 10, 10], [5, 5, 15, 15], [10, 10, 20, 20], ] ) indices = np.array([1100, 1101, 1102]) res = [] recursive_xy_cut_swapped(boxes, indices, res) # 49.1μs -> 13.2μs (273% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python ``` </details> To edit these changes `git checkout codeflash/optimize-recursive_xy_cut_swapped-mjcpsm6h` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
…by_style_name` by 69% (Unstructured-IO#4174) <!-- CODEFLASH_OPTIMIZATION: {"function":"_DocxPartitioner._parse_category_depth_by_style_name","file":"unstructured/partition/docx.py","speedup_pct":"69%","speedup_x":"0.69x","original_runtime":"8.62 milliseconds","best_runtime":"5.11 milliseconds","optimization_type":"loop","timestamp":"2025-08-22T21:02:58.781Z","version":"1.0"} --> ### 📄 69% (0.69x) speedup for ***`_DocxPartitioner._parse_category_depth_by_style_name` in `unstructured/partition/docx.py`*** ⏱️ Runtime : **`8.62 milliseconds`** **→** **`5.11 milliseconds`** (best of `17` runs) ### 📝 Explanation and details The optimized code achieves a **68% speedup** through two key optimizations: **1. Tuple-based prefix matching:** Changed `list_prefixes` from a list to a tuple and replaced the `any()` loop with a single `str.startswith()` call that accepts multiple prefixes. This eliminates the overhead of creating a generator expression and iterating through prefixes one by one. The line profiler shows this optimization reduced the time spent on prefix matching from 39.4% to 10.9% of total execution time. **2. Cached string splitting in `_extract_number()`:** Instead of calling `suffix.split()` twice (once to check the last element and once to extract it), the result is now cached in a `parts` variable. This eliminates redundant string operations when extracting numbers from style names. **Performance characteristics by test case:** - **List styles see the biggest gains** (43-69% faster): The tuple-based prefix matching is most effective here since these styles require prefix checking - **Non-matching styles improve dramatically** (65-151% faster): These benefit from faster rejection through the optimized prefix check - **Heading styles show modest gains** (2-33% faster): These bypass the list prefix logic, so improvements come mainly from the cached splitting - **Large-scale tests demonstrate consistent speedup** (20-69% faster): The optimizations scale well with volume The optimizations are particularly effective for documents with many list-style elements or diverse style names that don't match any prefixes. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **87 Passed** | | 🌀 Generated Regression Tests | ✅ **5555 Passed** | | ⏪ Replay Tests | ✅ **13 Passed** | | 🔎 Concolic Coverage Tests | ✅ **6 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:------------------------------------------------------------------|:--------------|:---------------|:----------| | `partition/test_docx.py::test_parse_category_depth_by_style_name` | 24.5μs | 17.3μs | 41.7%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from __future__ import annotations # imports import pytest from unstructured.partition.docx import _DocxPartitioner # unit tests @pytest.fixture def partitioner(): # Provide a partitioner instance for use in tests return _DocxPartitioner() # -------------------------- # 1. Basic Test Cases # -------------------------- def test_heading_level_1(partitioner): # Heading 1 should map to depth 0 codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1") def test_heading_level_2(partitioner): # Heading 2 should map to depth 1 codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2") def test_heading_level_10(partitioner): # Heading 10 should map to depth 9 codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 10") def test_subtitle(partitioner): # Subtitle should map to depth 1 codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle") def test_list_bullet_1(partitioner): # List Bullet 1 should map to depth 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 1") def test_list_bullet_3(partitioner): # List Bullet 3 should map to depth 2 codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 3") def test_list_number_2(partitioner): # List Number 2 should map to depth 1 codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 2") def test_list_continue_5(partitioner): # List Continue 5 should map to depth 4 codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 5") def test_list_plain(partitioner): # "List" without a number should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List") def test_normal_style(partitioner): # Any non-special style should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("Normal") def test_random_style(partitioner): # Unknown style name should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("RandomStyle") # -------------------------- # 2. Edge Test Cases # -------------------------- def test_heading_with_extra_spaces(partitioner): # Heading with extra spaces should still parse the last word as number if possible codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 3") def test_heading_without_number(partitioner): # Heading with no number should map to 0 (since no number to subtract 1) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading") def test_list_bullet_with_non_digit_suffix(partitioner): # List Bullet with non-digit at end should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet foo") def test_list_number_with_large_number(partitioner): # List Number with a large number codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 999") def test_empty_string(partitioner): # Empty string should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("") def test_case_sensitivity(partitioner): # Should be case-sensitive: "heading 1" does not match "Heading" codeflash_output = partitioner._parse_category_depth_by_style_name("heading 1") def test_subtitle_case(partitioner): # "subtitle" (lowercase) should not match "Subtitle" codeflash_output = partitioner._parse_category_depth_by_style_name("subtitle") def test_list_bullet_with_multiple_spaces(partitioner): # List Bullet with multiple spaces before number codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 2") def test_style_name_with_trailing_space(partitioner): # Style name with trailing space codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 4 ") def test_style_name_with_leading_space(partitioner): # Style name with leading space codeflash_output = partitioner._parse_category_depth_by_style_name(" List Bullet 2") def test_style_name_with_internal_non_digit(partitioner): # Heading with non-digit in the number position codeflash_output = partitioner._parse_category_depth_by_style_name("Heading X") def test_style_name_with_number_in_middle(partitioner): # Only the last word is checked for a digit codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2 Extra") def test_list_continue_with_no_number(partitioner): # List Continue with no number should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue") def test_style_name_with_special_characters(partitioner): # Style name with special characters should not break function codeflash_output = partitioner._parse_category_depth_by_style_name("Heading #$%") def test_list_prefix_overlap(partitioner): # "List BulletPoint 2" does not match any valid prefix, so should map to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("List BulletPoint 2") # -------------------------- # 3. Large Scale Test Cases # -------------------------- def test_many_headings(partitioner): # Test a large number of headings, up to 1000 for i in range(1, 1001): # "Heading N" should map to N-1 codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") def test_many_list_bullets(partitioner): # Test a large number of list bullets, up to 1000 for i in range(1, 1001): codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}") def test_many_list_numbers(partitioner): # Test a large number of list numbers, up to 1000 for i in range(1, 1001): codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}") def test_mixed_styles_large_scale(partitioner): # Mix a large number of different style names, including edge cases for i in range(1, 501): # Headings codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") # List Bullets codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}") # List Number codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}") # List Continue codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Continue {i}") # Unknown style codeflash_output = partitioner._parse_category_depth_by_style_name(f"Unknown {i}") # Heading with non-digit codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}x") def test_large_scale_with_unusual_inputs(partitioner): # Test 1000 random/edge case style names for i in range(1, 1001): # Style with only number codeflash_output = partitioner._parse_category_depth_by_style_name(str(i)) # Style with number at start codeflash_output = partitioner._parse_category_depth_by_style_name(f"{i} Heading") # Style with number in middle codeflash_output = partitioner._parse_category_depth_by_style_name(f"List {i} Bullet") # Style with extra spaces codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. #------------------------------------------------ from __future__ import annotations # imports import pytest # used for our unit tests from unstructured.partition.docx import _DocxPartitioner # function to test # pyright: reportPrivateUsage=false class DocxPartitionerOptions: pass from unstructured.partition.docx import _DocxPartitioner # unit tests @pytest.fixture def partitioner(): # Fixture to create a _DocxPartitioner instance return _DocxPartitioner(DocxPartitionerOptions()) # 1. Basic Test Cases def test_heading_styles_basic(partitioner): # Test standard heading styles codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1") # 4.50μs -> 4.41μs (2.16% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2") # 1.45μs -> 1.41μs (2.62% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 3") # 1.23μs -> 1.07μs (14.6% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 10") # 1.45μs -> 1.09μs (33.2% faster) def test_subtitle_style(partitioner): # Test the special case for 'Subtitle' codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle") # 1.97μs -> 1.89μs (4.18% faster) def test_list_styles_basic(partitioner): # Test basic list styles codeflash_output = partitioner._parse_category_depth_by_style_name("List 1") # 6.28μs -> 4.37μs (43.6% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List 2") # 2.53μs -> 1.59μs (59.8% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List 10") # 2.36μs -> 1.47μs (60.8% faster) def test_list_bullet_styles(partitioner): # Test 'List Bullet' styles codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 1") # 6.13μs -> 4.47μs (37.1% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 2") # 2.53μs -> 1.71μs (47.9% faster) def test_list_continue_styles(partitioner): # Test 'List Continue' styles codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 1") # 6.34μs -> 4.41μs (43.9% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue 5") # 2.59μs -> 1.75μs (48.1% faster) def test_list_number_styles(partitioner): # Test 'List Number' styles codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 1") # 6.25μs -> 4.34μs (44.0% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Number 3") # 2.54μs -> 1.72μs (48.2% faster) def test_other_styles_default_to_zero(partitioner): # Test styles that should default to 0 codeflash_output = partitioner._parse_category_depth_by_style_name("Normal") # 4.09μs -> 2.48μs (65.0% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Body Text") # 1.94μs -> 913ns (113% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Title") # 1.65μs -> 728ns (127% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Random Style") # 1.58μs -> 727ns (117% faster) # 2. Edge Test Cases def test_heading_without_number(partitioner): # Test 'Heading' with no number codeflash_output = partitioner._parse_category_depth_by_style_name("Heading") # 3.04μs -> 3.10μs (1.97% slower) def test_list_without_number(partitioner): # Test 'List' with no number codeflash_output = partitioner._parse_category_depth_by_style_name("List") # 5.24μs -> 3.63μs (44.4% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet") # 2.56μs -> 1.72μs (49.1% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue") # 1.66μs -> 1.08μs (53.4% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Number") # 1.52μs -> 932ns (63.2% faster) def test_heading_with_non_numeric_suffix(partitioner): # Test 'Heading' with a non-numeric suffix codeflash_output = partitioner._parse_category_depth_by_style_name("Heading One") # 3.37μs -> 3.44μs (1.95% slower) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading X") # 1.36μs -> 1.32μs (2.81% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 1A") # 953ns -> 983ns (3.05% slower) def test_list_with_non_numeric_suffix(partitioner): # Test 'List' with a non-numeric suffix codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet X") # 5.65μs -> 3.98μs (42.2% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Continue A") # 2.24μs -> 1.61μs (38.9% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Number Foo") # 1.76μs -> 1.22μs (44.1% faster) def test_case_sensitivity(partitioner): # Test that style names are case-sensitive codeflash_output = partitioner._parse_category_depth_by_style_name("heading 1") # 3.98μs -> 2.35μs (69.0% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("HEADING 1") # 2.01μs -> 935ns (115% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("Subtitle") # 665ns -> 591ns (12.5% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("subtitle") # 1.71μs -> 803ns (113% faster) def test_empty_and_whitespace_styles(partitioner): # Test empty string and whitespace-only style names codeflash_output = partitioner._parse_category_depth_by_style_name("") # 4.14μs -> 2.40μs (72.4% faster) codeflash_output = partitioner._parse_category_depth_by_style_name(" ") # 1.94μs -> 808ns (139% faster) def test_style_name_with_extra_spaces(partitioner): # Test style names with extra spaces codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 2") # 3.79μs -> 3.78μs (0.371% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet 3") # 4.11μs -> 2.14μs (92.0% faster) def test_style_name_with_leading_trailing_spaces(partitioner): # Test style names with leading/trailing spaces codeflash_output = partitioner._parse_category_depth_by_style_name(" Heading 1") # 3.97μs -> 2.44μs (62.9% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List 2 ") # 4.34μs -> 3.08μs (41.2% faster) def test_style_name_with_multiple_words(partitioner): # Test style names with multiple words that don't match any prefix codeflash_output = partitioner._parse_category_depth_by_style_name("My Custom Heading 1") # 3.81μs -> 2.31μs (65.1% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List Bullet Special 2") # 4.46μs -> 3.08μs (45.0% faster) def test_style_name_with_large_number(partitioner): # Test styles with very large numbers codeflash_output = partitioner._parse_category_depth_by_style_name("Heading 999") # 4.21μs -> 4.06μs (3.64% faster) codeflash_output = partitioner._parse_category_depth_by_style_name("List 1000") # 4.09μs -> 2.19μs (86.9% faster) # 3. Large Scale Test Cases def test_large_number_of_headings(partitioner): # Test a large number of heading levels for performance and correctness for i in range(1, 1000): style = f"Heading {i}" expected = i - 1 codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.06ms -> 881μs (20.5% faster) def test_large_number_of_list_bullets(partitioner): # Test a large number of list bullet levels for i in range(1, 1000): style = f"List Bullet {i}" expected = i - 1 codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.77ms -> 1.05ms (67.8% faster) def test_large_number_of_list_numbers(partitioner): # Test a large number of list number levels for i in range(1, 1000): style = f"List Number {i}" expected = i - 1 codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.77ms -> 1.05ms (69.0% faster) def test_large_number_of_non_matching_styles(partitioner): # Test a large number of non-matching style names for i in range(1, 1000): style = f"Custom Style {i}" codeflash_output = partitioner._parse_category_depth_by_style_name(style) # 1.47ms -> 583μs (151% faster) def test_large_mixed_styles(partitioner): # Test a mixture of all types in a large batch for i in range(1, 250): codeflash_output = partitioner._parse_category_depth_by_style_name(f"Heading {i}") # 282μs -> 229μs (23.2% faster) codeflash_output = partitioner._parse_category_depth_by_style_name(f"List {i}") codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Bullet {i}") # 434μs -> 265μs (63.7% faster) codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Continue {i}") codeflash_output = partitioner._parse_category_depth_by_style_name(f"List Number {i}") # 431μs -> 258μs (66.5% faster) codeflash_output = partitioner._parse_category_depth_by_style_name(f"Random Style {i}") # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. #------------------------------------------------ from typing import TextIO from unstructured.partition.docx import DocxPartitionerOptions from unstructured.partition.docx import _DocxPartitioner def test__DocxPartitioner__parse_category_depth_by_style_name(): _DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=None, file_path='', include_page_breaks=False, infer_table_structure=True, starting_page_number=0, strategy=None)), 'List\x00\x00\x00\x00') def test__DocxPartitioner__parse_category_depth_by_style_name_2(): _DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=None, file_path=None, include_page_breaks=False, infer_table_structure=False, starting_page_number=0, strategy=None)), '') def test__DocxPartitioner__parse_category_depth_by_style_name_3(): _DocxPartitioner._parse_category_depth_by_style_name(_DocxPartitioner(DocxPartitionerOptions(file=TextIO(), file_path='', include_page_breaks=True, infer_table_structure=False, starting_page_number=0, strategy='')), 'Subtitle') ``` </details> <details> <summary>⏪ Replay Tests and Runtime</summary> </details> <details> <summary>🔎 Concolic Coverage Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:---------------------------------------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name` | 6.64μs | 4.95μs | 34.2%✅ | | `codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name_2` | 4.50μs | 2.90μs | 54.8%✅ | | `codeflash_concolic_ktxbqhta/tmp0euuwgaw/test_concolic_coverage.py::test__DocxPartitioner__parse_category_depth_by_style_name_3` | 1.80μs | 1.70μs | 5.58%✅ | </details> To edit these changes `git checkout codeflash/optimize-_DocxPartitioner._parse_category_depth_by_style_name-menbhfu6` and push. [](https://codeflash.ai) --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
…s_to_elements` by 85% (Unstructured-IO#4175) <!-- CODEFLASH_OPTIMIZATION: {"function":"VertexAIEmbeddingEncoder._add_embeddings_to_elements","file":"unstructured/embed/vertexai.py","speedup_pct":"85%","speedup_x":"0.85x","original_runtime":"195 microseconds","best_runtime":"105 microseconds","optimization_type":"memory","timestamp":"2025-12-20T08:21:25.645Z","version":"1.0"} --> #### 📄 85% (0.85x) speedup for ***`VertexAIEmbeddingEncoder._add_embeddings_to_elements` in `unstructured/embed/vertexai.py`*** ⏱️ Runtime : **`195 microseconds`** **→** **`105 microseconds`** (best of `250` runs) #### 📝 Explanation and details The optimization achieves an 85% speedup by eliminating the need for manual indexing and list building. The key changes are: **What was optimized:** 1. **Replaced `enumerate()` with `zip()`** - Instead of `for i, element in enumerate(elements)` followed by `embeddings[i]`, the code now uses `for element, embedding in zip(elements, embeddings)` to iterate over both collections simultaneously 2. **Removed unnecessary list building** - Eliminated the `elements_w_embedding = []` list and `.append()` operations since the function mutates elements in-place and returns the original `elements` list **Why this is faster:** - **Reduced indexing overhead**: The original code performed `embeddings[i]` lookup for each iteration, which requires bounds checking and index calculation. `zip()` provides direct element access without indexing - **Eliminated list operations**: Building and appending to `elements_w_embedding` added ~35.6% of the original runtime overhead according to the profiler - **Better memory locality**: `zip()` creates an iterator that processes elements sequentially without additional memory allocations **Performance impact based on test results:** - **Small inputs (1-5 elements)**: 8-35% speedup - **Large inputs (100-999 elements)**: 87-98% speedup, showing the optimization scales very well - **Edge cases**: Consistent improvements across empty lists, None embeddings, and varied types The optimization is particularly effective for larger datasets, which is important since embedding operations typically process batches of documents. The function maintains identical behavior - elements are still mutated in-place and the same list is returned. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **60 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python from dataclasses import dataclass, field from typing import Any # imports import pytest # used for our unit tests from unstructured.embed.vertexai import VertexAIEmbeddingEncoder # Minimal stubs for dependencies class VertexAIEmbeddingConfig: pass @DataClass class Element: text: str embeddings: Any = field(default=None) class BaseEmbeddingEncoder: pass # unit tests # --- Basic Test Cases --- def test_basic_single_element_embedding(): # Test with a single element and single embedding encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) element = Element(text="Hello world") embedding = [0.1, 0.2, 0.3] codeflash_output = encoder._add_embeddings_to_elements([element], [embedding]) result = codeflash_output # 542ns -> 541ns (0.185% faster) def test_basic_multiple_elements_embeddings(): # Test with multiple elements and embeddings encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [Element(text="A"), Element(text="B"), Element(text="C")] embeddings = [[1], [2], [3]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 750ns -> 583ns (28.6% faster) for i in range(3): pass def test_basic_return_is_input_list(): # The function should return the same list object (not a copy) encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [Element(text="X")] embeddings = [[42]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 542ns -> 458ns (18.3% faster) # --- Edge Test Cases --- def test_edge_empty_lists(): # Test with empty input lists encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [] embeddings = [] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 375ns -> 416ns (9.86% slower) def test_edge_mismatched_lengths_raises(): # Test with mismatched lengths (should raise AssertionError) encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [Element(text="A"), Element(text="B")] embeddings = [[1]] with pytest.raises(AssertionError): encoder._add_embeddings_to_elements(elements, embeddings) # 500ns -> 500ns (0.000% faster) def test_edge_none_embedding(): # Test with None as an embedding encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [Element(text="A")] embeddings = [None] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 625ns -> 541ns (15.5% faster) def test_edge_element_with_existing_embedding(): # If element already has an embedding, it should be overwritten encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) element = Element(text="A", embeddings=[0]) new_embedding = [1, 2, 3] codeflash_output = encoder._add_embeddings_to_elements([element], [new_embedding]) result = codeflash_output # 625ns -> 500ns (25.0% faster) def test_edge_embedding_is_mutable_object(): # Test that mutable embeddings (like lists) are assigned, not copied encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [Element(text="A")] embedding = [1, 2, 3] codeflash_output = encoder._add_embeddings_to_elements(elements, [embedding]) result = codeflash_output # 583ns -> 500ns (16.6% faster) # Mutate embedding and check if element reflects change (should, if assigned) embedding.append(4) def test_edge_elements_are_mutated_in_place(): # The input elements should be mutated in place, not replaced encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [Element(text="X")] embeddings = [[99]] encoder._add_embeddings_to_elements(elements, embeddings) # 583ns -> 458ns (27.3% faster) # --- Large Scale Test Cases --- def test_large_scale_many_elements(): # Test with a large number of elements and embeddings encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) num_items = 500 # Under 1000 as per instructions elements = [Element(text=f"Text {i}") for i in range(num_items)] embeddings = [[i, i + 1, i + 2] for i in range(num_items)] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 31.1μs -> 16.1μs (93.5% faster) for i in range(num_items): pass def test_large_scale_all_none_embeddings(): # Large number of elements, all embeddings are None encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) num_items = 300 elements = [Element(text=str(i)) for i in range(num_items)] embeddings = [None] * num_items codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 18.4μs -> 9.58μs (91.7% faster) for i in range(num_items): pass def test_large_scale_varied_embedding_types(): # Mix of different embedding types (int, float, str, list, dict) encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [Element(text=f"e{i}") for i in range(5)] embeddings = [123, 3.14, "vector", [1, 2, 3], {"x": 1}] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 1.00μs -> 708ns (41.2% faster) for i in range(5): pass # --- Determinism and Idempotency --- def test_determinism_multiple_runs(): # Running the function twice with same input should yield same output encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) elements = [Element(text="Deterministic")] embeddings = [[7, 8, 9]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result1 = codeflash_output # 583ns -> 500ns (16.6% faster) # Reset embeddings elements[0].embeddings = None codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result2 = codeflash_output # 291ns -> 250ns (16.4% faster) def test_idempotency_overwrites_embedding(): # Running again overwrites previous embedding encoder = VertexAIEmbeddingEncoder(config=VertexAIEmbeddingConfig()) element = Element(text="Test", embeddings=[0]) encoder._add_embeddings_to_elements([element], [[1, 2, 3]]) # 542ns -> 500ns (8.40% faster) encoder._add_embeddings_to_elements([element], [[4, 5, 6]]) # 291ns -> 291ns (0.000% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python from dataclasses import dataclass from typing import Any # imports import pytest # used for our unit tests from unstructured.embed.vertexai import VertexAIEmbeddingEncoder # Simulate the Element class for testing @DataClass class Element: text: str embeddings: Any = None # Simulate the BaseEmbeddingEncoder and VertexAIEmbeddingConfig for testing class BaseEmbeddingEncoder: pass @DataClass class VertexAIEmbeddingConfig: pass # unit tests # ----------- BASIC TEST CASES ----------- def test_add_embeddings_basic_single_element(): # Test with one element and one embedding encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="hello")] embeddings = [[0.1, 0.2, 0.3]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 541ns -> 500ns (8.20% faster) def test_add_embeddings_basic_multiple_elements(): # Test with multiple elements and embeddings encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="a"), Element(text="b"), Element(text="c")] embeddings = [[1], [2], [3]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 791ns -> 583ns (35.7% faster) for i, element in enumerate(result): pass def test_add_embeddings_basic_empty_lists(): # Test with empty elements and embeddings encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [] embeddings = [] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 416ns -> 416ns (0.000% faster) def test_add_embeddings_basic_varied_embedding_types(): # Test with embeddings of different types (float, int, str) encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="x"), Element(text="y"), Element(text="z")] embeddings = [[0.1, 0.2], [1, 2], ["a", "b"]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 750ns -> 583ns (28.6% faster) for i, element in enumerate(result): pass # ----------- EDGE TEST CASES ----------- def test_add_embeddings_length_mismatch_raises(): # Test that length mismatch raises AssertionError encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="a"), Element(text="b")] embeddings = [[1, 2, 3]] # Only one embedding with pytest.raises(AssertionError): encoder._add_embeddings_to_elements(elements, embeddings) # 500ns -> 500ns (0.000% faster) def test_add_embeddings_elements_with_existing_embeddings(): # Test that existing embeddings are overwritten encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="a", embeddings=[9, 9]), Element(text="b", embeddings=[8, 8])] embeddings = [[1, 2], [3, 4]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 833ns -> 625ns (33.3% faster) for i, element in enumerate(result): pass def test_add_embeddings_none_embeddings(): # Test with None as embedding values encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="a"), Element(text="b")] embeddings = [None, None] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 708ns -> 583ns (21.4% faster) for element in result: pass def test_add_embeddings_elements_are_mutated_in_place(): # Test that the original elements are mutated (in-place) encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="a"), Element(text="b")] embeddings = [[1], [2]] encoder._add_embeddings_to_elements(elements, embeddings) # 708ns -> 542ns (30.6% faster) def test_add_embeddings_with_empty_embedding_vectors(): # Test with empty embedding vectors encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="a"), Element(text="b")] embeddings = [[], []] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 667ns -> 541ns (23.3% faster) for element in result: pass def test_add_embeddings_elements_are_returned_in_same_order(): # Test that the returned elements are in the same order as input encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="first"), Element(text="second"), Element(text="third")] embeddings = [[1], [2], [3]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 750ns -> 583ns (28.6% faster) def test_add_embeddings_embedded_elements_are_same_objects(): # Test that returned elements are the same objects as input (not copies) encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) elements = [Element(text="a"), Element(text="b")] embeddings = [[1], [2]] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 708ns -> 542ns (30.6% faster) for orig, returned in zip(elements, result): pass # ----------- LARGE SCALE TEST CASES ----------- def test_add_embeddings_large_scale_100_elements(): # Test with 100 elements and embeddings encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) count = 100 elements = [Element(text=f"elem{i}") for i in range(count)] embeddings = [[float(i)] * 10 for i in range(count)] # 10-dim embeddings codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 6.62μs -> 3.54μs (87.1% faster) for i in range(count): pass def test_add_embeddings_large_scale_999_elements(): # Test with 999 elements and embeddings (near upper limit) encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) count = 999 elements = [Element(text=f"e{i}") for i in range(count)] embeddings = [[i] for i in range(count)] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 62.2μs -> 31.5μs (97.9% faster) for i in range(count): pass def test_add_embeddings_large_scale_embedding_size_variation(): # Test with large number of elements and variable embedding sizes encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) count = 500 elements = [Element(text=f"t{i}") for i in range(count)] embeddings = [[float(i)] * (i % 10 + 1) for i in range(count)] codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 31.1μs -> 16.0μs (94.5% faster) for i in range(count): pass def test_add_embeddings_large_scale_performance(): # Test that function completes in reasonable time for large input import time encoder = VertexAIEmbeddingEncoder(VertexAIEmbeddingConfig()) count = 500 elements = [Element(text=str(i)) for i in range(count)] embeddings = [[i] * 5 for i in range(count)] start = time.time() codeflash_output = encoder._add_embeddings_to_elements(elements, embeddings) result = codeflash_output # 30.8μs -> 16.0μs (92.7% faster) end = time.time() for i in range(count): pass # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-VertexAIEmbeddingEncoder._add_embeddings_to_elements-mje14as7` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io>
<!-- CODEFLASH_OPTIMIZATION:
{"function":"ngrams","file":"unstructured/utils.py","speedup_pct":"188%","speedup_x":"1.88x","original_runtime":"6.12
milliseconds","best_runtime":"2.13
milliseconds","optimization_type":"loop","timestamp":"2026-01-01T04:37:11.183Z","version":"1.0"}
-->
#### 📄 188% (1.88x) speedup for ***`ngrams` in
`unstructured/utils.py`***
⏱️ Runtime : **`6.12 milliseconds`** **→** **`2.13 milliseconds`** (best
of `138` runs)
#### 📝 Explanation and details
The optimized code achieves a **187% speedup** by replacing nested loops
with Python's efficient list slicing and comprehension. Here's why it's
faster:
## Key Optimizations
**1. List Comprehension vs Nested Loops**
- **Original**: Uses nested loops with individual element appends
(`ngram.append(s[i + j])`) - this creates and grows a temporary list
`ngram` for each n-gram, then converts it to a tuple
- **Optimized**: Uses list slicing `s[i:i+n]` which is implemented in C
and directly creates the subsequence in one operation
**2. Eliminated Redundant Operations**
The line profiler shows the original code spends:
- 35% of time in the inner loop iteration (`for j in range(n)`)
- 37% of time appending elements (`ngram.append(s[i + j])`)
- 12.5% converting lists to tuples (`tuple(ngram)`)
The optimized version eliminates all this overhead by extracting the
slice and converting it to a tuple in a single expression.
## Performance Impact by Context
The function is called in `calculate_shared_ngram_percentage()` which
operates on split text strings. This is likely used for text similarity
analysis. The optimization particularly benefits:
- **Large n-grams**: When `n` is large (e.g., `n=1000`), the speedup
reaches **1394%** because the original code's inner loop overhead scales
with `n`, while slicing remains constant time
- **Many n-grams**: For lists with 1000 elements and `n=2-3`, speedup is
**181-234%** because the outer loop runs many times
- **Hot paths**: Since this is used in text similarity calculations,
it's likely called frequently on document chunks, making even the 5-20%
gains on small inputs meaningful
## Edge Case Handling
The optimized code adds explicit handling for `n <= 0`:
- Returns empty tuples for each position when `n <= 0`, matching the
original behavior where `range(n)` with negative `n` produces no
iterations
- This is 7-9% faster for edge cases while maintaining correctness
## Test Results Summary
- **Small inputs** (3-10 elements): 5-40% faster
- **Medium inputs** (100-500 elements): 132-354% faster
- **Large inputs** (1000 elements): 181-1394% faster depending on `n`
- **Edge cases** (empty lists, `n > len`): Some are 25-30% slower due to
the empty list comprehension overhead, but these are rare cases with
negligible absolute time impact (<3μs)
The optimization trades slightly slower edge case performance for
dramatically better typical case performance, which is the right
tradeoff given the function's usage pattern in text processing.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | 🔘 **None Found** |
| 🌀 Generated Regression Tests | ✅ **58 Passed** |
| ⏪ Replay Tests | 🔘 **None Found** |
| 🔎 Concolic Coverage Tests | ✅ **1 Passed** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>🌀 Click to see Generated Regression Tests</summary>
```python
from __future__ import annotations
# imports
from unstructured.utils import ngrams
# unit tests
# -------------------- BASIC TEST CASES --------------------
def test_ngrams_basic_unigram():
# Test with n=1 (unigram)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, 1)
result = codeflash_output # 4.39μs -> 4.17μs (5.30% faster)
def test_ngrams_basic_bigram():
# Test with n=2 (bigram)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.29μs -> 4.00μs (7.46% faster)
def test_ngrams_basic_trigram():
# Test with n=3 (trigram)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, 3)
result = codeflash_output # 3.75μs -> 3.77μs (0.531% slower)
def test_ngrams_basic_typical_sentence():
# Test with a typical sentence split into words
s = ["the", "quick", "brown", "fox", "jumps"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 5.16μs -> 4.33μs (19.1% faster)
def test_ngrams_basic_full_ngram():
# Test where n equals the length of the list
s = ["a", "b", "c", "d"]
codeflash_output = ngrams(s, 4)
result = codeflash_output # 3.88μs -> 3.81μs (1.63% faster)
# -------------------- EDGE TEST CASES --------------------
def test_ngrams_empty_list():
# Test with an empty list
s = []
codeflash_output = ngrams(s, 2)
result = codeflash_output # 1.98μs -> 2.82μs (29.9% slower)
def test_ngrams_n_zero():
# Test with n=0, should return empty list (no 0-grams)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, 0)
result = codeflash_output # 3.93μs -> 3.67μs (7.20% faster)
def test_ngrams_n_negative():
# Test with negative n, should return empty list (no negative n-grams)
s = ["a", "b", "c"]
codeflash_output = ngrams(s, -1)
result = codeflash_output # 4.16μs -> 3.84μs (8.27% faster)
def test_ngrams_n_greater_than_len():
# Test with n greater than the length of the list
s = ["a", "b"]
codeflash_output = ngrams(s, 3)
result = codeflash_output # 2.03μs -> 2.80μs (27.4% slower)
def test_ngrams_n_equals_zero_and_empty_list():
# Test with n=0 and empty list
s = []
codeflash_output = ngrams(s, 0)
result = codeflash_output # 3.18μs -> 3.49μs (8.90% slower)
def test_ngrams_list_of_length_one():
# Test with a single element list and n=1
s = ["a"]
codeflash_output = ngrams(s, 1)
result = codeflash_output # 3.47μs -> 3.79μs (8.57% slower)
def test_ngrams_list_of_length_one_n_greater():
# Test with a single element list and n>1
s = ["a"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 2.08μs -> 2.76μs (24.5% slower)
def test_ngrams_non_ascii_characters():
# Test with non-ASCII and unicode characters
s = ["你好", "世界", "😊"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.23μs -> 4.02μs (5.07% faster)
def test_ngrams_repeated_elements():
# Test with repeated elements in the list
s = ["a", "a", "a", "a"]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.78μs -> 4.26μs (12.3% faster)
def test_ngrams_with_empty_strings():
# Test with empty strings as elements
s = ["", "a", ""]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.24μs -> 4.04μs (4.85% faster)
def test_ngrams_with_mixed_types_raises():
# Test with non-string elements should raise TypeError in type-checked code, but function as written does not check
s = ["a", 1, None]
# The function will not error, but let's check that output matches tuple of elements
codeflash_output = ngrams(s, 2)
result = codeflash_output # 4.30μs -> 4.07μs (5.66% faster)
def test_ngrams_large_n_and_empty_list():
# Test with very large n and empty list
s = []
codeflash_output = ngrams(s, 100)
result = codeflash_output # 2.22μs -> 2.94μs (24.5% slower)
# -------------------- LARGE SCALE TEST CASES --------------------
def test_ngrams_large_input_unigram():
# Test with a large list and n=1 (should return all elements as singletons)
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 1)
result = codeflash_output # 372μs -> 157μs (136% faster)
def test_ngrams_large_input_bigram():
# Test with a large list and n=2 (should return len(s)-1 bigrams)
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 2)
result = codeflash_output # 457μs -> 162μs (181% faster)
def test_ngrams_large_input_trigram():
# Test with a large list and n=3
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 3)
result = codeflash_output # 541μs -> 162μs (234% faster)
def test_ngrams_large_input_n_equals_length():
# Test with a large list and n equals the list length
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 1000)
result = codeflash_output # 99.9μs -> 8.80μs (1035% faster)
def test_ngrams_large_input_n_greater_than_length():
# Test with a large list and n greater than the list length
s = [str(i) for i in range(1000)]
codeflash_output = ngrams(s, 1001)
result = codeflash_output # 1.71μs -> 2.42μs (29.6% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
# imports
import pytest # used for our unit tests
from unstructured.utils import ngrams
# unit tests
class TestNgramsBasic:
"""Basic test cases for normal operating conditions"""
def test_bigrams_simple_sentence(self):
# Test generating bigrams (n=2) from a simple sentence
words = ["the", "quick", "brown", "fox"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.63μs -> 4.23μs (9.43% faster)
expected = [("the", "quick"), ("quick", "brown"), ("brown", "fox")]
def test_trigrams_simple_sentence(self):
# Test generating trigrams (n=3) from a simple sentence
words = ["I", "love", "to", "code"]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 4.37μs -> 4.03μs (8.36% faster)
expected = [("I", "love", "to"), ("love", "to", "code")]
def test_unigrams(self):
# Test generating unigrams (n=1), should return each word as a single-element tuple
words = ["hello", "world"]
codeflash_output = ngrams(words, 1)
result = codeflash_output # 4.04μs -> 4.00μs (1.000% faster)
expected = [("hello",), ("world",)]
def test_fourgrams(self):
# Test generating 4-grams from a longer sequence
words = ["a", "b", "c", "d", "e", "f"]
codeflash_output = ngrams(words, 4)
result = codeflash_output # 5.29μs -> 4.26μs (24.0% faster)
expected = [("a", "b", "c", "d"), ("b", "c", "d", "e"), ("c", "d", "e", "f")]
def test_single_word_list_unigram(self):
# Test with a single word and n=1
words = ["hello"]
codeflash_output = ngrams(words, 1)
result = codeflash_output # 3.31μs -> 3.75μs (11.7% slower)
expected = [("hello",)]
def test_exact_length_match(self):
# Test when n equals the length of the list (should return one n-gram)
words = ["one", "two", "three"]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 3.64μs -> 3.74μs (2.65% slower)
expected = [("one", "two", "three")]
def test_numeric_strings(self):
# Test with numeric strings to ensure type handling
words = ["1", "2", "3", "4", "5"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 5.03μs -> 4.26μs (17.9% faster)
expected = [("1", "2"), ("2", "3"), ("3", "4"), ("4", "5")]
def test_special_characters(self):
# Test with special characters and punctuation
words = ["Hello", ",", "world", "!", "How", "are", "you", "?"]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 6.60μs -> 4.74μs (39.3% faster)
expected = [
("Hello", ",", "world"),
(",", "world", "!"),
("world", "!", "How"),
("!", "How", "are"),
("How", "are", "you"),
("are", "you", "?"),
]
class TestNgramsEdgeCases:
"""Edge cases and unusual conditions"""
def test_empty_list(self):
# Test with an empty list, should return empty list
words = []
codeflash_output = ngrams(words, 2)
result = codeflash_output # 1.91μs -> 2.74μs (30.3% slower)
expected = []
def test_n_greater_than_list_length(self):
# Test when n is greater than the list length, should return empty list
words = ["one", "two"]
codeflash_output = ngrams(words, 5)
result = codeflash_output # 1.94μs -> 2.76μs (29.6% slower)
expected = []
def test_n_equals_zero(self):
# Test with n=0, should return empty list (no 0-grams possible)
words = ["a", "b", "c"]
codeflash_output = ngrams(words, 0)
result = codeflash_output # 3.82μs -> 3.51μs (8.68% faster)
expected = []
def test_n_negative(self):
# Test with negative n, should return empty list
words = ["a", "b", "c"]
codeflash_output = ngrams(words, -1)
result = codeflash_output # 3.99μs -> 3.65μs (9.31% faster)
expected = []
def test_very_large_n(self):
# Test with very large n value, much greater than list length
words = ["a", "b"]
codeflash_output = ngrams(words, 1000)
result = codeflash_output # 2.09μs -> 2.80μs (25.5% slower)
expected = []
def test_empty_strings_in_list(self):
# Test with empty strings as elements
words = ["", "hello", "", "world", ""]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 5.21μs -> 4.36μs (19.6% faster)
expected = [("", "hello"), ("hello", ""), ("", "world"), ("world", "")]
def test_whitespace_strings(self):
# Test with whitespace-only strings
words = [" ", " ", " ", " "]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.68μs -> 4.19μs (11.6% faster)
expected = [(" ", " "), (" ", " "), (" ", " ")]
def test_duplicate_consecutive_words(self):
# Test with duplicate consecutive words
words = ["the", "the", "the", "end"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.73μs -> 4.21μs (12.3% faster)
expected = [("the", "the"), ("the", "the"), ("the", "end")]
def test_unicode_characters(self):
# Test with unicode characters
words = ["hello", "世界", "🌍", "مرحبا"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.69μs -> 4.19μs (11.9% faster)
expected = [("hello", "世界"), ("世界", "🌍"), ("🌍", "مرحبا")]
def test_very_long_strings(self):
# Test with very long individual strings
long_string = "a" * 10000
words = [long_string, "short", long_string]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.16μs -> 4.08μs (2.04% faster)
expected = [(long_string, "short"), ("short", long_string)]
def test_single_element_list_bigram(self):
# Test with single element list and n=2, should return empty
words = ["alone"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 1.94μs -> 2.79μs (30.5% slower)
expected = []
def test_two_elements_trigram(self):
# Test with two elements and n=3, should return empty
words = ["one", "two"]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 1.91μs -> 2.79μs (31.5% slower)
expected = []
def test_result_is_list_of_tuples(self):
# Verify the result is a list and contains tuples
words = ["a", "b", "c"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.16μs -> 4.09μs (1.76% faster)
def test_tuples_are_immutable(self):
# Verify that returned tuples are truly tuples (immutable)
words = ["x", "y", "z"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.19μs -> 3.94μs (6.43% faster)
# Try to modify a tuple (should raise TypeError)
with pytest.raises(TypeError):
result[0][0] = "modified"
def test_original_list_unchanged(self):
# Verify the original list is not modified
words = ["a", "b", "c", "d"]
original_copy = words.copy()
ngrams(words, 2) # 4.68μs -> 4.12μs (13.7% faster)
def test_mixed_case_sensitivity(self):
# Test that function preserves case
words = ["Hello", "WORLD", "hello", "world"]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 4.60μs -> 4.16μs (10.6% faster)
expected = [("Hello", "WORLD"), ("WORLD", "hello"), ("hello", "world")]
class TestNgramsLargeScale:
"""Large scale tests for performance and scalability"""
def test_large_list_bigrams(self):
# Test with a large list (1000 elements) generating bigrams
words = [f"word{i}" for i in range(1000)]
codeflash_output = ngrams(words, 2)
result = codeflash_output # 456μs -> 162μs (181% faster)
def test_large_list_small_n(self):
# Test with large list and small n value
words = [f"token{i}" for i in range(500)]
codeflash_output = ngrams(words, 3)
result = codeflash_output # 267μs -> 81.6μs (228% faster)
def test_large_list_large_n(self):
# Test with large list and large n value
words = [f"item{i}" for i in range(100)]
codeflash_output = ngrams(words, 50)
result = codeflash_output # 235μs -> 19.5μs (1108% faster)
def test_large_n_value_unigrams(self):
# Test with large list generating unigrams (should be fast)
words = [f"element{i}" for i in range(1000)]
codeflash_output = ngrams(words, 1)
result = codeflash_output # 372μs -> 160μs (132% faster)
def test_maximum_size_ngram(self):
# Test generating an n-gram that spans almost the entire list
words = [f"w{i}" for i in range(100)]
codeflash_output = ngrams(words, 99)
result = codeflash_output # 21.2μs -> 4.66μs (354% faster)
def test_many_small_ngrams(self):
# Test generating many small n-grams from a large list
words = [chr(65 + (i % 26)) for i in range(1000)] # A-Z repeated
codeflash_output = ngrams(words, 2)
result = codeflash_output # 454μs -> 160μs (183% faster)
# Verify structure is maintained
for i, ngram in enumerate(result):
pass
def test_repeated_pattern_large_scale(self):
# Test with repeated pattern in large list
pattern = ["a", "b", "c"]
words = pattern * 333 # 999 elements
codeflash_output = ngrams(words, 3)
result = codeflash_output # 544μs -> 163μs (234% faster)
# Every third n-gram should be ("a", "b", "c")
for i in range(0, len(result), 3):
if i < len(result):
pass
def test_all_unique_elements_large(self):
# Test with all unique elements in a large list
words = [f"unique_{i}_{j}" for i in range(10) for j in range(100)]
codeflash_output = ngrams(words, 5)
result = codeflash_output # 762μs -> 172μs (343% faster)
def test_memory_efficiency_check(self):
# Test that function doesn't create excessive intermediate structures
# by verifying output size is proportional to input
words = [f"mem{i}" for i in range(500)]
codeflash_output = ngrams(words, 10)
result = codeflash_output # 607μs -> 95.3μs (537% faster)
def test_boundary_conditions_large_list(self):
# Test boundary conditions with large list
words = [f"boundary{i}" for i in range(1000)]
# n = 1 (minimum meaningful n)
codeflash_output = ngrams(words, 1)
result_1 = codeflash_output # 372μs -> 158μs (135% faster)
# n = 1000 (equals list length)
codeflash_output = ngrams(words, 1000)
result_1000 = codeflash_output # 98.7μs -> 6.61μs (1394% faster)
# n = 1001 (exceeds list length)
codeflash_output = ngrams(words, 1001)
result_1001 = codeflash_output # 724ns -> 1.05μs (30.9% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
```python
from unstructured.utils import ngrams
def test_ngrams():
ngrams([""], 1)
```
</details>
<details>
<summary>🔎 Click to see Concolic Coverage Tests</summary>
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|:---------------------------------------------------------------------------------|:--------------|:---------------|:----------|
|
`codeflash_concolic_ph7c2wr0/tmphq_b3i1a/test_concolic_coverage.py::test_ngrams`
| 292μs | 292μs | 0.101%✅ |
</details>
To edit these changes `git checkout codeflash/optimize-ngrams-mjuye5a2`
and push.
[](https://codeflash.ai)

---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
qued
approved these changes
Jan 8, 2026
…ed-IO#4176) <!-- CODEFLASH_OPTIMIZATION: {"function":"stage_for_datasaur","file":"unstructured/staging/datasaur.py","speedup_pct":"8%","speedup_x":"0.08x","original_runtime":"1.69 milliseconds","best_runtime":"1.56 milliseconds","optimization_type":"loop","timestamp":"2025-12-20T04:34:26.272Z","version":"1.0"} --> #### 📄 8% (0.08x) speedup for ***`stage_for_datasaur` in `unstructured/staging/datasaur.py`*** ⏱️ Runtime : **`1.69 milliseconds`** **→** **`1.56 milliseconds`** (best of `250` runs) #### 📝 Explanation and details The optimization replaces the explicit loop-based result construction with a **list comprehension**. This change eliminates the intermediate `result` list initialization and the repeated `append()` operations. **Key changes:** - Removed `result: List[Dict[str, Any]] = []` initialization - Replaced the `for i, item in enumerate(elements):` loop with a single list comprehension: `return [{"text": item.text, "entities": _entities[i]} for i, item in enumerate(elements)]` - Eliminated multiple `result.append(data)` calls **Why this is faster:** List comprehensions in Python are implemented in C and execute significantly faster than equivalent explicit loops with append operations. The optimization eliminates the overhead of: - Creating an empty list and growing it incrementally - Multiple function calls to `append()` - Temporary variable assignment (`data`) **Performance characteristics:** The profiler shows this optimization is most effective for larger datasets - the annotated tests demonstrate **18-20% speedup** for 1000+ elements, while smaller datasets see modest gains or slight overhead due to the comprehension setup cost. The optimization delivers consistent **6-10% improvements** for medium-scale workloads (500+ elements with entities). **Impact on workloads:** This optimization will benefit any application processing substantial amounts of text data for Datasaur formatting, particularly document processing pipelines or batch entity annotation workflows where hundreds or thousands of text elements are processed together. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⚙️ Existing Unit Tests | ✅ **6 Passed** | | 🌀 Generated Regression Tests | ✅ **37 Passed** | | ⏪ Replay Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | ✅ **3 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>⚙️ Existing Unit Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:--------------------------------------------------------------------------|:--------------|:---------------|:----------| | `staging/test_datasaur.py::test_datasaur_raises_with_bad_type` | 2.67μs | 2.50μs | 6.64%✅ | | `staging/test_datasaur.py::test_datasaur_raises_with_missing_entity_text` | 1.04μs | 1.04μs | -0.096%⚠️ | | `staging/test_datasaur.py::test_datasaur_raises_with_missing_key` | 2.08μs | 1.96μs | 6.33%✅ | | `staging/test_datasaur.py::test_datasaur_raises_with_wrong_length` | 1.08μs | 1.04μs | 4.03%✅ | | `staging/test_datasaur.py::test_stage_for_datasaur` | 1.29μs | 1.33μs | -3.08%⚠️ | | `staging/test_datasaur.py::test_stage_for_datasaur_with_entities` | 2.50μs | 2.46μs | 1.67%✅ | </details> <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python # imports import pytest from unstructured.staging.datasaur import stage_for_datasaur # Mock class for Text, as per unstructured.documents.elements.Text class Text: def __init__(self, text: str): self.text = text # unit tests # --------------------------- # Basic Test Cases # --------------------------- def test_single_element_no_entities(): # Single Text element, no entities elements = [Text("hello world")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.12μs -> 1.25μs (10.0% slower) def test_multiple_elements_no_entities(): # Multiple Text elements, no entities elements = [Text("a"), Text("b"), Text("c")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.38μs -> 1.38μs (0.000% faster) for i, letter in enumerate(["a", "b", "c"]): pass def test_single_element_with_single_entity(): # Single element, one entity elements = [Text("hello world")] entities = [[{"text": "hello", "type": "GREETING", "start_idx": 0, "end_idx": 5}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.04μs -> 2.04μs (0.000% faster) def test_multiple_elements_with_entities(): # Multiple elements, each with entities elements = [Text("foo bar"), Text("baz qux")] entities = [ [{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3}], [{"text": "qux", "type": "NOUN", "start_idx": 4, "end_idx": 7}], ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.50μs -> 2.58μs (3.21% slower) def test_elements_with_mixed_entities(): # Some elements have entities, some do not elements = [Text("foo bar"), Text("baz qux")] entities = [[], [{"text": "baz", "type": "NOUN", "start_idx": 0, "end_idx": 3}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.08μs -> 2.08μs (0.000% faster) # --------------------------- # Edge Test Cases # --------------------------- def test_empty_elements_list(): # Empty input list elements = [] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 791ns -> 875ns (9.60% slower) def test_entities_length_mismatch(): # entities list length does not match elements length elements = [Text("foo"), Text("bar")] entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3}]] with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 916ns -> 875ns (4.69% faster) def test_entity_missing_key(): # Entity is missing a required key elements = [Text("foo")] entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0}]] # missing 'end_idx' with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 1.83μs -> 1.75μs (4.74% faster) def test_entity_wrong_type(): # Entity has wrong type for a key elements = [Text("foo")] entities = [ [{"text": "foo", "type": "NOUN", "start_idx": "0", "end_idx": 3}] ] # 'start_idx' should be int with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 2.42μs -> 2.33μs (3.60% faster) def test_entity_extra_keys(): # Entity has extra keys (should not error) elements = [Text("foo")] entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 3, "confidence": 0.99}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.00μs -> 2.04μs (2.01% slower) def test_entities_is_none(): # entities explicitly passed as None elements = [Text("foo")] codeflash_output = stage_for_datasaur(elements, None) result = codeflash_output # 1.04μs -> 1.08μs (3.79% slower) def test_entity_empty_list(): # entities is a list of empty lists (should be valid) elements = [Text("foo"), Text("bar")] entities = [[], []] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.42μs -> 1.50μs (5.60% slower) def test_entity_text_not_matching_element(): # Entity text does not match element text (should not error) elements = [Text("foobar")] entities = [[{"text": "baz", "type": "NOUN", "start_idx": 0, "end_idx": 3}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.00μs -> 2.00μs (0.000% faster) def test_entity_indices_out_of_bounds(): # Entity indices out of text bounds (should not error) elements = [Text("foo")] entities = [[{"text": "foo", "type": "NOUN", "start_idx": 0, "end_idx": 10}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.96μs -> 2.00μs (2.10% slower) # --------------------------- # Large Scale Test Cases # --------------------------- def test_large_number_of_elements(): # Test with 1000 elements, no entities n = 1000 elements = [Text(str(i)) for i in range(n)] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 102μs -> 87.0μs (18.1% faster) for i in range(n): pass def test_large_number_of_elements_with_entities(): # Test with 500 elements, each with one entity n = 500 elements = [Text(f"text_{i}") for i in range(n)] entities = [ [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}] for i in range(n) ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 244μs -> 227μs (7.83% faster) for i in range(n): pass def test_large_number_of_entities_per_element(): # Test with 10 elements, each with 100 entities elements = [Text(f"text_{i}") for i in range(10)] entities = [ [{"text": f"t_{j}", "type": "TYPE", "start_idx": j, "end_idx": j + 1} for j in range(100)] for _ in range(10) ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 356μs -> 337μs (5.73% faster) for i in range(10): for j in range(100): pass # --------------------------- # Mutation Testing Guards # --------------------------- def test_mutation_guard_wrong_text_key(): # Changing the output key 'text' should fail elements = [Text("foo")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.00μs -> 1.04μs (4.03% slower) def test_mutation_guard_wrong_entities_key(): # Changing the output key 'entities' should fail elements = [Text("foo")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 958ns -> 1.00μs (4.20% slower) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python # imports import pytest from unstructured.staging.datasaur import stage_for_datasaur # Dummy Text class for testing, since unstructured.documents.elements.Text is not available class Text: def __init__(self, text: str): self.text = text # unit tests # --------------------- Basic Test Cases --------------------- def test_single_element_no_entities(): # One element, no entities elements = [Text("hello world")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.17μs -> 1.21μs (3.47% slower) def test_multiple_elements_no_entities(): # Multiple elements, no entities elements = [Text("foo"), Text("bar"), Text("baz")] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 1.29μs -> 1.33μs (3.15% slower) def test_single_element_with_valid_entities(): # One element, one valid entity elements = [Text("hello world")] entities = [[{"text": "hello", "type": "GREETING", "start_idx": 0, "end_idx": 5}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.04μs -> 2.00μs (2.05% faster) def test_multiple_elements_with_entities(): # Multiple elements, each with their own entities elements = [Text("foo bar"), Text("baz qux")] entities = [ [{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3}], [{"text": "qux", "type": "WORD", "start_idx": 4, "end_idx": 7}], ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.62μs -> 2.50μs (5.00% faster) def test_multiple_elements_some_empty_entities(): # Multiple elements, some with no entities elements = [Text("foo bar"), Text("baz qux")] entities = [ [], [{"text": "baz", "type": "WORD", "start_idx": 0, "end_idx": 3}], ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 2.08μs -> 2.08μs (0.048% slower) # --------------------- Edge Test Cases --------------------- def test_empty_elements_list(): # No elements elements = [] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 750ns -> 875ns (14.3% slower) def test_empty_elements_with_empty_entities(): # No elements, entities is empty list elements = [] entities = [] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 833ns -> 1.00μs (16.7% slower) def test_entities_length_mismatch(): # entities list length does not match elements list length elements = [Text("foo"), Text("bar")] entities = [[]] # Should be length 2 with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 916ns -> 875ns (4.69% faster) def test_entity_missing_key(): # Entity dict missing a required key elements = [Text("foo")] entities = [[{"text": "foo", "type": "WORD", "start_idx": 0}]] # Missing 'end_idx' with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 1.92μs -> 1.75μs (9.49% faster) def test_entity_wrong_type(): # Entity dict with wrong type for a key elements = [Text("foo")] entities = [ [{"text": "foo", "type": "WORD", "start_idx": "zero", "end_idx": 3}] ] # start_idx should be int with pytest.raises(ValueError) as excinfo: stage_for_datasaur(elements, entities) # 2.46μs -> 2.33μs (5.36% faster) def test_entity_extra_keys(): # Entity dict with extra keys (should be ignored) elements = [Text("foo")] entities = [[{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3, "extra": "ignored"}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.96μs -> 2.00μs (2.05% slower) def test_entity_with_empty_string(): # Entity with empty string values (should be allowed) elements = [Text("")] entities = [[{"text": "", "type": "", "start_idx": 0, "end_idx": 0}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.96μs -> 1.96μs (0.000% faster) def test_entity_with_negative_indices(): # Entity with negative indices (should be allowed, not validated) elements = [Text("foo")] entities = [[{"text": "foo", "type": "WORD", "start_idx": -1, "end_idx": -1}]] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 1.83μs -> 1.88μs (2.24% slower) # --------------------- Large Scale Test Cases --------------------- def test_large_number_of_elements_no_entities(): # Large number of elements, no entities n = 1000 elements = [Text(f"text_{i}") for i in range(n)] codeflash_output = stage_for_datasaur(elements) result = codeflash_output # 103μs -> 86.7μs (19.7% faster) for i in range(n): pass def test_large_number_of_elements_with_entities(): # Large number of elements, each with one entity n = 1000 elements = [Text(f"text_{i}") for i in range(n)] entities = [ [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}] for i in range(n) ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 502μs -> 470μs (6.85% faster) for i in range(n): pass def test_large_number_of_elements_some_with_entities(): # Large number of elements, only even indices have entities n = 1000 elements = [Text(f"text_{i}") for i in range(n)] entities = [ ( [{"text": f"text_{i}", "type": "TYPE", "start_idx": 0, "end_idx": len(f"text_{i}")}] if i % 2 == 0 else [] ) for i in range(n) ] codeflash_output = stage_for_datasaur(elements, entities) result = codeflash_output # 309μs -> 282μs (9.66% faster) for i in range(n): if i % 2 == 0: pass else: pass # --------------------- Determinism Test --------------------- def test_determinism(): # Running the function twice with the same input should yield the same result elements = [Text("foo"), Text("bar")] entities = [ [{"text": "foo", "type": "WORD", "start_idx": 0, "end_idx": 3}], [{"text": "bar", "type": "WORD", "start_idx": 0, "end_idx": 3}], ] codeflash_output = stage_for_datasaur(elements, entities) result1 = codeflash_output # 2.75μs -> 2.67μs (3.15% faster) codeflash_output = stage_for_datasaur(elements, entities) result2 = codeflash_output # 1.58μs -> 1.54μs (2.66% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` ```python import pytest from unstructured.documents.elements import Text from unstructured.staging.datasaur import stage_for_datasaur def test_stage_for_datasaur(): stage_for_datasaur( [ Text( "", element_id=None, coordinates=None, coordinate_system=None, metadata=None, detection_origin="", embeddings=[], ) ], entities=[[]], ) def test_stage_for_datasaur_2(): with pytest.raises( ValueError, match="If\\ entities\\ is\\ specified,\\ it\\ must\\ be\\ the\\ same\\ length\\ as\\ elements\\.", ): stage_for_datasaur([], entities=[[]]) def test_stage_for_datasaur_3(): with pytest.raises( ValueError, match="Key\\ 'text'\\ was\\ expected\\ but\\ not\\ present\\ in\\ the\\ Datasaur\\ entity\\.", ): stage_for_datasaur( [ Text( "", element_id=None, coordinates=None, coordinate_system=None, metadata=None, detection_origin="", embeddings=[0.0], ) ], entities=[[{}, {}]], ) ``` </details> <details> <summary>🔎 Concolic Coverage Tests and Runtime</summary> | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | |:-----------------------------------------------------------------------------------------------|:--------------|:---------------|:----------| | `codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur` | 1.29μs | 1.46μs | -11.4%⚠️ | | `codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur_2` | 916ns | 959ns | -4.48%⚠️ | | `codeflash_concolic_e8goshnj/tmp5mzednpf/test_concolic_coverage.py::test_stage_for_datasaur_3` | 1.71μs | 1.67μs | 2.52%✅ | </details> To edit these changes `git checkout codeflash/optimize-stage_for_datasaur-mjdt0e1s` and push. [](https://codeflash.ai)  --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
⚡️ codeflash
Optimization PR opened by Codeflash AI
🎯 Quality: High
Optimization Quality according to Codeflash
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 353% (3.53x) speedup for
_get_bbox_to_page_ratioinunstructured/partition/pdf_image/analysis/bbox_visualisation.py⏱️ Runtime :
930 microseconds→205 microseconds(best of250runs)📝 Explanation and details
The optimization applies Numba's Just-In-Time (JIT) compilation using the
@njit(cache=True)decorator to dramatically speed up this mathematical computation function.Key changes:
from numba import njitimport@njit(cache=True)decorator to the functionWhy this leads to a speedup:
Numba compiles Python bytecode to optimized machine code at runtime, eliminating Python's interpreter overhead for numerical computations. The function performs several floating-point operations (
math.sqrt, exponentiation, arithmetic) that benefit significantly from native machine code execution. Thecache=Trueparameter ensures the compiled version is cached for subsequent calls, avoiding recompilation overhead.Performance characteristics:
Impact on workloads:
Based on
function_references, this function is called from_get_optimal_value_for_bbox(), which suggests it's used in document analysis pipelines where bounding box calculations are performed repeatedly. The substantial speedup will be particularly beneficial when processing documents with many bounding boxes, as demonstrated by the large-scale test cases showing 300%+ improvements when processing thousands of bboxes.Optimization effectiveness:
Most effective for computational workloads with repeated calls to this function, especially when processing large documents or batch operations where the function is called hundreds or thousands of times.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-_get_bbox_to_page_ratio-mjdkzmaoand push.