Skip to content

perf: speed up standardize_quotes with str.translate()#4314

Merged
cragwolfe merged 11 commits intomainfrom
codeflash/optimize-standardize_quotes-mklcp188
Apr 3, 2026
Merged

perf: speed up standardize_quotes with str.translate()#4314
cragwolfe merged 11 commits intomainfrom
codeflash/optimize-standardize_quotes-mklcp188

Conversation

@KRRT7
Copy link
Copy Markdown
Collaborator

@KRRT7 KRRT7 commented Apr 2, 2026

Summary

  • Replace per-character regex with a precomputed str.maketrans() + str.translate() table for standardize_quotes
  • Covers all 36 Unicode fancy-quote codepoints (double + single) from the original regex
  • Adds a benchmark (test_unstructured/benchmarks/) to track standardize_quotes performance

Benchmark

Azure Standard_D8s_v5 — 8 vCPU Intel Xeon Platinum 8473C, 32 GiB RAM, Python 3.12.12

test_benchmark_standardize_quotes

Min Median Mean OPS Rounds
6ada488f6c28 (base) 139.4μs 175.3μs 177.2μs 5.64 Kops/s 6,137
8929336e66aa (head) 86.6μs 119.7μs 119.5μs 8.37 Kops/s 11,900
Speedup 1.61x 1.46x 1.48x 1.48x
Function base head Delta Speedup
standardize_quotes 112.0μs 50.0μs ██████░░░░ -55% 2.24x

Generated by codeflash agent

Reproduce the benchmark locally
# Full comparison with codeflash compare:
uv run codeflash compare 6ada488f6c28 8929336e66aa \
  --inject test_unstructured/benchmarks/test_benchmark_standardize_quotes.py \
  --inject test_unstructured/benchmarks/__init__.py \
  --inject pyproject.toml

# Or manually with pytest-benchmark:
git checkout 6ada488f6c28
uv run pytest test_unstructured/benchmarks/test_benchmark_standardize_quotes.py --benchmark-save=baseline

git checkout 8929336e66aa
uv run pytest test_unstructured/benchmarks/test_benchmark_standardize_quotes.py --benchmark-compare=0001_baseline
Benchmark test source
from unstructured.metrics.text_extraction import standardize_quotes

SAMPLE_TEXTS = [
    "She said “Hello” and then whispered ‘Goodbye’ before leaving.",
    "„To be, or not to be, that is the question” - Shakespeare’s famous quote.",
    "«When he said “life is beautiful,” I believed him» wrote Maria.",
    "❝Do you remember when we first met?❞ she asked with a smile.",
    "〝The meeting starts at 10:00, don’t be late!〟 announced the manager.",
    'He told me 「"This is important" yesterday」, she explained.',
    "『The sun was setting. The birds were singing. It was peaceful.』",
    "﹂Meeting #123 @ 15:00 - Don’t forget!﹁",
    "「Hello」, ❝World❞, \"Test\", 'Example', „Quote”, «Final»",
    "It’s John’s book, isn’t it?",
    '‹Testing the system’s capability for "quoted" text›',
    "❛First sentence. Second sentence. Third sentence.❜",
    "「Chapter 1」: ❝The Beginning❞ - „A new story” begins «today».",
]


def run_standardize_quotes():
    for text in SAMPLE_TEXTS:
        standardize_quotes(text)


def test_benchmark_standardize_quotes(benchmark):
    benchmark(run_standardize_quotes)

Changelog

Added entry in CHANGELOG.md under 0.22.13.

Test plan

  • Benchmarked with codeflash compare on Azure VM (Standard_D8s_v5)
  • Existing unit tests pass — standardize_quotes is a drop-in replacement
  • All 36 quote codepoints covered by the translation table

codeflash-ai bot and others added 10 commits April 2, 2026 10:36
The optimized code achieves a **144% speedup** by replacing a loop-based character replacement approach with Python's built-in `str.translate()` method using a pre-computed translation table.

## Key Optimizations

**1. Pre-computed Translation Table at Module Load**
- The quote dictionaries and translation table are now created once at module import time (module-level constants prefixed with `_`)
- Original code recreated these 40+ entry dictionaries on every function call (6.1% + 6.5% = 12.6% of runtime just for dictionary creation)
- Translation table maps Unicode codepoints directly to ASCII quote codepoints, eliminating repeated string operations

**2. Single-Pass O(n) Algorithm with `str.translate()`**
- Original: Two loops iterating through ~40 quote types, calling `unicode_to_char()` 3,096 times (67.5% of total runtime) and performing substring searches with `in` operator (5.9% of runtime)
- Optimized: Single `str.translate()` call that processes the entire string in one pass using efficient C-level implementation
- Eliminates 3,096 function calls to `unicode_to_char()` and all associated string parsing/conversion overhead

**3. Algorithmic Complexity Improvement**
- Original: O(n × m) where n = text length, m = number of quote types (~40), with repeated `text.replace()` creating new string objects
- Optimized: O(n) single pass through the text, with translation table lookups being O(1)

## Performance Context

Based on `function_references`, this function is called from `calculate_edit_distance()`, which is likely in a **hot path** for text extraction metrics. The function processes strings before edit distance calculations, meaning:
- Any text comparison workflow will call this repeatedly
- The 144% speedup compounds when processing multiple documents or performing batch comparisons
- Reduced memory allocation pressure from eliminating repeated dictionary creation and intermediate string objects

## Test Case Insights

The test with input `"«'"` (containing both double and single quote variants) shows the optimization handles mixed quote types efficiently in a single pass, whereas the original code would iterate through all 40 quote types regardless of actual presence in the text.
…te dict keys

The quote-mapping dicts used literal quote characters as keys, but '"'/'"'/'"'
all encode as byte 0x22 and '''/'''/''' as 0x27. Python deduplicates them,
silently dropping U+201C (left double) and U+2018 (left single) before the
translation table is built. Restructure as tuples of \uXXXX escape sequences
so every codepoint is guaranteed unique.
Copy link
Copy Markdown
Collaborator Author

@KRRT7 KRRT7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog claims this fixes "a pre-existing bug where left smart quotes were never normalized due to duplicate dictionary keys," but there are no regression assertions that prove the fix works for the specific characters that were allegedly broken.

Add explicit regression assertions for U+201C (") and U+2018 (') — the claimed bug-fix characters — and for mixed strings containing both left/right smart quotes (e.g. "\u201cHello\u201d""\"Hello\"", "\u2018it\u2019s""'it's").

The new benchmark input in test_benchmark_standardize_quotes.py includes those characters, but it only measures runtime; it does not assert correctness. The existing test_standardize_quotes parametrized cases still do not directly cover those exact code points — I checked and neither \u201c nor \u2018 appear anywhere in the test file.

Without these assertions, the bug-fix claim is untested and could silently regress.

…ints

Add explicit tests for U+201C and U+2018 (the characters silently dropped
by duplicate dict keys in the old implementation), plus a parametrized test
that asserts every one of the 39 codepoints in the translation table maps
to its correct ASCII equivalent.
@KRRT7
Copy link
Copy Markdown
Collaborator Author

KRRT7 commented Apr 3, 2026

Added regression tests in 8dffad1 — 58/58 pass.

Coverage report — standardize_quotes
════════════════════════════════════════════════════════

▸ Translation table (module-level)

  ✓  52 │ _TRANSLATION_TABLE = str.maketrans(
     53 │     dict.fromkeys(_DOUBLE_QUOTE_CODEPOINTS, '"') | dict.fromkeys(_SINGLE_QUOTE_CODEPOINTS, "'")
     54 │ )

▸ Function

  ✓ 214 │ def standardize_quotes(text: str) -> str:
    215 │     """
    216 │     Converts all unicode quotes to standard ASCII quotes with comprehensive coverage.
    217 │
    218 │     Args:
    219 │         text (str): The input text to be standardized.
    220 │
    221 │     Returns:
    222 │         str: The text with standardized quotes.
    223 │     """
  ✓ 224 │     return text.translate(_TRANSLATION_TABLE)

▸ Codepoint test coverage
  (which of the 39 quote codepoints are exercised by test inputs)

  Double quotes → "
  ✓  U+0022       '"'  QUOTATION MARK
  ✓  U+201C       '"'  LEFT DOUBLE QUOTATION MARK
  ✓  U+201D       '"'  RIGHT DOUBLE QUOTATION MARK
  ✓  U+201E       '„'  DOUBLE LOW-9 QUOTATION MARK
  ✓  U+201F       '‟'  DOUBLE HIGH-REVERSED-9 QUOTATION MARK
  ✓  U+00AB       '«'  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
  ✓  U+00BB       '»'  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
  ✓  U+275D       '❝'  HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
  ✓  U+275E       '❞'  HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
  ✓  U+2E42       '⹂'  DOUBLE LOW-REVERSED-9 QUOTATION MARK
  ✓  U+1F676       '🙶'  SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
  ✓  U+1F677       '🙷'  SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
  ✓  U+1F678       '🙸'  SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
  ✓  U+2826       '⠦'  BRAILLE PATTERN DOTS-236
  ✓  U+2834       '⠴'  BRAILLE PATTERN DOTS-356
  ✓  U+301D       '〝'  REVERSED DOUBLE PRIME QUOTATION MARK
  ✓  U+301E       '〞'  DOUBLE PRIME QUOTATION MARK
  ✓  U+301F       '〟'  LOW DOUBLE PRIME QUOTATION MARK
  ✓  U+FF02       '"'  FULLWIDTH QUOTATION MARK

  Single quotes → '
  ✓  U+0027       "'"  APOSTROPHE
  ✓  U+2018       '''  LEFT SINGLE QUOTATION MARK
  ✓  U+2019       '''  RIGHT SINGLE QUOTATION MARK
  ✓  U+201A       '‚'  SINGLE LOW-9 QUOTATION MARK
  ✓  U+201B       '‛'  SINGLE HIGH-REVERSED-9 QUOTATION MARK
  ✓  U+2039       '‹'  SINGLE LEFT-POINTING ANGLE QUOTATION MARK
  ✓  U+203A       '›'  SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
  ✓  U+275B       '❛'  HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
  ✓  U+275C       '❜'  HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
  ✓  U+300C       '「'  LEFT CORNER BRACKET
  ✓  U+300D       '」'  RIGHT CORNER BRACKET
  ✓  U+300E       '『'  LEFT WHITE CORNER BRACKET
  ✓  U+300F       '』'  RIGHT WHITE CORNER BRACKET
  ✓  U+FE41       '﹁'  PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
  ✓  U+FE42       '﹂'  PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
  ✓  U+FE43       '﹃'  PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
  ✓  U+FE44       '﹄'  PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
  ✓  U+FF07       '''  FULLWIDTH APOSTROPHE
  ✓  U+FF62       '「'  HALFWIDTH LEFT CORNER BRACKET
  ✓  U+FF63       '」'  HALFWIDTH RIGHT CORNER BRACKET

  ██████████████████████████████  🟢 39/39 codepoints (100%)

════════════════════════════════════════════════════════
✓ 58/58 tests pass (19 string-level + 39 per-codepoint)
✓ U+201C and U+2018 now explicitly asserted

@cragwolfe cragwolfe added this pull request to the merge queue Apr 3, 2026
Merged via the queue into main with commit 8929336 Apr 3, 2026
53 checks passed
@cragwolfe cragwolfe deleted the codeflash/optimize-standardize_quotes-mklcp188 branch April 3, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants