Skip to content

⚡ Bolt: Optimize keyword density extraction performance#322

Open
anchapin wants to merge 1 commit into
mainfrom
bolt-keyword-density-optimization-10506461077069710837
Open

⚡ Bolt: Optimize keyword density extraction performance#322
anchapin wants to merge 1 commit into
mainfrom
bolt-keyword-density-optimization-10506461077069710837

Conversation

@anchapin
Copy link
Copy Markdown
Owner

@anchapin anchapin commented May 26, 2026

💡 What:

  • Extracted dynamic regex lists title_patterns and company_patterns to module-level pre-compiled objects _TITLE_PATTERNS and _COMPANY_PATTERNS.
  • Extracted the dynamically allocated list common_keywords to a module-level constant _COMMON_KEYWORDS.
  • Extracted the dynamically allocated list tech_keywords to a module-level set _TECH_KEYWORDS.
  • Refactored _extract_job_details, _simple_keyword_extraction, and _suggest_sections_for_keyword to use these optimizations.

🎯 Why:
Instantiating large lists or performing inline .compile on regular expressions within instance methods consumes unnecessary execution time per call, particularly during loops or string processing. By moving them to the module-level, they are computed once at import time. The conversion of _TECH_KEYWORDS to a set also upgrades keyword lookup performance from $O(N)$ to $O(1)$.

📊 Impact:
Reduces processing time per density analysis execution by avoiding redundant parsing and memory allocation inside loop calls.

🔬 Measurement:
This can be verified by running the test suite (python -m pytest tests/test_keyword_density.py), which succeeds without regression, and observing improved average processing times across repeated calls to the KeywordDensityGenerator.


PR created automatically by Jules for task 10506461077069710837 started by @anchapin

Summary by Sourcery

Optimize keyword density analysis by hoisting reusable patterns and keyword collections to module-level constants for faster execution.

Enhancements:

  • Pre-compile job title and company extraction regex patterns at module scope and reuse them across calls.
  • Centralize common keyword definitions and tech keyword membership into shared module-level collections to reduce per-call allocations and speed up lookups.

Documentation:

  • Add a Bolt learning log entry documenting regex and keyword optimization practices for density analysis.

Identified a performance bottleneck in `cli/utils/keyword_density.py` where regex compilations and list allocations were occurring inside heavily used methods. Extracted these patterns to module-level variables.

Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 26, 2026

Reviewer's Guide

This PR optimizes keyword density analysis by hoisting regex patterns and keyword collections from frequently called methods into module-level precompiled patterns and constants, and by using a set for tech keyword membership checks, reducing per-call overhead without changing behavior.

File-Level Changes

Change Details Files
Precompile job title and company extraction regexes at module scope and reuse them in job detail extraction.
  • Introduce module-level _TITLE_PATTERNS and _COMPANY_PATTERNS lists of precompiled regex objects.
  • Refactor _extract_job_details to iterate over these compiled patterns using pattern.search instead of re.search with inline flags.
cli/utils/keyword_density.py
Hoist common keyword list to a module-level constant and reuse it in simple keyword extraction.
  • Add _COMMON_KEYWORDS as a module-level list of (keyword, importance) tuples.
  • Refactor _simple_keyword_extraction to iterate over _COMMON_KEYWORDS instead of recreating the list on each call.
cli/utils/keyword_density.py
Replace dynamically created tech keyword list with a module-level set for faster membership checks in keyword section suggestions.
  • Add _TECH_KEYWORDS as a module-level set of tech-related keywords.
  • Update _suggest_sections_for_keyword to test membership against _TECH_KEYWORDS instead of recreating a list each time.
cli/utils/keyword_density.py
Document the optimization pattern in the Bolt guide for future performance-oriented refactors.
  • Append a new dated entry describing regex and keyword collection optimizations for density analysis.
  • Capture the learning about pre-compiling regexes and using module-level constants/sets for repeated lookups.
.jules/bolt.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The _TECH_KEYWORDS set is missing entries that were present in the original tech_keywords list (e.g., "react native", "hibernate"), which changes behavior for _suggest_sections_for_keyword; please verify whether this is intentional or bring the set back in sync.
  • To avoid future drift between _COMMON_KEYWORDS and _TECH_KEYWORDS, consider deriving _TECH_KEYWORDS programmatically from _COMMON_KEYWORDS (e.g., by filtering on importance or a separate tag) instead of maintaining two separate hard-coded collections.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `_TECH_KEYWORDS` set is missing entries that were present in the original `tech_keywords` list (e.g., `"react native"`, `"hibernate"`), which changes behavior for `_suggest_sections_for_keyword`; please verify whether this is intentional or bring the set back in sync.
- To avoid future drift between `_COMMON_KEYWORDS` and `_TECH_KEYWORDS`, consider deriving `_TECH_KEYWORDS` programmatically from `_COMMON_KEYWORDS` (e.g., by filtering on importance or a separate tag) instead of maintaining two separate hard-coded collections.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant