⚡ Bolt: Pre-compile regex constants for job parsing#324
Conversation
Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's GuidePre-compiles all frequently used regexes in JobParser into module-level patterns and refactors parsing helpers to accept and use these compiled patterns, reducing repeated compilation in hot parsing paths while documenting the performance guideline in the Bolt playbook. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Instead of using
re.Patternwith multiple# type: ignore[call-overload]s, consider annotating the compiled constants asPattern[str](fromtypingorrein 3.11+) and adjusting the BeautifulSoup helper signatures so thatsoup.find(..., string=...)accepts the correctly typed pattern without suppressing type checking. - For helpers like
_extract_text_by_patternand_extract_list_by_keyword, it may be clearer to expose two overloads (one forstr, one forPattern[str]) rather thanUnion[str, Pattern], which would make call sites and type checking more precise and could eliminate the need for runtimeisinstancebranching.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Instead of using `re.Pattern` with multiple `# type: ignore[call-overload]`s, consider annotating the compiled constants as `Pattern[str]` (from `typing` or `re` in 3.11+) and adjusting the BeautifulSoup helper signatures so that `soup.find(..., string=...)` accepts the correctly typed pattern without suppressing type checking.
- For helpers like `_extract_text_by_pattern` and `_extract_list_by_keyword`, it may be clearer to expose two overloads (one for `str`, one for `Pattern[str]`) rather than `Union[str, Pattern]`, which would make call sites and type checking more precise and could eliminate the need for runtime `isinstance` branching.
## Individual Comments
### Comment 1
<location path="cli/integrations/job_parser.py" line_range="844-847" />
<code_context>
-
- for pattern in patterns:
- match = re.search(pattern, html, re.IGNORECASE)
+ for pattern in _JOB_TYPE_PATTERNS:
+ match = pattern.search(html)
if match:
return match.group(1).lower().replace("-", "-")
</code_context>
<issue_to_address>
**issue (bug_risk):** The `replace("-", "-")` call is a no-op and likely not doing what was intended.
This makes the line equivalent to `match.group(1).lower()`. If you meant to normalize separators (e.g., turn spaces into hyphens or collapse multiple hyphens), please update the `replace` to the intended mapping, such as `.replace(" ", "-")` or another appropriate transformation.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| for pattern in _JOB_TYPE_PATTERNS: | ||
| match = pattern.search(html) | ||
| if match: | ||
| return match.group(1).lower().replace("-", "-") |
There was a problem hiding this comment.
issue (bug_risk): The replace("-", "-") call is a no-op and likely not doing what was intended.
This makes the line equivalent to match.group(1).lower(). If you meant to normalize separators (e.g., turn spaces into hyphens or collapse multiple hyphens), please update the replace to the intended mapping, such as .replace(" ", "-") or another appropriate transformation.
💡 What: Extracted 20+ dynamically compiled regular expressions inside
JobParsermethods (like_extract_salary_from_text,_parse_generic,_extract_job_type) into module-levelre.Patternconstants. Refactored helper methods like_extract_text_by_patternto supportUnion[str, re.Pattern].🎯 Why: In bulk parsing scenarios, creating and destroying regex engine states via
re.searchandre.compilewithin loops and multiple function calls creates measurable and unnecessary overhead. Pre-compiling them guarantees the parser pays the regex initialization cost only once at module load.📊 Impact: Reduces CPU allocation and string parsing time inside hot loops for every job parsing request.
🔬 Measurement: Verify tests (
python -m pytest tests/test_job_parser_integration.py) pass and usecProfileon a bulk list of job postings to observe the reduction inre.compilecalls.PR created automatically by Jules for task 11625549520871511255 started by @anchapin
Summary by Sourcery
Pre-compile frequently used regular expressions in the job parser to reduce repeated compilation overhead and improve performance when parsing job postings.
Enhancements:
Documentation: