-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix: support vector features in PyOD anomaly detection batching logic #35884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @vijay-x-Raj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request resolves an issue in the Apache Beam PyOD anomaly detection module where beam.Row objects containing vector features, such as embeddings, caused failures during processing. The core change involves modifying the batching logic within the run_inference function to properly flatten both scalar and vector fields. This ensures that all types of features are correctly converted into a format suitable for PyOD's anomaly detection models, enabling broader use cases without introducing breaking changes for existing scalar-only rows.
Highlights
- Anomaly Detection Batching Logic: The
run_inferencefunction inpyod_adapter.pyhas been updated to correctly handlebeam.Rowobjects that contain vector features (e.g., embeddings represented as lists, tuples, or NumPy arrays). Previously, such rows would cause the anomaly detection module to fail. - Support for Vector Features: The batching logic now iterates through each element of a
beam.Row. If an element is a vector type, its contents are flattened and extended into a single feature array. Scalar values are appended directly, ensuring all features are correctly combined into anumpy.ndarrayfor processing.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
Assigning reviewers: R: @tvalentyn for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Thanks! I'll defer to @shunping whether Beam's anomaly detection intends to support this use case. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly adds support for vector features in the PyOD anomaly detection batching logic by flattening nested features. The change is well-implemented. I've added one suggestion to refactor the batch creation logic to be slightly more memory-efficient and performant by using a generator with np.fromiter, which also makes the code more concise.
| np_batch = [] | ||
| for row in batch: | ||
| np_batch.append(np.fromiter(row, dtype=np.float64)) | ||
| features = [] | ||
| for value in row: | ||
| if isinstance(value, (list, tuple, np.ndarray)): | ||
| features.extend(value) | ||
| else: | ||
| features.append(value) | ||
| np_batch.append(np.array(features, dtype=np.float64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is correct, but it can be made more efficient and concise.
It currently builds an intermediate list features for each row, and uses np.array() which can be slower than np.fromiter() for creating 1D arrays from iterables.
A more performant approach is to use a generator function to flatten features on-the-fly and pass it to np.fromiter. This avoids creating the intermediate list for each row.
def _flatten_row(row_values):
for value in row_values:
if isinstance(value, (list, tuple, np.ndarray)):
yield from value
else:
yield value
np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch]|
should I implement the changes suggested by the bot. |
If it makes sense, why not? I would like to state as well I am not familiar with this change. I just wanted to initiate some momentum on reviewing this PR |
…D anomaly detection
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly adds support for vector features in the PyOD anomaly detection batching logic by introducing a flattening mechanism for beam.Row objects. The change is well-implemented and addresses the described issue. I have one suggestion regarding the placement of the new helper function to improve code structure and performance.
|
R: @shunping |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
| else: | ||
| yield value | ||
|
|
||
| np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is too long and causing tests on linter and formatter failed.
Could you make the following change?
- np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch]
+ np_batch = [
+ np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch
+ ]
shunping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing the code. There is a minor formatting issue. Could you help to address that?
|
There were issues in the original file which caused the tests to fail i have overlapped the file and re wrote codes. |
|
I think you are using your own formatter which is causing the linter and formatter problem. Could you run |
|
|
||
| detector = OfflineDetector( | ||
| model_handler, threshold_criterion=FixedThreshold(threshold), **kwargs) # type: ignore[arg-type] | ||
| model_handler, threshold_criterion=FixedThreshold(threshold), **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the change here is causing linter error.
|
Hey @vijay-x-Raj, I see you are working to fix the linter and format issue. Have you ever tried the comment I suggested at #35884 (comment)? We used a specific yapf config to ensure our code format is consistent across the project. Please run that command, or the formatter test would fail: https://github.com/apache/beam/actions/runs/17105155269/job/48512023797?pr=35884 After the formatter error is fixed, we can then check the logs in the linter test to see if there is any error: https://github.com/apache/beam/actions/runs/17105155251/job/48512023540?pr=35884 Hope this helps! |
|
@shunping Sorry I tried that earlier but the format became weird that's why I applied changes manually afterwards .This time I thing it is working thanks for helping me out. |
|
@vijay-x-Raj – The formatter CI pipeline passed now 🎉 Only the linter failing with this err (from publicly available CI logs) It is okay. It takes a bit of time to go through the process of fixing linting issues especially on first contributions. You can fix those linting issues from logs as you go and perhaps you can test it utilizing |
|
yea I will try that and after will keep trying !. |
| from apache_beam.ml.anomaly.specifiable import specifiable | ||
| from apache_beam.ml.anomaly.thresholds import FixedThreshold | ||
| from apache_beam.ml.inference.base import KeyedModelHandler | ||
| from typing import cast, Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove "Any" and move this line after "from typing import Optional" (line 28).
Imports needs to be placed in order and system imports come first, then third-party imports and apache beam imports.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #35884 +/- ##
============================================
+ Coverage 56.69% 56.71% +0.02%
Complexity 3380 3380
============================================
Files 1204 1220 +16
Lines 184114 184843 +729
Branches 3507 3507
============================================
+ Hits 104379 104833 +454
- Misses 76406 76681 +275
Partials 3329 3329
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… compact inference logic
|
Fair amount of those linting issues (as in the CI) can be fixed by enforcing proper indentation in the root of your beam project. Recently editor config added to beam (#35956). This can be enforced if you are on Visual Studio Code by installing this extension: https://marketplace.visualstudio.com/items?itemName=EditorConfig.EditorConfig. And please remember to do |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
The anomaly detection module previously failed when a
[beam.Row]contained vector features (e.g., embeddings).This PR updates the batching logic in
[run_inference]to flatten both scalar and vector fields, enabling support for use cases like[beam.Row(embedding=[...])].No breaking changes; scalar-only rows continue to work as before.
Issue: #35841