Skip to content

Conversation

@vijay-x-Raj
Copy link

The anomaly detection module previously failed when a [beam.Row] contained vector features (e.g., embeddings).
This PR updates the batching logic in [run_inference] to flatten both scalar and vector fields, enabling support for use cases like [beam.Row(embedding=[...])].
No breaking changes; scalar-only rows continue to work as before.

Issue: #35841

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @vijay-x-Raj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue in the Apache Beam PyOD anomaly detection module where beam.Row objects containing vector features, such as embeddings, caused failures during processing. The core change involves modifying the batching logic within the run_inference function to properly flatten both scalar and vector fields. This ensures that all types of features are correctly converted into a format suitable for PyOD's anomaly detection models, enabling broader use cases without introducing breaking changes for existing scalar-only rows.

Highlights

  • Anomaly Detection Batching Logic: The run_inference function in pyod_adapter.py has been updated to correctly handle beam.Row objects that contain vector features (e.g., embeddings represented as lists, tuples, or NumPy arrays). Previously, such rows would cause the anomaly detection module to fail.
  • Support for Vector Features: The batching logic now iterates through each element of a beam.Row. If an element is a vector type, its contents are flattened and extended into a single feature array. Scalar values are appended directly, ensuring all features are correctly combined into a numpy.ndarray for processing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

Assigning reviewers:

R: @tvalentyn for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@charlespnh
Copy link
Contributor

Thanks! I'll defer to @shunping whether Beam's anomaly detection intends to support this use case.

@mohamedawnallah
Copy link
Contributor

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds support for vector features in the PyOD anomaly detection batching logic by flattening nested features. The change is well-implemented. I've added one suggestion to refactor the batch creation logic to be slightly more memory-efficient and performant by using a generator with np.fromiter, which also makes the code more concise.

Comment on lines 78 to 86
np_batch = []
for row in batch:
np_batch.append(np.fromiter(row, dtype=np.float64))
features = []
for value in row:
if isinstance(value, (list, tuple, np.ndarray)):
features.extend(value)
else:
features.append(value)
np_batch.append(np.array(features, dtype=np.float64))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation is correct, but it can be made more efficient and concise.
It currently builds an intermediate list features for each row, and uses np.array() which can be slower than np.fromiter() for creating 1D arrays from iterables.

A more performant approach is to use a generator function to flatten features on-the-fly and pass it to np.fromiter. This avoids creating the intermediate list for each row.

    def _flatten_row(row_values):
      for value in row_values:
        if isinstance(value, (list, tuple, np.ndarray)):
          yield from value
        else:
          yield value
    np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch]

@vijay-x-Raj
Copy link
Author

should I implement the changes suggested by the bot.

@mohamedawnallah
Copy link
Contributor

should I implement the changes suggested by the bot.

If it makes sense, why not? I would like to state as well I am not familiar with this change. I just wanted to initiate some momentum on reviewing this PR

@vijay-x-Raj
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds support for vector features in the PyOD anomaly detection batching logic by introducing a flattening mechanism for beam.Row objects. The change is well-implemented and addresses the described issue. I have one suggestion regarding the placement of the new helper function to improve code structure and performance.

@tvalentyn
Copy link
Contributor

R: @shunping

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

else:
yield value

np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch]
Copy link
Collaborator

@shunping shunping Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is too long and causing tests on linter and formatter failed.

Could you make the following change?

-    np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch]
+    np_batch = [
+        np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch
+    ]

Copy link
Collaborator

@shunping shunping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing the code. There is a minor formatting issue. Could you help to address that?

@vijay-x-Raj
Copy link
Author

There were issues in the original file which caused the tests to fail i have overlapped the file and re wrote codes.

@shunping
Copy link
Collaborator

I think you are using your own formatter which is causing the linter and formatter problem.

Could you run tox -c tox.ini run -e py3-yapf under <YOUR_PROJECT_DIR>/sdks/python? That should fix the formatting issue.


detector = OfflineDetector(
model_handler, threshold_criterion=FixedThreshold(threshold), **kwargs) # type: ignore[arg-type]
model_handler, threshold_criterion=FixedThreshold(threshold), **kwargs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the change here is causing linter error.

@shunping
Copy link
Collaborator

shunping commented Aug 21, 2025

Hey @vijay-x-Raj, I see you are working to fix the linter and format issue.

Have you ever tried the comment I suggested at #35884 (comment)?

We used a specific yapf config to ensure our code format is consistent across the project. Please run that command, or the formatter test would fail:

https://github.com/apache/beam/actions/runs/17105155269/job/48512023797?pr=35884

After the formatter error is fixed, we can then check the logs in the linter test to see if there is any error:

https://github.com/apache/beam/actions/runs/17105155251/job/48512023540?pr=35884

Hope this helps!

@vijay-x-Raj
Copy link
Author

@shunping Sorry I tried that earlier but the format became weird that's why I applied changes manually afterwards .This time I thing it is working thanks for helping me out.

@mohamedawnallah
Copy link
Contributor

mohamedawnallah commented Aug 21, 2025

@vijay-x-Raj – The formatter CI pipeline passed now 🎉

Only the linter failing with this err (from publicly available CI logs)

apache_beam/ml/anomaly/detectors/pyod_adapter.py:142: error: Argument 1 to "OfflineDetector" has incompatible type "ModelHandler[tuple[Any, Row], tuple[Any, AnomalyPrediction], Union[Any, _ModelManager]]"; expected "KeyedModelHandler[Any, Row, Never, Any]"  [arg-type]
apache_beam/ml/anomaly/detectors/pyod_adapter.py:145: error: Unused "type: ignore" comment  [unused-ignore]
Found 2 errors in 1 file (checked 1334 source files)
error: mypy exited with status 1

It is okay. It takes a bit of time to go through the process of fixing linting issues especially on first contributions. You can fix those linting issues from logs as you go and perhaps you can test it utilizing .pylintrc and mypy.ini (don't remember the exact command to run against those files but I think can be easily found online). This can give you a fast iteration of feedback locally instead of (fix an issue that cause the linting to fail -> push to CI -> async waiting for CI to finish -> oops there are still CI linting issues -> loop).
https://github.com/apache/beam/blob/master/sdks/python/.pylintrc
https://github.com/apache/beam/blob/master/sdks/python/mypy.ini

@vijay-x-Raj
Copy link
Author

yea I will try that and after will keep trying !.

from apache_beam.ml.anomaly.specifiable import specifiable
from apache_beam.ml.anomaly.thresholds import FixedThreshold
from apache_beam.ml.inference.base import KeyedModelHandler
from typing import cast, Any
Copy link
Collaborator

@shunping shunping Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove "Any" and move this line after "from typing import Optional" (line 28).

Imports needs to be placed in order and system imports come first, then third-party imports and apache beam imports.

@codecov
Copy link

codecov bot commented Aug 23, 2025

Codecov Report

❌ Patch coverage is 10.00000% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.71%. Comparing base (14bb49c) to head (33c3295).
⚠️ Report is 30 commits behind head on master.

Files with missing lines Patch % Lines
...n/apache_beam/ml/anomaly/detectors/pyod_adapter.py 10.00% 27 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #35884      +/-   ##
============================================
+ Coverage     56.69%   56.71%   +0.02%     
  Complexity     3380     3380              
============================================
  Files          1204     1220      +16     
  Lines        184114   184843     +729     
  Branches       3507     3507              
============================================
+ Hits         104379   104833     +454     
- Misses        76406    76681     +275     
  Partials       3329     3329              
Flag Coverage Δ
python 80.79% <10.00%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mohamedawnallah
Copy link
Contributor

mohamedawnallah commented Aug 28, 2025

Fair amount of those linting issues (as in the CI) can be fixed by enforcing proper indentation in the root of your beam project. Recently editor config added to beam (#35956). This can be enforced if you are on Visual Studio Code by installing this extension: https://marketplace.visualstudio.com/items?itemName=EditorConfig.EditorConfig. And please remember to do tox -c tox.ini run -e py3-yapf (#35884 (comment)) before any push as when the Python Formatter CI workflow fails it is mostly because the author didn't apply that command. This may helpful as well to share here: https://cwiki.apache.org/confluence/display/BEAM/Python+Tips

apache_beam/ml/anomaly/detectors/pyod_adapter.py:61:0: C0301: Line too long (83/80) (line-too-long)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:64:0: C0301: Line too long (83/80) (line-too-long)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:62:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:73:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:74:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:75:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:77:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:78:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:79:0: W0311: Bad indentation. Found 12 spaces, expected 6 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:81:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:87:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:93:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:94:0: W0311: Bad indentation. Found 12 spaces, expected 6 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:95:0: W0311: Bad indentation. Found 16 spaces, expected 8 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:96:0: W0311: Bad indentation. Found 20 spaces, expected 10 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:97:0: W0311: Bad indentation. Found 16 spaces, expected 8 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:98:0: W0311: Bad indentation. Found 20 spaces, expected 10 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:100:0: C0301: Line too long (86/80) (line-too-long)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:100:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:101:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:102:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:103:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:107:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:109:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:110:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:111:0: C0301: Line too long (95/80) (line-too-long)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:111:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:113:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:114:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:115:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:116:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)

@github-actions
Copy link
Contributor

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 28, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2025

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants