fix: support vector features in PyOD anomaly detection batching logic #35884

vijay-x-Raj · 2025-08-15T20:10:24Z

The anomaly detection module previously failed when a [beam.Row] contained vector features (e.g., embeddings).
This PR updates the batching logic in [run_inference] to flatten both scalar and vector fields, enabling support for use cases like [beam.Row(embedding=[...])].
No breaking changes; scalar-only rows continue to work as before.

Issue: #35841

gemini-code-assist

Summary of Changes

Hello @vijay-x-Raj, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue in the Apache Beam PyOD anomaly detection module where beam.Row objects containing vector features, such as embeddings, caused failures during processing. The core change involves modifying the batching logic within the run_inference function to properly flatten both scalar and vector fields. This ensures that all types of features are correctly converted into a format suitable for PyOD's anomaly detection models, enabling broader use cases without introducing breaking changes for existing scalar-only rows.

Highlights

Anomaly Detection Batching Logic: The run_inference function in pyod_adapter.py has been updated to correctly handle beam.Row objects that contain vector features (e.g., embeddings represented as lists, tuples, or NumPy arrays). Previously, such rows would cause the anomaly detection module to fail.
Support for Vector Features: The batching logic now iterates through each element of a beam.Row. If an element is a vector type, its contents are flattened and extended into a single feature array. Scalar values are appended directly, ensuring all features are correctly combined into a numpy.ndarray for processing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-08-15T21:06:31Z

Assigning reviewers:

R: @tvalentyn for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

charlespnh · 2025-08-15T21:33:15Z

Thanks! I'll defer to @shunping whether Beam's anomaly detection intends to support this use case.

mohamedawnallah · 2025-08-16T10:13:15Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly adds support for vector features in the PyOD anomaly detection batching logic by flattening nested features. The change is well-implemented. I've added one suggestion to refactor the batch creation logic to be slightly more memory-efficient and performant by using a generator with np.fromiter, which also makes the code more concise.

gemini-code-assist · 2025-08-16T10:14:41Z

sdks/python/apache_beam/ml/anomaly/detectors/pyod_adapter.py

    np_batch = []
    for row in batch:
-      np_batch.append(np.fromiter(row, dtype=np.float64))
+      features = []
+      for value in row:
+        if isinstance(value, (list, tuple, np.ndarray)):
+          features.extend(value)
+        else:
+          features.append(value)
+      np_batch.append(np.array(features, dtype=np.float64))


The current implementation is correct, but it can be made more efficient and concise.
It currently builds an intermediate list features for each row, and uses np.array() which can be slower than np.fromiter() for creating 1D arrays from iterables.

A more performant approach is to use a generator function to flatten features on-the-fly and pass it to np.fromiter. This avoids creating the intermediate list for each row.

def _flatten_row(row_values): for value in row_values: if isinstance(value, (list, tuple, np.ndarray)): yield from value else: yield value np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch]

vijay-x-Raj · 2025-08-16T13:59:26Z

should I implement the changes suggested by the bot.

mohamedawnallah · 2025-08-16T14:03:40Z

should I implement the changes suggested by the bot.

If it makes sense, why not? I would like to state as well I am not familiar with this change. I just wanted to initiate some momentum on reviewing this PR

…D anomaly detection

vijay-x-Raj · 2025-08-17T06:27:16Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly adds support for vector features in the PyOD anomaly detection batching logic by introducing a flattening mechanism for beam.Row objects. The change is well-implemented and addresses the described issue. I have one suggestion regarding the placement of the new helper function to improve code structure and performance.

sdks/python/apache_beam/ml/anomaly/detectors/pyod_adapter.py

tvalentyn · 2025-08-18T14:53:05Z

R: @shunping

github-actions · 2025-08-18T14:54:08Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

shunping · 2025-08-18T15:01:57Z

sdks/python/apache_beam/ml/anomaly/detectors/pyod_adapter.py

+        else:
+          yield value
+
+    np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch]


This line is too long and causing tests on linter and formatter failed.

Could you make the following change?

- np_batch = [np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch] + np_batch = [ + np.fromiter(_flatten_row(row), dtype=np.float64) for row in batch + ]

shunping

Thanks for contributing the code. There is a minor formatting issue. Could you help to address that?

vijay-x-Raj · 2025-08-18T20:32:10Z

There were issues in the original file which caused the tests to fail i have overlapped the file and re wrote codes.

shunping · 2025-08-19T01:23:04Z

I think you are using your own formatter which is causing the linter and formatter problem.

Could you run tox -c tox.ini run -e py3-yapf under <YOUR_PROJECT_DIR>/sdks/python? That should fix the formatting issue.

shunping · 2025-08-19T12:41:07Z

sdks/python/apache_beam/ml/anomaly/detectors/pyod_adapter.py

+
    detector = OfflineDetector(
-        model_handler, threshold_criterion=FixedThreshold(threshold), **kwargs)  # type: ignore[arg-type]
+        model_handler, threshold_criterion=FixedThreshold(threshold), **kwargs


Looks like the change here is causing linter error.

…ter.py

shunping · 2025-08-21T13:40:50Z

Hey @vijay-x-Raj, I see you are working to fix the linter and format issue.

Have you ever tried the comment I suggested at #35884 (comment)?

We used a specific yapf config to ensure our code format is consistent across the project. Please run that command, or the formatter test would fail:

https://github.com/apache/beam/actions/runs/17105155269/job/48512023797?pr=35884

After the formatter error is fixed, we can then check the logs in the linter test to see if there is any error:

https://github.com/apache/beam/actions/runs/17105155251/job/48512023540?pr=35884

Hope this helps!

vijay-x-Raj · 2025-08-21T14:37:53Z

@shunping Sorry I tried that earlier but the format became weird that's why I applied changes manually afterwards .This time I thing it is working thanks for helping me out.

mohamedawnallah · 2025-08-21T15:53:19Z

@vijay-x-Raj – The formatter CI pipeline passed now 🎉

Only the linter failing with this err (from publicly available CI logs)

apache_beam/ml/anomaly/detectors/pyod_adapter.py:142: error: Argument 1 to "OfflineDetector" has incompatible type "ModelHandler[tuple[Any, Row], tuple[Any, AnomalyPrediction], Union[Any, _ModelManager]]"; expected "KeyedModelHandler[Any, Row, Never, Any]"  [arg-type]
apache_beam/ml/anomaly/detectors/pyod_adapter.py:145: error: Unused "type: ignore" comment  [unused-ignore]
Found 2 errors in 1 file (checked 1334 source files)
error: mypy exited with status 1

It is okay. It takes a bit of time to go through the process of fixing linting issues especially on first contributions. You can fix those linting issues from logs as you go and perhaps you can test it utilizing .pylintrc and mypy.ini (don't remember the exact command to run against those files but I think can be easily found online). This can give you a fast iteration of feedback locally instead of (fix an issue that cause the linting to fail -> push to CI -> async waiting for CI to finish -> oops there are still CI linting issues -> loop).
https://github.com/apache/beam/blob/master/sdks/python/.pylintrc
https://github.com/apache/beam/blob/master/sdks/python/mypy.ini

vijay-x-Raj · 2025-08-21T16:34:26Z

yea I will try that and after will keep trying !.

…pyod adapter change)

shunping · 2025-08-22T01:31:03Z

sdks/python/apache_beam/ml/anomaly/detectors/pyod_adapter.py

 from apache_beam.ml.anomaly.specifiable import specifiable
 from apache_beam.ml.anomaly.thresholds import FixedThreshold
 from apache_beam.ml.inference.base import KeyedModelHandler
+from typing import cast, Any


Please remove "Any" and move this line after "from typing import Optional" (line 28).

Imports needs to be placed in order and system imports come first, then third-party imports and apache beam imports.

…ting)

codecov · 2025-08-23T13:45:01Z

Codecov Report

❌ Patch coverage is 10.00000% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.71%. Comparing base (14bb49c) to head (33c3295).
⚠️ Report is 30 commits behind head on master.

Files with missing lines	Patch %	Lines
...n/apache_beam/ml/anomaly/detectors/pyod_adapter.py	10.00%	27 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #35884      +/-   ##
============================================
+ Coverage     56.69%   56.71%   +0.02%     
  Complexity     3380     3380              
============================================
  Files          1204     1220      +16     
  Lines        184114   184843     +729     
  Branches       3507     3507              
============================================
+ Hits         104379   104833     +454     
- Misses        76406    76681     +275     
  Partials       3329     3329

Flag	Coverage Δ
python	`80.79% <10.00%> (-0.18%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… compact inference logic

mohamedawnallah · 2025-08-28T16:30:30Z

Fair amount of those linting issues (as in the CI) can be fixed by enforcing proper indentation in the root of your beam project. Recently editor config added to beam (#35956). This can be enforced if you are on Visual Studio Code by installing this extension: https://marketplace.visualstudio.com/items?itemName=EditorConfig.EditorConfig. And please remember to do tox -c tox.ini run -e py3-yapf (#35884 (comment)) before any push as when the Python Formatter CI workflow fails it is mostly because the author didn't apply that command. This may helpful as well to share here: https://cwiki.apache.org/confluence/display/BEAM/Python+Tips

apache_beam/ml/anomaly/detectors/pyod_adapter.py:61:0: C0301: Line too long (83/80) (line-too-long)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:64:0: C0301: Line too long (83/80) (line-too-long)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:62:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:73:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:74:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:75:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:77:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:78:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:79:0: W0311: Bad indentation. Found 12 spaces, expected 6 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:81:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:87:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:93:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:94:0: W0311: Bad indentation. Found 12 spaces, expected 6 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:95:0: W0311: Bad indentation. Found 16 spaces, expected 8 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:96:0: W0311: Bad indentation. Found 20 spaces, expected 10 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:97:0: W0311: Bad indentation. Found 16 spaces, expected 8 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:98:0: W0311: Bad indentation. Found 20 spaces, expected 10 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:100:0: C0301: Line too long (86/80) (line-too-long)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:100:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:101:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:102:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:103:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:107:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:109:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:110:0: W0311: Bad indentation. Found 4 spaces, expected 2 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:111:0: C0301: Line too long (95/80) (line-too-long)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:111:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:113:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:114:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:115:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)
apache_beam/ml/anomaly/detectors/pyod_adapter.py:116:0: W0311: Bad indentation. Found 8 spaces, expected 4 (bad-indentation)

github-actions · 2025-10-28T12:46:49Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

github-actions · 2025-11-04T12:47:05Z

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

fix: support vector features in PyOD anomaly detection batching logic

9fcb7f1

github-actions bot added the python label Aug 15, 2025

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

vijay-x-Raj mentioned this pull request Aug 15, 2025

[Feature Request]: AnomalyDetection with non-numeric features #35841

Open

17 tasks

github-actions bot added the Next Action: Reviewers label Aug 15, 2025

gemini-code-assist bot reviewed Aug 16, 2025

View reviewed changes

refactor: use generator and np.fromiter for efficient batching in PyO…

e1e3364

…D anomaly detection

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

sdks/python/apache_beam/ml/anomaly/detectors/pyod_adapter.py Outdated Show resolved Hide resolved

shunping reviewed Aug 18, 2025

View reviewed changes

shunping requested changes Aug 18, 2025

View reviewed changes

vijay-x-Raj added 2 commits August 18, 2025 18:35

Fixing Linting issues

8742cd1

style: fix indentation and line-length in pyod_adapter.py

fff2cef

vijay-x-Raj added 2 commits August 19, 2025 07:16

chore: apply yapf formatting to pyod_adapter.py

c33d1bb

style: format pyod_adapter to match repo yapf rules

e204f02

shunping reviewed Aug 19, 2025

View reviewed changes

vijay-x-Raj added 2 commits August 19, 2025 18:23

pyod: support vector features and fix lint/format issues in pyod_adap…

03c5377

…ter.py

chore: format pyod_adapter with yapf (vector feature flattening only)

f4189e7

Fixing linter and formate tests

d685cb4

vijay-x-Raj added 2 commits August 21, 2025 18:03

fix(pyod): support vector features; clean mypy unused ignores

25d2619

revert: restore handlers and xgboost test to master state (only keep …

79ffcd4

…pyod adapter change)

shunping reviewed Aug 22, 2025

View reviewed changes

Fix(pyod): reformat and lint pyod_adapter.py (imports, typing, format…

33c3295

…ting)

chore(ml-anomaly): refine PyOD adapter style, remove pylint disables,…

12fe54c

… compact inference logic

github-actions bot added the stale label Oct 28, 2025

github-actions bot closed this Nov 4, 2025

fix: support vector features in PyOD anomaly detection batching logic #35884

fix: support vector features in PyOD anomaly detection batching logic #35884

Uh oh!

Conversation

vijay-x-Raj commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

charlespnh commented Aug 15, 2025

Uh oh!

mohamedawnallah commented Aug 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 16, 2025

Choose a reason for hiding this comment

Uh oh!

vijay-x-Raj commented Aug 16, 2025

Uh oh!

mohamedawnallah commented Aug 16, 2025

Uh oh!

vijay-x-Raj commented Aug 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

tvalentyn commented Aug 18, 2025

Uh oh!

github-actions bot commented Aug 18, 2025

Uh oh!

shunping Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shunping left a comment

Choose a reason for hiding this comment

Uh oh!

vijay-x-Raj commented Aug 18, 2025

Uh oh!

shunping commented Aug 19, 2025

Uh oh!

shunping Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

shunping commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vijay-x-Raj commented Aug 21, 2025

Uh oh!

mohamedawnallah commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vijay-x-Raj commented Aug 21, 2025

Uh oh!

shunping Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 23, 2025

Codecov Report

Uh oh!

mohamedawnallah commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

github-actions bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

shunping Aug 18, 2025 •

edited

Loading

shunping commented Aug 21, 2025 •

edited

Loading

mohamedawnallah commented Aug 21, 2025 •

edited

Loading

shunping Aug 22, 2025 •

edited

Loading

mohamedawnallah commented Aug 28, 2025 •

edited

Loading