Skip to content

feat: add keyword argument support for train_from_stream (#114)#155

Open
Yegorov wants to merge 4 commits into
cardmagic:masterfrom
Yegorov:114
Open

feat: add keyword argument support for train_from_stream (#114)#155
Yegorov wants to merge 4 commits into
cardmagic:masterfrom
Yegorov:114

Conversation

@Yegorov
Copy link
Copy Markdown
Contributor

@Yegorov Yegorov commented May 21, 2026

Closes #114

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR extends train_from_stream across all four classifiers (Bayes, KNN, LogisticRegression, LSI) to accept multiple categories via keyword arguments (e.g. train_from_stream(spam: io1, ham: io2)), alongside the existing positional form. New tests cover the multi-category path and invalid-IO validation.

  • The LSI implementation moves the begin/ensure block inside the per-category loop, causing build_index to fire once per category rather than once after all streams have been loaded.
  • Making category and io optional without an explicit guard means callers who supply only one of the two arguments silently get a no-op instead of an error.
  • The Streaming base module stub was not updated to match the optional-positional signature now used by every implementation.

Confidence Score: 3/5

The multi-category keyword API is a useful addition, but the LSI implementation rebuilds the index after each category stream and the missing-argument case silently does nothing across all classifiers.

The LSI build_index regression means any caller who streams multiple categories while auto_rebuild is enabled will trigger redundant and increasingly expensive index rebuilds. The silent no-op for partial arguments removes a safety net that existed for free under the old required-parameter contract.

lib/classifier/lsi.rb needs the auto_rebuild guard hoisted outside the loop; lib/classifier/bayes.rb and logistic_regression.rb need an explicit guard for the partial-argument case; lib/classifier/streaming.rb base stub should be updated to match the optional signature.

Important Files Changed

Filename Overview
lib/classifier/lsi.rb Moving the begin/ensure block inside the category loop causes build_index to fire after each individual stream instead of once after all categories, breaking the original optimisation.
lib/classifier/bayes.rb Adds multi-category keyword support. Missing guard means partial argument combinations silently do nothing instead of raising.
lib/classifier/logistic_regression.rb Same pattern as Bayes; same silent no-op concern applies. synchronize and dirty handling looks structurally correct inside the loop.
lib/classifier/knn.rb Thin delegation to lsi.train_from_stream; correctly threads **categories through.
lib/classifier/streaming.rb Base stub not updated to match the optional-positional signature adopted by all implementations; @RBS annotation is also inconsistent.
test/lsi/streaming_test.rb Adds multi-category and invalid-IO tests; does not cover the auto_rebuild regression.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["train_from_stream(category=nil, io=nil, **categories)"] --> B{category && io?}
    B -- Yes --> C["{ category => io }"]
    B -- No --> D["categories hash"]
    C --> E[".each do |cat, stream|"]
    D --> E
    E --> F[Validate stream]
    F --> G[LineReader + Progress]
    G --> H[each_batch]
    H --> I[Train + update progress]
    I --> N{More batches?}
    N -- Yes --> H
    N -- No --> O{More categories?}
    O -- Yes --> E
    O -- No --> P[Done]
    subgraph LSI_BUG ["LSI only"]
        Q["save auto_rebuild, set false"]
        R["ensure: restore + build_index per-category"]
    end
    E -.-> Q
    Q --> G
    G -.-> R
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
lib/classifier/lsi.rb:667-688
**`build_index` called once per category instead of once after all streams**

The `begin/ensure` block (which saves/restores `@auto_rebuild` and calls `build_index`) is now inside the `.each` loop. When multiple categories are provided, `build_index` fires after every stream rather than once at the end. This defeats the original optimisation: if `auto_rebuild` was `true` before the call, you'll pay the full re-index cost for each category, and every intermediate rebuild immediately becomes obsolete. The fix is to hoist the `auto_rebuild` save, the `@auto_rebuild = false` assignment, and the `ensure` block outside the loop so the index is rebuilt exactly once after all categories have been loaded.

### Issue 2 of 3
lib/classifier/bayes.rb:332-333
**Silent no-op when only one of `category`/`io` is supplied**

When a caller passes only `category` (e.g. `train_from_stream(:spam, batch_size: 100)`) or only `io`, the ternary resolves to `categories` which is an empty hash, so the method returns without training or raising. Before this PR, the required positional parameters made such a call a Ruby `ArgumentError`. Consider an explicit guard to preserve that contract.

```suggestion
    def train_from_stream(category = nil, io = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories)
      raise ArgumentError, 'Provide either (category, io) or keyword category: io pairs' if category.nil? == io.nil? && category.nil? && categories.empty?
      raise ArgumentError, 'Provide both category and io, or use keyword arguments' if category.nil? ^ io.nil?

      (category && io ? { category => io } : categories).each do |category, io|
```

### Issue 3 of 3
lib/classifier/streaming.rb:29-30
**Base method signature still requires `category` and `io` as positional arguments**

The `Streaming` module's stub still declares `category` and `io` as required, but every concrete implementation now makes them optional. A caller coding to the base interface would be incorrectly told these are mandatory. Update the stub and its `@rbs` annotation to match.

```suggestion
    # @rbs (?(Symbol | String), ?IO, ?batch_size: Integer, **Hash[Symbol, IO]) { (Progress) -> void } -> void
    def train_from_stream(category = nil, io = nil, batch_size: DEFAULT_BATCH_SIZE, **categories, &block)
```

Reviews (1): Last reviewed commit: "feat: add keyword argument support for `..." | Re-trigger Greptile

Comment thread lib/classifier/lsi.rb Outdated
Comment thread lib/classifier/bayes.rb Outdated
Comment thread lib/classifier/streaming.rb Outdated
@Yegorov
Copy link
Copy Markdown
Contributor Author

Yegorov commented May 21, 2026

@cardmagic can you take a look, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add keyword argument support for train_from_stream

1 participant