Skip to content

Conversation

@MaxGhenis
Copy link
Collaborator

Summary

Changes

Issue #265 Fix

Added custom _MicroLocIndexer and _MicroILocIndexer classes that wrap pandas' indexers. These ensure that when filtering rows, the result is a MicroDataFrame with properly filtered weights:

df = MicroDataFrame({'x': [1, 2, 3]}, weights=[10, 20, 30])
subset = df.loc[df.x > 1]  # Now returns MicroDataFrame with weights [20, 30]
subset.x.sum()  # Returns 130 (weighted), not 5 (unweighted)

Issue #193 Fix

Added __getitem__ method to MicroDataFrameGroupBy class. When selecting columns after groupby, it now returns properly configured groupby objects with weights:

d = MicroDataFrame(dict(g=["a", "a", "b"], y=[1, 2, 3]), weights=[4, 5, 6])
d.groupby("g")["y"].sum()     # Now returns a=14, b=18 (weighted)
d.groupby("g")[["y"]].sum()   # Now returns a=14, b=18 (weighted)

Test plan

  • Added test_loc_preserves_weights() - tests .loc[] filtering
  • Added test_iloc_preserves_weights() - tests .iloc[] filtering
  • Added test_groupby_column_selection() - tests groupby column selection
  • All 26 tests pass
  • Code passes linting

🤖 Generated with Claude Code

- Add custom _MicroLocIndexer and _MicroILocIndexer classes that wrap
  pandas indexers and return MicroDataFrame with filtered weights
- Add __getitem__ to MicroDataFrameGroupBy to support weighted operations
  on column selections like groupby("g")["y"].sum()
- Add tests for loc, iloc, and groupby column selection

Fixes #265, Fixes #193

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@MaxGhenis MaxGhenis requested a review from baogorek November 25, 2025 14:32
Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MaxGhenis , documentation was also broken in -us-data. I have a branch out if you want me to try to bring anything over.

For the fix, the values are looking good but I'm seeing NaN weights in certain circumstances:

from microdf import MicroDataFrame
In [8]: df = MicroDataFrame({'group': ['A', 'A', 'B', 'B'], 'value': [10, 20, 30, 40]}, weights=[1, 2, 3, 4])

In [9]: df.groupby("group").value.sum()
Out[9]: 
group
A     50.0
B    250.0
dtype: float64

In [10]: df.groupby("group")['value'].sum()
Out[10]: 
group
A     50.0
B    250.0
dtype: float64

In [11]: df.groupby("group").sum()
Out[11]: 
       weight  value
group               
A         NaN   50.0
B         NaN  250.0

In [13]: df.groupby("group")[["value"]].sum()
Out[13]: 
       weight  value
group               
A         NaN   50.0
B         NaN  250.0

I think we'd want to sum the weights, right?

…ights

Aggregated results from groupby operations should be plain DataFrames,
not MicroDataFrames, since the weights have already been applied during
aggregation. This fixes the NaN weight column issue noted in review.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@MaxGhenis
Copy link
Collaborator Author

@baogorek Thanks for the review! I've fixed the NaN weights issue.

The problem was that groupby aggregation results were being returned as MicroDataFrame, which automatically adds a weights column. Since weights have already been applied during aggregation, the result should be a plain pd.DataFrame instead.

Now the output is clean:

from microdf import MicroDataFrame
df = MicroDataFrame({'group': ['A', 'A', 'B', 'B'], 'value': [10, 20, 30, 40]}, weights=[1, 2, 3, 4])

df.groupby("group").sum()
#        value
# group       
# A       50.0
# B      250.0

df.groupby("group")[["value"]].sum()
#        value
# group       
# A       50.0
# B      250.0

I've also added a test to verify aggregated results don't have a spurious weight column.

MaxGhenis and others added 2 commits November 25, 2025 15:24
- Add myst.yml configuration for MyST-based documentation
- Update CI workflow to use myst build instead of jb build
- Add Node.js setup step (required by MyST)
- Remove jupyter_book from pyproject.toml (MyST installs via npm)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@MaxGhenis
Copy link
Collaborator Author

I've also fixed the documentation build by migrating to Jupyter Book 2.0 / MyST:

  • Added docs/myst.yml configuration
  • Updated CI workflow to install mystmd via npm and use myst build --html
  • Added Node.js setup step (required by MyST)

All CI checks are now passing! ✅

@MaxGhenis MaxGhenis requested a review from baogorek November 25, 2025 20:30
Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@MaxGhenis MaxGhenis merged commit 835e695 into master Nov 26, 2025
8 checks passed
@MaxGhenis MaxGhenis deleted the fix-issues branch November 26, 2025 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MicroDataFrame.loc[] returns plain DataFrame, losing weights MicroDataFrame.groupby(col)[[cols]].aggfunc doesn't consider weights

3 participants