Skip to content
This repository was archived by the owner on Dec 1, 2025. It is now read-only.

Conversation

@dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Mar 13, 2025

Resolves #89.

The usability of this on the nested-dask side is hit pretty hard by meta assignment, as the user needs to build their own nesteddtypes appropriately, like this:

ndd = generate_data(20, 20, npartitions=2, seed=1)

    def complex_output(flux):
        return {
            "max_flux": np.max(flux),
            "lc.flux_quantiles": np.quantile(flux, [0.1, 0.2, 0.3, 0.4, 0.5]),
            "lc.labels": [0.1, 0.2, 0.3, 0.4, 0.5],
            "meta.colors": ["green", "red", "blue"],
        }

    # ouch
    result_meta = npd.NestedFrame(
        {
            "max_flux": pd.Series([], dtype="float"),
            "lc": pd.Series(
                [],
                dtype=NestedDtype(
                    pa.struct(
                        [
                            pa.field("flux_quantiles", pa.list_(pa.float64())),
                            pa.field("labels", pa.list_(pa.float64())),
                        ]
                    )
                ),
            ),
            "meta": pd.Series([], dtype=NestedDtype(pa.struct([pa.field("colors", pa.list_(pa.string()))]))),
        }
    )
    result = ndd.reduce(complex_output, "nested.flux", infer_nesting=True, meta=result_meta)

You can get away from this by just making these object columns, technically, but the downstream consequences of not keeping these dtypes correct will likely be pretty obstructive. This means that unless we find a way to soften the meta requirements (#73), a user would likely want to not use the infer_nesting feature and instead use a pack_lists style call after reduce.

A potential choice to make here could even be to specifically not support nesting inference until #73, but given that the motivator for this feature comes from LSDB it would be unfortunate to not have it be available in dask.

@github-actions
Copy link

github-actions bot commented Mar 13, 2025

Before [49d26eb] After [b2477e3] Ratio Benchmark (Parameter)
876±40ms 898±8ms 1.02 benchmarks.NestedFrameQuery.time_run
281±4ms 283±6ms 1.01 benchmarks.NestedFrameAddNested.time_run
137M 137M 1 benchmarks.NestedFrameAddNested.peakmem_run
139M 139M 1 benchmarks.NestedFrameQuery.peakmem_run
136M 136M 1 benchmarks.NestedFrameReduce.peakmem_run
290±10ms 285±3ms 0.98 benchmarks.NestedFrameReduce.time_run

Click here to view all benchmarks.

@codecov
Copy link

codecov bot commented Mar 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.78%. Comparing base (49d26eb) to head (d7e0d53).
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #90      +/-   ##
==========================================
+ Coverage   95.76%   95.78%   +0.01%     
==========================================
  Files           9        9              
  Lines         260      261       +1     
==========================================
+ Hits          249      250       +1     
  Misses         11       11              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dougbrn
Copy link
Collaborator Author

dougbrn commented Mar 13, 2025

Took a step towards #73 here by adopting dask's native meta inference. Dask tries the function on some dummy data and sets the meta accordingly, note that this will never resolve a correct dtype for nesteddtypes and often will not for even normal dtypes.

@dougbrn dougbrn requested a review from hombit March 14, 2025 16:47
Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, let's go with it!

@dougbrn dougbrn merged commit cb2bbbf into main Mar 17, 2025
9 checks passed
@dougbrn dougbrn deleted the infer_nesting branch March 17, 2025 16:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adopt nested-pandas reduce output nesting inference

3 participants