[Claude] Migrate RNG handling to modern NumPy default_rng/Generator API#31
Merged
Conversation
The package previously relied on the legacy global ``np.random`` interface and ``np.random.seed()``, which has several drawbacks: per-worker ``np.random.seed`` produces correlated streams when distributing work, the legacy MT19937 lacks the spawn-based parallelism guarantees of the modern API, and library-level use of the global RNG makes results depend on whatever the caller has done to ``np.random`` elsewhere. Each ``EmpiricalDistribution`` (and ``MultiSampleEmpiricalDistribution``) now owns a ``Generator`` accepted via a new ``rng`` constructor argument, sampling routes through that generator instead of the global one, and all public bootstrap entry points (``bootstrap_samples``, ``standard_error``, ``bias``, ``better_bootstrap_bias``, ``bias_corrected``, ``percentile_interval``, ``bcanon_interval``, ``t_interval``, ``calibrate_interval``, ``bootstrap_asl``, ``percentile_asl``, ``bcanon_asl``, ``bootstrap_power``, ``prediction_error_optimism``, ``prediction_error_632``, ``prediction_interval``) take an optional ``rng=`` argument. Multi-threaded paths use ``SeedSequence.spawn`` to hand each worker an independent stream, replacing the prior pattern of seeding workers from sibling draws of the parent's MT19937 state. Tests now construct ``EmpiricalDistribution(data, rng=seed)`` (or pass ``rng=`` to inference functions) instead of calling ``np.random.seed()``; recorded expected values were re-recorded under PCG64. Bumps minimum ``numpy>=1.25`` (for ``SeedSequence.spawn`` / ``bit_generator.seed_seq``) and ``pandas>=1.4`` (for ``random_state=Generator`` in ``DataFrame.sample``). https://claude.ai/code/session_01DX4Gi3Vwx6qJwr1ZYQEJiJ
The custom EmpiricalDistribution subclasses in the zero-inflated and significance guides previously sampled from the global ``np.random``, which contradicts the modern rng-as-first-class-argument pattern the library now uses. Their ``__init__`` now forwards ``rng=`` to the base class and ``sample()`` draws from ``self._rng``, so the subclasses are reproducible from a single seed and play correctly with the spawn-based parallel paths. The README quickstart gains a short paragraph showing ``EmpiricalDistribution(df, rng=0)`` and noting that inference functions also accept ``rng=`` directly. The quantiles-at-scale guide passes its existing ``rng`` to the empirical distribution for consistency. ``uv.lock`` is regenerated against the bumped numpy / pandas minimums from the prior commit. https://claude.ai/code/session_01DX4Gi3Vwx6qJwr1ZYQEJiJ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The package previously relied on the legacy global
np.randominterfaceand
np.random.seed(), which has several drawbacks: per-workernp.random.seedproduces correlated streams when distributing work,the legacy MT19937 lacks the spawn-based parallelism guarantees of the
modern API, and library-level use of the global RNG makes results depend
on whatever the caller has done to
np.randomelsewhere.Each
EmpiricalDistribution(andMultiSampleEmpiricalDistribution)now owns a
Generatoraccepted via a newrngconstructor argument,sampling routes through that generator instead of the global one, and
all public bootstrap entry points (
bootstrap_samples,standard_error,bias,better_bootstrap_bias,bias_corrected,percentile_interval,bcanon_interval,t_interval,calibrate_interval,bootstrap_asl,percentile_asl,bcanon_asl,bootstrap_power,prediction_error_optimism,prediction_error_632,prediction_interval) take an optionalrng=argument. Multi-threaded paths useSeedSequence.spawntohand each worker an independent stream, replacing the prior pattern of
seeding workers from sibling draws of the parent's MT19937 state.
Tests now construct
EmpiricalDistribution(data, rng=seed)(or passrng=to inference functions) instead of callingnp.random.seed();recorded expected values were re-recorded under PCG64. Bumps minimum
numpy>=1.25(forSeedSequence.spawn/bit_generator.seed_seq)and
pandas>=1.4(forrandom_state=GeneratorinDataFrame.sample).https://claude.ai/code/session_01DX4Gi3Vwx6qJwr1ZYQEJiJ