Improve reweighting: L-BFGS optimizer, float64, L2 penalty#403
Improve reweighting: L-BFGS optimizer, float64, L2 penalty#403
Conversation
…e target filtering Major changes to the national reweighting optimizer: 1. Drop impossible targets: automatically filter out targets where the data column is all-zero (8 targets for estate income/losses and rent & royalty net income/losses). These caused the loss to plateau at ~8.0. Now 550 targets instead of 558. 2. Switch float32 to float64: eliminates floating-point precision issues that caused cross-machine non-determinism on the flat loss surface. 3. Run reweighting in a subprocess: isolates from PyTorch autograd state left by PolicyEngine Microsimulation, which shifted gradient accumulation order by 1 ULP, compounding over many iterations. 4. Pre-scale weights: multiply all weights so the weighted filer total matches the SOI target before optimization. Ensures the L2 deviation penalty only measures redistributive changes, not the level shift. 5. Enable L2 weight deviation penalty (default 0.0001): penalizes sum((new - original)^2) / sum(original^2), scaled by the initial loss value. Reduces extreme weight distortion while maintaining excellent target accuracy. 6. Switch Adam to L-BFGS optimizer: quasi-Newton method with strong Wolfe line search. Dramatically better convergence — 549/550 targets within 0.1% (vs 523/550 with Adam at same penalty). GPU and CPU produce nearly identical results. 7. Extract build_loss_matrix() to module-level function for reuse. 8. Add diagnostic output: penalty value, target accuracy statistics, weight change distribution, and reproducibility fingerprint for cross-machine comparison. Remove TensorBoard and tqdm dependencies (no longer needed with L-BFGS). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@martinholmer, could you please review this draft PR that addresses #400? Please run it on your machine with The key changes that probably aid in making results cross-machine-reproducible are (1) adding a regularization term - a penalty for deviations from the pre-optimization weights (biggest impact), which should make the optimizer seek a solution, of many possible solutions, that is unique, and (2) moving to float64 from float32. I think the regularization term should make the results much better (more plausible and better representations of the real world) in addition to being more reproducible. I was always frustrated that we didn't use regularization before. (It was not working well, likely because it had an L1 loss function; this PR changes it to an L2 loss for the weight deviation penalty.) Running it probably will take more time than before, because it does so much more. With GPU, it runs in about 45 seconds on my machine, up from 13 seconds before. With CPU only (the automatic fallback), it runs in about 13 minutes, which is ok I guess. If your machine is older and doesn't have a GPU it may take longer. It runs for 200 iterations. It currently fails several tests, as noted above. That will have to be fixed. I think the right solution is to change expectations but could you please weigh in on that? I can update the expectations if we agree that's the way to go. |
Thanks, @martinholmer. Done. |
|
@donboyd5, I download PR #403 and merged in recent changes on the master branch (which is something you should do on your computer), and then executed "make clean ; make test". Suggestion One thing I would suggest is removing this print output: This is not a WARNING; these eight variables should never have been included in the reweighting optimization. Question I thought these changes would bring the reweighting results across computers into line, but that this not so. |
|
@donboyd5, when on your computer you use your NVIDIA chip to accelerate the reweighting optimization, are the floating-point operations on the NVIDIA chip being done at 32-bit precision or at 64-bit precision? Google AI gives this response (which is not clear to me): |
Actually, when using the M4 64-bit CPU on my Mac AirBook, the reweighting time is 524 secs (~8.7 mins). |
Here's what Claude says with regard to the specific code: Our code does run in true 64-bit on the GPU. When we create tensors with torch.float64, PyTorch executes FP64 operations on the CUDA cores — the GPU respects the tensor dtype, it doesn't silently downgrade. However, Google AI's answer highlights the tradeoff: FP64 on GeForce cards runs at ~1/32 the throughput of FP32 (consumer cards intentionally throttle FP64 to differentiate from data-center GPUs like A100/H100) So bottom line: we're genuinely computing in 64-bit, it's just slower than 32-bit would be. For our workload (~46 seconds total), it's well worth the precision. |
Impressive. |
I think we need some kind of programmer self-protection -- with automated methods of defining targets even a responsible programmer in the future testing a large number of possible targets could include a target that the data cannot hit. That should be reported. One option would be to add a test that checks whether any columns were all zeros (and therefore removed). Then we wouldn't need anything in the normal console output. |
There are still hardware and software differences that could lead to very small differences in results. I think these differences are extremely small and might have no noticeable impact on results. Can you look at the actual values I have for the following test failures and see if they are noticeably different from what you have? I suspect they are extremely close if not identical (as far as amounts displayed are concerned; there would be differences if we showed full precision, of course). We should decide whether differences are small enough for our purposes. If not we can always try another 100 iterations and see what we get. |
|
@martinholmer, the note above about differences is just for starters. Most of those numbers have few digits. It would be better to compare more-precise numbers. If you can provide me with your tmd files in the folder we have been sharing that would be great. Meanwhile, I'll post my files that result from the run I did for the PR. |
This is an excellent idea --- one that should have been part of the original phase of the project. |
|
Will definitely do when I revise the PR.
Sent from my phone; please excuse brevity and speech-to-text errors.
…On Wed, Feb 18, 2026, 3:22 PM Martin Holmer ***@***.***> wrote:
*martinholmer* left a comment (PSLmodels/tax-microdata-benchmarking#403)
<#403 (comment)>
@donboyd5 <https://github.com/donboyd5> said in PR #403
<#403>:
One option would be to add a test that checks whether any columns were all
zeros (and therefore removed). Then we wouldn't need anything in the normal
console output.
This is an excellent idea --- one that should have been part of the
original phase of the project.
Why don't you add that check and streamline the "normal console output"?
—
Reply to this email directly, view it on GitHub
<#403 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABR4JGAWWEUS2YP5HMA6WQL4MTCYFAVCNFSM6AAAAACVRD6Y6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSMRSHE4TOMRYGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Comparing "actual values" is a good idea, but the test results are too indirect.
Thanks. |
|
@donboyd5 said in PR #403 that when using his Nvidia chip the reweighting FINGERPRINT was this: Using my Apple M4 chip, I get this FINGERPRINT: The differences in these fingerprints are considerable. |
|
@martinholmer, you are right. I've learned a lot since yesterday and have a suggested alternative analysis if you have time to do one more run on your computer to compare to one on mine. Here's what I learned, with considerable help from Claude:
The fix, reflected in an update I'll push in a few minutes:
The update is based on current master. It also converts the all-zero-column console print to a proper Python UserWarning with a test, and improves the optimization log output. The larger penalty for weight deviations means those deviations are more important relative to differences from targets than they were before, and errors in targets can worsen. We have a few targets that have errors larger than we would like. But as always, that can mean those targets are simply hard to achieve. I am ok with that. Results on my machine are now virtually identical between gpu and cpu. I'm hopeful this means you'll see near-identical results too. The optimization runs in about 77 seconds using my gpu, and in 26 minutes using my cpu. Here is the fingerprint info from the gpu run: Finally, with Claude I explored many dead ends and suboptimal solutions between yesterday and today, including Adam with different stopping criteria and max iterations (20,000, up from 2,000!), L-BFGS-B through scipy (the bounds version) rather than L-BFGS through pytorch with clamping to implement bounds (the best approach), and several other approaches. Results are in the archive subfolder of our Drive folder if you have any interest; I will delete them in a day or two. I do have one idea of how we might speed up cpu-only solution but I am not sure the cpu implementation is so slow that it merits more work, and I am not sure the idea would work. My gpu results are in folder tmd_2026-02-19_lbfgs_gpu_800iters_p010_gradnorm_prupdate. Would you be able to run Many thanks. |
|
@martinholmer , good to go. |
|
Superseded by #407, which is rebased on current master. |
Summary
Addresses #400. Improves the reweighting optimization in several ways:
REWEIGHT_DEVIATION_PENALTY = 0.0001) that penalizes large deviations from original weights using L2 norm:sum((new - original)^2) / sum(original^2)The 8 impossible targets (all-zero data columns) dropped are:
Results
Optimization (NVIDIA GeForce RTX 5070 Ti):
Target accuracy (550 targets after filtering impossible ones):
Weight change distribution (vs pre-optimization weights):
Reproducibility fingerprint:
Known test failures
4 tests fail because expected values need updating for the new weights. These are not regressions — they reflect the changed weight distribution. Test expectations should be updated after the approach is reviewed and approved.
test_weightstest_variable_totalstest_imputed_variablestest_tax_expenditures40 passed, 4 failed, 2 skipped.
Files changed
tmd/utils/reweight.pytmd/datasets/tmd.pytmd/imputation_assumptions.pyREWEIGHT_DEVIATION_PENALTY = 0.0001