-
Notifications
You must be signed in to change notification settings - Fork 21
⚡️ Speed up function _gridmake2_torch by 7%
#989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: experimental-jit
Are you sure you want to change the base?
⚡️ Speed up function _gridmake2_torch by 7%
#989
Conversation
The optimized code achieves a **6% speedup** through two key changes: ## Primary Optimization: Replacing `tile()` with `repeat()` The line profiler shows that `x1.tile(x2.shape[0])` consumed **68.6% of the original runtime**. The optimization replaces this with `x1.repeat(n)`, which is significantly faster because: - `torch.tile()` creates unnecessary intermediate copies when expanding tensors - `torch.repeat()` is a more direct memory operation for simple replication along a single dimension - In the 2D case, `x1.repeat(n, 1)` similarly outperforms `x1.tile(n, 1)` by avoiding redundant copy operations ## Secondary Optimization: `torch.stack()` vs `torch.column_stack()` For the 1D-1D case, replacing `torch.column_stack([first, second])` (27.5% of runtime) with `torch.stack((first, second), dim=1)`: - `torch.stack()` is more efficient when stacking exactly two 1D tensors into a 2D result - `torch.column_stack()` has additional overhead to handle variable-length lists and more general input shapes ## Added JIT Compilation The `@torch.compile` decorator enables PyTorch 2.0's graph optimization, which can provide additional speedups through: - Fusion of operations (reducing intermediate tensor allocations) - Kernel optimizations for the specific tensor operations used - Note: The first call incurs compilation overhead, but subsequent calls benefit from cached optimized code ## Impact Assessment This optimization is most beneficial for workloads that: - Call `_gridmake2_torch` repeatedly with similar tensor shapes (amortizing JIT compilation cost) - Use moderately-sized tensors where memory allocation overhead is significant - Process cartesian products in computational economics, grid-based algorithms, or combinatorial expansions The changes preserve all behavior, types, and error handling exactly.
Code Review for PR #989: ⚡️ Speed up function
|
📄 7% (0.07x) speedup for
_gridmake2_torchincode_to_optimize/discrete_riccati.py⏱️ Runtime :
5.63 milliseconds→5.28 milliseconds(best of37runs)📝 Explanation and details
The optimized code achieves a 6% speedup through two key changes:
Primary Optimization: Replacing
tile()withrepeat()The line profiler shows that
x1.tile(x2.shape[0])consumed 68.6% of the original runtime. The optimization replaces this withx1.repeat(n), which is significantly faster because:torch.tile()creates unnecessary intermediate copies when expanding tensorstorch.repeat()is a more direct memory operation for simple replication along a single dimensionx1.repeat(n, 1)similarly outperformsx1.tile(n, 1)by avoiding redundant copy operationsSecondary Optimization:
torch.stack()vstorch.column_stack()For the 1D-1D case, replacing
torch.column_stack([first, second])(27.5% of runtime) withtorch.stack((first, second), dim=1):torch.stack()is more efficient when stacking exactly two 1D tensors into a 2D resulttorch.column_stack()has additional overhead to handle variable-length lists and more general input shapesAdded JIT Compilation
The
@torch.compiledecorator enables PyTorch 2.0's graph optimization, which can provide additional speedups through:Impact Assessment
This optimization is most beneficial for workloads that:
_gridmake2_torchrepeatedly with similar tensor shapes (amortizing JIT compilation cost)The changes preserve all behavior, types, and error handling exactly.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
test_gridmake2_torch.py::TestGridmake2TorchCPU.test_2d_and_1d_matches_numpytest_gridmake2_torch.py::TestGridmake2TorchCPU.test_2d_and_1d_simpletest_gridmake2_torch.py::TestGridmake2TorchCPU.test_2d_and_1d_single_columntest_gridmake2_torch.py::TestGridmake2TorchCPU.test_both_1d_float_tensorstest_gridmake2_torch.py::TestGridmake2TorchCPU.test_both_1d_matches_numpytest_gridmake2_torch.py::TestGridmake2TorchCPU.test_both_1d_simpletest_gridmake2_torch.py::TestGridmake2TorchCPU.test_both_1d_single_elementtest_gridmake2_torch.py::TestGridmake2TorchCPU.test_large_tensorstest_gridmake2_torch.py::TestGridmake2TorchCPU.test_not_implemented_for_1d_2dtest_gridmake2_torch.py::TestGridmake2TorchCPU.test_not_implemented_for_2d_2dtest_gridmake2_torch.py::TestGridmake2TorchCPU.test_output_shape_1d_1dtest_gridmake2_torch.py::TestGridmake2TorchCPU.test_output_shape_2d_1dtest_gridmake2_torch.py::TestGridmake2TorchCPU.test_preserves_dtype_float64test_gridmake2_torch.py::TestGridmake2TorchCPU.test_preserves_dtype_inttest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_2d_and_1d_cudatest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_2d_and_1d_matches_cputest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_both_1d_matches_cputest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_both_1d_simple_cudatest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_large_tensors_cudatest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_matches_numpy_via_cpu_conversiontest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_output_stays_on_cudatest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_preserves_dtype_float32_cudatest_gridmake2_torch.py::TestGridmake2TorchCUDA.test_preserves_dtype_float64_cudaTo edit these changes
git checkout codeflash/optimize-_gridmake2_torch-mjj3mowiand push.