Add Schur completment and its mat-free mode#35
Conversation
Added a section for future plans including a new backend for distributed solver.
There was a problem hiding this comment.
Code Review
This pull request introduces high-performance Triton kernels for sparse BSR operations, including matrix-vector multiplication, matrix-matrix multiplication, and transposition. It also implements a matrix-free NormalMatVec operator and a new Schur complement-based optimizer to improve the efficiency of bundle adjustment tasks. The bundle adjustment example was updated with CUDA memory snapshotting and Warp mempool reporting. Review feedback highlights a critical issue where in-place diagonal modifications in the LM and Schur optimizers cause damping factors to accumulate incorrectly during step rejections. Additionally, the reviewer recommends removing performance-hindering torch.cuda.empty_cache() calls, addressing potential divisions by zero in the Conjugate Gradient solver, and cleaning up redundant or commented-out code.
| diag_scale *= 1.0 + pg['damping'] | ||
| A.set_damping(diag_scale - 1.0) | ||
| else: | ||
| diagonal_op_(A, op=partial(torch.mul, other=1+pg['damping'])) |
There was a problem hiding this comment.
The diagonal_op_ function performs an in-place multiplication on the matrix A. Since this is inside the while loop, if a step is rejected and the loop repeats, the damping will be applied cumulatively (e.g.,
| R = R.tensor() | ||
| else: | ||
| R = R.detach() | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
Calling torch.cuda.empty_cache() inside the optimization step is generally discouraged as it triggers a GPU synchronization and can significantly degrade performance. If memory management is a concern, it's better to optimize tensor lifecycles or use a dedicated memory pool. If this was added for debugging memory usage, it should be removed before merging.
|
|
||
| Ap = matvec(p) | ||
| Ap_flat = Ap.reshape(-1) | ||
| alpha = (rz / torch.dot(p.reshape(-1), Ap_flat)).item() |
There was a problem hiding this comment.
Potential division by zero if torch.dot(p.reshape(-1), Ap_flat) is zero (e.g., if the matrix is singular or not positive definite). While
|
|
||
| rz_new = torch.dot(r_flat, z_flat) | ||
| beta = (rz_new / rz).item() | ||
| p.mul_(beta).add_(z) |
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
This pull request introduces significant improvements to the optimizer infrastructure, focusing on enhanced memory profiling, a new Schur complement optimizer, and better support for matrix-free operations.
Optimizer Enhancements
Added a new
Schuroptimizer class inbae.optim.optimizer, implementing the Schur complement method with support for both standard and matrix-free normal equations, block Jacobi preconditioning, and efficient memory usage.Updated the
LMoptimizer to support amatrix_free_normalmode, allowing for more efficient computation and memory usage in large-scale problems.Add a custom
TrustRegionclass that supports Warp, especially for use with the Schur optimizer.Sparse Matrix and PyOps Improvements
inv_opfor correct tensor creation and a new test block inpy_ops.pyfor diagonal operations on CUDA.