Skip to content

Speed and memory optimizations: rewrites, inplace ops, and gradient graphs #9

@bwengals

Description

@bwengals

Collecting ideas for performance work. ptgp relies on PyTensor's rewrite system to choose efficient algorithms, but we haven't done a thorough audit of what's actually happening at compile time.

The ptgp-local rewrites have already made a significant difference. For the unapproximated GP MLL, the joint forward+gradient graph compiles down to exactly 1 Cholesky and 0 MatrixInverse ops (the naive graph before rewrites has multiple of each). The rewrites that achieve this include PSD-aware slogdet lowering, Cholesky reuse across the graph, and a composite-merge pass that prevents the fusion optimizer from duplicating cubic ops.

Rewrites that are generally useful should be upstreamed into PyTensor. Rewrites that are GP-specific can live locally in ptgp, or live there until they're mature enough to upstream.

Areas to investigate

  • Memory profiling and inplace operations — are Cholesky, solve, and other cubic ops getting inplace flags where they should? I haven't looked closely at whether PyTensor is handling this well. There may be missed inplace opportunities, especially when tensors pass through assumption annotations.

  • PyTensor rewrites — which rewrites are firing on our graphs and which aren't? Are there rewrites we should be registering ourselves (e.g. Woodbury for low-rank-plus-diagonal structure)?

  • Gradient graph quality — PyTensor generates gradient graphs via autodiff, but for known kernels (Matern, ExpQuad, etc.) there are analytic gradient forms. It would be worth comparing the autodiff graphs to the analytic forms to see if there are unnecessary intermediate allocations or missed simplifications.

  • Structured matrix operations — Kronecker, Toeplitz, banded, and block-diagonal solvers all exploit specific matrix structure to reduce the cost of Cholesky / solve below the dense O(N^3) baseline. Which of these structures show up in ptgp's graphs, and could PyTensor dispatch to them?

Advanced / subjective rewrites

PyTensor can arbitrarily manipulate graphs, and some potential rewrites are subjective and situation-specific. The simplest example is the Woodbury lemma: whether to apply it depends on the relative shapes of the matrices involved. A more ambitious example is choosing between Cholesky and iterative solvers (CG, Lanczos) for solve and slogdet (see #11). PyTensor can inspect the incoming matrix shapes at compile time (if the user specifies them) and could potentially choose the algorithm automatically.

Could this be done automatically, or should users choose the approach? Both?

Other libraries handle this at runtime rather than compile time:

  • GPyTorch uses a LinearOperator class where structured matrix types (DiagLinearOperator, LowRankRootLinearOperator, etc.) dispatch to specialized algorithms at runtime
  • GPJax uses CoLA (Compositional Linear Algebra), which combines a linear operator abstraction with over 70 dispatch rules that select algorithms based on detected structure (e.g. solve[Sum[LowRank, Diagonal]] dispatches to Woodbury automatically)

ptgp's approach could be the compile-time version of the same idea: use shape info and assumption tags in the symbolic graph to choose algorithms after the initial PyTensor graph is created. How should this be implemented?

Open questions

  • What tools or approaches are people using to profile PyTensor graph execution?
  • Are there other optimization opportunities we should be looking at?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions