Skip to content

Conversation

@swfsql
Copy link
Contributor

@swfsql swfsql commented Nov 26, 2025

Pull Request Template

Checklist

  • Confirmed that cargo run-checks command has been executed.
  • Made sure the book is up to date with changes in this PR.

Related Issues/PRs

Changes

  • Adds the graph_cleanup method to the burn::tensor::backend::AutodiffBackend trait.
  • Also changed how the ordering of node removal is made, ensuring all leaf nodes are removed first.
    • This is not necessary since a given parent can only have a single child, but I think this could be a good little change.

Testing

Example of usage:

use burn::prelude::*;

type NdArray = burn::backend::NdArray<f32, i32>;
type NdArrayAuto = burn::backend::Autodiff<NdArray>;

fn main() {
    train::<NdArrayAuto>();
}

pub fn train<AutoB: burn::tensor::backend::AutodiffBackend>() {
    let train_device = <<AutoB as Backend>::Device>::default();
    for _ in 0..10_000_000 {
        let a: Tensor<AutoB, 2> = Tensor::zeros([2, 2], &train_device);
        let a = a.require_grad(); // an autodiff node is created
        drop(a); // the tensor is dropped but its unusable graph persists
        AutoB::graph_cleanup(); // either here, after the tensor is dropped, or after the loop
    }
    // in here, without graph_cleanup, AutoB/train_device cannot be used to cleanup the "hanging nodes"
}

Then by accompanying the system's ram usage, the memory usage doesn't grow (in case it's in the loop). Also, the are no vector/hashmaps/hashsets re-allocations, possibly improving runtime by a little.


I recently rebased the graph-sweep branch to main, as I'm using it locally.
In the current main, the "hot loop stress test" (increased to 10 million iterations) eventually still has the ram going up to around ~1GB, whereas graph-sweep leaves it limited at ~138kB. I haven't looked much into it why some nodes are missed on main, but I suspect it could be related to Arc counting (but I'm not sure).

Well, the reason why I created another PR is because all updates to the graph-sweep branch are ignored on the previous PR, because it is closed. And If the PR was already closed and if I rebased the branch (i.e. force-pushed the branch), I can no longer re-open it (-.-)', it appears to be a Github thing. So if it is ok, and contrary to what I did, I'd like to leave this PR open, even if it's open-ended.

@swfsql swfsql mentioned this pull request Nov 26, 2025
2 tasks
@swfsql swfsql marked this pull request as draft November 26, 2025 22:23
@swfsql
Copy link
Contributor Author

swfsql commented Nov 27, 2025

To note, this PR probably has deadlocks on it, given the github action test runs.

@vadixidav
Copy link

Let me know when it is stable again and I will battle test your branch. I am running training continuously and benefit a lot for my application from graph cleanup. I will say rc4 has been better with automated graph cleanup, but it isn't perfect. It helps a whole lot to call graph cleanup after I am done running all my backwards calls.

@swfsql
Copy link
Contributor Author

swfsql commented Dec 3, 2025

@vadixidav I still haven't investigated the deadlocks, but I often rebase it to main (those are the force-push updates). But I'll ping you once I do -- but plz note that I'm not planning to investigate it too soon as I'm not getting the deadlocks for my own use.

And also to note, this PR as it is, you don't need to manually run it (you can if you want), but it automatically runs on every backward as well.

Edit: @laggui Also to note, if I moved the "extra cleanup procedure" from this PR to below the current cleanup procedure (not above as it currently is), then it fails (deadlocks aside) in cleaning up. So this is why I believe that this is related to Arc counting on the standard main cleanup. But then again, I haven't checked this too closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants