Skip to content

Conversation

@Ig-dolci
Copy link
Contributor

@Ig-dolci Ig-dolci commented Feb 8, 2025

Description

This PR seeks to fix the memory leak by calling a function.name method at the CheckpointFunction constructor. In addition, this PR proposes disk checkpointing using the SingleDiskStorageSchedule schedule and pyadjoint checkpoint manager to execute the solvers. This code also enables the user to choose other schedules from checkpoint_schedules with disk storage.

There is a point to consider here:

  • The SingleDiskStorageSchedule schedule stores only the adjoint dependence data. So, we have to pay attention to the cases requiring all checkpoint data storage on disk in every single time step, i.e., the required data to restart the forward solver and the required data for the adjoint computations. The SingleDiskStorageSchedule schedule does not support that.

Observation:

  • The tests will pass if the pyadjoint PR 195 is merged.

  • I will wait for the PR 4023 to be merged to open this PR.

@Ig-dolci Ig-dolci linked an issue Feb 8, 2025 that may be closed by this pull request
@github-actions
Copy link

github-actions bot commented Feb 8, 2025

TestsPassed ✅Skipped ⏭️Failed ❌
Firedrake real8241 ran7525 passed716 skipped0 failed

@github-actions
Copy link

github-actions bot commented Feb 8, 2025

TestsPassed ✅Skipped ⏭️Failed ❌
Firedrake complex8234 ran6590 passed1644 skipped0 failed

@dham
Copy link
Member

dham commented Feb 9, 2025

I don't believe this is correct. Merely referencing function.function_space() in CheckpointFunction.__init__ does not create persistent references to function. That isn't going to be the source of the memory leak.

@sghelichkhani
Copy link
Contributor

Just confirming that this doesn't change any of our reproducers. Also noting that the behaviour is slightly different on my mac and our HPC system. Unlike my mac, the RSS seems to go flat after the first derivative (see image). Not sure why it works differently on mac. But at any rate, the memory usage seems to go higher when disk-checkpointing.
Screenshot 2025-02-10 at 14 02 25

@dham
Copy link
Member

dham commented Feb 10, 2025

Can I just check how many Degrees of Freedom this is? I'm trying to work out how many vectors we must have leaked.

@Ig-dolci
Copy link
Contributor Author

Ig-dolci commented Feb 10, 2025

Firstly, I have identified memory leaks in two cases involving disk checkpointing:

  1. When using only enable_disk_checkpointing().
  2. When using enable_disk_checkpointing() along with checkpoint_schedules.

Additionally, the function.function_space() is not the source of the leak. However, assigning self.name = function.name (where function.name is a callable) contributes to the leak in both cases.

The following chart evaluates Case 2 on the master branch (black curve) and with a modification replacing self.name = function.name with self.name = function.name() (blue curve):

Figure_1

The red curve represents the case where no disk checkpointing is applied.

Although changing self.name = function.name to self.name = function.name() mitigates the issue in Case 2, it does not resolve the memory leak in Case 1. The cause of this leak is still unknown.

Next Steps

We have two options moving forward:

  1. Investigate the memory leak in Case 1 (I know is coming from the derivates/gradients computations).
  2. Use disk checkpointing only with checkpoint_schedules

For Option 2 (this PR proposal), here are the results obtained:

Figure_2

  • The black curve corresponds to the results with this PR.
  • The blue curve represents the use of enable_disk_checkpointing() without checkpoint_schedules.
  • The red curve corresponds to the case where no disk checkpointing is applied.

These charts use the example added #4014 issue.
I used 500 time steps. Only 20 time steps are not enough to evaluate these leaks.

@dham
Copy link
Member

dham commented Feb 10, 2025

I believe this answer. Storing the name method rather than its result will definitely cause the leak observed.

Moving to checkpoint schedules for everything is our aim in any event, so I am happy to always use it in this case. @sghelichkhani: the PR enables this in a way that shouldn't require any code changes for G-Adopt.

@Ig-dolci
Copy link
Contributor Author

Perfect. I will convert it to a draft to run more tests. As soon as possible, I will open it to review again.

@Ig-dolci Ig-dolci marked this pull request as draft February 10, 2025 09:59
@sghelichkhani
Copy link
Contributor

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing:
comp
Am I missing something?

@sghelichkhani
Copy link
Contributor

@dham For the graph I have above (and presumably what @Ig-dolci is plotting) I am using this reproducer https://github.com/g-adopt/g-adopt/blob/adjoint-memory/demos/mantle_convection/test/tester.py, where Q.dim() is 10201, and we are doing 500 timesteps.

@Ig-dolci
Copy link
Contributor Author

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: comp Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

@angus-g
Copy link
Contributor

angus-g commented Feb 11, 2025

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: comp Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

Are you also using the extra garbage collection from dolfin-adjoint/pyadjoint#187?

@Ig-dolci
Copy link
Contributor Author

@Ig-dolci I have tried hard to reproduce your black curve, but I still see a big leak on dolci/fix_mem_leak_diskcheckpointing: comp Am I missing something?

I think is missing to replace the time loop from for i in range(total_steps): to for i in tape.timestepper(iter(range(total_steps))). My bad, I have to document this properly or write a warning for the users to be informed.

Are you also using the extra garbage collection from dolfin-adjoint/pyadjoint#187?

Yes. I just noticed that I used it. Applying the garbage collection between a number of time steps (20 steps) helped decrease memory usage by almost 20% for this case.

@Ig-dolci Ig-dolci marked this pull request as ready for review February 13, 2025 16:53
@github-actions
Copy link

github-actions bot commented Mar 5, 2025

TestsPassed ✅Skipped ⏭️Failed ❌
Firedrake default8193 ran7493 passed700 skipped0 failed

@dham dham merged commit 7355522 into master Mar 14, 2025
15 checks passed
@dham dham deleted the dolci/fix_mem_leak_diskcheckpointing branch March 14, 2025 13:20
connorjward pushed a commit that referenced this pull request Mar 17, 2025
* deepcopy name and count function attributes
* Avoid reference the callable function name; Disk checkpoint using checkpoint_schedules
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory Growth and Unexpected Behaviour in Firedrake Adjoint

5 participants