Skip to content

Conversation

@ChSonnabend
Copy link
Collaborator

This PR fixes memory leaks and changes (u)int -> (u)int32_t in the GPU kernel code

@ChSonnabend ChSonnabend requested a review from davidrohr as a code owner May 14, 2025 07:37
@github-actions
Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

Please consider the following formatting changes to AliceO2Group#14272
@ChSonnabend ChSonnabend changed the title VRAM memory leak fix + (u)int -> (u)int32_t NN clustering: VRAM memory leak fix + (u)int -> (u)int32_t May 14, 2025
@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for b0e8923 at 2025-05-14 12:44:

## sw/BUILD/O2-latest/log
clang: error: unable to execute command: Segmentation fault (core dumped)
clang: error: llvm-spirv command failed due to signal (use -v to see invocation)
ninja: build stopped: subcommand failed.

Full log here.

@ChSonnabend ChSonnabend requested a review from a team as a code owner May 14, 2025 22:28
@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for afbdade at 2025-05-15 01:54:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14272-slc9_x86-64/0/GPU/GPUTracking/TPCClusterFinder/GPUTPCNNClusterizerKernels.cxx:302:54: error: no matching function for call to 'o2::gpu::GPUTPCCFClusterizer::sortIntoBuckets(o2::gpu::GPUTPCClusterFinder&, o2::tpc::ClusterNative&, o2::gpu::tpccf::Row, uint32_t&, uint32_t*&, o2::tpc::ClusterNative*&, uint32_t&)'
ninja: build stopped: subcommand failed.

Full log here.

@ChSonnabend
Copy link
Collaborator Author

@drohr Can be merged from my side

// ORT variables -> need to be hidden as pImpl
struct OrtVariables;
OrtVariables* mPImplOrt;
std::shared_ptr<OrtVariables> mPImplOrt = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use a shared_ptr and no unique_ptr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I can't assign a nullptr in the header file since OrtVariables is not known there. It will throw an error for invalid size of incomplete type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to declare the constructor and destructor in the header, but instanciate it in the cxx.
I.e. only OrtInterface() in the header, and
OrtInterface::OrtInterface() = default; in the cxx.
Then you can use unique_ptr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we need a unique_ptr? Especially for the OrtAllocator (which gets created only for one environment but used in multiple), shared_ptr would probably be preferable. Unique_ptr would also require a deleter, which will make the code more cumbersome...

@davidrohr
Copy link
Collaborator

davidrohr commented May 16, 2025 via email

@ChSonnabend
Copy link
Collaborator Author

ChSonnabend commented May 16, 2025

But a unique_ptr is simpler than a shared_ptr. shared_ptr is basically a unique_ptr with refcounting. So if you don't need particularly a shared_ptr, you should use a unique_ptr. And what kind of deleter do you need? Kind Regards David Rohr Sent from my mobile. (Excuse the typos!)

On 16 May 2025 16:13:48 CEST, Christian Sonnabend @.> wrote: @ChSonnabend commented on this pull request. > @@ -113,7 +113,7 @@ class OrtModel private: // ORT variables -> need to be hidden as pImpl struct OrtVariables; - OrtVariables mPImplOrt; + std::shared_ptr mPImplOrt = nullptr; Is there any reason why we need a unique_ptr? Especially for the OrtAllocator (which gets created only for one environment but used in multiple), shared_ptr would probably be preferable. Unique_ptr would also require a deleter, which will make the code more cumbersome... -- Reply to this email directly or view it on GitHub: #14272 (comment) You are receiving this because your review was requested. Message ID: @.*>

An example.

OrtInterface.h:
'''
private:
// ORT variables -> need to be hidden as pImpl
struct OrtVariables;
struct OrtVariablesDeleter {
void operator()(OrtVariables* ptr) const;
};
std::unique_ptr<OrtVariables, OrtVariablesDeleter> mPImplOrt;
...
'''

OrtInterface.cxx:

'''
void OrtModel::OrtVariablesDeleter::operator()(OrtVariables* ptr) const {
delete ptr;
}

// General purpose
void OrtModel::initOptions(std::unordered_map<std::string, std::string> optionsMap)
{
mPImplOrt = std::unique_ptr<OrtVariables, OrtVariablesDeleter>(new OrtVariables());
...
}
'''

@davidrohr
Copy link
Collaborator

I would do it like in the patch below. You are basically using a shared_ptr as a unique_ptr, since your pointer is never really shared. The reason it works for you with a shared ptr is that for a shared ptr the constructor / destructor do not need sizeof(...), while for unique_ptr they do. Thus, with a unique_ptr the constructors must not be in the header. My patch moves them to the cxx.

And then, your setEnv / getEnv functions certainly cause memory corruption, irrespective of whether they use shared or unique ptr. With get(...), you are getting a raw C pointer, and then you are setting that to as the env of another OrtVariables struct, thus you have 2 unique_ptr (or in your case 2 shared_ptr, which do not know about each other), both owning the same object, which will lead to use after free memory corruption and to double delete.
It is not a problem right now, since from what I see in your code, you are never calling setEnv. But you must not use the unique_shared ptr in that way. If you use the get() semantic, you must never make something else own that pointer. If you want to get an owning pointer, you have to use the release() semantic.

diff --git a/Common/ML/include/ML/OrtInterface.h b/Common/ML/include/ML/OrtInterface.h
index 248542cebe..7224645425 100644
--- a/Common/ML/include/ML/OrtInterface.h
+++ b/Common/ML/include/ML/OrtInterface.h
@@ -45,14 +45,10 @@ class OrtModel
 
  public:
   // Constructors & destructors
-  OrtModel() = default;
-  OrtModel(std::unordered_map<std::string, std::string> optionsMap) { init(optionsMap); }
-  void init(std::unordered_map<std::string, std::string> optionsMap)
-  {
-    initOptions(optionsMap);
-    initEnvironment();
-  }
-  virtual ~OrtModel() = default;
+  OrtModel();
+  OrtModel(std::unordered_map<std::string, std::string> optionsMap);
+  void init(std::unordered_map<std::string, std::string> optionsMap);
+  virtual ~OrtModel();
 
   // General purpose
   void initOptions(std::unordered_map<std::string, std::string> optionsMap);
@@ -113,7 +109,7 @@ class OrtModel
  private:
   // ORT variables -> need to be hidden as pImpl
   struct OrtVariables;
-  std::shared_ptr<OrtVariables> mPImplOrt = nullptr;
+  std::unique_ptr<OrtVariables> mPImplOrt;
 
   // Input & Output specifications of the loaded network
   std::vector<const char*> mInputNamesChar, mOutputNamesChar;
diff --git a/Common/ML/src/OrtInterface.cxx b/Common/ML/src/OrtInterface.cxx
index 9d6da4c1ec..4af3b97aee 100644
--- a/Common/ML/src/OrtInterface.cxx
+++ b/Common/ML/src/OrtInterface.cxx
@@ -27,11 +27,20 @@ namespace o2
 namespace ml
 {
 
+OrtModel::OrtModel() = default;
+OrtModel::OrtModel(std::unordered_map<std::string, std::string> optionsMap) { init(optionsMap); }
+OrtModel::~OrtModel() = default;
+void OrtModel::init(std::unordered_map<std::string, std::string> optionsMap)
+{
+  initOptions(optionsMap);
+  initEnvironment();
+}
+
 struct OrtModel::OrtVariables { // The actual implementation is hidden in the .cxx file
   // ORT runtime objects
   Ort::RunOptions runOptions;
-  std::shared_ptr<Ort::Env> env = nullptr;
-  std::shared_ptr<Ort::Session> session = nullptr; ///< ONNX session
+  std::unique_ptr<Ort::Env> env;
+  std::unique_ptr<Ort::Session> session; ///< ONNX session
   Ort::SessionOptions sessionOptions;
   Ort::AllocatorWithDefaultOptions allocator;
   Ort::MemoryInfo memoryInfo = Ort::MemoryInfo("Cpu", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
@@ -41,7 +50,7 @@ struct OrtModel::OrtVariables { // The actual implementation is hidden in the .c
 // General purpose
 void OrtModel::initOptions(std::unordered_map<std::string, std::string> optionsMap)
 {
-  mPImplOrt = std::make_shared<OrtVariables>();
+  mPImplOrt = std::make_unique<OrtVariables>();
 
   // Load from options map
   if (!optionsMap.contains("model-path")) {
@@ -101,7 +110,7 @@ void OrtModel::initOptions(std::unordered_map<std::string, std::string> optionsM
 
 void OrtModel::initEnvironment()
 {
-  mPImplOrt->env = std::make_shared<Ort::Env>(
+  mPImplOrt->env = std::make_unique<Ort::Env>(
     OrtLoggingLevel(mLoggingLevel),
     (mEnvName.empty() ? "ORT" : mEnvName.c_str()),
     // Integrate ORT logging into Fairlogger
@@ -129,7 +138,7 @@ void OrtModel::initSession()
   if (mAllocateDeviceMemory) {
     memoryOnDevice(mDeviceId);
   }
-  mPImplOrt->session = std::make_shared<Ort::Session>(*mPImplOrt->env, mModelPath.c_str(), mPImplOrt->sessionOptions);
+  mPImplOrt->session = std::make_unique<Ort::Session>(*mPImplOrt->env, mModelPath.c_str(), mPImplOrt->sessionOptions);
   mPImplOrt->ioBinding = std::make_unique<Ort::IoBinding>(*mPImplOrt->session);
 
   setIO();
@@ -166,7 +175,7 @@ void OrtModel::memoryOnDevice(int32_t deviceIndex)
 
 void OrtModel::resetSession()
 {
-  mPImplOrt->session = std::make_shared<Ort::Session>(*(mPImplOrt->env), mModelPath.c_str(), mPImplOrt->sessionOptions);
+  mPImplOrt->session = std::make_unique<Ort::Session>(*(mPImplOrt->env), mModelPath.c_str(), mPImplOrt->sessionOptions);
 }
 
 // Getters
@@ -252,7 +261,7 @@ void OrtModel::setIO()
 
 void OrtModel::setEnv(Ort::Env* env)
 {
-  mPImplOrt->env = std::shared_ptr<Ort::Env>(env);
+  mPImplOrt->env.reset(env);
 }
 
 // Inference

@davidrohr davidrohr merged commit 0c5140e into AliceO2Group:dev May 19, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants