Currently, downstream consumers (e.g., InfiniTensor) that call Operator::Call() must include device-specific headers, which forces them to replicate platform-specific build logic (vendor compiler detection, SDK paths, compile definitions) in their own CMake configuration.
Example: InfiniTensor's redundancy
InfiniTensor's adapter kernels (src/kernels/*.cc) call InfiniOps operators like:
#include "cuda/nvidia/add/kernel.h" // transitively includes kernel.cuh with __global__
#include "cuda/moore/add/kernel.h" // transitively includes MUSA-specific headers
#include "core/kernel.h"
infini::ops::Add::Call(handle, config, input, other, output);
Operator::Call() is a template function defined in operator.h. Its dispatch mechanism (DispatchFunc<ActiveDevices>) is resolved at compile time, requiring all device specializations to be visible at the call site:
// operator.h
template <typename... Args>
static auto Call(const Handle& handle, const Config& config, Args&&... args) {
auto op = Make(config, args...);
// Make() uses DispatchFunc<ActiveDevices<Key>>() which expands to:
// case kCpu: return Operator<Add, kCpu>(...); ← needs cpu/add/add.h
// case kNvidia: return Operator<Add, kNvidia>(...); ← needs cuda/nvidia/add/kernel.h
// case kMoore: return Operator<Add, kMoore>(...); ← needs cuda/moore/add/kernel.h
return (*op)(handle, args...);
}
Even though libinfiniops.so already contains compiled code for all device specializations, the template dispatch code is generated in the caller's translation unit, so the caller must include device-specific headers and use the vendor compiler.
Proposed Solution
Use C++ explicit template instantiation to compile Operator::Call() into libinfiniops.so, and extern template declarations to prevent downstream re-instantiation.
Currently, downstream consumers (e.g., InfiniTensor) that call Operator::Call() must include device-specific headers, which forces them to replicate platform-specific build logic (vendor compiler detection, SDK paths, compile definitions) in their own CMake configuration.
Example: InfiniTensor's redundancy
InfiniTensor's adapter kernels (src/kernels/*.cc) call InfiniOps operators like:
Operator::Call() is a template function defined in operator.h. Its dispatch mechanism (DispatchFunc<ActiveDevices>) is resolved at compile time, requiring all device specializations to be visible at the call site:
Even though libinfiniops.so already contains compiled code for all device specializations, the template dispatch code is generated in the caller's translation unit, so the caller must include device-specific headers and use the vendor compiler.
Proposed Solution
Use C++ explicit template instantiation to compile Operator::Call() into libinfiniops.so, and extern template declarations to prevent downstream re-instantiation.