Skip to content

Refactor tensor class in C++ unit tests#2962

Open
timmoon10 wants to merge 10 commits intoNVIDIA:mainfrom
timmoon10:tmoon/refactor-cpp-test-tensor
Open

Refactor tensor class in C++ unit tests#2962
timmoon10 wants to merge 10 commits intoNVIDIA:mainfrom
timmoon10:tmoon/refactor-cpp-test-tensor

Conversation

@timmoon10
Copy link
Copy Markdown
Collaborator

Description

The tensor wrapper in the C++ unit tests has become unwieldy, with complicated interactions between recipes and memory management. This has recently resulted in bugs where we accidently didn't allocate a required buffer (#2943). This PR disentangles the memory management from the recipe logic by adding a simple RAII class to manage GPU and CPU buffers. I've also added more explicit checks, e.g. when we assume a tensor is a single FP32.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Add class to manage GPU buffer, CPU buffer, and memory transfers between them.
  • Remove memory management logic from tensor class in C++ tests.
  • Add checks to accessors that make implicit assumptions on buffer size and dtype.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

timmoon10 and others added 7 commits April 30, 2026 01:51
Refactor test tensor wrapper by removing recipe-specific logic whenever possible.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
- Fix syntax error in switch case (:: -> :)
- Fix double-underscore typo in variable name
- Fix wrong buffer passed to set_amax_columnwise
- Fix unique_ptr assignment from raw pointer (use reset())
- Remove dead duplicate NVTE_MXFP8_1D_SCALING branch in get_scales()
- Rename cpu_data -> cpu_buffer to match Buffer class API
- Remove const from Tensor::to_cpu/from_cpu and their callers,
  since both methods write to the CPU buffer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
CPU and GPU types are inconsistent, so the type checks cause too many problems.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR refactors the tensor wrapper in the C++ unit test infrastructure by introducing a dedicated Buffer RAII class that manages paired CPU/GPU allocations separately from recipe/scaling-mode logic. The primary motivation is a recent bug where a required buffer was not allocated because recipe logic was tangled with memory management.

  • New Tensor::Buffer inner class manages matching CPU (unique_ptr<unsigned char[]>) and GPU (cudaMalloc + custom deleter) allocations, with explicit to_cpu()/from_cpu() transfers and zero-initialization on construction.
  • Recipe logic separated: the Tensor constructor now switches on scaling mode to allocate the correct set of Buffer objects, then wires their GPU pointers into the TensorWrapper.
  • Pre-capture of scale before kernel launch: all operator tests now snapshot ref_scale = output.scale() before calling nvte_quantize, avoiding a correctness bug where the post-kernel (possibly modified) scale was used for the reference computation.

Confidence Score: 5/5

Safe to merge; changes are confined to the C++ test infrastructure with no production code paths affected.

All substantive logic is in test helpers. The refactoring cleanly separates memory management from scaling-mode logic, the Buffer RAII class handles all allocation/transfer paths correctly, and the pre-capture of ref_scale before the kernel launch is a genuine correctness improvement. The remaining observations are diagnostic-quality items in the test layer.

No files require special attention; the two minor items are both in test-only code.

Important Files Changed

Filename Overview
tests/cpp/test_common.h Introduces inner Buffer RAII class and rewrites Tensor public API; rowwise_cpu_dptr/columnwise_cpu_dptr now have dtype and rowwise/columnwise guards but their ordering produces a misleading error message for the non-byte-type case.
tests/cpp/test_common.cu Buffer constructor, transfer helpers, and refactored Tensor constructor look correct; set_scale/set_scale_inv/amax methods all include size and dtype guards; fillUniform and fillCase_special retain existing behavior.
tests/cpp_distributed/test_comm_gemm.cu Correctly guards set_scale/set_scale_inv calls with isFp8Type, replacing the previous exception-based control flow.
tests/cpp/operator/test_cast.cu Pre-captures ref_scale before the kernel launch; passes it to both the reference computation and the scale_inv comparison, fixing a prior correctness issue.
tests/cpp/operator/test_qdq.cu Pre-captures ref_scale = output.scale() without an isFp8Type guard, unlike every other test in this PR; safe in practice since QDQ tests are FP8-only but inconsistent with the established pattern.
tests/cpp/operator/test_normalization.cu Uses already-defined ref_scale (captured with isFp8Type guard) instead of re-reading z.scale() after the kernel; correct fix.
tests/cpp/operator/test_cast_nvfp4_transpose.cu Correctly changes compareResults_nvfp4 and compute_amax signatures from const Tensor& to Tensor& to allow non-const to_cpu() calls.
tests/cpp/operator/test_multi_cast_transpose.cu Applies the isFp8Type guard when building ref_scale_list and uses it consistently for the scale_inv comparison; correct.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class Tensor {
        -Buffer data_rowwise_
        -Buffer data_columnwise_
        -Buffer scale_inv_rowwise_
        -Buffer scale_inv_columnwise_
        -Buffer amax_rowwise_
        -Buffer amax_columnwise_
        -Buffer scale_
        -TensorWrapper tensor_
        -bool rowwise_
        -bool columnwise_
        +rowwise_cpu_dptr() T*
        +columnwise_cpu_dptr() T*
        +scale() float
        +amax() float
        +rowwise_scale_inv() float
        +set_scale(float)
        +set_scale_inv(float)
        +to_cpu()
        +from_cpu()
    }

    class Buffer {
        -unique_ptr cpu_buffer_
        -unique_ptr gpu_buffer_
        -size_t size_
        -DType dtype_
        -size_t bytes_
        +cpu_buffer() T*
        +gpu_buffer() T*
        +to_cpu()
        +from_cpu()
        +size() size_t
        +dtype() DType
    }

    class GPUDeleter {
        +operator()(void*) void
    }

    Tensor *-- Buffer : owns 7 buffers
    Buffer *-- GPUDeleter : uses
    Tensor *-- TensorWrapper : wraps
Loading

Reviews (3): Last reviewed commit: "Copy-paste error" | Re-trigger Greptile

Comment thread tests/cpp/test_common.h
Comment thread tests/cpp/test_common.cu Outdated
Also adopt review suggestions from @greptile-apps.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Comment thread tests/cpp/test_common.cu
Comment on lines +940 to +950
// Fill scales
if (t->scaling_mode() == NVTE_DELAYED_TENSOR_SCALING) {
if (isFp8Type(t->dtype())) {
// FP8 tensor scale is set to 1
t->set_scale_inv(1.0);
}
} else {
// Block scales are filled randomly
t->fill_uniform_rowwise_scale_inv();
t->fill_uniform_columnwise_scale_inv();
}
Copy link
Copy Markdown
Collaborator Author

@timmoon10 timmoon10 May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, but it approximates the previous behavior.

Comment thread tests/cpp/test_common.cu Outdated
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Comment thread tests/cpp/test_common.h Outdated
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci core L1

Copy link
Copy Markdown
Collaborator

@Oleg-Goncharov Oleg-Goncharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this looks much cleaner now, but the cast+transpose current scaling tests are failing with a segmentation fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants