Skip to content

[SM6.10][HLK] Restructure LinAlg Tests#8323

Merged
V-FEXrt merged 5 commits intomicrosoft:mainfrom
V-FEXrt:linalg-hlk-infra
Apr 3, 2026
Merged

[SM6.10][HLK] Restructure LinAlg Tests#8323
V-FEXrt merged 5 commits intomicrosoft:mainfrom
V-FEXrt:linalg-hlk-infra

Conversation

@V-FEXrt
Copy link
Copy Markdown
Collaborator

@V-FEXrt V-FEXrt commented Apr 1, 2026

There are several infra improvements in this PR

  • Use std::variant to not duplicate every variable for every type in the test
  • Extract common code into several helper functions
  • Set ELEM_SIZE on every test
  • BUGFIX: Use ELEM_SIZE as the align parameter for Load/Store
  • Add a flag that runs an alternative "emulated" shader that has the expected side effects of the real shader. Useful for exercising the infra even though the shader itself can't run
  • Use F16 type instead of F32/I32 as F16 is the T1 type

Co-authored-by: Alex Sepkowski alexsepkowski@gmail.com

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

✅ With the latest revision this PR passed the C/C++ code formatter.

@V-FEXrt V-FEXrt marked this pull request as ready for review April 1, 2026 04:23
Copy link
Copy Markdown
Contributor

@alsepkow alsepkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few minor comments.

Copy link
Copy Markdown
Member

@damyanp damyanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superficial review - LGTM.

Copy link
Copy Markdown
Contributor

@tex3d tex3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of changes in this PR.

Some are (mostly) code moves, which I had to do an external comparison on to see changes made during the move (there were some). I wish pure moves and formatting changes could have been made as a separate NFC PR, or at least as a separate commit.

There are also new classes: HLSLNorm_t (and variants), F8E4M3_t, and F8E5M2_t, which are not used anywhere except in isFloatingPointType (which is changed to include these).

I think the new classes should be kept out of this PR, since they are not needed/used by it. Doing so would save some required extra careful review, since these can be tricky.

I reviewed moved code by copy-pasting it into the original locations on a local clone and looking at the diff. This reveals several types of changes:

  • Fixed spelling in comment: constuctor -> constructor.
  • inline required for moving to header
  • Parameter and member name changes:
    • ValidationType -> VType
      • Only for some overloads, why?
    • ValidationConfig -> Validation
      • If you wanted to avoid the type name collision, why not go with Config, which seems more intuitive?
  • Deleted comments: // Tolerance is in ULPs. Convert to (int, int64_t) for the comparison.
  • Deleted extraneous ;
  • Modified HLSLHalf_t constructor (see comment)

I had some more comments too.

return static_cast<bool>(Val) != static_cast<bool>(Other.Val);
}

bool operator<(const HLSLBool_t &Other) const { return Val < Other.Val; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever we use Val in these operators, shouldn't we cast to bool, since we only ever want to interpret the value as either true or false? Also, where are these comparison operators used?

I realize now that this code has simply been moved. Is there a way to separate moved code changes from other changes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved code yeah. imo the comment should be addressed separately

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the LT, GT, LTE, or GTE are used with HLSLBool_t anymore. Can probably remove them. But I can do that in a subsequent PR.

WEX::Logging::Log::Result(WEX::Logging::TestResults::Skipped);
compileShader(DxcSupport, LoadStoreShader, Target.c_str(), Args, Verbose);

if (shouldSkipBecauseSM610Unsupported(Device))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we always end the test here, since if we don't have a device in HLK mode, we still can't proceed, and we must have emitted an error earlier when trying to create the device, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its helpful to talk about when we intend to skip. (Specifically, skip, not end)

Right now there is a useful flow through the test that is basically

  • try to get a SM6.10 device
  • fail because non exist yet
  • keep going anyway and use dxc to compile the SM6.10 shader
    • if it compiles exit with skip (setup was okay, but we can't run yet)
    • if it doesn't, exit with fail (can't compile the SM6.10 shader, that's a bug that shouldn't pass CI)

This function is in service of that flow. If we have a device or if we are in HLK mode we should always just continue the test and let something else be the cause of failure

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, it's odd to proceed with a null Device with any other operations. That's a setup failure, and if we're on HLK, we should abort at that point, otherwise, we should skip at that point. I understand the desire to still try a shader compile at this time because you don't have a device that supports 6.10 at all, and that's fine for now, but this is a test-development-time convenience, and it doesn't justify executing test code beyond this point.

if we are in HLK mode we should always just continue the test and let something else be the cause of failure

I don't think so. I think we should abort the test with a failure if we failed to create the device (which would have logged an error). You return at this point when not HLK because the usefulness of proceeding has ended. The same goes for HLK, but usefulness of proceeding ended even earlier (there's no point in going through with anything else in the HLK if device creation failed).

Any other failure that happens down the line as a result of a null Device will just add noise to the failing test result that needs to be ignored (the real failure was the device creation failure).

I think you want to return from this point if (!Device) whether or not you're in HLK mode. The difference is whether you log a comment and call this a skip or log an error and abort.


if (EmulateTest) {
hlsl_test::LogWarningFmt(L"EmulateTest flag set. Tests are NOT REAL");
return D3D12SDK->createDevice(&D3DDevice, D3D_SHADER_MODEL_6_8, false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't create the device twice when EmulateTest is specified. This will break if the first device create succeeded (device reports SM 6.10 support).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts on this were that the EmulateTest flag takes precedent. Even if we successfully create the SM6.10 device the downstream tests are still going to run the emulated shaders. If we don't take this branch then it'll look like the tests pass when they didn't actually run at all. Maybe the fix is to put this branch above the one that attempts to create the SM6.10 device.

Thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you should put it above the other device creation. Otherwise, the non-null D3DDevice CComPtr will cause the second call to fail because the pointer is already initialized.

I would normally recommend you move this into setupClass and call this createDevice() method instead of embedding the logic inside this createDevice() method, but I see that it's really a convenient way for you to keep going when you have no valid device so you can perform compilation before skipping/aborting, and to use the same emulation path when recreating a device in setupMethod. So perhaps it's ok to just move it earlier.

It's unexpected that your createDevice always returns true, even when it fails, unless it's in HLK mode. That's counterintuitive. I get that you return this from setupClass, which means it would fail on device creation failure, but that specific decision probably belongs in setupClass, not inside createDevice. And by extension, in setupMethod when recreating the device. That way it's more obvious you've chosen to proceed with a device creation failure in this code.

Your createDevice seems to invert the SkipUnsupported parameter when calling D3D12SDK->createDevice. If this flag is true, D3D12SDK->createDevice will register a skip. You do not want that in HLK mode, yet you initialize it to FailIfRequirementsNotMet which is true in HLK mode. If you'd rather control test skips separately, you should just pass false here instead.

Also, regarding FailIfRequirementsNotMet, when in HLK mode, your log message says it's failing because this FailIfRequirementsNotMet is set. But this is a minute internal code detail of this particular method, not visible to the testing interface. I think it should suffice to say that it is a device creation failure for the required shader model support.

Also, in setupMethod, you assume/log that the device was lost if you have a null D3DDevice, or it's been removed. But you also explicitly proceed if device creation fails, resulting in a null D3DDevice, just so you can try compilation before aborting. So you're going to get a bunch of these misleading "device lost" messages, unless you update the method to account for this condition.

@V-FEXrt
Copy link
Copy Markdown
Collaborator Author

V-FEXrt commented Apr 1, 2026

I wish pure moves and formatting changes could have been made as a separate NFC PR, or at least as a separate commit.

The vast majority of the code moves were triggered at the 11th hour of a pending deadline to satisfy your requirement of the use of F16 for the smoke test. There wasn't time to split this up into multiple PRs and get the necessary round of reviews and merge by deadlines. Multiple commits may have been possible, but I had a bunch of other pending changes before the last minute request came through. I'll break this up into several PRs now and adjust expectations on when things will land

@tex3d
Copy link
Copy Markdown
Contributor

tex3d commented Apr 1, 2026

I wish pure moves and formatting changes could have been made as a separate NFC PR, or at least as a separate commit.

The vast majority of the code moves were triggered at the 11th hour of a pending deadline to satisfy your requirement of the use of F16 for the smoke test. There wasn't time to split this up into multiple PRs and get the necessary round of reviews and merge by deadlines. Multiple commits may have been possible, but I had a bunch of other pending changes before the last minute request came through. I'll break this up into several PRs now and adjust expectations on when things will land

I'm not asking you to break it up now. But I do have some feedback about removing unused additions and a few other changes.

@V-FEXrt V-FEXrt force-pushed the linalg-hlk-infra branch from 5c10f18 to 70fefce Compare April 2, 2026 01:03
@V-FEXrt
Copy link
Copy Markdown
Collaborator Author

V-FEXrt commented Apr 2, 2026

@tex3d

It turns out it was easiest start from "scratch" on the code moves and just move the exact minimal set of code vs trying to go back and handle each case. You can now review by commit where 3eb0d28 does the moves and 70fefce does the rest.

I'm going to pull in a couple quick fixes (align => 128, fix a messy switch I wasn't happy with) but all the changes in the moved code has been reverted.

If there are any of those changes you'd like to see kept, please not them and I'll get them included in a separate PR

bool Verbose, HLSLHalf_t Tolerance = 0.0f) {
bool Success = true;
for (size_t I = 0; I < Count; I++) {
HLSLHalf_t Diff = Actual[I] - Expected[I];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't these using comparison methods from HlslTestUtils.h?

WEX::Logging::Log::Result(WEX::Logging::TestResults::Skipped);
compileShader(DxcSupport, LoadStoreShader, Target.c_str(), Args, Verbose);

if (shouldSkipBecauseSM610Unsupported(Device))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, it's odd to proceed with a null Device with any other operations. That's a setup failure, and if we're on HLK, we should abort at that point, otherwise, we should skip at that point. I understand the desire to still try a shader compile at this time because you don't have a device that supports 6.10 at all, and that's fine for now, but this is a test-development-time convenience, and it doesn't justify executing test code beyond this point.

if we are in HLK mode we should always just continue the test and let something else be the cause of failure

I don't think so. I think we should abort the test with a failure if we failed to create the device (which would have logged an error). You return at this point when not HLK because the usefulness of proceeding has ended. The same goes for HLK, but usefulness of proceeding ended even earlier (there's no point in going through with anything else in the HLK if device creation failed).

Any other failure that happens down the line as a result of a null Device will just add noise to the failing test result that needs to be ignored (the real failure was the device creation failure).

I think you want to return from this point if (!Device) whether or not you're in HLK mode. The difference is whether you log a comment and call this a skip or log an error and abort.


if (EmulateTest) {
hlsl_test::LogWarningFmt(L"EmulateTest flag set. Tests are NOT REAL");
return D3D12SDK->createDevice(&D3DDevice, D3D_SHADER_MODEL_6_8, false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you should put it above the other device creation. Otherwise, the non-null D3DDevice CComPtr will cause the second call to fail because the pointer is already initialized.

I would normally recommend you move this into setupClass and call this createDevice() method instead of embedding the logic inside this createDevice() method, but I see that it's really a convenient way for you to keep going when you have no valid device so you can perform compilation before skipping/aborting, and to use the same emulation path when recreating a device in setupMethod. So perhaps it's ok to just move it earlier.

It's unexpected that your createDevice always returns true, even when it fails, unless it's in HLK mode. That's counterintuitive. I get that you return this from setupClass, which means it would fail on device creation failure, but that specific decision probably belongs in setupClass, not inside createDevice. And by extension, in setupMethod when recreating the device. That way it's more obvious you've chosen to proceed with a device creation failure in this code.

Your createDevice seems to invert the SkipUnsupported parameter when calling D3D12SDK->createDevice. If this flag is true, D3D12SDK->createDevice will register a skip. You do not want that in HLK mode, yet you initialize it to FailIfRequirementsNotMet which is true in HLK mode. If you'd rather control test skips separately, you should just pass false here instead.

Also, regarding FailIfRequirementsNotMet, when in HLK mode, your log message says it's failing because this FailIfRequirementsNotMet is set. But this is a minute internal code detail of this particular method, not visible to the testing interface. I think it should suffice to say that it is a device creation failure for the required shader model support.

Also, in setupMethod, you assume/log that the device was lost if you have a null D3DDevice, or it's been removed. But you also explicitly proceed if device creation fails, resulting in a null D3DDevice, just so you can try compilation before aborting. So you're going to get a bunch of these misleading "device lost" messages, unless you update the method to account for this condition.

Copy link
Copy Markdown
Contributor

@tex3d tex3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving assuming some follow-up adjustments are coming to address some concerns in a subsequent PR, since none should be blocking for the merge now.

@V-FEXrt
Copy link
Copy Markdown
Collaborator Author

V-FEXrt commented Apr 3, 2026

Thanks Tex! Here is the tracking issue! I'll be working on this feedback first thing tomorrow #8332

@V-FEXrt V-FEXrt merged commit eb67a90 into microsoft:main Apr 3, 2026
12 checks passed
@github-project-automation github-project-automation bot moved this from New to Done in HLSL Roadmap Apr 3, 2026
@V-FEXrt V-FEXrt deleted the linalg-hlk-infra branch April 3, 2026 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants