Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

No description provided.

@zasdfgbnm
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Dec 18, 2025

Description

  • Simplified ParallelType::Stream handling in unshardedSizes function

  • Removed conditional logic for stream parallelization dimension checking

  • Eliminated error checking for non-constant stream extent

  • Unconditionally returns 1 for all stream parallelized dimensions

Changes walkthrough

Relevant files
Bug fix
execution_utils.cpp
Simplify stream parallelization handling in unshardedSizes

csrc/multidevice/execution_utils.cpp

  • Removed TODO comment about MultiDeviceExecutor hack for stream
    parallelization
  • Eliminated conditional logic checking if sharded_id is in logical
    domain
  • Removed error checking for non-constant stream extent evaluation
  • Simplified to unconditionally return 1 for ParallelType::Stream
  • +1/-23   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review
    Removed Error Checking

    The PR removes important error checking that validated DIDs/Stream extent is constant. This error checking was ensuring that non-constant extents would be caught early rather than potentially causing runtime issues. The removal of this validation could lead to silent failures or incorrect behavior when non-constant extents are encountered.

    return 1;
    Behavioral Change Risk

    The change removes the conditional logic that only returned 1 for logical domains while evaluating extent for other domains. The new unconditional return of 1 for all ParallelType::Stream cases could be incorrect for non-logical domains and may break existing functionality that depends on the extent evaluation.

    if (parallel_type == ParallelType::Stream) {
      return 1;
    }
    Missing Documentation

    The PR removes a detailed TODO comment that explained the reasoning behind the original implementation. This context is valuable for understanding why certain decisions were made and what future work is planned. The removal of this documentation makes it harder for future maintainers to understand the design rationale.

    if (parallel_type == ParallelType::Stream) {
      return 1;
    }

    Test failures

    • (High, 95) CUDA driver/runtime mismatch causing init-time failures in nvFuser matmul & top-k test suites on dlcluster_h100

      Test Name H100 Source
      Ampere/MmaTest.SingleTile/Ampere_16_8_16__bfloat Link
      ArgsortParameterizedWithBlockAndBatch.SharedMemoryRequirement/2048_1_1_0 Link
      BlockSizeAndItemsPerThread/ArgSortComprehensiveTest.ComprehensiveValidation/BlockSize32_ItemsPerThread4 Link
      ClusterReductionTest.SimpleFusionNotAllReduce/cluster_15_dtype_double Link
      ClusterReductionTest.SimpleFusionNotAllReduce/cluster_4_dtype_double Link
      CutlassExecutorTest.Nvfp4Matmul_BiasEpilogue Link
      General/HopperPlusMatmulSchedulerTest.FusedMultiplySum/KK_512_256_128_MmaMacro_m64_n128_k16_splitk_2 Link
      General/HopperPlusMatmulSchedulerTest.FusedMultiplySum/MK_512_256_128_MmaMacro_m128_n128_k16_tma_store Link
      General/HopperPlusMatmulSchedulerTest.FusedMultiplySumBiasNeg/MN_512_256_128_MmaMacro_m64_n128_k16_tma_store_splitk_2 Link
      GreedySchedulerTest.ScanNonLocalOutput Link
      ... with 85 more test failures omitted. Check internal logs.
    • (High, 1) Outdated NVIDIA driver on dlcluster_h100 causing CUDA initialization failure in RNG tests

      Test Name H100 Source
      RNGTest.BroadcastingRNG Link
    • (Medium, 12) nvFuser internal assert on non-divisible split (tensor_metadata.cpp) in test_stream.test_two_matmuls_inlinable and multidevice.test_overlap suites

      Test Name A100 A100 (dist.) GB200 GB200 (dist.) H100 H100 (dist.) Source
      tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=eager]
      tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=lru_cache]
      tests.python.multidevice.test_overlap.test_row_parallel_linear_forward
    • (Low, 1) Small numerical mismatch in nvFuser reduction tutorial test (tests.python.direct.test_tutorial)

      Test Name GB200 Source
      tests.python.direct.test_tutorial.test_tutorial_reduction

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants