Explicit copy memset nodes by ThrudPrimrose · Pull Request #2380 · spcl/dace

ThrudPrimrose · 2026-05-22T11:17:08Z

No description provided.

Introduce explicit copy / memset library nodes and a pass that lifts every implicit AccessNode -> AccessNode (or scope-staging) edge into a CopyLibraryNode, so the dataflow shape post-simplification is self-describing and the legacy "copy edge gets lowered by ad-hoc codegen" path can be deprecated. dace/libraries/standard/nodes/copy_node.py (CopyLibraryNode + 8 expansions) dace/libraries/standard/nodes/memset_node.py (MemsetLibraryNode + 3 expansions) dace/libraries/standard/helper.py (shared expansion helpers) dace/libraries/standard/environments/cpu.py (CPU environment used by ExpandMemcpyCPU) dace/sdfg/construction_utils.py (small utilities used by copy_node) dace/transformation/passes/insert_explicit_copies.py (the pass) Pass wiring: appended `InsertExplicitCopies` to `SIMPLIFY_PASSES` so ``SDFG.simplify()`` lifts implicit copies as part of standard cleanup (idempotent — once lifted, no further implicit edges remain to match). Tests: tests/library/copy_node_test.py tests/library/memset_node_test.py tests/passes/insert_explicit_copies_test.py

…ryNode Three small fixes that unblock the Copy/Memset libnodes' schedule inference: * ``infer_out_connector_type`` / ``infer_connector_types``: wrap the ``e.data.subset and e.data.subset.num_elements() == 1`` expression in ``bool()``. A single-element ``Range`` returns False from ``__bool__`` and the bare ``and`` chain leaks the Range object instead of a bool, which later trips the ``scalar |= …`` operator with TypeError when the libnode's collapsed subset is empty. * ``_determine_schedule_from_storage``: when the node under inspection is a ``CopyLibraryNode`` / ``MemsetLibraryNode`` AND any neighbouring memlet imposes a ``GPU_Device`` constraint, return ``GPU_Device`` directly. Without this, an H2D copy has both ``CPU_Multicore`` (from the CPU source) and ``GPU_Device`` (from the GPU sink) in its constraint set and the existing ``len(constraints) > 1`` branch raises ``InvalidSDFGNodeError: Cannot determine default schedule for node copy_A_to_gpu_A``. The libnode is exactly the node class designed to bridge storages; routing it to GPU_Device when GPU is involved is the intended resolution.

…simplify InsertExplicitCopies materialises implicit AccessNode->AccessNode edges into CopyLibraryNodes — a shape-changing lowering step, not a shape-preserving simplification. Including it in SIMPLIFY_PASSES broke 22 tests (numpy reshape/flatten/view, redundant_copy count assertions, range_indirection / reinterpret validators) that rely on simplify preserving the implicit-edge form. Pass remains available as a standalone Pass for consumers that want it (e.g. the GPU codegen lowering pipeline, which calls it explicitly via InsertExplicitGPUGlobalMemoryCopies).

Three xfail-strict pins: - AccessNode<->View edges must not be lifted (policy: views are aliases). - Rank-changing reshape lift produces mismatched-rank memlets that trip codegen IndexError in cpp_offset_expr. - Dtype-reinterpret View lift produces CopyLibraryNode with mismatched element types, failing sdfg.validate().

…atch InsertExplicitCopies changes: - Drop the AN<->View edge lift (was inserting a Copy + intermediate buffer for every AN<->View edge, including pure aliasing edges that don't need one). - Add a round-trip collapse: AN_src -> View -> AN_dst becomes a single AN_src -> CopyLibraryNode -> AN_dst direct edge with the composed memlet (src side from the view-underlying subset, dst side from the access-side subset). The View AccessNode is removed when it has no other consumers. CopyLibraryNode expansion: - select_copy_implementation now routes rank-mismatched volume-equal copies (different rank after collapse_shape_and_strides) to CopyNDTemplate when both sides are same-storage C-packed contiguous. MappedTasklet reuses a single access expression for both endpoints, which produces a rank-mismatched memlet on the smaller side and crashes codegen. - ExpandCopyNDTemplate flattens to a 1D pointer walk when the in/out collapsed shapes have different ranks (both sides packed contiguous, so the linearization is sound). Tests: - Update the 4 view-lift tests to assert the new policy (round-trip collapses to a single Copy; View is removed if it has no other consumers; repeat applies are no-ops). - Add 10 new test_iec_* pins covering both patterns: AN<->View edges kept direct, AN->View->AN round-trip collapse, AN->View round-trip with another consumer keeping the View, AN<->AN copies with rank differing because of a constant-index dim, AN<->AN rank-mismatched volume-equal copies routed through CopyND, plus @dace.program reshape and reinterpret cases.

…h legacy and experimental codegen)

…TREAM_NAME

… explicit-copy-memset-nodes

…cstring

…onstruction_utils Use SDFG.parent (O(1)) instead of the recursive _get_parent_state scan to find the state containing a nested-SDFG node; verified equivalent. copy_node.py imports the helper from dace.transformation.helpers (top-level, no import cycle). dace/sdfg/construction_utils.py removed (it only held these two helpers on this branch).

…T_STORAGES

…copy_node

…sinstance checks

…appedTasklet

…nearization; selector bypasses CopyND

…ted reshape impl

…se feature-regression tests

… helpers

…trides in memset, Range.num_elements over reduce

…D only for shared memory)

…y2D 1D+2D branches

…re_contiguous param, _make_memset_skeleton unused returns

…nodes

…) on copy/memset libnode helpers

…nnectors correctly

… 3D/2D collapse, F-order collapse, step-2 stride

…o '_out'; add transpose-pattern + dst-non-contig rejection tests

…the failure mode

The marker is unregistered on this branch (new GPU codegen lives on a separate branch); the 3 single-element h2d/d2h tests work on any codegen. test_copy_single_element_memcpy_connector_types asserted the dropped _memcpy_connector_typing special case (value-typed CPU output + ``&_cpy_out`` in body). Now that both connectors are pointer-typed, those assertions are stale -- delete the test.

Replace _make_same_storage_sdfg / _make_cross_storage_sdfg / _make_multidim_libnode_sdfg / _build_explicit_copy_sdfg with a single _make_copy_sdfg(src, dst, *, ...) taking _ArraySpec dataclasses per side. The dataclass groups (shape, storage, strides, total_size, transient, subset, name, dtype); the helper returns (sdfg, libnode). Convert the inline-built SDFGs throughout the file (Fortran-packed same/cross-storage, no-common-stride1, register, single-element, shared-memory rejection, etc.) to the unified helper. Tests with GPU_Device / GPU_ThreadBlock map wiring stay inline since they use memlet_path instead of a single libnode edge. Net: -118 LoC (-486 / +368).

…rings Same-rank copy with mismatched per-dim shapes (e.g. (3,4) -> (4,3)) previously slipped through libnode validation and tripped DaCe's generic ``out-of-bounds memlet`` post-expansion validate -- the rejection was incidental. Add a per-dim shape check at the head of ExpandMappedTasklet's same-rank branch with a specific error message; the test now asserts on that explicit contract instead of the incidental SDFG-level error. Trim multi-line test docstrings (transpose-pattern, rank-mismatch variants, no-common-stride1, shared-memory collective, single- element-in-kernel) to one line each.

… array shapes The transpose-pattern check compares ``in_shape_collapsed`` vs ``out_shape_collapsed``, which come from ``collapse_shape_and_strides(subset, strides)`` -- per-dim subset sizes after singleton-collapse, NOT the underlying array shapes. Add two pin tests: test_copy_same_subset_different_array_shapes -- 0:N slice between arrays of different total size is fine when the per-dim subset sizes match. test_copy_1d_slice_from_2d_source -- a row-slice ``[i, 0:N]`` of a 2D array copies into a 1D array; the leading singleton collapses to the same rank on both sides. Both pass; transpose-pattern still rejected.

….X form Consistency over brevity: every test now uses the full dace.dtypes.StorageType.{CPU_Heap,GPU_Global,GPU_Shared,Register} inline instead of a mix of local cpu/gpu aliases and inline long forms. Drops 26 local-alias lines and removes the inconsistency across the file. Two over-long docstrings trimmed to keep yapf happy.

Previously select_memset_implementation picked CUDA / CPU purely from storage type. cudaMemsetAsync / memset zero ``num_elements * sizeof(T)`` consecutive bytes from the dst pointer, so a non-contiguous subset (e.g. a middle 2-D slice of a row-major array) silently zeroes memory outside the region. Auto now routes such cases to the 'pure' (mapped-tasklet) expansion, which writes per element via the subset. ExpandCUDA / ExpandCPU also reject non-contiguous subsets upfront with a clear error so explicit forcing still raises. Tests: test_memset_auto_routes_non_contiguous_to_pure_cpu -- Auto succeeds via 'pure', zeroes only the 6x10 sub-block. test_memset_cpu_rejects_non_contiguous_subset -- explicit CPU raises. test_memset_cuda_rejects_non_contiguous_subset -- explicit CUDA raises. Also: in copy_node_test.py, factor out _make_copy_skeleton + add _make_legacy_copy_sdfg (canonical Memlet(data=dst, subset=dst_subset, other_subset=src_subset) form). The previous _strip_libnodes mutation was producing memlets without other_subset, which made the legacy codegen look broken on patterns it actually handles correctly. Keep the genuine libnode-advantage pin -- rank-mismatch 4D->2D Fortran reshape -- delete the three false-positive tests.

- helper.py: drop collapsed_map_lengths, inlined as [s for s in subset.size() if s != 1] (2 callsites; Range.size() is the existing util that does the same job). - helper.py: shorten CURRENT_STREAM_NAME comment. - insert_explicit_copies.py: drop src_locations / dst_locations / skip_inside_device_scope properties + _storage_allowed helper. Zero callers in dace/ or tests/ pass any of them. Drop the Iterable / is_devicelevel_gpu imports that go with them. - insert_explicit_copies.py: fix the class docstring -- the pass only handles AN -> AN (and AN -> View -> AN via the round-trip collapse). Map-staging patterns are not handled on this branch (they were on explicit-gpu-global-copies; D1 outermost-subset bug; removed pending a correct rewrite).

…side the map scope AN -> MapEntry -> AN (stage-in) and AN -> MapExit -> AN (stage-out) edges now lift to a CopyLibraryNode placed INSIDE the map scope, wired directly to MapEntry's output connector / MapExit's input connector (no intermediate AN inserted on the scope-side). Chains of MapEntries / MapExits are followed via memlet_path; the body of the map (tasklets, NestedSDFGs, nested maps) is irrelevant to the lift. Views on the outer side stay in place. The outer-side memlet (per-iteration subset on the outer array) is preserved verbatim on the new MapEntry -> libnode (or libnode -> MapExit) edge; the inner-array-side memlet is derived via the existing _derive_matching_dst_subset against the inner array's descriptor. Tests in tests/passes/insert_explicit_copies_test.py: test_lift_stage_in_copy test_lift_stage_out_copy test_lift_stage_in_copy_through_view test_lift_stage_out_copy_through_view test_lift_stage_in_copy_chained_map_entries test_lift_stage_out_copy_chained_map_exits test_lift_stage_in_copy_with_nested_sdfg_consumer Each asserts: exactly one libnode in the lifted state, libnode's scope owner is the (innermost) MapEntry, libnode input/output wired directly to MapEntry/MapExit (not via an inserted AN), numerical match against NumPy, and zero `CopyND<` template instantiations in generated code. New helper _assert_no_copynd(sdfg) calls generate_code and greps each CodeObject for `CopyND<`; pinned in all seven new staging tests. Endpoint resolution refactored to sdutils.find_input_arraynode / find_output_arraynode instead of inline memlet_path[0/-1] + isinstance checks.

- Drop the unused third argument from _derive_matching_dst_subset (docstring already said "unused; kept for symmetric signature"); update both direct-copy callsites accordingly. - Fold _is_stage_in_candidate / _is_stage_out_candidate / _insert_stage_in_libnode / _insert_stage_out_libnode into one _lift_staging_edge(..., stage_in: bool) -- the four methods differed only by which side of the edge was the inner AN and which was the MapEntry/MapExit. -30 LoC net.

Introduce _compile_no_copynd(sdfg) wrapper: greps every CodeObject emitted by sdfg.generate_code() for 'CopyND<' and asserts none, then returns sdfg.compile(). Apply to all 17 compile sites in copy_node_test.py (including the libnode-side compile of the legacy comparison test). Pins the contract: libnode expansions displace the runtime CopyND fallback entirely. The only intentional CopyND user is ExpandSharedMemoryCollective, whose test inspects tasklet bodies directly without compiling, so no exemption is needed.

…cks; factor staging-test scaffold dace/transformation/passes/insert_explicit_copies.py - Drop dead ``if src_subset is None or dst_subset is None`` guard (_resolve_subset_for always returns a Range). - Inline _expr_lt (4 lines, one caller in _is_consecutive_reshape); the try/except + intent comment now live where they're used. dace/libraries/standard/nodes/copy_node.py - Inline _coarse_pick_for_storage_pair (single caller in select_copy_implementation). - Inline _cuda2d_strides_are_supported (single caller in _refine_cuda_impl_for_subsets); the explanatory comment moves with the logic. tests/passes/insert_explicit_copies_test.py - Factor the structural assertion shared across the seven staging tests into _assert_lifted_libnode(state, side, expected_scope=...) -- replaces ~5 lines of "find libnode + check scope + check wire" per test with one call. - Pull "list View AccessNodes" into _view_an_names helper. tests/library/{copy_node,memset_node}_test.py + tests/passes/insert_explicit_copies_test.py - Replace explicit ``if __name__ == "__main__": test_a(); test_b(); ...`` blocks with ``pytest.main([__file__])`` so the script form picks up tests automatically without a hand-maintained list.

…shape Three lines collapsed to one. The comment still explains why the try/except swallows the exception silently (symbolic indeterminacy + equal-product safety net) without the docstring-shaped prose that was inherited from the deleted _expr_lt helper.

…ion pass Functions that already receive parent_state no longer take a redundant parent_sdfg/sdfg argument; they read the owning SDFG from state.sdfg. This covers the copy/memset expansion helpers, the auto_dispatch shim, CopyLibraryNode.src_storage/dst_storage, and the InsertExplicitCopies private methods. The framework-fixed validate(self, sdfg, state) and the ExpandTransformation.expansion(node, parent_state, parent_sdfg) signatures are left intact. Add copy_node tests for a padded (1, N) array whose unit leading dim carries a non-packed stride: is_contiguous_subset is False, so Auto falls back to a map (same storage) or a pitched cudaMemcpy2D (cross storage), and the copy stays numerically exact.

ExpandMemcpyCPU and the CUDA1D helper duplicated the same validate -> contiguous-check -> size -> pointer-Tasklet body, differing only in the cross-storage flag and the memcpy vs cudaMemcpyAsync code string. Fold them into one _make_memcpy_tasklet(node, parent_state, *, cuda), mirroring MemsetLibraryNode's _make_memset_tasklet. Generated code is unchanged.

Imports: drop the copy/subsets/data aliases and inline imports for plain top-level imports; add type hints to the module helpers. Reduce _derive_matching_dst_subset to a single volume check via subsets.Range.num_elements + dace.symbolic.equal, removing the four-branch shape ladder, the hand-rolled symbolic equality, and _is_consecutive_reshape (whose two-pointer walk only ever compared total volumes). Stop collapsing AN -> View -> AN round-trips. A View is an Array subclass with its own shape/strides, so _replace_direct_copies now lifts any View<->Array movement edge and skips only the view's alias edge, leaving the view in place as a copy endpoint (AN -> View -> Copy -> AN). Deletes _collapse_round_trip_views and the now-unused _resolve_subset_for; updates the view tests with structural + numerical checks for both view directions.

…e, test dedup - _replace_direct_copies now resolves src/dst subsets via Memlet.get_src_subset / get_dst_subset (the memlet path) instead of an ambiguous data-name else branch; the self-copy convention stays explicit. - ExpandSharedMemoryCollective's GPU_ThreadBlock guard uses the existing dace.sdfg.scope.is_in_scope, so get_parent_map_and_loop_scopes is dropped from transformation/helpers.py (the PR no longer modifies that file). - Tests route SDFG construction through shared builders: view round-trips (_make_view_round_trip_sdfg, src + dst), chained-map staging (_build_chained_stage_sdfg), reshape (_run_reshape_copy_test), and the array_to_array cases (one parametrized test) via _build_copy_sdfg. - Polybench tests reuse the canonical tests/polybench kernels instead of inline copies (covariance/correlation imported, fdtd-2d loaded by path under a clean module name); the __main__-only 'import polybench' in those three files moves under __main__ so they import without the absl CLI dependency. - Shortened/cleaned comments and docstrings.

ThrudPrimrose and others added 30 commits May 13, 2026 16:43

cleanup: dead code, docstrings, comments

f82811b

docs: ASCII-only, repo backtick convention, concise docstrings

92f031c

docs: ASCII-only, repo backtick convention, concise docstrings

9cf409e

Unify libnode stream connector on __dace_current_stream (valid in bot…

26547d6

…h legacy and experimental codegen)

Remove libnode stream-input plumbing; rename STREAM_CONN -> CURRENT_S…

3812133

…TREAM_NAME

Reuse subset.num_elements(); unify memset tasklet builder

dae3798

Merge remote-tracking branch 'origin/explicit-copy-memset-nodes' into…

6f26f73

… explicit-copy-memset-nodes

Drop stale stream-descriptor mention from CopyLibraryNode.validate do…

18cc384

…cstring

Move no-cycle inline imports to module top in copy/memset libnodes

a5540a2

Add length-1<->scalar conversion passes

a40c343

Passes are the only API (drop standalone function)

84a0079

Merge branch 'main' into explicit-copy-memset-nodes

dad18c1

Move connector names to class constants; use dtypes.{CPU,GPU}_RESIDEN…

1cbc095

…T_STORAGES

Apply connector-contract + inner-literal consolidation to memset and …

8e39e71

…copy_node

helpers: use imported nodes module for MapEntry/Tasklet/LibraryNode i…

a9b8bea

…sinstance checks

Accept Fortran-packed layouts in CopyNDTemplate; route mixed C/F to M…

ef777dd

…appedTasklet

MappedTasklet handles rank-mismatch via 1-D walker + int_floor/% deli…

192d6c9

…nearization; selector bypasses CopyND

Revert MappedTasklet rank-mismatch; CopyNDTemplate is the only suppor…

3074fd0

…ted reshape impl

Refactor multi-dim copy tests onto shared helper + add unsupported-ca…

6a42deb

…se feature-regression tests

Extract auto_dispatch + merge memset test helpers + type hints on new…

31b963a

… helpers

DRY/YAGNI: drop dead CopyExpansion fields, reuse collapse_shape_and_s…

7b3e859

…trides in memset, Range.num_elements over reduce

Drop ExpandCopyNDTemplate; MappedTasklet handles rank-mismatch (CopyN…

6c35792

…D only for shared memory)

Simplify _coarse_pick, _cuda2d_strides_are_supported, merge cudaMemcp…

a8aceb7

…y2D 1D+2D branches

Tailor _build_copynd_call to its sole shared-memory caller

f42e134

ThrudPrimrose and others added 30 commits May 20, 2026 13:45

Prune dead code: drop length-one<->scalar pass (unused), TODOs, requi…

cb62493

…re_contiguous param, _make_memset_skeleton unused returns

Soften _is_cross_cpu_gpu docstring on Register handling assumption

b55c5f5

Merge remote-tracking branch 'origin/main' into explicit-copy-memset-…

2e47679

…nodes

Inline _require_contiguous_subset as combined check at both call sites

567bd79

Add type hints (StorageType, Range, SymExpr, LibraryNode forward refs…

411b83a

…) on copy/memset libnode helpers

Improve _refine_cuda_impl_for_subsets docstring (accurate routing table)

b29b36f

Drop _memcpy_connector_typing: DaCe handles single-element pointer co…

5e2ba18

…nnectors correctly

Add 5 Auto-routed copy edge-case tests: 4D/1D flatten, 1D/4D inflate,…

d99c6b0

… 3D/2D collapse, F-order collapse, step-2 stride

Trim inner-tasklet connector comments; unify memset inner connector t…

03f97ef

…o '_out'; add transpose-pattern + dst-non-contig rejection tests

Rename memset tests: <expansion>_<rank>_<storage>; reject-test names …

3bf59c8

…the failure mode

Merge branch 'main' into explicit-copy-memset-nodes

9451d5a

Improve comments

1a52336

Pre-commit

786eb82

Merge origin/main into explicit-copy-memset-nodes (prefer theirs)

21d7ebb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit copy memset nodes#2380

Explicit copy memset nodes#2380
ThrudPrimrose wants to merge 61 commits into
mainfrom
explicit-copy-memset-nodes

ThrudPrimrose commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ThrudPrimrose commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant