Skip to content

Explicit copy memset nodes#2380

Open
ThrudPrimrose wants to merge 61 commits into
mainfrom
explicit-copy-memset-nodes
Open

Explicit copy memset nodes#2380
ThrudPrimrose wants to merge 61 commits into
mainfrom
explicit-copy-memset-nodes

Conversation

@ThrudPrimrose
Copy link
Copy Markdown
Collaborator

No description provided.

ThrudPrimrose and others added 30 commits May 13, 2026 16:43
Introduce explicit copy / memset library nodes and a pass that lifts
every implicit AccessNode -> AccessNode (or scope-staging) edge into
a CopyLibraryNode, so the dataflow shape post-simplification is
self-describing and the legacy "copy edge gets lowered by ad-hoc
codegen" path can be deprecated.

  dace/libraries/standard/nodes/copy_node.py     (CopyLibraryNode + 8 expansions)
  dace/libraries/standard/nodes/memset_node.py   (MemsetLibraryNode + 3 expansions)
  dace/libraries/standard/helper.py              (shared expansion helpers)
  dace/libraries/standard/environments/cpu.py    (CPU environment used by ExpandMemcpyCPU)
  dace/sdfg/construction_utils.py                (small utilities used by copy_node)
  dace/transformation/passes/insert_explicit_copies.py
                                                 (the pass)

Pass wiring: appended `InsertExplicitCopies` to `SIMPLIFY_PASSES` so
``SDFG.simplify()`` lifts implicit copies as part of standard cleanup
(idempotent — once lifted, no further implicit edges remain to match).

Tests:
  tests/library/copy_node_test.py
  tests/library/memset_node_test.py
  tests/passes/insert_explicit_copies_test.py
…ryNode

Three small fixes that unblock the Copy/Memset libnodes' schedule
inference:

  * ``infer_out_connector_type`` / ``infer_connector_types``: wrap the
    ``e.data.subset and e.data.subset.num_elements() == 1`` expression
    in ``bool()``. A single-element ``Range`` returns False from
    ``__bool__`` and the bare ``and`` chain leaks the Range object
    instead of a bool, which later trips the ``scalar |= …`` operator
    with TypeError when the libnode's collapsed subset is empty.
  * ``_determine_schedule_from_storage``: when the node under inspection
    is a ``CopyLibraryNode`` / ``MemsetLibraryNode`` AND any neighbouring
    memlet imposes a ``GPU_Device`` constraint, return ``GPU_Device``
    directly. Without this, an H2D copy has both ``CPU_Multicore``
    (from the CPU source) and ``GPU_Device`` (from the GPU sink) in its
    constraint set and the existing ``len(constraints) > 1`` branch
    raises ``InvalidSDFGNodeError: Cannot determine default schedule
    for node copy_A_to_gpu_A``. The libnode is exactly the node class
    designed to bridge storages; routing it to GPU_Device when GPU is
    involved is the intended resolution.
…simplify

InsertExplicitCopies materialises implicit AccessNode->AccessNode edges
into CopyLibraryNodes — a shape-changing lowering step, not a
shape-preserving simplification. Including it in SIMPLIFY_PASSES broke
22 tests (numpy reshape/flatten/view, redundant_copy count assertions,
range_indirection / reinterpret validators) that rely on simplify
preserving the implicit-edge form.

Pass remains available as a standalone Pass for consumers that want it
(e.g. the GPU codegen lowering pipeline, which calls it explicitly via
InsertExplicitGPUGlobalMemoryCopies).
Three xfail-strict pins:
- AccessNode<->View edges must not be lifted (policy: views are aliases).
- Rank-changing reshape lift produces mismatched-rank memlets that trip
  codegen IndexError in cpp_offset_expr.
- Dtype-reinterpret View lift produces CopyLibraryNode with mismatched
  element types, failing sdfg.validate().
…atch

InsertExplicitCopies changes:
- Drop the AN<->View edge lift (was inserting a Copy + intermediate buffer
  for every AN<->View edge, including pure aliasing edges that don't need
  one).
- Add a round-trip collapse: AN_src -> View -> AN_dst becomes a single
  AN_src -> CopyLibraryNode -> AN_dst direct edge with the composed memlet
  (src side from the view-underlying subset, dst side from the access-side
  subset). The View AccessNode is removed when it has no other consumers.

CopyLibraryNode expansion:
- select_copy_implementation now routes rank-mismatched volume-equal copies
  (different rank after collapse_shape_and_strides) to CopyNDTemplate when
  both sides are same-storage C-packed contiguous. MappedTasklet reuses a
  single access expression for both endpoints, which produces a
  rank-mismatched memlet on the smaller side and crashes codegen.
- ExpandCopyNDTemplate flattens to a 1D pointer walk when the in/out
  collapsed shapes have different ranks (both sides packed contiguous, so
  the linearization is sound).

Tests:
- Update the 4 view-lift tests to assert the new policy (round-trip collapses
  to a single Copy; View is removed if it has no other consumers; repeat
  applies are no-ops).
- Add 10 new test_iec_* pins covering both patterns: AN<->View edges kept
  direct, AN->View->AN round-trip collapse, AN->View round-trip with another
  consumer keeping the View, AN<->AN copies with rank differing because of a
  constant-index dim, AN<->AN rank-mismatched volume-equal copies routed
  through CopyND, plus @dace.program reshape and reinterpret cases.
…onstruction_utils

Use SDFG.parent (O(1)) instead of the recursive _get_parent_state scan
to find the state containing a nested-SDFG node; verified equivalent.
copy_node.py imports the helper from dace.transformation.helpers
(top-level, no import cycle).  dace/sdfg/construction_utils.py removed
(it only held these two helpers on this branch).
…trides in memset, Range.num_elements over reduce
ThrudPrimrose and others added 30 commits May 20, 2026 13:45
…re_contiguous param, _make_memset_skeleton unused returns
… 3D/2D collapse, F-order collapse, step-2 stride
…o '_out'; add transpose-pattern + dst-non-contig rejection tests
The marker is unregistered on this branch (new GPU codegen lives on a
separate branch); the 3 single-element h2d/d2h tests work on any
codegen.

test_copy_single_element_memcpy_connector_types asserted the
dropped _memcpy_connector_typing special case (value-typed CPU
output + ``&_cpy_out`` in body). Now that both connectors are
pointer-typed, those assertions are stale -- delete the test.
Replace _make_same_storage_sdfg / _make_cross_storage_sdfg /
_make_multidim_libnode_sdfg / _build_explicit_copy_sdfg with a
single _make_copy_sdfg(src, dst, *, ...) taking _ArraySpec
dataclasses per side. The dataclass groups (shape, storage,
strides, total_size, transient, subset, name, dtype); the
helper returns (sdfg, libnode).

Convert the inline-built SDFGs throughout the file (Fortran-packed
same/cross-storage, no-common-stride1, register, single-element,
shared-memory rejection, etc.) to the unified helper. Tests with
GPU_Device / GPU_ThreadBlock map wiring stay inline since they
use memlet_path instead of a single libnode edge.

Net: -118 LoC (-486 / +368).
…rings

Same-rank copy with mismatched per-dim shapes (e.g. (3,4) -> (4,3))
previously slipped through libnode validation and tripped DaCe's
generic ``out-of-bounds memlet`` post-expansion validate -- the
rejection was incidental. Add a per-dim shape check at the head of
ExpandMappedTasklet's same-rank branch with a specific error
message; the test now asserts on that explicit contract instead of
the incidental SDFG-level error.

Trim multi-line test docstrings (transpose-pattern, rank-mismatch
variants, no-common-stride1, shared-memory collective, single-
element-in-kernel) to one line each.
… array shapes

The transpose-pattern check compares ``in_shape_collapsed`` vs
``out_shape_collapsed``, which come from
``collapse_shape_and_strides(subset, strides)`` -- per-dim subset
sizes after singleton-collapse, NOT the underlying array shapes.

Add two pin tests:

  test_copy_same_subset_different_array_shapes -- 0:N slice between
  arrays of different total size is fine when the per-dim subset
  sizes match.

  test_copy_1d_slice_from_2d_source -- a row-slice ``[i, 0:N]`` of
  a 2D array copies into a 1D array; the leading singleton
  collapses to the same rank on both sides.

Both pass; transpose-pattern still rejected.
….X form

Consistency over brevity: every test now uses the full
dace.dtypes.StorageType.{CPU_Heap,GPU_Global,GPU_Shared,Register}
inline instead of a mix of local cpu/gpu aliases and inline long
forms. Drops 26 local-alias lines and removes the inconsistency
across the file. Two over-long docstrings trimmed to keep yapf
happy.
Previously select_memset_implementation picked CUDA / CPU purely from
storage type. cudaMemsetAsync / memset zero ``num_elements * sizeof(T)``
consecutive bytes from the dst pointer, so a non-contiguous subset
(e.g. a middle 2-D slice of a row-major array) silently zeroes memory
outside the region. Auto now routes such cases to the 'pure'
(mapped-tasklet) expansion, which writes per element via the subset.

ExpandCUDA / ExpandCPU also reject non-contiguous subsets upfront with
a clear error so explicit forcing still raises.

Tests:
  test_memset_auto_routes_non_contiguous_to_pure_cpu  -- Auto succeeds
    via 'pure', zeroes only the 6x10 sub-block.
  test_memset_cpu_rejects_non_contiguous_subset  -- explicit CPU raises.
  test_memset_cuda_rejects_non_contiguous_subset  -- explicit CUDA raises.

Also: in copy_node_test.py, factor out _make_copy_skeleton + add
_make_legacy_copy_sdfg (canonical Memlet(data=dst, subset=dst_subset,
other_subset=src_subset) form). The previous _strip_libnodes mutation
was producing memlets without other_subset, which made the legacy
codegen look broken on patterns it actually handles correctly. Keep
the genuine libnode-advantage pin -- rank-mismatch 4D->2D Fortran
reshape -- delete the three false-positive tests.
- helper.py: drop collapsed_map_lengths, inlined as
  [s for s in subset.size() if s != 1] (2 callsites; Range.size()
  is the existing util that does the same job).
- helper.py: shorten CURRENT_STREAM_NAME comment.
- insert_explicit_copies.py: drop src_locations / dst_locations /
  skip_inside_device_scope properties + _storage_allowed helper.
  Zero callers in dace/ or tests/ pass any of them. Drop the
  Iterable / is_devicelevel_gpu imports that go with them.
- insert_explicit_copies.py: fix the class docstring -- the pass
  only handles AN -> AN (and AN -> View -> AN via the round-trip
  collapse). Map-staging patterns are not handled on this branch
  (they were on explicit-gpu-global-copies; D1 outermost-subset
  bug; removed pending a correct rewrite).
…side the map scope

AN -> MapEntry -> AN (stage-in) and AN -> MapExit -> AN (stage-out)
edges now lift to a CopyLibraryNode placed INSIDE the map scope,
wired directly to MapEntry's output connector / MapExit's input
connector (no intermediate AN inserted on the scope-side). Chains
of MapEntries / MapExits are followed via memlet_path; the body
of the map (tasklets, NestedSDFGs, nested maps) is irrelevant to
the lift. Views on the outer side stay in place.

The outer-side memlet (per-iteration subset on the outer array)
is preserved verbatim on the new MapEntry -> libnode (or
libnode -> MapExit) edge; the inner-array-side memlet is derived
via the existing _derive_matching_dst_subset against the inner
array's descriptor.

Tests in tests/passes/insert_explicit_copies_test.py:

  test_lift_stage_in_copy
  test_lift_stage_out_copy
  test_lift_stage_in_copy_through_view
  test_lift_stage_out_copy_through_view
  test_lift_stage_in_copy_chained_map_entries
  test_lift_stage_out_copy_chained_map_exits
  test_lift_stage_in_copy_with_nested_sdfg_consumer

Each asserts: exactly one libnode in the lifted state, libnode's
scope owner is the (innermost) MapEntry, libnode input/output
wired directly to MapEntry/MapExit (not via an inserted AN),
numerical match against NumPy, and zero `CopyND<` template
instantiations in generated code.

New helper _assert_no_copynd(sdfg) calls generate_code and
greps each CodeObject for `CopyND<`; pinned in all seven new
staging tests.

Endpoint resolution refactored to sdutils.find_input_arraynode /
find_output_arraynode instead of inline memlet_path[0/-1] +
isinstance checks.
- Drop the unused third argument from _derive_matching_dst_subset
  (docstring already said "unused; kept for symmetric signature");
  update both direct-copy callsites accordingly.
- Fold _is_stage_in_candidate / _is_stage_out_candidate /
  _insert_stage_in_libnode / _insert_stage_out_libnode into one
  _lift_staging_edge(..., stage_in: bool) -- the four methods
  differed only by which side of the edge was the inner AN and
  which was the MapEntry/MapExit.

-30 LoC net.
Introduce _compile_no_copynd(sdfg) wrapper: greps every CodeObject
emitted by sdfg.generate_code() for 'CopyND<' and asserts none,
then returns sdfg.compile(). Apply to all 17 compile sites in
copy_node_test.py (including the libnode-side compile of the
legacy comparison test).

Pins the contract: libnode expansions displace the runtime CopyND
fallback entirely. The only intentional CopyND user is
ExpandSharedMemoryCollective, whose test inspects tasklet bodies
directly without compiling, so no exemption is needed.
…cks; factor staging-test scaffold

dace/transformation/passes/insert_explicit_copies.py
- Drop dead ``if src_subset is None or dst_subset is None`` guard
  (_resolve_subset_for always returns a Range).
- Inline _expr_lt (4 lines, one caller in _is_consecutive_reshape);
  the try/except + intent comment now live where they're used.

dace/libraries/standard/nodes/copy_node.py
- Inline _coarse_pick_for_storage_pair (single caller in
  select_copy_implementation).
- Inline _cuda2d_strides_are_supported (single caller in
  _refine_cuda_impl_for_subsets); the explanatory comment moves with
  the logic.

tests/passes/insert_explicit_copies_test.py
- Factor the structural assertion shared across the seven staging
  tests into _assert_lifted_libnode(state, side, expected_scope=...)
  -- replaces ~5 lines of "find libnode + check scope + check wire"
  per test with one call.
- Pull "list View AccessNodes" into _view_an_names helper.

tests/library/{copy_node,memset_node}_test.py +
tests/passes/insert_explicit_copies_test.py
- Replace explicit ``if __name__ == "__main__": test_a(); test_b(); ...``
  blocks with ``pytest.main([__file__])`` so the script form picks up
  tests automatically without a hand-maintained list.
…shape

Three lines collapsed to one. The comment still explains why the
try/except swallows the exception silently (symbolic indeterminacy +
equal-product safety net) without the docstring-shaped prose that
was inherited from the deleted _expr_lt helper.
…ion pass

Functions that already receive parent_state no longer take a redundant
parent_sdfg/sdfg argument; they read the owning SDFG from state.sdfg.
This covers the copy/memset expansion helpers, the auto_dispatch shim,
CopyLibraryNode.src_storage/dst_storage, and the InsertExplicitCopies
private methods. The framework-fixed validate(self, sdfg, state) and the
ExpandTransformation.expansion(node, parent_state, parent_sdfg) signatures
are left intact.

Add copy_node tests for a padded (1, N) array whose unit leading dim
carries a non-packed stride: is_contiguous_subset is False, so Auto falls
back to a map (same storage) or a pitched cudaMemcpy2D (cross storage),
and the copy stays numerically exact.
ExpandMemcpyCPU and the CUDA1D helper duplicated the same validate ->
contiguous-check -> size -> pointer-Tasklet body, differing only in the
cross-storage flag and the memcpy vs cudaMemcpyAsync code string. Fold them
into one _make_memcpy_tasklet(node, parent_state, *, cuda), mirroring
MemsetLibraryNode's _make_memset_tasklet. Generated code is unchanged.
Imports: drop the copy/subsets/data aliases and inline imports for plain
top-level imports; add type hints to the module helpers.

Reduce _derive_matching_dst_subset to a single volume check via
subsets.Range.num_elements + dace.symbolic.equal, removing the four-branch
shape ladder, the hand-rolled symbolic equality, and _is_consecutive_reshape
(whose two-pointer walk only ever compared total volumes).

Stop collapsing AN -> View -> AN round-trips. A View is an Array subclass
with its own shape/strides, so _replace_direct_copies now lifts any
View<->Array movement edge and skips only the view's alias edge, leaving the
view in place as a copy endpoint (AN -> View -> Copy -> AN). Deletes
_collapse_round_trip_views and the now-unused _resolve_subset_for; updates the
view tests with structural + numerical checks for both view directions.
…e, test dedup

- _replace_direct_copies now resolves src/dst subsets via Memlet.get_src_subset /
  get_dst_subset (the memlet path) instead of an ambiguous data-name else branch;
  the self-copy convention stays explicit.
- ExpandSharedMemoryCollective's GPU_ThreadBlock guard uses the existing
  dace.sdfg.scope.is_in_scope, so get_parent_map_and_loop_scopes is dropped from
  transformation/helpers.py (the PR no longer modifies that file).
- Tests route SDFG construction through shared builders: view round-trips
  (_make_view_round_trip_sdfg, src + dst), chained-map staging
  (_build_chained_stage_sdfg), reshape (_run_reshape_copy_test), and the
  array_to_array cases (one parametrized test) via _build_copy_sdfg.
- Polybench tests reuse the canonical tests/polybench kernels instead of inline
  copies (covariance/correlation imported, fdtd-2d loaded by path under a clean
  module name); the __main__-only 'import polybench' in those three files moves
  under __main__ so they import without the absl CLI dependency.
- Shortened/cleaned comments and docstrings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant