Explicit copy memset nodes#2380
Open
ThrudPrimrose wants to merge 61 commits into
Open
Conversation
Introduce explicit copy / memset library nodes and a pass that lifts
every implicit AccessNode -> AccessNode (or scope-staging) edge into
a CopyLibraryNode, so the dataflow shape post-simplification is
self-describing and the legacy "copy edge gets lowered by ad-hoc
codegen" path can be deprecated.
dace/libraries/standard/nodes/copy_node.py (CopyLibraryNode + 8 expansions)
dace/libraries/standard/nodes/memset_node.py (MemsetLibraryNode + 3 expansions)
dace/libraries/standard/helper.py (shared expansion helpers)
dace/libraries/standard/environments/cpu.py (CPU environment used by ExpandMemcpyCPU)
dace/sdfg/construction_utils.py (small utilities used by copy_node)
dace/transformation/passes/insert_explicit_copies.py
(the pass)
Pass wiring: appended `InsertExplicitCopies` to `SIMPLIFY_PASSES` so
``SDFG.simplify()`` lifts implicit copies as part of standard cleanup
(idempotent — once lifted, no further implicit edges remain to match).
Tests:
tests/library/copy_node_test.py
tests/library/memset_node_test.py
tests/passes/insert_explicit_copies_test.py
…ryNode
Three small fixes that unblock the Copy/Memset libnodes' schedule
inference:
* ``infer_out_connector_type`` / ``infer_connector_types``: wrap the
``e.data.subset and e.data.subset.num_elements() == 1`` expression
in ``bool()``. A single-element ``Range`` returns False from
``__bool__`` and the bare ``and`` chain leaks the Range object
instead of a bool, which later trips the ``scalar |= …`` operator
with TypeError when the libnode's collapsed subset is empty.
* ``_determine_schedule_from_storage``: when the node under inspection
is a ``CopyLibraryNode`` / ``MemsetLibraryNode`` AND any neighbouring
memlet imposes a ``GPU_Device`` constraint, return ``GPU_Device``
directly. Without this, an H2D copy has both ``CPU_Multicore``
(from the CPU source) and ``GPU_Device`` (from the GPU sink) in its
constraint set and the existing ``len(constraints) > 1`` branch
raises ``InvalidSDFGNodeError: Cannot determine default schedule
for node copy_A_to_gpu_A``. The libnode is exactly the node class
designed to bridge storages; routing it to GPU_Device when GPU is
involved is the intended resolution.
…simplify InsertExplicitCopies materialises implicit AccessNode->AccessNode edges into CopyLibraryNodes — a shape-changing lowering step, not a shape-preserving simplification. Including it in SIMPLIFY_PASSES broke 22 tests (numpy reshape/flatten/view, redundant_copy count assertions, range_indirection / reinterpret validators) that rely on simplify preserving the implicit-edge form. Pass remains available as a standalone Pass for consumers that want it (e.g. the GPU codegen lowering pipeline, which calls it explicitly via InsertExplicitGPUGlobalMemoryCopies).
Three xfail-strict pins: - AccessNode<->View edges must not be lifted (policy: views are aliases). - Rank-changing reshape lift produces mismatched-rank memlets that trip codegen IndexError in cpp_offset_expr. - Dtype-reinterpret View lift produces CopyLibraryNode with mismatched element types, failing sdfg.validate().
…atch InsertExplicitCopies changes: - Drop the AN<->View edge lift (was inserting a Copy + intermediate buffer for every AN<->View edge, including pure aliasing edges that don't need one). - Add a round-trip collapse: AN_src -> View -> AN_dst becomes a single AN_src -> CopyLibraryNode -> AN_dst direct edge with the composed memlet (src side from the view-underlying subset, dst side from the access-side subset). The View AccessNode is removed when it has no other consumers. CopyLibraryNode expansion: - select_copy_implementation now routes rank-mismatched volume-equal copies (different rank after collapse_shape_and_strides) to CopyNDTemplate when both sides are same-storage C-packed contiguous. MappedTasklet reuses a single access expression for both endpoints, which produces a rank-mismatched memlet on the smaller side and crashes codegen. - ExpandCopyNDTemplate flattens to a 1D pointer walk when the in/out collapsed shapes have different ranks (both sides packed contiguous, so the linearization is sound). Tests: - Update the 4 view-lift tests to assert the new policy (round-trip collapses to a single Copy; View is removed if it has no other consumers; repeat applies are no-ops). - Add 10 new test_iec_* pins covering both patterns: AN<->View edges kept direct, AN->View->AN round-trip collapse, AN->View round-trip with another consumer keeping the View, AN<->AN copies with rank differing because of a constant-index dim, AN<->AN rank-mismatched volume-equal copies routed through CopyND, plus @dace.program reshape and reinterpret cases.
…h legacy and experimental codegen)
… explicit-copy-memset-nodes
…onstruction_utils Use SDFG.parent (O(1)) instead of the recursive _get_parent_state scan to find the state containing a nested-SDFG node; verified equivalent. copy_node.py imports the helper from dace.transformation.helpers (top-level, no import cycle). dace/sdfg/construction_utils.py removed (it only held these two helpers on this branch).
…nearization; selector bypasses CopyND
…se feature-regression tests
…trides in memset, Range.num_elements over reduce
…D only for shared memory)
…y2D 1D+2D branches
…re_contiguous param, _make_memset_skeleton unused returns
…) on copy/memset libnode helpers
…nnectors correctly
… 3D/2D collapse, F-order collapse, step-2 stride
…o '_out'; add transpose-pattern + dst-non-contig rejection tests
The marker is unregistered on this branch (new GPU codegen lives on a separate branch); the 3 single-element h2d/d2h tests work on any codegen. test_copy_single_element_memcpy_connector_types asserted the dropped _memcpy_connector_typing special case (value-typed CPU output + ``&_cpy_out`` in body). Now that both connectors are pointer-typed, those assertions are stale -- delete the test.
Replace _make_same_storage_sdfg / _make_cross_storage_sdfg / _make_multidim_libnode_sdfg / _build_explicit_copy_sdfg with a single _make_copy_sdfg(src, dst, *, ...) taking _ArraySpec dataclasses per side. The dataclass groups (shape, storage, strides, total_size, transient, subset, name, dtype); the helper returns (sdfg, libnode). Convert the inline-built SDFGs throughout the file (Fortran-packed same/cross-storage, no-common-stride1, register, single-element, shared-memory rejection, etc.) to the unified helper. Tests with GPU_Device / GPU_ThreadBlock map wiring stay inline since they use memlet_path instead of a single libnode edge. Net: -118 LoC (-486 / +368).
…rings Same-rank copy with mismatched per-dim shapes (e.g. (3,4) -> (4,3)) previously slipped through libnode validation and tripped DaCe's generic ``out-of-bounds memlet`` post-expansion validate -- the rejection was incidental. Add a per-dim shape check at the head of ExpandMappedTasklet's same-rank branch with a specific error message; the test now asserts on that explicit contract instead of the incidental SDFG-level error. Trim multi-line test docstrings (transpose-pattern, rank-mismatch variants, no-common-stride1, shared-memory collective, single- element-in-kernel) to one line each.
… array shapes The transpose-pattern check compares ``in_shape_collapsed`` vs ``out_shape_collapsed``, which come from ``collapse_shape_and_strides(subset, strides)`` -- per-dim subset sizes after singleton-collapse, NOT the underlying array shapes. Add two pin tests: test_copy_same_subset_different_array_shapes -- 0:N slice between arrays of different total size is fine when the per-dim subset sizes match. test_copy_1d_slice_from_2d_source -- a row-slice ``[i, 0:N]`` of a 2D array copies into a 1D array; the leading singleton collapses to the same rank on both sides. Both pass; transpose-pattern still rejected.
….X form
Consistency over brevity: every test now uses the full
dace.dtypes.StorageType.{CPU_Heap,GPU_Global,GPU_Shared,Register}
inline instead of a mix of local cpu/gpu aliases and inline long
forms. Drops 26 local-alias lines and removes the inconsistency
across the file. Two over-long docstrings trimmed to keep yapf
happy.
Previously select_memset_implementation picked CUDA / CPU purely from
storage type. cudaMemsetAsync / memset zero ``num_elements * sizeof(T)``
consecutive bytes from the dst pointer, so a non-contiguous subset
(e.g. a middle 2-D slice of a row-major array) silently zeroes memory
outside the region. Auto now routes such cases to the 'pure'
(mapped-tasklet) expansion, which writes per element via the subset.
ExpandCUDA / ExpandCPU also reject non-contiguous subsets upfront with
a clear error so explicit forcing still raises.
Tests:
test_memset_auto_routes_non_contiguous_to_pure_cpu -- Auto succeeds
via 'pure', zeroes only the 6x10 sub-block.
test_memset_cpu_rejects_non_contiguous_subset -- explicit CPU raises.
test_memset_cuda_rejects_non_contiguous_subset -- explicit CUDA raises.
Also: in copy_node_test.py, factor out _make_copy_skeleton + add
_make_legacy_copy_sdfg (canonical Memlet(data=dst, subset=dst_subset,
other_subset=src_subset) form). The previous _strip_libnodes mutation
was producing memlets without other_subset, which made the legacy
codegen look broken on patterns it actually handles correctly. Keep
the genuine libnode-advantage pin -- rank-mismatch 4D->2D Fortran
reshape -- delete the three false-positive tests.
- helper.py: drop collapsed_map_lengths, inlined as [s for s in subset.size() if s != 1] (2 callsites; Range.size() is the existing util that does the same job). - helper.py: shorten CURRENT_STREAM_NAME comment. - insert_explicit_copies.py: drop src_locations / dst_locations / skip_inside_device_scope properties + _storage_allowed helper. Zero callers in dace/ or tests/ pass any of them. Drop the Iterable / is_devicelevel_gpu imports that go with them. - insert_explicit_copies.py: fix the class docstring -- the pass only handles AN -> AN (and AN -> View -> AN via the round-trip collapse). Map-staging patterns are not handled on this branch (they were on explicit-gpu-global-copies; D1 outermost-subset bug; removed pending a correct rewrite).
…side the map scope AN -> MapEntry -> AN (stage-in) and AN -> MapExit -> AN (stage-out) edges now lift to a CopyLibraryNode placed INSIDE the map scope, wired directly to MapEntry's output connector / MapExit's input connector (no intermediate AN inserted on the scope-side). Chains of MapEntries / MapExits are followed via memlet_path; the body of the map (tasklets, NestedSDFGs, nested maps) is irrelevant to the lift. Views on the outer side stay in place. The outer-side memlet (per-iteration subset on the outer array) is preserved verbatim on the new MapEntry -> libnode (or libnode -> MapExit) edge; the inner-array-side memlet is derived via the existing _derive_matching_dst_subset against the inner array's descriptor. Tests in tests/passes/insert_explicit_copies_test.py: test_lift_stage_in_copy test_lift_stage_out_copy test_lift_stage_in_copy_through_view test_lift_stage_out_copy_through_view test_lift_stage_in_copy_chained_map_entries test_lift_stage_out_copy_chained_map_exits test_lift_stage_in_copy_with_nested_sdfg_consumer Each asserts: exactly one libnode in the lifted state, libnode's scope owner is the (innermost) MapEntry, libnode input/output wired directly to MapEntry/MapExit (not via an inserted AN), numerical match against NumPy, and zero `CopyND<` template instantiations in generated code. New helper _assert_no_copynd(sdfg) calls generate_code and greps each CodeObject for `CopyND<`; pinned in all seven new staging tests. Endpoint resolution refactored to sdutils.find_input_arraynode / find_output_arraynode instead of inline memlet_path[0/-1] + isinstance checks.
- Drop the unused third argument from _derive_matching_dst_subset (docstring already said "unused; kept for symmetric signature"); update both direct-copy callsites accordingly. - Fold _is_stage_in_candidate / _is_stage_out_candidate / _insert_stage_in_libnode / _insert_stage_out_libnode into one _lift_staging_edge(..., stage_in: bool) -- the four methods differed only by which side of the edge was the inner AN and which was the MapEntry/MapExit. -30 LoC net.
Introduce _compile_no_copynd(sdfg) wrapper: greps every CodeObject emitted by sdfg.generate_code() for 'CopyND<' and asserts none, then returns sdfg.compile(). Apply to all 17 compile sites in copy_node_test.py (including the libnode-side compile of the legacy comparison test). Pins the contract: libnode expansions displace the runtime CopyND fallback entirely. The only intentional CopyND user is ExpandSharedMemoryCollective, whose test inspects tasklet bodies directly without compiling, so no exemption is needed.
…cks; factor staging-test scaffold
dace/transformation/passes/insert_explicit_copies.py
- Drop dead ``if src_subset is None or dst_subset is None`` guard
(_resolve_subset_for always returns a Range).
- Inline _expr_lt (4 lines, one caller in _is_consecutive_reshape);
the try/except + intent comment now live where they're used.
dace/libraries/standard/nodes/copy_node.py
- Inline _coarse_pick_for_storage_pair (single caller in
select_copy_implementation).
- Inline _cuda2d_strides_are_supported (single caller in
_refine_cuda_impl_for_subsets); the explanatory comment moves with
the logic.
tests/passes/insert_explicit_copies_test.py
- Factor the structural assertion shared across the seven staging
tests into _assert_lifted_libnode(state, side, expected_scope=...)
-- replaces ~5 lines of "find libnode + check scope + check wire"
per test with one call.
- Pull "list View AccessNodes" into _view_an_names helper.
tests/library/{copy_node,memset_node}_test.py +
tests/passes/insert_explicit_copies_test.py
- Replace explicit ``if __name__ == "__main__": test_a(); test_b(); ...``
blocks with ``pytest.main([__file__])`` so the script form picks up
tests automatically without a hand-maintained list.
…shape Three lines collapsed to one. The comment still explains why the try/except swallows the exception silently (symbolic indeterminacy + equal-product safety net) without the docstring-shaped prose that was inherited from the deleted _expr_lt helper.
…ion pass Functions that already receive parent_state no longer take a redundant parent_sdfg/sdfg argument; they read the owning SDFG from state.sdfg. This covers the copy/memset expansion helpers, the auto_dispatch shim, CopyLibraryNode.src_storage/dst_storage, and the InsertExplicitCopies private methods. The framework-fixed validate(self, sdfg, state) and the ExpandTransformation.expansion(node, parent_state, parent_sdfg) signatures are left intact. Add copy_node tests for a padded (1, N) array whose unit leading dim carries a non-packed stride: is_contiguous_subset is False, so Auto falls back to a map (same storage) or a pitched cudaMemcpy2D (cross storage), and the copy stays numerically exact.
ExpandMemcpyCPU and the CUDA1D helper duplicated the same validate -> contiguous-check -> size -> pointer-Tasklet body, differing only in the cross-storage flag and the memcpy vs cudaMemcpyAsync code string. Fold them into one _make_memcpy_tasklet(node, parent_state, *, cuda), mirroring MemsetLibraryNode's _make_memset_tasklet. Generated code is unchanged.
Imports: drop the copy/subsets/data aliases and inline imports for plain top-level imports; add type hints to the module helpers. Reduce _derive_matching_dst_subset to a single volume check via subsets.Range.num_elements + dace.symbolic.equal, removing the four-branch shape ladder, the hand-rolled symbolic equality, and _is_consecutive_reshape (whose two-pointer walk only ever compared total volumes). Stop collapsing AN -> View -> AN round-trips. A View is an Array subclass with its own shape/strides, so _replace_direct_copies now lifts any View<->Array movement edge and skips only the view's alias edge, leaving the view in place as a copy endpoint (AN -> View -> Copy -> AN). Deletes _collapse_round_trip_views and the now-unused _resolve_subset_for; updates the view tests with structural + numerical checks for both view directions.
…e, test dedup - _replace_direct_copies now resolves src/dst subsets via Memlet.get_src_subset / get_dst_subset (the memlet path) instead of an ambiguous data-name else branch; the self-copy convention stays explicit. - ExpandSharedMemoryCollective's GPU_ThreadBlock guard uses the existing dace.sdfg.scope.is_in_scope, so get_parent_map_and_loop_scopes is dropped from transformation/helpers.py (the PR no longer modifies that file). - Tests route SDFG construction through shared builders: view round-trips (_make_view_round_trip_sdfg, src + dst), chained-map staging (_build_chained_stage_sdfg), reshape (_run_reshape_copy_test), and the array_to_array cases (one parametrized test) via _build_copy_sdfg. - Polybench tests reuse the canonical tests/polybench kernels instead of inline copies (covariance/correlation imported, fdtd-2d loaded by path under a clean module name); the __main__-only 'import polybench' in those three files moves under __main__ so they import without the absl CLI dependency. - Shortened/cleaned comments and docstrings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.