Skip to content

Extend problem cache with hardware provenance metadata#4835

Open
danieyan-amd wants to merge 3 commits into
ROCm:developfrom
danieyan-amd:feature/problem-cache-schema-extension
Open

Extend problem cache with hardware provenance metadata#4835
danieyan-amd wants to merge 3 commits into
ROCm:developfrom
danieyan-amd:feature/problem-cache-schema-extension

Conversation

@danieyan-amd
Copy link
Copy Markdown

@danieyan-amd danieyan-amd commented Apr 30, 2026

Two changes to problem_cache.cpp:

  1. load(): Project deserialized keys to only {name, problem} so that extra metadata fields in the JSON don't break cache key matching. Previously, the full JSON object (all fields) was used as the map key, causing 100% cache misses when metadata was present.

  2. save(): Enrich each key with hardware provenance before writing: gpu_arch, cu_count, graphics_clock_mhz, memory_clock_mhz, memory_bus_bits, vram_bytes, wavefront_size, regs_per_block, max_threads_per_cu. Queried once via hipGetDeviceProperties at session end — negligible performance cost.

The in-memory map always uses {name, problem} keys for O(1) lookups. The on-disk JSON carries additional hardware context for traceability. On load, the extra fields are projected away, preserving fast matching.

Motivation

Adding hardware info to the problem cache, and added handling of the hardware data when doing cache lookups for solutions.

Technical Details

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

Two changes to problem_cache.cpp:

1. load(): Project deserialized keys to only {name, problem} so that
   extra metadata fields in the JSON don't break cache key matching.
   Previously, the full JSON object (all fields) was used as the map
   key, causing 100% cache misses when metadata was present.

2. save(): Enrich each key with hardware provenance before writing:
   gpu_arch, cu_count, graphics_clock_mhz, memory_clock_mhz,
   memory_bus_bits, vram_bytes, wavefront_size, regs_per_block,
   max_threads_per_cu. Queried once via hipGetDeviceProperties at
   session end — negligible performance cost.

The in-memory map always uses {name, problem} keys for O(1) lookups.
The on-disk JSON carries additional hardware context for traceability.
On load, the extra fields are projected away, preserving fast matching.
@danieyan-amd danieyan-amd marked this pull request as ready for review April 30, 2026 19:44
@danieyan-amd danieyan-amd requested a review from causten as a code owner April 30, 2026 19:44
Copilot AI review requested due to automatic review settings April 30, 2026 19:44
@danieyan-amd danieyan-amd marked this pull request as draft April 30, 2026 19:44
@danieyan-amd
Copy link
Copy Markdown
Author

Sorry Chris, I didnt mean to hit ready for review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the GPU problem cache persistence format to remain resilient to extra on-disk metadata while also recording hardware provenance for traceability.

Changes:

  • In load(), deserialize into a temporary map and project keys down to {name, problem} to prevent metadata fields from breaking cache-key matching.
  • In save(), enrich persisted keys with HIP device properties (e.g., arch, CU count, clocks, VRAM) before writing the JSON file.

// Enrich keys with hardware provenance metadata on write.
// This runs once at session end — negligible cost.
hipDeviceProp_t props{};
auto status = hipGetDeviceProperties(&props, get_device_id());
Comment on lines +61 to +67
std::unordered_map<value, value> raw;
from_value(from_json_string(read_string(pc_path)), raw);
for(auto& [k, v] : raw)
{
auto projected = create_key(k.at("name").to<std::string>(), k.at("problem"));
cache[projected] = v;
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment thread src/targets/gpu/problem_cache.cpp Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4835      +/-   ##
===========================================
+ Coverage    92.32%   92.80%   +0.48%     
===========================================
  Files          583      584       +1     
  Lines        29332    30146     +814     
===========================================
+ Hits         27080    27976     +896     
+ Misses        2252     2170      -82     

see 75 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@danieyan-amd danieyan-amd marked this pull request as ready for review May 7, 2026 19:34
{
auto projected = create_key(k.at("name").to<std::string>(), k.at("problem"));
cache[projected] = v;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make an extra copy can get slow with larger problem caches.

@pfultz2
Copy link
Copy Markdown
Collaborator

pfultz2 commented May 8, 2026

I think the metadata should be managed externally. In the future, we may use sqlite dbs to manage problem caches which may not be efficient to insert metadata like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants