⚡️ Speed up function _create_device_sync_precompute_statements by 14% in PR #1015 (gpu-sync-instrumentation)
#1040
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #1015
If you approve this dependent PR, these changes will be merged into the original PR branch
gpu-sync-instrumentation.📄 14% (0.14x) speedup for
_create_device_sync_precompute_statementsincodeflash/code_utils/instrument_existing_tests.py⏱️ Runtime :
691 microseconds→605 microseconds(best of173runs)📝 Explanation and details
The optimized code achieves a 14% speedup by reducing redundant AST node allocations through strategic object reuse, particularly for PyTorch framework handling which dominates the function's workload.
Key Optimizations
1. Context Object Reuse (3-4% gain)
The optimized version creates
ast.Load()andast.Store()context objects once and reuses them throughout, rather than creating new instances inline for every AST node. Since these are singleton-like objects, this reduces allocation overhead.2. Shared AST Attribute Chains (8-10% gain for torch-heavy workloads)
For PyTorch, the code now creates intermediate AST nodes once and reuses them:
torch_name- reused for both CUDA and MPS statementstorch_cuda- reused for bothis_available()andis_initialized()callstorch_backends- reused for MPS hasattr check andbackends.mps.is_available()calltorch_mps_attr- reused for thehasattr(torch.mps, 'synchronize')checkThe original code reconstructed these attribute chains from scratch each time, creating duplicate
ast.Nameandast.Attributenodes with identical structure.Performance Impact Analysis
The test results show clear patterns:
test_torch_with_custom_alias_and_empty_alias: 27.7% faster with empty alias)Context and Impact
Based on
function_references, this function is called bycreate_wrapper_function()which instruments test functions for profiling. The wrapper is generated for every test function being monitored, making this a performance-critical code path during test suite instrumentation.The optimization is particularly valuable when:
used_frameworksdict frequently contains "torch" (the most common case)The speedup compounds across hundreds or thousands of test instrumentations, reducing overall profiling setup overhead.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-pr1015-2026-01-09T21.40.59and push.