⚡️ Speed up function `_create_device_sync_precompute_statements` by 14% in PR #1015 (`gpu-sync-instrumentation`) #1040

codeflash-ai · 2026-01-09T21:41:05Z

⚡️ This pull request contains optimizations for PR #1015

If you approve this dependent PR, these changes will be merged into the original PR branch gpu-sync-instrumentation.

This PR will be automatically closed if the original PR is merged.

📄 14% (0.14x) speedup for `_create_device_sync_precompute_statements` in `codeflash/code_utils/instrument_existing_tests.py`

⏱️ Runtime : 691 microseconds → 605 microseconds (best of 173 runs)

📝 Explanation and details

The optimized code achieves a 14% speedup by reducing redundant AST node allocations through strategic object reuse, particularly for PyTorch framework handling which dominates the function's workload.

Key Optimizations

1. Context Object Reuse (3-4% gain)
The optimized version creates ast.Load() and ast.Store() context objects once and reuses them throughout, rather than creating new instances inline for every AST node. Since these are singleton-like objects, this reduces allocation overhead.

2. Shared AST Attribute Chains (8-10% gain for torch-heavy workloads)
For PyTorch, the code now creates intermediate AST nodes once and reuses them:

torch_name - reused for both CUDA and MPS statements
torch_cuda - reused for both is_available() and is_initialized() calls
torch_backends - reused for MPS hasattr check and backends.mps.is_available() call
torch_mps_attr - reused for the hasattr(torch.mps, 'synchronize') check

The original code reconstructed these attribute chains from scratch each time, creating duplicate ast.Name and ast.Attribute nodes with identical structure.

Performance Impact Analysis

The test results show clear patterns:

torch-only tests: 13-19% faster (e.g., test_torch_with_custom_alias_and_empty_alias: 27.7% faster with empty alias)
multi-framework tests: 14-18% faster when torch is included
non-torch tests (JAX/TensorFlow only): minimal change or slightly slower (0-3%), since they don't benefit from the torch-specific optimizations

Context and Impact

Based on function_references, this function is called by create_wrapper_function() which instruments test functions for profiling. The wrapper is generated for every test function being monitored, making this a performance-critical code path during test suite instrumentation.

The optimization is particularly valuable when:

Instrumenting large test suites with many PyTorch-based tests
The used_frameworks dict frequently contains "torch" (the most common case)
Tests are re-instrumented multiple times during iterative optimization

The speedup compounds across hundreds or thousands of test instrumentations, reducing overall profiling setup overhead.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 61 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Click to see Generated Regression Tests

import ast
import time

from codeflash.code_utils.instrument_existing_tests import _create_device_sync_precompute_statements


# -------------------------
# Helper utilities for tests
# -------------------------
def find_assign_by_target(statements: list[ast.stmt], target_id: str) -> ast.Assign | None:
    """Return the first ast.Assign in statements that assigns to a Name with id == target_id."""
    for stmt in statements:
        if isinstance(stmt, ast.Assign):
            for t in stmt.targets:
                if isinstance(t, ast.Name) and t.id == target_id:
                    return stmt
    return None


# -------------------------
# Test cases
# -------------------------


def test_no_frameworks_returns_empty_for_none_and_empty_dict():
    # If used_frameworks is None, we expect an empty list
    codeflash_output = _create_device_sync_precompute_statements(None)
    result_none = codeflash_output  # 431ns -> 441ns (2.27% slower)

    # If used_frameworks is empty dict, we also expect an empty list
    codeflash_output = _create_device_sync_precompute_statements({})
    result_empty = codeflash_output  # 231ns -> 230ns (0.435% faster)


def test_torch_basic_structure_and_attribute_chain():
    # Basic PyTorch alias 'torch' should produce two assignments for CUDA and MPS
    codeflash_output = _create_device_sync_precompute_statements({"torch": "torch"})
    stmts = codeflash_output  # 15.0μs -> 13.2μs (13.8% faster)

    # First assignment checks _codeflash_should_sync_cuda
    cuda_assign = find_assign_by_target(stmts, "_codeflash_should_sync_cuda")
    # Each value must be a Call to torch.cuda.is_available / torch.cuda.is_initialized
    first_call, second_call = cuda_assign.value.values
    fa_value = first_call.func.value
    sa_value = second_call.func.value

    # Second assignment checks _codeflash_should_sync_mps
    mps_assign = find_assign_by_target(stmts, "_codeflash_should_sync_mps")
    # Second should be a call to hasattr(torch.backends, 'mps')
    second_val = mps_assign.value.values[1]
    # The first arg of this hasattr must be torch.backends attribute
    arg0 = second_val.args[0]
    # Third component should be a call to backends.mps.is_available
    third_val = mps_assign.value.values[2]
    third_chain = third_val.func.value
    # Fourth component should be hasattr(torch.mps, 'synchronize')
    fourth_val = mps_assign.value.values[3]
    arg0_4 = fourth_val.args[0]


def test_torch_with_custom_alias_and_empty_alias():
    # Custom alias 't' should be referenced instead of 'torch'
    codeflash_output = _create_device_sync_precompute_statements({"torch": "t"})
    stmts = codeflash_output  # 14.7μs -> 12.7μs (16.4% faster)
    cuda_assign = find_assign_by_target(stmts, "_codeflash_should_sync_cuda")
    # Confirm the Name used in the attribute chain is the alias 't'
    first_call = cuda_assign.value.values[0]
    fa_value = first_call.func.value

    # Edge: empty string alias should still be treated as a Name node with id ''
    codeflash_output = _create_device_sync_precompute_statements({"torch": ""})
    stmts_empty_alias = codeflash_output  # 12.1μs -> 9.44μs (27.7% faster)
    cuda_assign_empty = find_assign_by_target(stmts_empty_alias, "_codeflash_should_sync_cuda")
    # The Name id for the alias should be the empty string (as provided)
    first_call_empty = cuda_assign_empty.value.values[0]


def test_jax_and_tensorflow_structures_and_ordering():
    # When both jax and tensorflow are present, we should get one assignment each
    mapping = {"jax": "jaxlib", "tensorflow": "tf"}
    codeflash_output = _create_device_sync_precompute_statements(mapping)
    stmts = codeflash_output  # 7.86μs -> 7.63μs (2.89% faster)

    # First should be jax assignment; check presence and shape
    jax_assign = find_assign_by_target(stmts, "_codeflash_should_sync_jax")

    # TensorFlow assignment
    tf_assign = find_assign_by_target(stmts, "_codeflash_should_sync_tf")
    tf_first_arg = tf_assign.value.args[0]


def test_combined_all_frameworks_and_duplicate_aliases_affect_naming_consistently():
    # Include all three primary frameworks and use the same alias for each to ensure the alias is used faithfully
    mapping = {"torch": "x", "jax": "x", "tensorflow": "x"}
    codeflash_output = _create_device_sync_precompute_statements(mapping)
    stmts = codeflash_output  # 19.5μs -> 17.2μs (13.7% faster)
    # Confirm that the Name ids used in the different statements are consistently 'x'
    cuda_assign = find_assign_by_target(stmts, "_codeflash_should_sync_cuda")
    jax_assign = find_assign_by_target(stmts, "_codeflash_should_sync_jax")
    tf_assign = find_assign_by_target(stmts, "_codeflash_should_sync_tf")


def test_unknown_frameworks_are_ignored_and_do_not_inject_statements():
    # Framework keys not recognized by the function shouldn't produce any statements
    codeflash_output = _create_device_sync_precompute_statements({"mxnet": "mx", "unknown": "u"})
    stmts = codeflash_output  # 731ns -> 1.54μs (52.6% slower)


def test_large_scale_many_irrelevant_keys_performance_and_correctness():
    # Construct a large mapping with many irrelevant entries but include the known ones.
    # Keep total number of keys well under 1000 to honor the constraints.
    large_mapping = {f"fake_{i}": f"a{i}" for i in range(300)}  # 300 irrelevant entries
    # Insert the real framework aliases among the many keys
    large_mapping["torch"] = "torch_large"
    large_mapping["jax"] = "jax_large"
    large_mapping["tensorflow"] = "tf_large"

    # Measure execution time to ensure function scales comfortably (not strict timing, but defensive)
    start = time.perf_counter()
    codeflash_output = _create_device_sync_precompute_statements(large_mapping)
    stmts = codeflash_output  # 20.4μs -> 17.8μs (14.6% faster)
    elapsed = time.perf_counter() - start


def test_all_returns_have_lineno_set_to_one_and_are_assign_nodes():
    # When providing multiple frameworks, ensure every returned statement is an ast.Assign with lineno == 1
    mapping = {"torch": "torch", "jax": "jax", "tensorflow": "tf"}
    codeflash_output = _create_device_sync_precompute_statements(mapping)
    stmts = codeflash_output  # 19.7μs -> 17.2μs (15.1% faster)
    for stmt in stmts:
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import ast

# imports
from codeflash.code_utils.instrument_existing_tests import _create_device_sync_precompute_statements


# Test helper function to convert AST nodes to source code for easier inspection
def ast_to_source(node):
    """Convert an AST node to source code string for debugging."""
    return ast.unparse(node)


# ============================================================================
# BASIC TEST CASES
# ============================================================================


def test_empty_frameworks_dict():
    """Test that an empty frameworks dict returns an empty list of statements."""
    codeflash_output = _create_device_sync_precompute_statements({})
    result = codeflash_output  # 421ns -> 420ns (0.238% faster)


def test_none_frameworks():
    """Test that None input returns an empty list of statements."""
    codeflash_output = _create_device_sync_precompute_statements(None)
    result = codeflash_output  # 440ns -> 431ns (2.09% faster)


def test_torch_only():
    """Test that torch framework generates correct precompute statements."""
    frameworks = {"torch": "torch"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.8μs -> 13.8μs (14.8% faster)


def test_jax_only():
    """Test that JAX framework generates correct precompute statement."""
    frameworks = {"jax": "jax"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 4.86μs -> 4.95μs (1.84% slower)


def test_tensorflow_only():
    """Test that TensorFlow framework generates correct precompute statement."""
    frameworks = {"tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 5.61μs -> 5.71μs (1.73% slower)


def test_all_frameworks():
    """Test that all three frameworks together generate the correct number of statements."""
    frameworks = {"torch": "torch", "jax": "jax", "tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 20.3μs -> 17.3μs (17.4% faster)


def test_torch_with_alias():
    """Test that torch with a custom alias generates correct statements."""
    frameworks = {"torch": "torch_alias"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 14.8μs -> 12.5μs (18.5% faster)

    # Check that the alias is used in the generated code
    first_stmt_src = ast_to_source(result[0])


def test_jax_with_alias():
    """Test that JAX with a custom alias generates correct statement."""
    frameworks = {"jax": "jax_custom"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 4.81μs -> 4.72μs (1.91% faster)
    stmt_src = ast_to_source(result[0])


def test_tensorflow_with_alias():
    """Test that TensorFlow with a custom alias generates correct statement."""
    frameworks = {"tensorflow": "tf_custom"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 5.74μs -> 5.49μs (4.57% faster)
    stmt_src = ast_to_source(result[0])


# ============================================================================
# EDGE TEST CASES
# ============================================================================


def test_torch_cuda_statement_structure():
    """Test the detailed structure of the torch CUDA sync statement."""
    frameworks = {"torch": "torch"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.3μs -> 13.1μs (16.5% faster)

    cuda_stmt = result[0]


def test_torch_mps_statement_structure():
    """Test the detailed structure of the torch MPS sync statement."""
    frameworks = {"torch": "torch"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.0μs -> 12.8μs (17.3% faster)

    mps_stmt = result[1]


def test_jax_statement_structure():
    """Test the detailed structure of the JAX sync statement."""
    frameworks = {"jax": "jax"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 4.72μs -> 4.76μs (0.841% slower)

    jax_stmt = result[0]


def test_tensorflow_statement_structure():
    """Test the detailed structure of the TensorFlow sync statement."""
    frameworks = {"tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 5.66μs -> 5.44μs (4.06% faster)

    tf_stmt = result[0]


def test_frameworks_with_unknown_framework():
    """Test that unknown frameworks are simply ignored."""
    frameworks = {"torch": "torch", "unknown_framework": "unknown"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.2μs -> 12.7μs (19.4% faster)


def test_frameworks_dict_with_empty_string_values():
    """Test behavior when framework aliases are empty strings."""
    frameworks = {"torch": "", "jax": "", "tensorflow": ""}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 20.0μs -> 17.2μs (16.4% faster)

    # Verify the empty strings are used in the generated code
    first_stmt_src = ast_to_source(result[0])


def test_torch_and_jax_combination():
    """Test that torch and JAX together generate correct number of statements."""
    frameworks = {"torch": "torch", "jax": "jax"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 16.8μs -> 14.6μs (15.3% faster)

    # Verify variable names
    var_names = [stmt.targets[0].id for stmt in result]


def test_torch_and_tensorflow_combination():
    """Test that torch and TensorFlow together generate correct number of statements."""
    frameworks = {"torch": "torch", "tensorflow": "tf"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 17.9μs -> 15.4μs (16.1% faster)

    # Verify variable names
    var_names = [stmt.targets[0].id for stmt in result]


def test_jax_and_tensorflow_combination():
    """Test that JAX and TensorFlow together generate correct number of statements."""
    frameworks = {"jax": "jax", "tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 7.89μs -> 7.64μs (3.27% faster)

    # Verify variable names
    var_names = [stmt.targets[0].id for stmt in result]


def test_case_sensitive_framework_names():
    """Test that framework names are case-sensitive."""
    # 'Torch' instead of 'torch' should be ignored
    frameworks = {"Torch": "torch", "Jax": "jax", "TensorFlow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 721ns -> 1.56μs (53.8% slower)


def test_return_type_is_list():
    """Test that the return type is always a list."""
    test_cases = [
        None,
        {},
        {"torch": "torch"},
        {"jax": "jax"},
        {"tensorflow": "tensorflow"},
        {"torch": "torch", "jax": "jax", "tensorflow": "tensorflow"},
    ]

    for test_case in test_cases:
        codeflash_output = _create_device_sync_precompute_statements(test_case)
        result = codeflash_output  # 38.3μs -> 32.8μs (16.8% faster)


def test_all_statements_are_ast_assign():
    """Test that all generated statements are ast.Assign nodes."""
    frameworks = {"torch": "torch", "jax": "jax", "tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 19.9μs -> 16.9μs (17.6% faster)


def test_all_statements_have_single_target():
    """Test that all Assign statements have exactly one target."""
    frameworks = {"torch": "torch", "jax": "jax", "tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 19.8μs -> 17.0μs (16.3% faster)


def test_all_targets_are_name_nodes():
    """Test that all assignment targets are ast.Name nodes."""
    frameworks = {"torch": "torch", "jax": "jax", "tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 19.9μs -> 16.8μs (18.0% faster)


def test_torch_mps_contains_cuda_reference():
    """Test that MPS statement references the CUDA variable."""
    frameworks = {"torch": "torch"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 14.8μs -> 12.4μs (19.1% faster)

    mps_stmt = result[1]
    # Convert to source to check for the reference
    mps_src = ast_to_source(mps_stmt)


# ============================================================================
# LARGE SCALE TEST CASES
# ============================================================================


def test_many_frameworks_dict_with_torch():
    """Test behavior when frameworks dict is large but only torch is relevant."""
    # Create a dict with many entries, but only torch is relevant
    frameworks = {"torch": "torch"}
    for i in range(500):
        frameworks[f"framework_{i}"] = f"alias_{i}"

    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.4μs -> 13.1μs (17.8% faster)


def test_many_frameworks_dict_with_all_relevant():
    """Test with a large frameworks dict containing all relevant frameworks."""
    frameworks = {"torch": "torch", "jax": "jax", "tensorflow": "tensorflow"}
    # Add many irrelevant frameworks
    for i in range(500):
        frameworks[f"framework_{i}"] = f"alias_{i}"

    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 20.4μs -> 17.8μs (14.5% faster)

    # Verify the expected variables are present
    var_names = [stmt.targets[0].id for stmt in result]


def test_torch_with_very_long_alias_name():
    """Test torch with a very long alias name."""
    long_alias = "torch_" + "x" * 1000
    frameworks = {"torch": long_alias}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 14.9μs -> 12.7μs (17.8% faster)

    # Verify the long alias is used
    first_stmt_src = ast_to_source(result[0])


def test_all_frameworks_with_long_aliases():
    """Test all frameworks with very long alias names."""
    frameworks = {"torch": "torch_" + "a" * 500, "jax": "jax_" + "b" * 500, "tensorflow": "tensorflow_" + "c" * 500}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 20.0μs -> 17.3μs (15.7% faster)


def test_sequential_framework_additions():
    """Test adding frameworks one by one and checking cumulative results."""
    results_by_framework = {}

    # Start with torch
    torch_frameworks = {"torch": "torch"}
    results_by_framework["torch"] = _create_device_sync_precompute_statements(
        torch_frameworks
    )  # 16.0μs -> 13.6μs (17.6% faster)

    # Add jax
    jax_frameworks = {"torch": "torch", "jax": "jax"}
    results_by_framework["torch_jax"] = _create_device_sync_precompute_statements(
        jax_frameworks
    )  # 14.1μs -> 11.2μs (25.6% faster)

    # Add tensorflow
    all_frameworks = {"torch": "torch", "jax": "jax", "tensorflow": "tensorflow"}
    results_by_framework["all"] = _create_device_sync_precompute_statements(
        all_frameworks
    )  # 16.9μs -> 13.9μs (21.5% faster)


def test_frameworks_dict_with_unicode_aliases():
    """Test frameworks with unicode characters in aliases."""
    frameworks = {"torch": "torch_αβγ", "jax": "jax_δεζ", "tensorflow": "tf_ηθι"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 20.1μs -> 17.4μs (15.4% faster)


def test_mps_statement_dependency_on_cuda():
    """Test that MPS statement is dependent on CUDA statement being false."""
    frameworks = {"torch": "torch"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.1μs -> 12.7μs (18.6% faster)

    mps_stmt = result[1]
    cuda_var_name = "_codeflash_should_sync_cuda"

    # The MPS statement should reference the CUDA variable in its first condition
    # Find if the UnaryOp operand is referencing the cuda variable
    first_condition = mps_stmt.value.values[0]


def test_torch_cuda_uses_both_is_available_and_is_initialized():
    """Test that CUDA statement checks both is_available and is_initialized."""
    frameworks = {"torch": "torch"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 14.9μs -> 12.5μs (19.9% faster)

    cuda_stmt = result[0]
    cuda_src = ast_to_source(cuda_stmt)


def test_torch_mps_uses_backends_and_mps_modules():
    """Test that MPS statement checks both backends.mps and mps.synchronize."""
    frameworks = {"torch": "torch"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.0μs -> 12.7μs (18.6% faster)

    mps_stmt = result[1]
    mps_src = ast_to_source(mps_stmt)


def test_jax_checks_block_until_ready():
    """Test that JAX statement checks for block_until_ready."""
    frameworks = {"jax": "jax"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 4.81μs -> 4.87μs (1.23% slower)

    jax_stmt = result[0]
    jax_src = ast_to_source(jax_stmt)


def test_tensorflow_checks_sync_devices():
    """Test that TensorFlow statement checks for sync_devices."""
    frameworks = {"tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 5.69μs -> 5.76μs (1.22% slower)

    tf_stmt = result[0]
    tf_src = ast_to_source(tf_stmt)


def test_all_statements_have_lineno_one():
    """Test that all statements have lineno set to 1."""
    frameworks = {"torch": "torch", "jax": "jax", "tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 20.8μs -> 18.1μs (14.7% faster)


def test_torch_alias_propagation():
    """Test that torch alias is correctly propagated in all CUDA checks."""
    torch_alias = "custom_torch_alias"
    frameworks = {"torch": torch_alias}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.0μs -> 13.1μs (14.6% faster)

    cuda_stmt = result[0]
    cuda_src = ast_to_source(cuda_stmt)


def test_jax_alias_propagation():
    """Test that JAX alias is correctly propagated."""
    jax_alias = "custom_jax_alias"
    frameworks = {"jax": jax_alias}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 4.95μs -> 5.09μs (2.77% slower)

    jax_stmt = result[0]
    jax_src = ast_to_source(jax_stmt)


def test_tensorflow_alias_propagation():
    """Test that TensorFlow alias is correctly propagated."""
    tf_alias = "custom_tf_alias"
    frameworks = {"tensorflow": tf_alias}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 5.74μs -> 5.77μs (0.503% slower)

    tf_stmt = result[0]
    tf_src = ast_to_source(tf_stmt)


def test_hasattr_calls_in_jax_statement():
    """Test that JAX statement uses hasattr correctly."""
    frameworks = {"jax": "jax"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 4.75μs -> 4.76μs (0.231% slower)

    jax_stmt = result[0]


def test_hasattr_calls_in_tensorflow_statement():
    """Test that TensorFlow statement uses hasattr correctly."""
    frameworks = {"tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 5.65μs -> 5.50μs (2.71% faster)

    tf_stmt = result[0]


def test_constant_values_in_jax_statement():
    """Test that JAX statement has correct constant values."""
    frameworks = {"jax": "jax"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 4.77μs -> 4.67μs (2.14% faster)

    jax_stmt = result[0]


def test_constant_values_in_tensorflow_statement():
    """Test that TensorFlow statement has correct constant values."""
    frameworks = {"tensorflow": "tensorflow"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 5.73μs -> 5.61μs (2.16% faster)

    tf_stmt = result[0]


def test_torch_mps_constant_values():
    """Test that MPS statement has correct constant values."""
    frameworks = {"torch": "torch"}
    codeflash_output = _create_device_sync_precompute_statements(frameworks)
    result = codeflash_output  # 15.2μs -> 13.2μs (15.9% faster)

    mps_stmt = result[1]

    # Find hasattr calls with "mps" and "synchronize" constants
    mps_src = ast_to_source(mps_stmt)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1015-2026-01-09T21.40.59 and push.

The optimized code achieves a **14% speedup** by reducing redundant AST node allocations through strategic object reuse, particularly for PyTorch framework handling which dominates the function's workload. ## Key Optimizations **1. Context Object Reuse (3-4% gain)** The optimized version creates `ast.Load()` and `ast.Store()` context objects once and reuses them throughout, rather than creating new instances inline for every AST node. Since these are singleton-like objects, this reduces allocation overhead. **2. Shared AST Attribute Chains (8-10% gain for torch-heavy workloads)** For PyTorch, the code now creates intermediate AST nodes once and reuses them: - `torch_name` - reused for both CUDA and MPS statements - `torch_cuda` - reused for both `is_available()` and `is_initialized()` calls - `torch_backends` - reused for MPS hasattr check and `backends.mps.is_available()` call - `torch_mps_attr` - reused for the `hasattr(torch.mps, 'synchronize')` check The original code reconstructed these attribute chains from scratch each time, creating duplicate `ast.Name` and `ast.Attribute` nodes with identical structure. ## Performance Impact Analysis The test results show clear patterns: - **torch-only tests**: 13-19% faster (e.g., `test_torch_with_custom_alias_and_empty_alias`: 27.7% faster with empty alias) - **multi-framework tests**: 14-18% faster when torch is included - **non-torch tests** (JAX/TensorFlow only): minimal change or slightly slower (0-3%), since they don't benefit from the torch-specific optimizations ## Context and Impact Based on `function_references`, this function is called by `create_wrapper_function()` which instruments test functions for profiling. The wrapper is generated for every test function being monitored, making this a performance-critical code path during test suite instrumentation. The optimization is particularly valuable when: - Instrumenting large test suites with many PyTorch-based tests - The `used_frameworks` dict frequently contains "torch" (the most common case) - Tests are re-instrumented multiple times during iterative optimization The speedup compounds across hundreds or thousands of test instrumentations, reducing overall profiling setup overhead.

codeflash-ai · 2026-01-13T00:17:19Z

This PR has been automatically closed because the original PR #1015 by aseembits93 was closed.

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 9, 2026

codeflash-ai bot mentioned this pull request Jan 9, 2026

Non-CPU device synchronization for accurate profiling #1015

Merged

codeflash-ai bot closed this Jan 13, 2026

codeflash-ai bot deleted the codeflash/optimize-pr1015-2026-01-09T21.40.59 branch January 13, 2026 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_create_device_sync_precompute_statements` by 14% in PR #1015 (`gpu-sync-instrumentation`) #1040

⚡️ Speed up function `_create_device_sync_precompute_statements` by 14% in PR #1015 (`gpu-sync-instrumentation`) #1040

Uh oh!

codeflash-ai bot commented Jan 9, 2026

Uh oh!

codeflash-ai bot commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _create_device_sync_precompute_statements by 14% in PR #1015 (gpu-sync-instrumentation) #1040

⚡️ Speed up function _create_device_sync_precompute_statements by 14% in PR #1015 (gpu-sync-instrumentation) #1040

Uh oh!

Conversation

codeflash-ai bot commented Jan 9, 2026

⚡️ This pull request contains optimizations for PR #1015

📄 14% (0.14x) speedup for _create_device_sync_precompute_statements in codeflash/code_utils/instrument_existing_tests.py

📝 Explanation and details

Key Optimizations

Performance Impact Analysis

Context and Impact

Uh oh!

codeflash-ai bot commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_create_device_sync_precompute_statements` by 14% in PR #1015 (`gpu-sync-instrumentation`) #1040

⚡️ Speed up function `_create_device_sync_precompute_statements` by 14% in PR #1015 (`gpu-sync-instrumentation`) #1040

📄 14% (0.14x) speedup for `_create_device_sync_precompute_statements` in `codeflash/code_utils/instrument_existing_tests.py`