Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 23, 2025

📄 16,072% (160.72x) speedup for find_last_node in src/algorithms/graph.py

⏱️ Runtime : 91.0 milliseconds 563 microseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 160x speedup by eliminating a nested loop that created O(n*m) complexity where n = number of nodes and m = number of edges.

Key Optimization

Original approach: For each node, iterate through ALL edges to check if that node is a source

  • For each of the n nodes, checks all m edges: all(e["source"] != n["id"] for e in edges)
  • Time complexity: O(n * m)

Optimized approach: Pre-compute all source node IDs once into a set, then do O(1) lookups

  • Build a set of all source IDs: sources = {e["source"] for e in edges} - O(m)
  • Check each node against the set: n["id"] not in sources - O(1) per node
  • Time complexity: O(n + m)

Why This Is Faster

  1. Set lookup is O(1) vs iterating through all edges which is O(m)
  2. Single pass through edges instead of scanning them repeatedly for each node
  3. Hash-based membership testing (in operator on sets) is dramatically faster than list iteration

Performance Impact by Test Case

The optimization shines particularly well with:

  • Large graphs with many edges (linear chains: 33,000% faster, complete graphs: 8,857% faster)
  • Graphs where the last node appears late in the nodes list (forces the original code to check many nodes)
  • Test cases show consistent 50-100% speedups on small graphs, but exponential gains (thousands of percent) on graphs with 500+ nodes/edges

Even on tiny graphs (2-3 nodes), the optimization provides 25-100% speedups, demonstrating the overhead of nested iteration even at small scales.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 41 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from src.algorithms.graph import find_last_node

# unit tests

# 1. Basic Test Cases


def test_single_node_no_edges():
    # Only one node, no edges. Should return the node itself.
    nodes = [{"id": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.21μs -> 958ns (26.1% faster)


def test_two_nodes_one_edge():
    # Two nodes, one edge from A to B. Only B is not a source.
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.96μs -> 1.17μs (67.8% faster)


def test_three_nodes_chain():
    # A -> B -> C. Only C is not a source.
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}, {"source": "B", "target": "C"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.42μs -> 1.21μs (99.9% faster)


def test_multiple_possible_last_nodes():
    # A->B, C->D. B and D are both possible last nodes, but function returns the first one found.
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}, {"id": "D"}]
    edges = [{"source": "A", "target": "B"}, {"source": "C", "target": "D"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.21μs (58.7% faster)


def test_cycle_graph():
    # A->B->C->A (cycle). All nodes are sources, so should return None.
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "B", "target": "C"},
        {"source": "C", "target": "A"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.42μs -> 1.29μs (87.2% faster)


# 2. Edge Test Cases


def test_empty_nodes_and_edges():
    # No nodes, no edges. Should return None.
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 791ns -> 875ns (9.60% slower)


def test_edges_with_nonexistent_nodes():
    # Edges refer to nodes not present in nodes list. Should return all nodes as last nodes.
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "X", "target": "A"}, {"source": "Y", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.58μs -> 1.17μs (35.7% faster)


def test_node_with_multiple_incoming_edges():
    # D has multiple incoming edges, but is not a source, so should be last node.
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}, {"id": "D"}]
    edges = [
        {"source": "A", "target": "D"},
        {"source": "B", "target": "D"},
        {"source": "C", "target": "D"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.79μs -> 1.33μs (109% faster)


def test_node_with_self_loop():
    # Node with a self-loop is a source, so should not be last node.
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "A"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.08μs (77.0% faster)


def test_all_nodes_are_sources():
    # Every node is a source in at least one edge, so should return None.
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "B", "target": "C"},
        {"source": "C", "target": "A"},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.29μs -> 1.25μs (83.4% faster)


def test_duplicate_edges():
    # Multiple edges with same source/target. Should not affect result.
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}, {"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.96μs -> 1.25μs (56.6% faster)


def test_nodes_with_extra_properties():
    # Nodes have additional properties, should still match by id only.
    nodes = [{"id": "A", "val": 1}, {"id": "B", "val": 2}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.12μs (70.4% faster)


def test_edges_with_extra_properties():
    # Edges have extra properties, should not affect result.
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B", "weight": 10}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.12μs (63.0% faster)


def test_node_id_is_int():
    # Node ids are integers, not strings.
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.12μs (62.9% faster)


def test_node_id_is_tuple():
    # Node ids are tuples.
    nodes = [{"id": (1, 2)}, {"id": (3, 4)}]
    edges = [{"source": (1, 2), "target": (3, 4)}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.04μs -> 1.21μs (69.0% faster)


# 3. Large Scale Test Cases


def test_large_linear_chain():
    # Large chain: 0->1->2->...->999. Only node 999 is not a source.
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i + 1} for i in range(N - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.3ms -> 55.2μs (33094% faster)


def test_large_star_graph():
    # Star: 0->1, 0->2, ..., 0->999. All except 0 are not sources. Should return 1.
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": 0, "target": i} for i in range(1, N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 37.5μs -> 19.8μs (88.9% faster)


def test_large_disconnected_graph():
    # 500 isolated nodes, 500 in a chain. First 500 are not sources, so should return node 0.
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i + 1} for i in range(500, N - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 20.5μs -> 13.8μs (48.8% faster)


def test_large_complete_graph():
    # Complete graph: every node is a source. Should return None.
    N = 100
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": j} for i in range(N) for j in range(N) if i != j]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 17.0ms -> 190μs (8857% faster)


def test_large_graph_with_multiple_last_nodes():
    # Two separate chains, each of length 500. Both last nodes are not sources, should return first one.
    N = 500
    nodes = [{"id": f"A{i}"} for i in range(N)] + [{"id": f"B{i}"} for i in range(N)]
    edges = [{"source": f"A{i}", "target": f"A{i+1}"} for i in range(N - 1)] + [
        {"source": f"B{i}", "target": f"B{i+1}"} for i in range(N - 1)
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 4.84ms -> 71.0μs (6722% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from src.algorithms.graph import find_last_node

# unit tests

# --- Basic Test Cases ---


def test_single_node_no_edges():
    # A single node with no edges should be returned as the last node
    nodes = [{"id": 1, "name": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.25μs -> 1.00μs (25.0% faster)


def test_two_nodes_one_edge():
    # Node 2 is not a source, so it should be returned
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.96μs -> 1.17μs (67.8% faster)


def test_three_nodes_linear_chain():
    # Only the last node in the chain is not a source
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}, {"source": "B", "target": "C"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.42μs -> 1.29μs (87.1% faster)


def test_multiple_last_nodes_returns_first():
    # Both 2 and 3 are not sources; function should return the first such node
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.12μs (66.7% faster)


def test_no_edges_multiple_nodes():
    # All nodes are not sources; should return the first node
    nodes = [{"id": "x"}, {"id": "y"}, {"id": "z"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.25μs -> 1.00μs (25.0% faster)


# --- Edge Test Cases ---


def test_empty_nodes_and_edges():
    # No nodes, no edges: should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 792ns -> 834ns (5.04% slower)


def test_edges_but_no_nodes():
    # Edges but no nodes: should return None
    nodes = []
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 750ns -> 959ns (21.8% slower)


def test_all_nodes_are_sources():
    # All nodes are sources; none should be returned
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.25μs (46.7% faster)


def test_node_with_multiple_incoming_edges():
    # Node 3 has two incoming edges, but is not a source
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 3}, {"source": 2, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.29μs -> 1.25μs (83.3% faster)


def test_node_with_self_loop():
    # Node with a self-loop is a source, so should not be returned
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.12μs (62.9% faster)


def test_node_with_nonexistent_source():
    # Edge with a source not in nodes; should ignore and return node 1
    nodes = [{"id": 1}]
    edges = [{"source": 2, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.50μs -> 1.08μs (38.4% faster)


def test_nodes_with_non_integer_ids():
    # IDs are strings
    nodes = [{"id": "foo"}, {"id": "bar"}]
    edges = [{"source": "foo", "target": "bar"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.17μs (64.2% faster)


def test_nodes_with_dict_ids():
    # IDs are tuples
    nodes = [{"id": (1, 2)}, {"id": (3, 4)}]
    edges = [{"source": (1, 2), "target": (3, 4)}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.00μs -> 1.21μs (65.6% faster)


def test_edges_with_extra_keys():
    # Edges have extra keys, should be ignored
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2, "weight": 5}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.12μs (66.7% faster)


def test_nodes_with_extra_keys():
    # Nodes have extra keys, should be preserved in output
    nodes = [{"id": 1, "meta": "x"}, {"id": 2, "meta": "y"}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.12μs (62.9% faster)


def test_duplicate_nodes():
    # Duplicate node definitions; should return the first non-source node
    nodes = [{"id": 1}, {"id": 2}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.79μs -> 1.12μs (59.3% faster)


# --- Large Scale Test Cases ---


def test_large_linear_chain():
    # Large chain of 1000 nodes, each points to the next
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i + 1} for i in range(N - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.3ms -> 55.5μs (32931% faster)


def test_large_tree_with_multiple_leaves():
    # Binary tree with 10 levels (1023 nodes), all leaves are not sources
    def build_binary_tree(levels):
        nodes = [{"id": i} for i in range(2**levels - 1)]
        edges = []
        for i in range(2 ** (levels - 1) - 1):
            edges.append({"source": i, "target": 2 * i + 1})
            edges.append({"source": i, "target": 2 * i + 2})
        return nodes, edges

    nodes, edges = build_binary_tree(10)
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 9.47ms -> 38.4μs (24546% faster)


def test_large_graph_all_sources():
    # All nodes are sources, so result should be None
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": (i + 1) % N} for i in range(N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.3ms -> 54.6μs (33478% faster)


def test_large_graph_no_edges():
    # All nodes are not sources; should return the first node
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.42μs -> 1.04μs (36.0% faster)


def test_large_graph_multiple_last_nodes():
    # First 500 nodes are sources, last 500 are not
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i + 1} for i in range(500)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 4.54ms -> 27.9μs (16170% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-find_last_node-mjiw54c4 and push.

Codeflash Static Badge

The optimized code achieves a **160x speedup** by eliminating a nested loop that created **O(n*m) complexity** where n = number of nodes and m = number of edges.

## Key Optimization

**Original approach**: For each node, iterate through ALL edges to check if that node is a source
- For each of the n nodes, checks all m edges: `all(e["source"] != n["id"] for e in edges)`
- Time complexity: **O(n * m)**

**Optimized approach**: Pre-compute all source node IDs once into a set, then do O(1) lookups
- Build a set of all source IDs: `sources = {e["source"] for e in edges}` - O(m)
- Check each node against the set: `n["id"] not in sources` - O(1) per node
- Time complexity: **O(n + m)**

## Why This Is Faster

1. **Set lookup is O(1)** vs iterating through all edges which is O(m)
2. **Single pass through edges** instead of scanning them repeatedly for each node
3. **Hash-based membership testing** (`in` operator on sets) is dramatically faster than list iteration

## Performance Impact by Test Case

The optimization shines particularly well with:
- **Large graphs with many edges** (linear chains: 33,000% faster, complete graphs: 8,857% faster)
- **Graphs where the last node appears late** in the nodes list (forces the original code to check many nodes)
- Test cases show consistent 50-100% speedups on small graphs, but **exponential gains** (thousands of percent) on graphs with 500+ nodes/edges

Even on tiny graphs (2-3 nodes), the optimization provides 25-100% speedups, demonstrating the overhead of nested iteration even at small scales.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 December 23, 2025 18:01
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant