Skip to content

FIX: Release GIL during blocking ODBC connect/disconnect calls#497

Open
saurabh500 wants to merge 5 commits intomainfrom
dev/saurabh/gilduringconnect
Open

FIX: Release GIL during blocking ODBC connect/disconnect calls#497
saurabh500 wants to merge 5 commits intomainfrom
dev/saurabh/gilduringconnect

Conversation

@saurabh500
Copy link
Copy Markdown
Contributor

@saurabh500 saurabh500 commented Apr 6, 2026

Summary

The Python GIL is held during the entire ODBC connection establishment (SQLDriverConnect) and teardown (SQLDisconnect) calls. These are blocking I/O operations involving DNS resolution, TCP handshake, TLS negotiation, and SQL Server authentication that can take hundreds of milliseconds to seconds. While the GIL is held, all other Python threads are blocked, even those doing unrelated work.

This causes severe throughput degradation in multi-threaded applications that establish connections concurrently.

Relates to #491

Benchmark Results

A/B benchmark with 10 threads concurrently establishing connections (pooling disabled, SQL Server 2025 on localhost):

Metric Before (GIL held) After (GIL released)
Serial single connection 85.1 ms 86.2 ms
10 threads concurrent (wall-clock) 862.9 ms 150.1 ms
Speedup vs serial estimate 0.99× (fully serialized) 5.74× (parallel I/O)

Before this change, 10 concurrent threads take the same time as 10 serial connections — the GIL completely serializes them. After, threads overlap their I/O and achieve ~6× speedup.

Raw benchmark output (click to expand)

Before (GIL held during ODBC calls):

--- Serial Baseline (5 rounds) ---
  Individual: [86.0ms, 89.8ms, 82.9ms, 85.0ms, 85.1ms]
  Median:     85.1 ms

--- Concurrent (10 threads) ---
  Trial 1: wall=862.9ms  per-thread=[257, 172, 86, 255, 428, 173, 341, 172, 88, 173]ms
  Trial 2: wall=873.5ms  per-thread=[90, 350, 599, 354, 433, 187, 269, 101, 173, 427]ms
  Trial 3: wall=852.2ms  per-thread=[337, 91, 258, 170, 89, 253, 338, 172, 254, 848]ms

  Speedup: 0.99x ❌

After (GIL released during ODBC calls):

--- Serial Baseline (5 rounds) ---
  Individual: [89.1ms, 85.7ms, 86.2ms, 85.6ms, 87.7ms]
  Median:     86.2 ms

--- Concurrent (10 threads) ---
  Trial 1: wall=150.1ms  per-thread=[91, 98, 137, 134, 133, 109, 137, 143, 136, 117]ms
  Trial 2: wall=148.2ms  per-thread=[105, 112, 131, 108, 118, 110, 111, 111, 142, 108]ms
  Trial 3: wall=184.6ms  per-thread=[87, 92, 97, 180, 91, 178, 179, 92, 175, 181]ms

  Speedup: 5.74x ✅

Solution

1. GIL Release in connection.cpp

  • Connection::connect(): Wrap SQLDriverConnect_ptr in a py::gil_scoped_release block. All py::dict access (applyAttrsBefore) completes before the release scope, and checkError() runs after GIL is re-acquired.
  • Connection::disconnect(): Conditionally release GIL around SQLDisconnect_ptr, guarded by PyGILState_Check() to safely handle destructor/shutdown paths where the GIL may not be held.
  • Connection::reset(): Removed internal disconnect() calls from failure paths. The pool caller already adds failed connections to to_disconnect for cleanup outside the mutex, so the internal disconnect was both redundant and problematic (it would release the GIL while the pool mutex was held).

2. Pool Lock Restructuring in connection_pool.cpp

Releasing the GIL inside ODBC calls while holding C++ mutexes creates a mutex/GIL lock-ordering deadlock. The pool changes are a necessary prerequisite for the GIL release to be safe.

Why the pool changes are required

Without restructuring, we get this deadlock:

Thread A                              Thread B
────────                              ────────
pool.acquire()
  lock(_mutex)           ✅
  connect()
    gil_scoped_release   → GIL freed
    SQLDriverConnect...  (blocking)
                                      Acquires GIL           ✅
                                      pool.acquire()
                                        lock(_mutex)         ❌ BLOCKED (Thread A holds it)
    ...SQLDriverConnect done
    gil_scoped_release   → tries to
      destructor            re-acquire GIL  ❌ BLOCKED (Thread B holds it)

💀 DEADLOCK: Thread A has mutex, needs GIL
             Thread B has GIL, needs mutex

How the fix works — ConnectionPool::acquire()

The solution separates mutex-protected bookkeeping from blocking ODBC calls:

Thread A                              Thread B
────────                              ────────
pool.acquire()                        pool.acquire()
  lock(_mutex)           ✅
    reserve slot (++_current_size)
    create Connection object
  unlock(_mutex)         → released
                                        lock(_mutex)         ✅
                                          reserve slot
                                          create Connection
                                        unlock(_mutex)       → released
  connect()                             connect()
    gil_scoped_release   → GIL freed      gil_scoped_release → GIL freed
    SQLDriverConnect...  (parallel)       SQLDriverConnect... (parallel)
    ...done              → GIL back      ...done             → GIL back

✅ Both threads connect in parallel — no deadlock

If connect() fails, the reserved slot is rolled back:

catch (...) {
    std::lock_guard<std::mutex> lock(_mutex);
    if (_current_size > 0) --_current_size;
    throw;
}

The same pattern applies to all pool/manager methods

Method Before (❌ deadlock risk) After (✅ safe)
Pool::acquire() connect() inside _mutex Reserve slot inside, connect() outside
Pool::release() disconnect() inside _mutex Decide inside, disconnect() outside
Manager::acquireConnection() pool->acquire() inside _manager_mutex Lookup pool inside, acquire() outside
Manager::returnConnection() pool->release() inside _manager_mutex Lookup pool inside, release() outside
Manager::closePools() pool->close() inside _manager_mutex Collect pools inside, close() outside

The invariant is: never hold a C++ mutex across a scope that releases the GIL.

Testing

Added test_021_concurrent_connection_perf.py with three @pytest.mark.stress tests:

  1. test_concurrent_connection_gil_release — 10 threads establish connections concurrently (pooling disabled); asserts >2× speedup over serial baseline, proving I/O parallelism.
  2. test_concurrent_disconnect_gil_release — 10 threads disconnect concurrently; asserts >1.5× speedup.
  3. test_mixed_connect_disconnect_under_load — I/O threads (connect/disconnect) run alongside CPU-bound Python threads for 5 seconds; verifies no starvation or deadlocks.

Added test_pool_release_overflow_disconnects_outside_mutex in test_009_pooling.py to cover the pool release overflow path.

Added benchmarks/bench_concurrent_connect.py — standalone A/B benchmark script for reproducing the results above.

Safety

  • LOG() macro acquires GIL internally via py::gil_scoped_acquire — safe to call from any context
  • checkError() is pure C++ (throws std::runtime_error) — caught by pybind11 at the boundary where GIL is held
  • PyGILState_Check() guard in disconnect() handles destructor paths safely
  • All pool data structures remain mutex-protected; only blocking ODBC calls moved outside lock scopes

Release the Python GIL during SQLDriverConnect and SQLDisconnect calls
so that other Python threads can run concurrently during connection I/O
(DNS resolution, TCP handshake, TLS negotiation, SQL Server auth).

Changes:
- connection.cpp: Wrap SQLDriverConnect_ptr in py::gil_scoped_release
- connection.cpp: Conditionally release GIL around SQLDisconnect_ptr
  (guarded by PyGILState_Check for destructor safety)
- connection.cpp: Remove disconnect() from reset() failure paths to
  avoid GIL release inside pool mutex; callers handle cleanup
- connection_pool.cpp: Move connect() outside pool mutex in acquire()
  (reserve slot inside mutex, connect outside, rollback on failure)
- connection_pool.cpp: Move disconnect() outside mutex in release()
- connection_pool.cpp: Release _manager_mutex before calling pool
  acquire/release/close to prevent mutex/GIL lock-ordering deadlocks
- Add test_021_concurrent_connection_perf.py with stress tests proving
  concurrent connection parallelism and no GIL starvation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

📊 Code Coverage Report

🔥 Diff Coverage

82%


🎯 Overall Coverage

78%


📈 Total Lines Covered: 6657 out of 8446
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/connection/connection.cpp (78.9%): Missing lines 160-161,167-168
  • mssql_python/pybind/connection/connection_pool.cpp (83.7%): Missing lines 107-108,113-118

Summary

  • Total: 68 lines
  • Missing: 12 lines
  • Coverage: 82%

mssql_python/pybind/connection/connection.cpp

Lines 156-165

  156             py::gil_scoped_release release;
  157             ret = SQLDisconnect_ptr(_dbcHandle->get());
  158         } else {
  159             // Destructor / shutdown path — GIL is not held, call directly.
! 160             ret = SQLDisconnect_ptr(_dbcHandle->get());
! 161         }
  162         // In destructor/shutdown paths, suppress errors to avoid
  163         // std::terminate() if this throws during stack unwinding.
  164         if (hasGil) {
  165             checkError(ret);

Lines 163-172

  163         // std::terminate() if this throws during stack unwinding.
  164         if (hasGil) {
  165             checkError(ret);
  166         } else if (!SQL_SUCCEEDED(ret)) {
! 167             LOG("SQLDisconnect failed in destructor/shutdown path; ignoring error");
! 168         }
  169         // triggers SQLFreeHandle via destructor, if last owner
  170         _dbcHandle.reset();
  171     } else {
  172         LOG("No connection handle to disconnect");

mssql_python/pybind/connection/connection_pool.cpp

Lines 103-122

  103         if (_pool.size() < _max_size) {
  104             conn->updateLastUsed();
  105             _pool.push_back(conn);
  106         } else {
! 107             should_disconnect = true;
! 108         }
  109     }
  110     // Disconnect outside the mutex to avoid holding it during the
  111     // blocking ODBC call (which releases the GIL).
  112     if (should_disconnect) {
! 113         try {
! 114             conn->disconnect();
! 115         } catch (const std::exception& ex) {
! 116             LOG("ConnectionPool::release: disconnect failed: %s", ex.what());
! 117         }
! 118         std::lock_guard<std::mutex> lock(_mutex);
  119         if (_current_size > 0)
  120             --_current_size;
  121     }
  122 }


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 73.9%
mssql_python.pybind.connection.connection.cpp: 75.6%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

saurabh500 and others added 2 commits April 6, 2026 14:57
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add test_pool_release_overflow_disconnects_outside_mutex to exercise the
ConnectionPool::release() overflow path where a connection is returned to
a pool that is already at max_size. This triggers the should_disconnect
branch that disconnects outside the mutex.

The disconnect-without-GIL path (connection.cpp lines 160-161) is a C++
destructor safety guard only reachable during interpreter shutdown when
the last shared_ptr drops without the GIL held. This cannot be reliably
exercised from a Python test since __del__/GC always holds the GIL.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@saurabh500 saurabh500 changed the title Release GIL during blocking ODBC connect/disconnect calls FIX: Release GIL during blocking ODBC connect/disconnect calls Apr 6, 2026
@github-actions github-actions bot added pr-size: medium Moderate update size pr-size: large Substantial code update and removed pr-size: medium Moderate update size labels Apr 6, 2026
@saurabh500 saurabh500 force-pushed the dev/saurabh/gilduringconnect branch from 004d32d to 83995e4 Compare April 6, 2026 23:45
@saurabh500 saurabh500 marked this pull request as ready for review April 6, 2026 23:46
Copilot AI review requested due to automatic review settings April 6, 2026 23:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves multi-threaded throughput by releasing the Python GIL around the blocking ODBC connect (SQLDriverConnect) and disconnect (SQLDisconnect) calls, and restructures connection pooling/manager locking to avoid mutex↔GIL deadlocks during those operations.

Changes:

  • Release the GIL during Connection::connect() and (conditionally) during Connection::disconnect().
  • Restructure ConnectionPool / ConnectionPoolManager to avoid holding C++ mutexes across connect/disconnect operations.
  • Add new stress/performance tests for concurrent connect/disconnect behavior, plus a pooling regression test.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
mssql_python/pybind/connection/connection.cpp Releases GIL during ODBC connect/disconnect; adjusts reset failure handling.
mssql_python/pybind/connection/connection_pool.cpp Moves connect/disconnect work outside pool/manager mutexes to prevent GIL/mutex deadlocks.
tests/test_021_concurrent_connection_perf.py Adds stress tests to validate parallelism and absence of starvation/deadlocks under load.
tests/test_009_pooling.py Adds a test intended to cover pool release overflow/disconnect behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

saurabh500 and others added 2 commits April 6, 2026 23:59
- connection.cpp: Suppress checkError() in destructor/shutdown path
  to avoid std::terminate() during stack unwinding
- connection_pool.cpp release(): Move _current_size decrement after
  successful disconnect to keep pool accounting consistent on failure
- connection_pool.cpp closePools(): Keep _manager_mutex held during
  close to prevent races with new pool creation
- test_009_pooling.py: Fix overflow test to use max_size=2 for
  acquisition then shrink to max_size=1 before release, matching
  actual pool semantics where max_size limits concurrent connections

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Disconnect on localhost is sub-millisecond (~0.3ms), so thread scheduling
overhead dominates and speedup ratios are meaningless. Changed the test
to verify all concurrent disconnects complete without errors or deadlocks
instead of asserting a speedup threshold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added pr-size: medium Moderate update size and removed pr-size: large Substantial code update labels Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants