Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an optional connection_timeout parameter to PyHive's Hive connection class to prevent connections from hanging indefinitely when issues occur. The timeout is specified in milliseconds and applies to both HTTP and TSocket transport types.
Changes:
- Added
connection_timeoutparameter to theConnection.__init__method signature and documentation - Implemented timeout setting for both HTTP (THttpClient) and TSocket transport paths
- Added integration test to verify connections work with the timeout parameter
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| python/pyhive/hive.py | Added connection_timeout parameter to Connection class, implemented setTimeout calls for both HTTP and socket transports, and added parameter documentation |
| python/pyhive/tests/test_hive.py | Added integration test to verify connection works with timeout parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7300 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 698 698
Lines 43646 43646
Branches 5894 5894
======================================
Misses 43646 43646 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| auth = 'NONE' | ||
| socket = thrift.transport.TSocket.TSocket(host, port) | ||
| if connection_timeout: | ||
| socket.setTimeout(connection_timeout) |
There was a problem hiding this comment.
FYI this sets both connection and socket timeout. There's a nice description of the difference here.
| socket.setTimeout(connection_timeout) | |
| socket.setConnectTimeout(connection_timeout) |
There was a problem hiding this comment.
That's a good mention. My goal with this was originally to manage the socket timeout. However, I think I will leave this way to not get too granular.
There was a problem hiding this comment.
In Java, a socket can have different connectTimeout, readTimeout, and writeTimeout values. Does Python have an equivalent? Generally, we want a small connectTimeout, and long readTimeout and writeTimeout, so it can fail fast on connecting dead server and failover to health one, if it has multiple server instances
There was a problem hiding this comment.
From my short research, it looks like the short answer is no, but possible with an explicit create_connection At least, python thrift is only using supporting one timeout and not doing this.
https://github.com/apache/thrift/blob/c99d09a231648d72e05a89d80281b38c9d0d1b9a/lib/py/src/transport/TSocket.py#L145-L147
There was a problem hiding this comment.
we can call socket.setTimeout after the socket is connected, right? if so, it's easy to implement connectTimeout, socketTimeout
There was a problem hiding this comment.
okay, I read the code, self._transport.open() happens in Connection __init__, so we can call setTimeout(connect_time) before it, and call setTimeout(socket_time), same for both host/port and thrift_transport cases
There was a problem hiding this comment.
Got it. Thanks for the feedback.
There was a problem hiding this comment.
Pushed up a solution for this, don't love it but should work. We can't guarantee a public API for setting a socket timeout so I believe it's this or reverting to only setting a timeout once, for connection and socket. Let me know what you think
There was a problem hiding this comment.
I'm leaning towards reverting to the connection timeout only, for both connections and socket, plus only on host, port combos for tradeoffs in simplicity. A user with a provided thrift object could fine tune themselves
There was a problem hiding this comment.
Reverting to the previous PR for these reasons. Plus, the pyhive connection wrapper provides polling and ability to run async, which lessons the need for a separate socket timeout. We can always add onto this given user feedback and a user can use a custom thrift transport to fine tune this still
Line 483 in c8a0ecb
Line 524 in c8a0ecb
1055bc5 to
c0e5ba9
Compare
Why are the changes needed?
Adds an optional parameter to pyhive hive connection class init to allow setting the socket connection timeout. By default, thrift socket connections have no timeout which causes connections to hang indefinitely when something goes wrong. This helps us in dbt by allowing for multi-threaded management of connections while continuing to rely on pyhive thrift setup.
Reference thrift lines:
https://github.com/apache/thrift/blob/0d18fb2e97a00fc56997fa059d6819e54cdff64b/lib/py/src/transport/THttpClient.py#L130
https://github.com/apache/thrift/blob/0d18fb2e97a00fc56997fa059d6819e54cdff64b/lib/py/src/transport/TSocket.py#L111
How was this patch tested?
Manually on TSocket path and added unit test
Was this patch authored or co-authored using generative AI tooling?
No