Skip to content

Crashes when submitting a large number of jobs #2

@szs8

Description

@szs8

I am running into issues when submitting lots of jobs (tens of thousands) from the python bindings.

The submit code looks like

schedd = htcondor.Schedd()
for i in some_list:
   j = build_job_dict(i)
   schedd.submit(j)

Here is the ouput with debugging turned on. Lines starting with "Processing .." is output from my code.

Tue Feb  2 16:13:58 2016 Processing A
02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from <10.x.xxx.xxx:12731>.
02/02/16 16:15:18 IO: Failed to read packet header
02/02/16 16:15:18 SECMAN: no classad from server, failing
02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad message.
Can't send RESCHEDULE command to schedd.
Tue Feb  2 16:16:46 2016 Processing B
02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from <10.x.xxx.xxx:12731>.
02/02/16 16:18:43 IO: Failed to read packet header
02/02/16 16:18:43 SECMAN: no classad from server, failing
02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad message.
Can't send RESCHEDULE command to schedd.
Tue Feb  2 16:20:13 2016 Processing C
02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from <10.x.xxx.xxx:12731>.
02/02/16 16:22:10 IO: Failed to read packet header
02/02/16 16:22:10 SECMAN: no classad from server, failing
02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad message.
Can't send RESCHEDULE command to schedd.
02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
02/02/16 16:22:10 Buf::write(): condor_write() failed
terminate called after throwing an instance of 'boost::python::error_already_set'
Aborted

My initial suspicion was that I was running a lot of jobs which finished very fast and thrashed the schedd process. But then I killed all my workers and simply tried to queue jobs and got the same error. This is not a one off occurrence and happens pretty deterministically.

Any idea what is going on?

Both htcondor and python bindings are for 8.4.3

Installed Packages
Name : condor-python
Arch : x86_64
Version : 8.4.3
Release : 1.el7
Size : 4.8 M
Repo : installed
From repo : htcondor-stable
Summary : Python bindings for HTCondor.
URL : http://www.cs.wisc.edu/condor/
License : ASL 2.0
Description : The python bindings allow one to directly invoke the C++ implementations of
: the ClassAd library and HTCondor from python

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions