Crashes when submitting a large number of jobs

I am running into issues when submitting lots of jobs (tens of thousands) from the python bindings. 

The submit code looks like

``` python
schedd = htcondor.Schedd()
for i in some_list:
   j = build_job_dict(i)
   schedd.submit(j)
```

Here is the ouput with debugging turned on. Lines starting with "Processing .." is output from my code.

```
Tue Feb  2 16:13:58 2016 Processing A
02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from <10.x.xxx.xxx:12731>.
02/02/16 16:15:18 IO: Failed to read packet header
02/02/16 16:15:18 SECMAN: no classad from server, failing
02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad message.
Can't send RESCHEDULE command to schedd.
Tue Feb  2 16:16:46 2016 Processing B
02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from <10.x.xxx.xxx:12731>.
02/02/16 16:18:43 IO: Failed to read packet header
02/02/16 16:18:43 SECMAN: no classad from server, failing
02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad message.
Can't send RESCHEDULE command to schedd.
Tue Feb  2 16:20:13 2016 Processing C
02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from <10.x.xxx.xxx:12731>.
02/02/16 16:22:10 IO: Failed to read packet header
02/02/16 16:22:10 SECMAN: no classad from server, failing
02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad message.
Can't send RESCHEDULE command to schedd.
02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
02/02/16 16:22:10 Buf::write(): condor_write() failed
terminate called after throwing an instance of 'boost::python::error_already_set'
Aborted
```

My initial suspicion was that I was running a lot of jobs which finished very fast and thrashed the schedd process. But then I killed all my workers and simply tried to queue jobs and got the same error. This is not a one off occurrence and happens pretty deterministically.

Any idea what is going on?

Both htcondor and python bindings are for 8.4.3

Installed Packages
Name        : condor-python
Arch        : x86_64
Version     : 8.4.3
Release     : 1.el7
Size        : 4.8 M
Repo        : installed
From repo   : htcondor-stable
Summary     : Python bindings for HTCondor.
URL         : http://www.cs.wisc.edu/condor/
License     : ASL 2.0
Description : The python bindings allow one to directly invoke the C++ implementations of
            : the ClassAd library and HTCondor from python


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crashes when submitting a large number of jobs #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Crashes when submitting a large number of jobs #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions