Start your first job
This part is now much easier as in principle all you need to do is to run
condor_submit analyze.sub
Submitting job(s)....................
20 job(s) submitted to cluster 210254.
Condor will probably accept your job, but if it tries to run it, it
will most certainly hit a problem and place the jobs in the “hold”
state. You can check this by running condor_q
:
condor_q
-- Schedd: condor1.atlas.local : <10.20.30.16:9618?... @ 05/25/20 05:53:58
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
carsten ID: 210254 5/25 05:49 _ _ _ 20 20 210254.0-19
Total for query: 20 jobs; 0 completed, 0 removed, 0 idle, 0 running, 20 held, 0 suspended
Total for carsten: 20 jobs; 0 completed, 0 removed, 0 idle, 0 running, 20 held, 0 suspended
Total for all users: 29362 jobs; 0 completed, 0 removed, 7532 idle, 21810 running, 20 held, 0 suspended
As there is quite a lot of information here, let us focus on the
important bits. Both condor_submit
and condor_q
tell you the
ClusterId
of your jobs (210254
), the latter tool also showing that
there are 20 processes as part of this cluster
210254.0-19
. Unfortunately, all are “on hold”, thus let us
investigate what went wrong:
condor_q -hold 210254.0
-- Schedd: condor1.atlas.local : <10.20.30.16:9618?... @ 05/25/20 06:01:25
ID OWNER HELD_SINCE HOLD_REASON
210254.0 carsten 5/25 05:50 Error from slot1_64@a4606.atlas.local: Failed to open '/work/carsten//Condor/FirstSteps/out/0.out' as standard output: No such file or directory (errno 2)
Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Total for all users: 29996 jobs; 0 completed, 0 removed, 7559 idle,
22417 running, 20 held, 0 suspended
Luckily, fixing this problem is easy: In our submit file analyze.sub
we told condor to write the stdout and stderr outputs into files under
the directories out
and err
and we simply forgot to create
them. Therefore, let us create these and let condor restart the jobs:
mkdir err out
condor_release 210254
Running condor_q
again, one can see if the jobs are idle, on hold
again or running:
# all jobs are idle:
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
carsten ID: 210254 5/25 05:49 _ _ 20 20
210254.0-19
# [a short while later]:
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
carsten ID: 210254 5/25 05:49 9 11 _ 20 210254.9-19
This means, your jobs have started (and a few have already finished).