Your initial submit file
Let us assume you have an executable called analyze
placed in the
directory you are currently in. The first step is to ensure this
program really is executable, i.e. run chmod a+x analyze
.
Usually, you call your program with a couple of arguments, for now,
let us assume your program performs some special analysis on an input
data set detector.data
which it partitions into 20 different
parts. Thus, on your laptop, you would start this program 20 times,
e.g. sequentially in a for loop like this:
for i in {0..19}; do ./analyze detector.data $i; done
As your paper should be ready sooner than your laptop will be able to
provide answers, let us hand the problem over to condor and create the
file analyze.sub
:
# what to run
executable = analyze
arguments = detector.data $(ProcId)
# where to place output/logfiles
output = out/$(ProcID).out
error = err/$(ProcID).err
log = analyze.log
# which resources are needed
request_cpus = 1
request_memory = 1024
request_disk = 10240
accounting_group = burst.test.first_steps
# use "spot" resources? (see bottom of page)
# +KillableJob = true
# how many jobs to start
queue 20
Even though this submit file is fairly simply, it does contain a lot of information:
how many jobs to start
Let us start with the bottom of the file, queue 20
tells condor to
parse all statements since the beginning of the file or the most
recent queue
line before this one. In this case, it tells condor,
you want to run your executable exactly 20 times.
Condor will start a job cluster for you with a unique number
identifying it; this will be an every growing integer number depending
on how many jobs have already been submitted on your chosen submit
host in the past. If you want to, you can access this variable as
$(ClusterID)
in your submit file.
queue
without any argument will only submit a single job, but here
we want to start 20 jobs. To distinguish these jobs, each of these
will get a process id assigned to, which can be referenced by using
$(ProcID)
.
what to run
Let us get back to the beginning of your submit file:
executable = analyze
arguments = detector.data $(ProcId)
This tells condor to run the program analyze
each time with the
arguments detector.data $(ProcID)
. The first argument is the file
name of your data file, the second argument will be one of the numbers
0,…,19.
where to place output/logfiles
output = out/$(ProcID).out
error = err/$(ProcID).err
log = analyze.log
This tells condor the output (stdout and stderr for each program),
should be placed into the files out/0.out
and err/0.err
for the
first job, out/1.out
, err/1.err
for the second job and so on.
The general log file analyze.log
will contain information about job
starts, exit codes, run times and some more details.
which resources are needed
request_cpus = 1
request_memory = 1024
request_disk = 5120
accounting_group = burst.test.first_steps
Finally, this section tells condor, what resources it should reserve for this job. In this example, you tell condor your program needs a single CPU core, 1024MByte of memory and 5120kByte of disk space.
As a rule of thumb, try to be as strict and accurate with these values as possible, as smaller limits are usually matched faster with available resources. But please leave a little bit of head room for your program, as condor is very strict in enforcing these limits, especially for memory usage and may simply kill your job if it exceeded its limits.
account_group
is another special variable which helps organizing
Atlas’ resource scheduling. As choosing this variable is q bit
involved, please read the section on accounting
tags.
KillableJob
The Atlas nodes are partitioned into three distinct pools. Sometimes, the main pool may be completely full and you have jobs which do not waste much CPU time if they may be killed before finishing, e.g. these jobs could either be short running ones or jobs which write their own checkpoint files and can easily restart from these.
For these jobs - and only for these, please - you may specify the line
+KillableJob = true
in the submit file to enable flocking. This means, a job may be scheduled to run on the specialized (parallel or interactive) pools if resources are available there but may be immediately killed if these are requested!