Your initial submit file

Let us assume you have an executable called analyze placed in the directory you are currently in. The first step is to ensure this program really is executable, i.e. run chmod a+x analyze.

Usually, you call your program with a couple of arguments, for now, let us assume your program performs some special analysis on an input data set detector.data which it partitions into 20 different parts. Thus, on your laptop, you would start this program 20 times, e.g. sequentially in a for loop like this:

for i in {0..19}; do ./analyze detector.data $i; done

As your paper should be ready sooner than your laptop will be able to provide answers, let us hand the problem over to condor and create the file analyze.sub:

# what to run
executable       = analyze
arguments        = detector.data $(ProcId)

# where to place output/logfiles
output           = out/$(ProcID).out
error            = err/$(ProcID).err
log              = analyze.log

# which resources are needed
request_cpus     = 1
request_memory   = 1024
request_disk     = 10240
accounting_group = burst.test.first_steps

# use "spot" resources? (see bottom of page)
# +KillableJob   = true

# how many jobs to start
queue 20

Even though this submit file is fairly simply, it does contain a lot of information:

how many jobs to start

Let us start with the bottom of the file, queue 20 tells condor to parse all statements since the beginning of the file or the most recent queue line before this one. In this case, it tells condor, you want to run your executable exactly 20 times.

Condor will start a job cluster for you with a unique number identifying it; this will be an every growing integer number depending on how many jobs have already been submitted on your chosen submit host in the past. If you want to, you can access this variable as $(ClusterID) in your submit file.

queue without any argument will only submit a single job, but here we want to start 20 jobs. To distinguish these jobs, each of these will get a process id assigned to, which can be referenced by using $(ProcID).

what to run

Let us get back to the beginning of your submit file:

executable     = analyze
arguments      = detector.data $(ProcId)

This tells condor to run the program analyze each time with the arguments detector.data $(ProcID). The first argument is the file name of your data file, the second argument will be one of the numbers 0,…,19.

where to place output/logfiles

output         = out/$(ProcID).out
error          = err/$(ProcID).err
log            = analyze.log

This tells condor the output (stdout and stderr for each program), should be placed into the files out/0.out and err/0.err for the first job, out/1.out, err/1.err for the second job and so on.

The general log file analyze.log will contain information about job starts, exit codes, run times and some more details.

which resources are needed

request_cpus     = 1
request_memory   = 1024
request_disk     = 5120
accounting_group = burst.test.first_steps

Finally, this section tells condor, what resources it should reserve for this job. In this example, you tell condor your program needs a single CPU core, 1024MByte of memory and 5120kByte of disk space.

As a rule of thumb, try to be as strict and accurate with these values as possible, as smaller limits are usually matched faster with available resources. But please leave a little bit of head room for your program, as condor is very strict in enforcing these limits, especially for memory usage and may simply kill your job if it exceeded its limits.

account_group is another special variable which helps organizing Atlas’ resource scheduling. As choosing this variable is q bit involved, please read the section on accounting tags.

KillableJob

The Atlas nodes are partitioned into three distinct pools. Sometimes, the main pool may be completely full and you have jobs which do not waste much CPU time if they may be killed before finishing, e.g. these jobs could either be short running ones or jobs which write their own checkpoint files and can easily restart from these.

For these jobs - and only for these, please - you may specify the line

+KillableJob = true

in the submit file to enable flocking. This means, a job may be scheduled to run on the specialized (parallel or interactive) pools if resources are available there but may be immediately killed if these are requested!