Online resources

Use condor_status to query the currently available resources in the connected pool and remember, we currently have three pools.

condor_status -constraint PartitionableSlot \
-af:V lscpu_model_name 'int(TotalCpus)' TotalMemory/1024 |\
sort | uniq -c | sort -rn

What this does is the following:

The given constraint will only target the “main slot” of each machine. When started initially, an empty machine has usually only a single “slot” containing all available resources. Whenever a job is started, its requirements are “carved” away from the main slot, creating a subslot. This is the reason why jobs are running under slot1_15@a0101.atlas.local, i.e. it is the 15th job carved away form the main slot1 on server a0101.

Omitting the constraint and still using Total* attributes will account for the resources multiple times leading to wrong results.

Then we ask condor_status to only output the CPU model, the number of configured cores and available memory in GiByte.

That output is then sorted and piped into uniq to summarize and count the number of occurrences of each line. The final sort then re-orders all this to place the highest line/machine count at the top.

Online and available resources

The command above only tells you, which resources are currently online but does not detail which are available for your jobs. To get this information, we need to switch away from using the Total* attributes:

condor_status -constraint "PartitionableSlot && Cpus > 0 && Memory > 0"\
-af:V 'int(Cpus)' Memory/1024 Machine lscpu_model_name |\
sort -n -r | head

This will list the top 10 machines with available resources, number of available cores in the first and available memory in GiBytes in the second column.

Note: The machines listed may require special attributes or job requirements to be matched. Unfortunately, it is not possible to gather all this information from a single call.

If you suspect something is amiss, the following quite complex expression may yield an answer:

join -1 3 -2 1 \
<(condor_status -constraint "PartitionableSlot && Cpus > 0 && Memory > 0"\
-af:V 'int(Cpus)' Memory/1024 Machine lscpu_model_name | sort -k3)\
<(condor_status -constraint "PartitionableSlot && Cpus > 0 && Memory > 0"\
-af:rV Machine "'Start:'" Start | sort -k1) |\
sort -k2n,2 | tail | tac

Understanding the inner workings are left as a reader’s exercise, suffice to say, a result (with added comments), could look like this:

# Machine name    Cores Mem[GiByte]  CPU type                 Start expression which may indicate special requirements
"a7004.atlas.local" 128 464 "AMD EPYC 7452 32-Core Processor" 'Start:' false
"a3305.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a3304.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a3303.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a3302.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a3301.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a6201.atlas.local" 70 2 "AMD EPYC 7452 32-Core Processor" 'Start:' true
"a5104.atlas.local" 70 2 "AMD EPYC 7452 32-Core Processor" 'Start:' true
"a3707.atlas.local" 70 2 "AMD EPYC 7452 32-Core Processor" 'Start:' true
"a3008.atlas.local" 65 0 "AMD EPYC 7452 32-Core Processor" 'Start:' true