How to find out which (and how many) CPU models are available?
Online resources
Use condor_status
to query the currently available resources in the
connected pool and remember, we currently have three
pools.
condor_status -constraint PartitionableSlot \
-af:V lscpu_model_name 'int(TotalCpus)' TotalMemory/1024 |\
sort | uniq -c | sort -rn
What this does is the following:
The given constraint
will only target the “main slot” of each machine. When started initially, an empty machine has usually only a single “slot” containing all available resources. Whenever a job is started, its requirements are “carved” away from the main slot, creating a subslot. This is the reason why jobs are running under slot1_15@a0101.atlas.local
, i.e. it is the 15th job carved away form the main slot1
on server a0101
.
Omitting the constraint and still using Total*
attributes will account for the resources multiple times leading to wrong results.
Then we ask condor_status
to only output the CPU model, the number of configured cores and available memory in GiByte.
That output is then sort
ed and piped into uniq
to summarize and count the number of occurrences of each line. The final sort
then re-orders all this to place the highest line/machine count at the top.
Online and available resources
The command above only tells you, which resources are currently online but does not detail which are available for your jobs. To get this information, we need to switch away from using the Total*
attributes:
condor_status -constraint "PartitionableSlot && Cpus > 0 && Memory > 0"\
-af:V 'int(Cpus)' Memory/1024 Machine lscpu_model_name |\
sort -n -r | head
This will list the top 10 machines with available resources, number of available cores in the first and available memory in GiBytes in the second column.
Note: The machines listed may require special attributes or job requirements to be matched. Unfortunately, it is not possible to gather all this information from a single call.
If you suspect something is amiss, the following quite complex expression may yield an answer:
join -1 3 -2 1 \
<(condor_status -constraint "PartitionableSlot && Cpus > 0 && Memory > 0"\
-af:V 'int(Cpus)' Memory/1024 Machine lscpu_model_name | sort -k3)\
<(condor_status -constraint "PartitionableSlot && Cpus > 0 && Memory > 0"\
-af:rV Machine "'Start:'" Start | sort -k1) |\
sort -k2n,2 | tail | tac
Understanding the inner workings are left as a reader’s exercise, suffice to say, a result (with added comments), could look like this:
# Machine name Cores Mem[GiByte] CPU type Start expression which may indicate special requirements
"a7004.atlas.local" 128 464 "AMD EPYC 7452 32-Core Processor" 'Start:' false
"a3305.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a3304.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a3303.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a3302.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a3301.atlas.local" 128 495 "AMD EPYC 7452 32-Core Processor" 'Start:' true && (REGEXP("^cbc\.(imp|prod|test|dev)\.post-processing$",AcctGroup))
"a6201.atlas.local" 70 2 "AMD EPYC 7452 32-Core Processor" 'Start:' true
"a5104.atlas.local" 70 2 "AMD EPYC 7452 32-Core Processor" 'Start:' true
"a3707.atlas.local" 70 2 "AMD EPYC 7452 32-Core Processor" 'Start:' true
"a3008.atlas.local" 65 0 "AMD EPYC 7452 32-Core Processor" 'Start:' true