Talk & Discussion at GW group meeting 2020-07-08
Atlas Admin Team
https://www.atlas.aei.uni-hannover.de/userdoc/buster/
request_cpus = 128
possible (but maybe not advisable)request_gpus = 3
is possiblepython
executable will vanish, only python3
!Steps for migration:
~/.ssh/config
ssh-agent
to prevent typing your passphrase for every loginImportant: Do this sooner rather than later!
https://www.atlas.aei.uni-hannover.de/userdoc/topic/gettingstarted/ssh-keypair
Important:
You need to specify your username (explicitly or via ~/.ssh/config
)
ssh user.name@condor1.atlas.aei.uni-hannover.de
Right now:
/work
, some /home
as initial $HOME
atlasX
as before/home
as initial home directory/work
still available in parallelcondorX
/work
as initial home directory/home
still available in parallelcondor1
, condor2
, …)condor8
)condor9
)vanilla
, scheduler
and local
universe possibleChanges and Information:
Everyone using Condor should at least read HTCondor’s user manual. Seriously! There is a lot of information to discover!
Before, e.g.
aei.sim.description101
Now, it must match
/^(admin|boinc|burst|cbc|cw|gamma|radio)\.
(back|dev|imp|prod|test)\.
[a-zA-Z0-9_-]+$/x
e.g.
radio.prod.arecibo_follow_up
First part is simply the appropriate group
admin
boinc
burst
cbc
cw
gamma
radio
We will be able to quickly add more groups if PIs need them!
Second part describes your relative priority factor:
imp
: 10prod
: 100test
: 1,000dev
: 100,000back
: 10,000,000Final part is almost free-form, use it to describe the jobs/pipeline!
[a-zA-Z0-9_-]+
For example,
GW201224_follow_up
GRB1234-tests
awesome-catalogue
Derived values for user priority calculation have changed!
Before
aei.dev.r1b2.user.name@atlas.local
Now
radio.prod.user.name@atlas.local
This means, using suffixes on description will not reset user priority for this accounting tag anymore!
If you have small jobs which
add +KillableJob = true
to your submit file.
These jobs may run on idle machines but may be readily killed when resources are needed.
Hence, only use this flag where appropriate!
condor8
request_cpus
, request_gpus
,
request_memory
and requirements
queue
keywordcondor_submit -interactive
Things to note:
tmux
or screen
on submit host when submittingPredicting run times of similar jobs will become next to impossible:
All these may lead to job run times varying a lot!
For early testers:
we will soon remove +MaxRunTimeHours
requirement again
Atlas has become quite heterogeneous:
Need to distinguish between
slot1@host
)slot1_31@host
)condor_status \
-constraint 'PartitionableSlot =?= True' \
-autoformat lscpu_model_name | sort |\
uniq -c | sort -g
# of machines | CPU types (machine may have multiple CPUs)
1 AMD EPYC 7551 32-Core Processor
1 AMD Opteron(tm) Processor 6136
1 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
1 Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz
1 Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
1 Intel(R) Xeon(R) E-2134 CPU @ 3.50GHz
43 Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz
45 Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
264 Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz
400 AMD EPYC 7452 32-Core Processor
495 Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz
1669 Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
This is especially important for interactive pool via condor9
!
List number of connected, online machines:
condor_status -constraint 'PartitionableSlot =?= True'
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@a3102.atlas.local LINUX X86_64 Unclaimed Idle 0.000 117884 0+00:13:50
slot1@a3104.atlas.local LINUX X86_64 Unclaimed Idle 0.000 28631 41+23:39:56
slot1@a3108.atlas.local LINUX X86_64 Unclaimed Idle 0.000 28631 41+23:39:24
slot1@a3112.atlas.local LINUX X86_64 Unclaimed Idle 0.000 28631 38+18:33:53
[...]
condor_status -long slot1@a3102.atlas.local
AcceptedWhileDraining = false
Activity = "Idle"
AddressV1 = "{[ p=\"primary\"; a=\"10.10.31.2\"; port=9618; n=\"Internet\"; [..]]}"
Arch = "X86_64"
AssignedGPUs = "CUDA0,CUDA1"
AuthenticatedIdentity = "condor_pool@atlas.local"
AuthenticationMethod = "PASSWORD"
[...]
Every of the given keys/values can be used as a filter!
condor_status -constraint 'PartitionableSlot =?= True && Cpus > 0'\
-autoformat:V Cpus "Memory/1024" GPUs CUDADeviceName lscpu_model_name|\
sort | uniq -c | sort -r -g -k2,2 | head -n 7
1 96 450 undefined undefined "AMD EPYC 7452 32-Core Processor"
37 32 182 8 "GeForce RTX 2070 SUPER" "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
1 32 182 undefined undefined "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
1 32 167 8 "GeForce RTX 2070 SUPER" "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
4 31 260 undefined undefined "AMD EPYC 7452 32-Core Processor"
2 31 206 undefined undefined "AMD EPYC 7452 32-Core Processor"
5 24 152 undefined undefined "Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz"
To only run on specific Intel CPU:
requirements = (lscpu_model_name =?= "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz")
or require a CPU with AVX2 extension and at least ~2.5GHz
requirements = (lscpu_has_avx2 =?= true && lscpu_bogomips > 4900)
Target GPU based on CUDA driver version and capability:
request_cpus = 2
request_gpus = 1
requirements = (CUDACapability is defined &&
CUDACapability >= 7.5 &&
CUDADriverVersion >= 10.2)
or simply by type:
[...]
requirements = (CUDADeviceName =?= "GeForce RTX 2070 SUPER")
Using optimized binary for target node
executable = /path/to/compiled/$(lscpu_vendor_id)/executable
will use either AuthenticAMD
or GenuineIntel
on Atlas.
Store log files by machine name first
output = /path/to/logs/$(Machine)-$(ClusterId)-$(ProcId).out
All of the following are possible (cf. HTCondor submission)
change one variable, keep everything else the same
[...]
initialdir = run_1
queue
initialdir = run_2
queue
Start one job per .dat
file in submit directory
[...]
transfer_input_files = $(filename)
arguments = -infile $(filename)
queue filename matching files *.dat
Submit 3 jobs where $(Item)
will be 7, 8 or 9.
[...]
arguments = -a -b -c $(Item)
queue from seq 7 9 |
Submit jobs based on tuples given
[...]
queue input,arguments from (
file1, -a -b 26
file2, -c -d 92
)
We have quite a number of changes still on our long to-do-list.
Some will be more visible and some may be completely invisible to you but nevertheless will use up our time!
/work
and /home
againspack
+ modules
)/software/cw
rsync
target for central build
artifacts, images, …