Upcoming changes | Atlas User Documentation

Atlas Upgrade and Changes

Talk & Discussion at GW group meeting 2020-07-08

Atlas Admin Team

https://www.atlas.aei.uni-hannover.de/userdoc/buster/

Topics to cover quickly

New nodes

Last year, we procured ~500 new machines
444 Gigabyte nodes with many CPU cores and RAM
44 Supermicro nodes with 8 GPUs each

444 new CPU nodes

1U Gigabyte servers
2 AMD CPUs with 32 cores each (twice with hyperthreading)
512 GByte RAM per machine
set up as single partionable slot, i.e. request_cpus = 128 possible (but maybe not advisable)
still single HDD and 1 Gbit/s networking (ought still be good enough)

44 new GPU nodes

4U Supermicro servers
2 Intel CPUs with 8 cores each (twice with hyperthreading)
192 GByte RAM per machine (double for special V100 node)
set up as single partionable slot, i.e. request_gpus = 3 is possible
also single HDD and 1 Gbit/s networking

OS upgrade to Debian buster

new nodes already installed with Debian 10 (buster)
old nodes soon to follow as Debian 8 (jessie) out of security support
last OS version with Python2 support
much newer versions of main system components (gcc, libc, …)
may require recompile/relinking
next update cycle probably much shorter:
- Debian 11 (bullseye) release probably mid 2021
- python executable will vanish, only python3!
- upgrade hopefully smoother than this time

No grid certificates anymore!!

Globus became paid service

NSF funding of Globus project ceased a few years ago
Globus moved to paid service model
GSI-openssh patch was (temporarily) abandoned in 2018
new fork Grid Community Forum created
current status unclear, same for available packages

Back to regular openssh

Steps for migration:

create a dedicated ssh key-pair for Atlas use
send us the public key and we will add it to Atlas
set-up your local ~/.ssh/config
use ssh-agent to prevent typing your passphrase for every login

Important: Do this sooner rather than later!

Instructions in our docs

https://www.atlas.aei.uni-hannover.de/userdoc/topic/gettingstarted/ssh-keypair

Important:

You need to specify your username (explicitly or via ~/.ssh/config)

ssh user.name@condor1.atlas.aei.uni-hannover.de

Changing naming convention for head nodes

Right now:

each head node used for interactive use and condor submission
some machines used /work, some /home as initial $HOME
prone to cause errors and problems

Interactive head nodes

remote work stations
named atlasX as before
no access to condor anymore!
/home as initial home directory
/work still available in parallel

Condor centric head nodes

job submission only possible on head nodes named condorX
/work as initial home directory
/home still available in parallel
partition compute nodes into three pools:
- main pool (submit from condor1, condor2, …)
- parallel pool (submit from condor8)
- interactive job pool (submit from condor9)

Regular submit hosts: condor1, …

submission of regular jobs
vanilla, scheduler and local universe possible
limited to about 30,000 jobs per submit machine
will start with 4 submit nodes
will probably expand to 5 or 6 soon

Special submit host: condor8

only possibly to submit parallel universe jobs
dedicated set of nodes attached to this host (“DedicatedScheduler”)
will accept killable jobs from main pool, if idle
parallel jobs will always have priority over overflow jobs

Special submit host: condor9

only possibly to submit interactive jobs
former “dev” nodes will be attached here, i.e. wild mix of old and new hardware
can be used to run
- test jobs on various machine types
- get more resources quickly (e.g. remote Jupyter)
will accept killable jobs from main pool, if idle
interactive jobs will always be favored over overflow jobs
please do NOT by-pass condor!

Overview

condor submit hosts

Job submission

Changes and Information:

accounting tags
killable jobs
interactive jobs
run times

Everyone using Condor should at least read HTCondor’s user manual. Seriously! There is a lot of information to discover!

new structure

Accounting tags have changed!

Before, e.g.

aei.sim.description101

Now, it must match

/^(admin|boinc|burst|cbc|cw|gamma|radio)\.
(back|dev|imp|prod|test)\.
[a-zA-Z0-9_-]+$/x

e.g.

radio.prod.arecibo_follow_up

(search) group first

First part is simply the appropriate group

admin
boinc
burst
cbc
cw
gamma
radio

We will be able to quickly add more groups if PIs need them!

individual priority factor

Second part describes your relative priority factor:

imp: 10
prod: 100
test: 1,000
dev: 100,000
back: 10,000,000

custom identifier

Final part is almost free-form, use it to describe the jobs/pipeline!

[a-zA-Z0-9_-]+

For example,

GW201224_follow_up
GRB1234-tests
awesome-catalogue

effect on user priorities

Derived values for user priority calculation have changed!

Before

aei.dev.r1b2.user.name@atlas.local

Now

radio.prod.user.name@atlas.local

This means, using suffixes on description will not reset user priority for this accounting tag anymore!

Killable jobs

If you have small jobs which

run less than an hour or
write small checkpoints and can automatically restart from there

add +KillableJob = true to your submit file.

These jobs may run on idle machines but may be readily killed when resources are needed.

Hence, only use this flag where appropriate!

Interactive jobs

Available on all submit nodes except condor8
Create regular submit file
Target wanted machine via request_cpus, request_gpus, request_memory and requirements
Omit queue keyword
Submit job via condor_submit -interactive
Wait
Once scheduled, condor will log you into remote machine

Interactive jobs

Things to note:

Auto logout after being idle for 2h
Consider using tmux or screen on submit host when submitting
Please do not block resources just because you want to hold on to it!

Predicting run times will be hard

Predicting run times of similar jobs will become next to impossible:

heterogeneous CPU model mix
SMT/Hyperthreading
Turbo boost
NUMA

All these may lead to job run times varying a lot!

For early testers:

we will soon remove +MaxRunTimeHours requirement again

Condor tutorial/examples

Atlas has become quite heterogeneous:

many different CPUs (2011-2019, 2.0 GHz - 3.6 GHz)
RAM sizes from 16 GByte to 512 GByte/machine
small, large or no GPUs at all

Need to distinguish between

partionable slot - currently available resources (slot1@host)
partitioned (sub)slot - resources allocated to running job (slot1_31@host)

Asking Condor about available machines by CPU

condor_status \
-constraint 'PartitionableSlot =?= True' \
-autoformat lscpu_model_name | sort |\
uniq -c | sort -g

# of machines | CPU types (machine may have multiple CPUs)
        AMD EPYC 7551 32-Core Processor
        AMD Opteron(tm) Processor 6136
        Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
        Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz
        Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
        Intel(R) Xeon(R) E-2134 CPU @ 3.50GHz
        Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz
        Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
        Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz
        AMD EPYC 7452 32-Core Processor
        Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz
        Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz

Which machines are online?

This is especially important for interactive pool via condor9!

List number of connected, online machines:

condor_status -constraint 'PartitionableSlot =?= True'

Name                    OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@a3102.atlas.local LINUX      X86_64 Unclaimed Idle      0.000 117884  0+00:13:50
slot1@a3104.atlas.local LINUX      X86_64 Unclaimed Idle      0.000  28631 41+23:39:56
slot1@a3108.atlas.local LINUX      X86_64 Unclaimed Idle      0.000  28631 41+23:39:24
slot1@a3112.atlas.local LINUX      X86_64 Unclaimed Idle      0.000  28631 38+18:33:53
[...]

Get all details from node

condor_status -long slot1@a3102.atlas.local

AcceptedWhileDraining = false
Activity = "Idle"
AddressV1 = "{[ p=\"primary\"; a=\"10.10.31.2\"; port=9618; n=\"Internet\"; [..]]}"
Arch = "X86_64"
AssignedGPUs = "CUDA0,CUDA1"
AuthenticatedIdentity = "condor_pool@atlas.local"
AuthenticationMethod = "PASSWORD"
[...]

Every of the given keys/values can be used as a filter!

What resources are currently available?

condor_status -constraint 'PartitionableSlot =?= True && Cpus > 0'\
-autoformat:V Cpus "Memory/1024" GPUs CUDADeviceName lscpu_model_name|\
sort | uniq -c | sort -r -g -k2,2 | head -n 7

96 450 undefined undefined                "AMD EPYC 7452 32-Core Processor"
32 182 8         "GeForce RTX 2070 SUPER" "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
32 182 undefined undefined                "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
32 167 8         "GeForce RTX 2070 SUPER" "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
31 260 undefined undefined                "AMD EPYC 7452 32-Core Processor"
31 206 undefined undefined                "AMD EPYC 7452 32-Core Processor"
24 152 undefined undefined                "Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz"

Targeting: CPU

To only run on specific Intel CPU:

requirements = (lscpu_model_name =?= "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz")

or require a CPU with AVX2 extension and at least ~2.5GHz

requirements = (lscpu_has_avx2 =?= true && lscpu_bogomips > 4900)

Targeting: GPU

Target GPU based on CUDA driver version and capability:

request_cpus = 2
request_gpus = 1
requirements = (CUDACapability is defined &&
                CUDACapability >= 7.5 &&
                CUDADriverVersion >= 10.2)

or simply by type:

[...]
requirements = (CUDADeviceName =?= "GeForce RTX 2070 SUPER")

Usage of machine class ads

Using optimized binary for target node

executable = /path/to/compiled/$(lscpu_vendor_id)/executable

will use either AuthenticAMD or GenuineIntel on Atlas.

Store log files by machine name first

output = /path/to/logs/$(Machine)-$(ClusterId)-$(ProcId).out

Final examples: special submit expressions

All of the following are possible (cf. HTCondor submission)

change one variable, keep everything else the same

[...]
initialdir     = run_1
queue

initialdir     = run_2
queue

Final examples: special submit expressions

Start one job per .dat file in submit directory

[...]
transfer_input_files = $(filename)
arguments            = -infile $(filename)
queue filename matching files *.dat

Final examples: special submit expressions

Submit 3 jobs where $(Item) will be 7, 8 or 9.

[...]
arguments = -a -b -c $(Item)
queue from seq 7 9 |

Final examples: special submit expressions

Submit jobs based on tuples given

[...]
queue input,arguments from (
  file1, -a -b 26
  file2, -c -d 92
)

Changes on the horizon

We have quite a number of changes still on our long to-do-list.

Some will be more visible and some may be completely invisible to you but nevertheless will use up our time!

Phase out HSM

Our current HSM software stack is EOS in 2021
Currently preparing move of old users to external entity
This will probably also mean we may join /work and /home again
Need to spread load to even more
Aim for 1-2 users per physical machines to reduce bad interaction

Centralized /software service

Slimming down compute nodes
Provide various software stacks as opt-ins (probably via spack + modules)
Reinhard already prepared /software/cw
From admin side still in infancy stage
If other working groups interested, please talk to us!
Straight forward to set-up rsync target for central build artifacts, images, …

Promote/evaluate singularity use

Would like scientists to investigate use of singularity images
Could be a well defined starting point for summer interns
Could serve as basis for static production environment
Potentially useful for re-runs years later!

Provide web based user auth management

Currently, only poor possibilities for collaborating with external people
Plan is to have web service which allows “easy” creation of groups
Can be tied into sharing information via our web server

Misc other stuff

renew cluster internal monitoring along with better user facing dashboards
use ssh certificates
- host based to get rid of nagging host key has changed messages
- user based to ensure your key does not get used by someone else (2FA)

Atlas Upgrade and Changes

Topics to cover quickly

New nodes

444 new CPU nodes

44 new GPU nodes

OS upgrade to Debian buster

No grid certificates anymore!!

Globus became paid service

Back to regular openssh

Instructions in our docs

Changing naming convention for head nodes

Interactive head nodes

Condor centric head nodes

Regular submit hosts: condor1, …

Special submit host: condor8

Special submit host: condor9

Overview

Job submission

new structure

(search) group first

individual priority factor

custom identifier

effect on user priorities

Killable jobs

Interactive jobs

Interactive jobs

Predicting run times will be hard

Condor tutorial/examples

Asking Condor about available machines by CPU

Which machines are online?

Get all details from node

What resources are currently available?

Targeting: CPU

Targeting: GPU

Usage of machine class ads

Final examples: special submit expressions

Final examples: special submit expressions

Final examples: special submit expressions

Final examples: special submit expressions

Changes on the horizon

Phase out HSM

Centralized /software service

Promote/evaluate singularity use

Provide web based user auth management

Misc other stuff

Questions?