Atlas Upgrade and Changes

Talk & Discussion at GW group meeting 2020-07-08

Atlas Admin Team

https://www.atlas.aei.uni-hannover.de/userdoc/buster/

Topics to cover quickly

  1. Adding 500 new nodes

  2. OS upgrade

  3. Good-bye gsissh - welcome openssh

  4. Head nodes: New naming convention and roles

  5. Condor job submission

  6. How to make better use of Condor

  7. Changes on the horizon

New nodes

  • Last year, we procured ~500 new machines
  • 444 Gigabyte nodes with many CPU cores and RAM
  • 44 Supermicro nodes with 8 GPUs each

444 new CPU nodes

  • 1U Gigabyte servers
  • 2 AMD CPUs with 32 cores each (twice with hyperthreading)
  • 512 GByte RAM per machine
  • set up as single partionable slot, i.e. request_cpus = 128 possible (but maybe not advisable)
  • still single HDD and 1 Gbit/s networking (ought still be good enough)

44 new GPU nodes

  • 4U Supermicro servers
  • 2 Intel CPUs with 8 cores each (twice with hyperthreading)
  • 192 GByte RAM per machine (double for special V100 node)
  • set up as single partionable slot, i.e. request_gpus = 3 is possible
  • also single HDD and 1 Gbit/s networking

OS upgrade to Debian buster

  • new nodes already installed with Debian 10 (buster)
  • old nodes soon to follow as Debian 8 (jessie) out of security support
  • last OS version with Python2 support
  • much newer versions of main system components (gcc, libc, …)
  • may require recompile/relinking
  • next update cycle probably much shorter:
    • Debian 11 (bullseye) release probably mid 2021
    • python executable will vanish, only python3!
    • upgrade hopefully smoother than this time

No grid certificates anymore!!

Globus became paid service

  • NSF funding of Globus project ceased a few years ago
  • Globus moved to paid service model
  • GSI-openssh patch was (temporarily) abandoned in 2018
  • new fork Grid Community Forum created
  • current status unclear, same for available packages

Back to regular openssh

Steps for migration:

  1. create a dedicated ssh key-pair for Atlas use
  2. send us the public key and we will add it to Atlas
  3. set-up your local ~/.ssh/config
  4. use ssh-agent to prevent typing your passphrase for every login

Important: Do this sooner rather than later!

Instructions in our docs

https://www.atlas.aei.uni-hannover.de/userdoc/topic/gettingstarted/ssh-keypair

Important:

You need to specify your username (explicitly or via ~/.ssh/config)

ssh user.name@condor1.atlas.aei.uni-hannover.de

Changing naming convention for head nodes

Right now:

  • each head node used for interactive use and condor submission
  • some machines used /work, some /home as initial $HOME
  • prone to cause errors and problems

Interactive head nodes

  • remote work stations
  • named atlasX as before
  • no access to condor anymore!
  • /home as initial home directory
  • /work still available in parallel

Condor centric head nodes

  • job submission only possible on head nodes named condorX
  • /work as initial home directory
  • /home still available in parallel
  • partition compute nodes into three pools:
    • main pool (submit from condor1, condor2, …)
    • parallel pool (submit from condor8)
    • interactive job pool (submit from condor9)

Regular submit hosts: condor1, …

  • submission of regular jobs
  • vanilla, scheduler and local universe possible
  • limited to about 30,000 jobs per submit machine
  • will start with 4 submit nodes
  • will probably expand to 5 or 6 soon

Special submit host: condor8

  • only possibly to submit parallel universe jobs
  • dedicated set of nodes attached to this host (“DedicatedScheduler”)
  • will accept killable jobs from main pool, if idle
  • parallel jobs will always have priority over overflow jobs

Special submit host: condor9

  • only possibly to submit interactive jobs
  • former “dev” nodes will be attached here, i.e. wild mix of old and new hardware
  • can be used to run
    • test jobs on various machine types
    • get more resources quickly (e.g. remote Jupyter)
  • will accept killable jobs from main pool, if idle
  • interactive jobs will always be favored over overflow jobs
  • please do NOT by-pass condor!

Overview

condor submit hosts

Job submission

Changes and Information:

  • accounting tags
  • killable jobs
  • interactive jobs
  • run times

Everyone using Condor should at least read HTCondor’s user manual. Seriously! There is a lot of information to discover!

new structure

Accounting tags have changed!

Before, e.g.

aei.sim.description101

Now, it must match

/^(admin|boinc|burst|cbc|cw|gamma|radio)\.
(back|dev|imp|prod|test)\.
[a-zA-Z0-9_-]+$/x

e.g.

radio.prod.arecibo_follow_up

(search) group first

First part is simply the appropriate group

  • admin
  • boinc
  • burst
  • cbc
  • cw
  • gamma
  • radio

We will be able to quickly add more groups if PIs need them!

individual priority factor

Second part describes your relative priority factor:

  • imp: 10
  • prod: 100
  • test: 1,000
  • dev: 100,000
  • back: 10,000,000

custom identifier

Final part is almost free-form, use it to describe the jobs/pipeline!

[a-zA-Z0-9_-]+

For example,

GW201224_follow_up
GRB1234-tests
awesome-catalogue

effect on user priorities

Derived values for user priority calculation have changed!

Before

aei.dev.r1b2.user.name@atlas.local

Now

radio.prod.user.name@atlas.local

This means, using suffixes on description will not reset user priority for this accounting tag anymore!

Killable jobs

If you have small jobs which

  • run less than an hour or
  • write small checkpoints and can automatically restart from there

add +KillableJob = true to your submit file.

These jobs may run on idle machines but may be readily killed when resources are needed.

Hence, only use this flag where appropriate!

Interactive jobs

  • Available on all submit nodes except condor8
  • Create regular submit file
  • Target wanted machine via request_cpus, request_gpus, request_memory and requirements
  • Omit queue keyword
  • Submit job via condor_submit -interactive
  • Wait
  • Once scheduled, condor will log you into remote machine

Interactive jobs

Things to note:

  • Auto logout after being idle for 2h
  • Consider using tmux or screen on submit host when submitting
  • Please do not block resources just because you want to hold on to it!

Predicting run times will be hard

Predicting run times of similar jobs will become next to impossible:

  • heterogeneous CPU model mix
  • SMT/Hyperthreading
  • Turbo boost
  • NUMA

All these may lead to job run times varying a lot!

For early testers:

we will soon remove +MaxRunTimeHours requirement again

Condor tutorial/examples

Atlas has become quite heterogeneous:

  • many different CPUs (2011-2019, 2.0 GHz - 3.6 GHz)
  • RAM sizes from 16 GByte to 512 GByte/machine
  • small, large or no GPUs at all

Need to distinguish between

  • partionable slot - currently available resources (slot1@host)
  • partitioned (sub)slot - resources allocated to running job (slot1_31@host)

Asking Condor about available machines by CPU

condor_status \
-constraint 'PartitionableSlot =?= True' \
-autoformat lscpu_model_name | sort |\
uniq -c | sort -g
# of machines | CPU types (machine may have multiple CPUs)
     1          AMD EPYC 7551 32-Core Processor
     1          AMD Opteron(tm) Processor 6136
     1          Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
     1          Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz
     1          Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
     1          Intel(R) Xeon(R) E-2134 CPU @ 3.50GHz
    43          Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz
    45          Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
   264          Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz
   400          AMD EPYC 7452 32-Core Processor
   495          Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz
  1669          Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz

Which machines are online?

This is especially important for interactive pool via condor9!

List number of connected, online machines:

condor_status -constraint 'PartitionableSlot =?= True'

Name                    OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@a3102.atlas.local LINUX      X86_64 Unclaimed Idle      0.000 117884  0+00:13:50
slot1@a3104.atlas.local LINUX      X86_64 Unclaimed Idle      0.000  28631 41+23:39:56
slot1@a3108.atlas.local LINUX      X86_64 Unclaimed Idle      0.000  28631 41+23:39:24
slot1@a3112.atlas.local LINUX      X86_64 Unclaimed Idle      0.000  28631 38+18:33:53
[...]

Get all details from node

condor_status -long slot1@a3102.atlas.local

AcceptedWhileDraining = false
Activity = "Idle"
AddressV1 = "{[ p=\"primary\"; a=\"10.10.31.2\"; port=9618; n=\"Internet\"; [..]]}"
Arch = "X86_64"
AssignedGPUs = "CUDA0,CUDA1"
AuthenticatedIdentity = "condor_pool@atlas.local"
AuthenticationMethod = "PASSWORD"
[...]

Every of the given keys/values can be used as a filter!

What resources are currently available?

condor_status -constraint 'PartitionableSlot =?= True && Cpus > 0'\
-autoformat:V Cpus "Memory/1024" GPUs CUDADeviceName lscpu_model_name|\
sort | uniq -c | sort -r -g -k2,2 | head -n 7

 1 96 450 undefined undefined                "AMD EPYC 7452 32-Core Processor"
37 32 182 8         "GeForce RTX 2070 SUPER" "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
 1 32 182 undefined undefined                "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
 1 32 167 8         "GeForce RTX 2070 SUPER" "Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz"
 4 31 260 undefined undefined                "AMD EPYC 7452 32-Core Processor"
 2 31 206 undefined undefined                "AMD EPYC 7452 32-Core Processor"
 5 24 152 undefined undefined                "Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz"

Targeting: CPU

To only run on specific Intel CPU:

requirements = (lscpu_model_name =?= "Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz")

or require a CPU with AVX2 extension and at least ~2.5GHz

requirements = (lscpu_has_avx2 =?= true && lscpu_bogomips > 4900)

Targeting: GPU

Target GPU based on CUDA driver version and capability:

request_cpus = 2
request_gpus = 1
requirements = (CUDACapability is defined &&
                CUDACapability >= 7.5 &&
                CUDADriverVersion >= 10.2)

or simply by type:

[...]
requirements = (CUDADeviceName =?= "GeForce RTX 2070 SUPER")

Usage of machine class ads

Using optimized binary for target node

executable = /path/to/compiled/$(lscpu_vendor_id)/executable

will use either AuthenticAMD or GenuineIntel on Atlas.

Store log files by machine name first

output = /path/to/logs/$(Machine)-$(ClusterId)-$(ProcId).out

Final examples: special submit expressions

All of the following are possible (cf. HTCondor submission)

change one variable, keep everything else the same

[...]
initialdir     = run_1
queue

initialdir     = run_2
queue

Final examples: special submit expressions

Start one job per .dat file in submit directory

[...]
transfer_input_files = $(filename)
arguments            = -infile $(filename)
queue filename matching files *.dat

Final examples: special submit expressions

Submit 3 jobs where $(Item) will be 7, 8 or 9.

[...]
arguments = -a -b -c $(Item)
queue from seq 7 9 |

Final examples: special submit expressions

Submit jobs based on tuples given

[...]
queue input,arguments from (
  file1, -a -b 26
  file2, -c -d 92
)

Changes on the horizon

We have quite a number of changes still on our long to-do-list.

Some will be more visible and some may be completely invisible to you but nevertheless will use up our time!

Phase out HSM

  • Our current HSM software stack is EOS in 2021
  • Currently preparing move of old users to external entity
  • This will probably also mean we may join /work and /home again
  • Need to spread load to even more
  • Aim for 1-2 users per physical machines to reduce bad interaction

Centralized /software service

  • Slimming down compute nodes
  • Provide various software stacks as opt-ins (probably via spack + modules)
  • Reinhard already prepared /software/cw
  • From admin side still in infancy stage
  • If other working groups interested, please talk to us!
  • Straight forward to set-up rsync target for central build artifacts, images, …

Promote/evaluate singularity use

  • Would like scientists to investigate use of singularity images
  • Could be a well defined starting point for summer interns
  • Could serve as basis for static production environment
  • Potentially useful for re-runs years later!

Provide web based user auth management

  • Currently, only poor possibilities for collaborating with external people
  • Plan is to have web service which allows “easy” creation of groups
  • Can be tied into sharing information via our web server

Misc other stuff

  • renew cluster internal monitoring along with better user facing dashboards
  • use ssh certificates
    • host based to get rid of nagging host key has changed messages
    • user based to ensure your key does not get used by someone else (2FA)

Questions?