slurm: jobs are pending even though resources are available -
we have started work slurm. operating cluster number of nodes 4 gpus each, , nodes cpus. start jobs using gpus higher priority. therefore, have 2 partitions, however, overlapping node lists. partition gpus, called 'batch' has higher 'prioritytier' value. partition without gpus called 'cpubatch'.
the main reason construbtion want use idle cpus on nodes gpus, if not needed gpu jobs.
now encounter problem jobs in 'cpubatch' partition not start on nodes jobs of 'batch' partition running, if there sufficiently many cpus idle on these nodes.
here slurm.conf file
controlmachine=qbig authtype=auth/none cryptotype=crypto/openssl jobcredentialprivatekey=/qbigwork/slurm/etc/slurm.key jobcredentialpubliccertificate=/qbigwork/slurm/etc/slurm.cert mailprog=/qbigwork/slurm/etc/mailwrapper.sh mpidefault=none mpiparams=ports=12000-12999 returntoservice=1 slurmctldpidfile=/var/run/slurmctld.pid slurmctldport=6817 slurmdpidfile=/var/run/slurmd.pid slurmdport=6818 slurmdspooldir=/var/spool/slurmd slurmuser=slurm statesavelocation=/var/spool/slurm.state/ switchtype=switch/none taskplugin=task/cgroup # # # timers inactivelimit=0 killwait=30 minjobage=300 slurmctldtimeout=120 slurmdtimeout=300 waittime=0 # # # scheduling defmempernode=1024 fastschedule=1 schedulertype=sched/backfill selecttype=select/cons_res selecttypeparameters=cr_cpu # # # job priority prioritytype=priority/multifactor priorityfavorsmall=no priorityweightjobsize=1000 priorityweightqos=0 # # # logging , accounting accountingstoragetype=accounting_storage/none accountingstorejobcomment=yes clustername=cluster jobcomptype=jobcomp/none jobacctgatherfrequency=30 jobacctgathertype=jobacct_gather/none slurmctlddebug=3 slurmctldlogfile=/var/log/slurm/slurmctld.log slurmddebug=3 slurmdlogfile=/var/log/slurm/slurmd.log slurmschedlogfile=/var/log/slurm/slurmsched.log slurmschedloglevel=3 # # # gres grestypes=gpu,bandwidth # # compute nodes nodename=lnode[01-12] cpus=8 realmemory=64525 sockets=2 corespersocket=4 threadspercore=1 gres=gpu:4 nodename=lcpunode01 cpus=32 realmemory=129174 sockets=2 corespersocket=8 threadspercore=2 nodename=qbig cpus=4 realmemory=40000 sockets=2 corespersocket=2 threadspercore=1 # partitionname=batch maxtime=48:00:00 nodes=lnode[01-12] prioritytier=50000 defaulttime=30 oversubscribe=no maxnodes=12 state=up partitionname=cpubatch maxtime=48:00:00 nodes=lnode[01-12],qbig,lcpunode01 default=yes prioritytier=5000 defaulttime=30 oversubscribe=no maxnodes=14 state=up and here gres.conf file
nodename=lnode[01-12] name=gpu file=/dev/nvidia[0-3] nodename=default name=bandwidth type=lustre count=4m we have freshly compiled slurm 17.02.7. 'squeue' gives instance
$> squeue [...] 1030 batch swc_a2p1 user1 pd 0:00 1 (priority) 1029 batch swc_a2p1 user1 pd 0:00 1 (resources) 951 cpubatch 002_e_11 user2 pd 0:00 1 (resources) 1062 batch swc_a2p1 user1 pd 0:00 1 (priority) [...] but e.g. on lnode[02-12] there resources available:
$scontrol show node lnode02 nodename=lnode02 arch=x86_64 corespersocket=4 cpualloc=4 cpuerr=0 cputot=8 cpuload=4.06 availablefeatures=(null) activefeatures=(null) gres=gpu:4 nodeaddr=lnode02 nodehostname=lnode02 version=17.02 os=linux realmemory=64525 allocmem=0 freemem=38915 sockets=2 boards=1 state=mixed threadspercore=1 tmpdisk=0 weight=1 owner=n/a mcs_label=n/a partitions=batch,cpubatch boottime=2016-09-14t11:55:35 slurmdstarttime=2017-09-11t13:49:36 cfgtres=cpu=8,mem=64525m alloctres=cpu=4 capwatts=n/a currentwatts=0 lowestjoules=0 consumedjoules=0 extsensorsjoules=n/s extsensorswatts=0 extsensorstemp=n/s and job 951 askes 4 cpus
$scontrol show job 951 jobid=951 jobname=002_e_110000 userid=user2(1416) groupid=theorie(149) mcs_label=n/a priority=50 nice=0 account=(null) qos=(null) jobstate=pending reason=resources dependency=(null) requeue=1 restarts=0 batchflag=1 reboot=0 exitcode=0:0 runtime=00:00:00 timelimit=2-00:00:00 timemin=n/a submittime=2017-09-11t11:41:01 eligibletime=2017-09-11t11:41:01 starttime=2017-09-12t13:31:30 endtime=2017-09-14t13:31:30 deadline=n/a preempttime=none suspendtime=none secspresuspend=0 partition=cpubatch allocnode:sid=qbig:30138 reqnodelist=(null) excnodelist=(null) nodelist=(null) schednodelist=lcpunode01 numnodes=1-1 numcpus=4 numtasks=1 cpus/task=4 reqb:s:c:t=0:0:*:* tres=cpu=4,mem=1024,node=1 socks/node=* ntaskspern:b:s:c=1:0:*:* corespec=* mincpusnode=4 minmemorynode=1g mintmpdisknode=0 features=(null) delayboot=00:00:00 gres=(null) reservation=(null) oversubscribe=ok contiguous=0 licenses=(null) network=(null) command=/hiskp2/user2/testtoy/corr.sh workdir=/hiskp2/user2/testtoy stderr=/hiskp2/user2/testtoy/test.%j.out stdin=/dev/null stdout=/hiskp2/user2/testtoy/test.%j.out power= the pending gpu job looks follows
$scontrol show job 1029 jobid=1029 jobname=swc_a2p1_mpi270_l24t96_strange_0589_1 userid=use1(1407) groupid=theorie(149) mcs_label=n/a priority=50 nice=0 account=(null) qos=(null) jobstate=running reason=none dependency=(null) requeue=1 restarts=0 batchflag=1 reboot=0 exitcode=0:0 runtime=00:02:49 timelimit=07:00:00 timemin=n/a submittime=2017-09-11t15:37:40 eligibletime=2017-09-11t15:37:40 starttime=2017-09-12t12:43:34 endtime=2017-09-12t19:43:34 deadline=n/a preempttime=none suspendtime=none secspresuspend=0 partition=batch allocnode:sid=qbig:12473 reqnodelist=(null) excnodelist=(null) nodelist=lnode06 batchhost=lnode06 numnodes=1 numcpus=4 numtasks=4 cpus/task=1 reqb:s:c:t=0:0:*:* tres=cpu=4,mem=25g,node=1 socks/node=* ntaskspern:b:s:c=4:0:*:* corespec=* mincpusnode=4 minmemorynode=25g mintmpdisknode=0 features=(null) delayboot=00:00:00 gres=gpu:4 reservation=(null) oversubscribe=ok contiguous=0 licenses=(null) network=(null) command=/hiskp2/user1/peram_generation/0120-mpi270-l24-t96/strange/cnfg0589/rnd_vec_01/quda.job.slurm.0589_01.cmd workdir=/hiskp2/user1/peram_generation/0120-mpi270-l24-t96/strange/cnfg0589/rnd_vec_01 stderr=/hiskp2/user1/peram_generation/0120-mpi270-l24-t96/strange/cnfg0589/rnd_vec_01/slurm-1029.out stdin=/dev/null stdout=/hiskp2/user1/peram_generation/0120-mpi270-l24-t96/strange/cnfg0589/rnd_vec_01/slurm-1029.out power= please provide or other solution priority of gpu jobs. thanks!
Comments
Post a Comment