想利用srun来提交4节点的mpi程序
[root@mu01 MPI_IniteDiff3]# srun -N 4 -n 4 -p gpu --gres=gpu:1 ./test
srun: Required node not available (down, drained or reserved)
srun: job 289 queued and waiting for resources
于是我查询sinof
[root@mu01 MPI_IniteDiff3]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu* up infinite 4 down* cu[01-04]
发现结点状态为down 不是idle
网上搜索命令,
[root@mu01 ~]# scontrol update NodeName=cu[01-04] State=idle
[root@mu01 ~]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu* up infinite 4 idle* cu[01-04]
发现状态改为idle 然后继续使用上面的srun命令 ,发现还是同样的问题
(貌似好像过一会这个状态就会自己变为down)
于是我输出单个节点的状态信息供大家参考
[root@mu01 ~]# scontrol show node
NodeName=cu01 CoresPerSocket=14
CPUAlloc=0 CPUErr=0 CPUTot=28 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:2
NodeAddr=192.168.100.101 NodeHostName=cu01
RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=gpu
BootTime=None SlurmdStartTime=None
CfgTRES=cpu=28,mem=1M
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [slurm@2018-05-30T14:18:24]
有没有谁帮我看看,给我点意见,我也是刚刚接触集群和slurm调度系统
谢谢大家了!