Tchaos2023 2023-10-14 16:11
浏览 85
已结题

slurm slurmctld Active: failed

Slurm, slurmctld status failed
求大神帮忙看看slurmctld 状态failed, 昨天安装是ok的,今天就不行了。。。。。
slurmd 和slurmdbd 状态是ok的

**
**slurmctld.service - Slurm controller daemon
Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2023-10-14 06:39:17 UTC; 1h 20min ago
Process: 6988 ExecStart=/opt/slurm/23.02.6/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 6988 (code=exited, status=1/FAILURE)

Oct 14 06:39:17 DUT7152ATSM systemd[1]: Started Slurm controller daemon.
Oct 14 06:39:17 DUT7152ATSM slurmctld[6988]: slurmctld: error: Configured MailProg is invalid
Oct 14 06:39:17 DUT7152ATSM slurmctld[6988]: slurmctld: slurmctld version 23.02.6 started on cluster cool
Oct 14 06:39:17 DUT7152ATSM slurmctld[6988]: slurmctld: fatal: Can not recover assoc_usage state, incompatible version, got 8704 need >= 9472 <= 9984, start with '-i' to ignore this. Warning>
Oct 14 06:39:17 DUT7152ATSM systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Oct 14 06:39:17 DUT7152ATSM systemd[1]: slurmctld.service: Failed with result 'exit-code'.****

  • 写回答

1条回答 默认 最新

  • Tchaos2023 2023-10-14 16:23
    关注

    下command: netstat -tulpen | grep 6988 没有任何东西输出,
    slurmctld -Dvvv 报下面error:
    slurmctld: debug: Log file re-opened
    slurmctld: pidfile not locked, assuming no running daemon
    slurmctld: error: Configured MailProg is invalid
    slurmctld: slurmctld version 19.05.5 started on cluster cool
    slurmctld: Munge credential signature plugin loaded
    slurmctld: debug: Munge authentication plugin loaded
    slurmctld: Cray/Aries node selection plugin loaded
    slurmctld: preempt/none loaded
    slurmctld: debug: Checkpoint plugin loaded: checkpoint/none
    slurmctld: debug: AcctGatherEnergy NONE plugin loaded
    slurmctld: debug: AcctGatherProfile NONE plugin loaded
    slurmctld: debug: AcctGatherInterconnect NONE plugin loaded
    slurmctld: debug: AcctGatherFilesystem NONE plugin loaded
    slurmctld: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
    slurmctld: debug: Job accounting gather NOT_INVOKED plugin loaded
    slurmctld: ExtSensors NONE plugin loaded
    slurmctld: debug: switch NONE plugin loaded
    slurmctld: debug: power_save module disabled, SuspendTime < 0
    slurmctld: Accounting storage NOT INVOKED plugin loaded
    slurmctld: debug: Recovered 8 tres
    slurmctld: debug: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
    slurmctld: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm-llnl/cgroup.conf)
    slurmctld: No memory enforcing mechanism configured.
    slurmctld: layouts: no layout to initialize
    slurmctld: topology NONE plugin loaded
    slurmctld: debug: No DownNodes
    slurmctld: debug: Log file re-opened
    slurmctld: sched: Backfill scheduler plugin loaded
    slurmctld: error: read_slurm_conf: default partition not set.
    slurmctld: route default plugin loaded
    slurmctld: layouts: loading entities/relations information
    slurmctld: debug: layouts: 1/1 nodes in hash table, rc=0
    slurmctld: debug: layouts: loading stage 1
    slurmctld: debug: layouts: loading stage 1.1 (restore state)
    slurmctld: debug: layouts: loading stage 2
    slurmctld: debug: layouts: loading stage 3
    slurmctld: Recovered state of 1 nodes
    slurmctld: Down nodes: DUT7152ATSM
    slurmctld: Recovered information about 0 jobs
    slurmctld: debug2: init_requeue_policy: kill_invalid_depend is set to 0
    slurmctld: debug: Updating partition uid access list
    slurmctld: Recovered state of 0 reservations
    slurmctld: State of 0 triggers recovered
    slurmctld: _preserve_plugins: backup_controller not specified
    slurmctld: Running as primary controller
    slurmctld: debug: No backup controllers, not launching heartbeat.
    slurmctld: debug: Priority BASIC plugin loaded
    slurmctld: No parameter for mcs plugin, default values set
    slurmctld: mcs: MCSParameters = (null). ondemand set.
    slurmctld: debug: mcs none plugin loaded
    slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
    slurmctld: debug: power_save mode not enabled
    slurmctld: debug: Spawning registration agent for DUT7152ATSM 1 hosts
    slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS
    slurmctld: debug2: Tree head got back 0 looking for 1
    slurmctld: debug: slurm_recv_timeout at 0 of 4, recv zero bytes
    slurmctld: error: slurm_receive_msgs: Zero Bytes were transmitted or received
    slurmctld: debug2: Tree head got back 1
    slurmctld: agent/is_node_resp: node:DUT7152ATSM RPC:REQUEST_NODE_REGISTRATION_STATUS : Zero Bytes were transmitted or received
    slurmctld: debug: backfill: beginning
    slurmctld: debug: backfill: no jobs to backfill
    slurmctld: debug2: Testing job time limits and checkpoints
    slurmctld: Terminate signal (SIGINT or SIGTERM) received
    slurmctld: debug: sched: slurmctld terminating
    slurmctld: Saving all slurm state
    slurmctld: debug2: _purge_files_thread: starting, 0 jobs to purge
    slurmctld: debug: mcs none plugin fini
    slurmctld: debug: layouts/base: dumping 4 records into state file
    slurmctld: layouts: all layouts are now unloaded.

    评论

报告相同问题?

问题事件

  • 系统已结题 10月22日
  • 创建了问题 10月14日