DoubleSin 2020-03-01 17:44 采纳率: 0%
浏览 334

启用docker后,nodemanagers无法启动

想搭建一个环境使用GPU来跑tensorflow任务,添加了yarn 使用 GPU 的配置之后,修改了hadoop的配置

在 container-executor.cfg 中添加配置:

[docker]

docker.allowed.volume-drivers=/usr/bin/nvidia-docker
docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0
docker.allowed.ro-mounts=nvidia_driver_375.26

[gpu]
module.enabled=true

[cgroups]
# /sys/fs/cgroup是cgroup的mount路径
# /hadoop-yarn是yarn在cgroup路径下默认创建的path
root=/sys/fs/cgroup
yarn-hierarchy=/hadoop-yarn

然后启动nodemanager失败,日志如下:

2020-03-01 09:21:07,116 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Controller devices not mounted. You either need to mount it with yarn.nodemanager.linux-container-executor.cgroups.mount or mount cgroups before launching Yarn
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:392)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:370)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:93)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:146)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
2020-03-01 09:21:07,118 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler failed in state INITED
java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:150)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
2020-03-01 09:21:07,118 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at oorg.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:150)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more
2020-03-01 09:21:07,119 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:150)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more
2020-03-01 09:21:07,120 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2020-03-01 09:21:07,120 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2020-03-01 09:21:07,120 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2020-03-01 09:21:07,120 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:150)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more
2020-03-01 09:21:07,124 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at 4c2cc1e6479d/172.17.0.6
************************************************************/

另外,在本地也做了docker的镜像来验证环境,在本地的docker到时没有什么问题,可以正常启动服务,将镜像上传到服务器测试的时候,遇到的也还是同一个问题,也不知道是不是cgroup挂载的问题,有没有大神解答一二,感谢感谢。

  • 写回答

0条回答 默认 最新

      报告相同问题?

      悬赏问题

      • ¥15 下载CSGO服务端时出错
      • ¥20 esp8266知心天气blinker上面显示
      • ¥20 python代码的实现讲解
      • ¥20 python代码实现,讲解
      • ¥30 六足机器人硬件配置问题
      • ¥15 uniapp ios端配置universal link无法唤起app
      • ¥70 java通过ISUP SDK开发连接海康威视获取PS封装流如何推送到web页面
      • ¥15 并且拖拽过去在picturebox上的数据也可以实现拖拽换位置的功能,
      • ¥15 AndroidStudio:fragment里的content怎么写?
      • ¥30 关于#vba数组#的问题,如何解决?