DoubleSin 2020-03-01 17:44 采纳率: 0%
浏览 336

启用docker后,nodemanagers无法启动

想搭建一个环境使用GPU来跑tensorflow任务,添加了yarn 使用 GPU 的配置之后,修改了hadoop的配置

在 container-executor.cfg 中添加配置:

[docker]

docker.allowed.volume-drivers=/usr/bin/nvidia-docker
docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0
docker.allowed.ro-mounts=nvidia_driver_375.26

[gpu]
module.enabled=true

[cgroups]
# /sys/fs/cgroup是cgroup的mount路径
# /hadoop-yarn是yarn在cgroup路径下默认创建的path
root=/sys/fs/cgroup
yarn-hierarchy=/hadoop-yarn

然后启动nodemanager失败,日志如下:

2020-03-01 09:21:07,116 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Controller devices not mounted. You either need to mount it with yarn.nodemanager.linux-container-executor.cgroups.mount or mount cgroups before launching Yarn
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:392)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:370)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:93)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:146)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
2020-03-01 09:21:07,118 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler failed in state INITED
java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:150)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
2020-03-01 09:21:07,118 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at oorg.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:150)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more
2020-03-01 09:21:07,119 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:150)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more
2020-03-01 09:21:07,120 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2020-03-01 09:21:07,120 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2020-03-01 09:21:07,120 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2020-03-01 09:21:07,120 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:323)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:516)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:974)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1054)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler.serviceInit(ContainerScheduler.java:150)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        ... 8 more
2020-03-01 09:21:07,124 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at 4c2cc1e6479d/172.17.0.6
************************************************************/

另外,在本地也做了docker的镜像来验证环境,在本地的docker到时没有什么问题,可以正常启动服务,将镜像上传到服务器测试的时候,遇到的也还是同一个问题,也不知道是不是cgroup挂载的问题,有没有大神解答一二,感谢感谢。

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥20 C#调用虚拟键盘TabTip.exe
    • ¥15 Qt4代码实现下面的界面
    • ¥15 CCS离散化传递函数与仿真不一致
    • ¥15 prism提示我reinstall prism 如何解决
    • ¥15 asp.core 权限控制怎么做,需要控制到每个方法
    • ¥20 while循环中OLED显示中断中的数据不正确
    • ¥15 这个视频里的stm32f4代码是怎么写的
    • ¥15 JNA调用DLL报堆栈溢出错误(0xC00000FD)
    • ¥15 请教SGeMs软件的使用
    • ¥15 自己用vb.net编写了一个dll文件,如何只给授权的用户使用这个dll文件进行打包编译,未授权用户不能进行打包编译操作?