Alex_SCY 2021-11-30 19:38 采纳率: 50%
浏览 242

tensorflow2.3 可以调用GPU 但是运行异常

问题遇到的现象和发生背景

各种包都装好了,GPU也已经检测到了,但是无法运行。
Linux版本:Ubuntu20.04
cuda版本10.1
tensorflow版本2.3.0
GPU为两张A100计算卡

img

img

问题相关代码,请勿粘贴截图

运行代码如下:

import tensorflow as tf
import time

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])


start = time.time()

model.fit(x_train, y_train, epochs=5)

end = time.time()

model.evaluate(x_test, y_test)
print(end - start)
运行结果及报错内容

主要报错信息:

tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

(py38) yunhao@qwe-SYS-7049GP-TRT:~/Desktop/alex$ /home/yunhao/anaconda3/envs/py38/bin/python /home/yunhao/Desktop/alex/testingCode/1.py
2021-11-30 19:39:28.324346: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-11-30 19:39:30.110735: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-11-30 19:39:30.329530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:3b:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-11-30 19:39:30.331399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:af:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-11-30 19:39:30.331438: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-11-30 19:39:30.333522: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-11-30 19:39:30.335195: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-11-30 19:39:30.335477: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-11-30 19:39:30.337318: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-11-30 19:39:30.338334: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-11-30 19:39:30.342294: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-11-30 19:39:30.348777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
2021-11-30 19:39:30.349369: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-30 19:39:30.365731: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2400000000 Hz
2021-11-30 19:39:30.367866: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f4c695b120 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-30 19:39:30.367890: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-11-30 19:39:30.579338: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f4c878f1b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-11-30 19:39:30.579397: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA A100-PCIE-40GB, Compute Capability 8.0
2021-11-30 19:39:30.579415: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA A100-PCIE-40GB, Compute Capability 8.0
2021-11-30 19:39:30.583335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:3b:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-11-30 19:39:30.586511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:af:00.0 name: NVIDIA A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-11-30 19:39:30.586548: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-11-30 19:39:30.586578: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-11-30 19:39:30.586592: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-11-30 19:39:30.586607: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-11-30 19:39:30.586621: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-11-30 19:39:30.586636: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-11-30 19:39:30.586651: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-11-30 19:39:30.592484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
2021-11-30 19:39:30.592526: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "/home/yunhao/Desktop/alex/testingCode/1.py", line 9, in <module>
    model = tf.keras.models.Sequential([
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 457, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/keras/engine/sequential.py", line 116, in __init__
    super(functional.Functional, self).__init__(  # pylint: disable=bad-super-call
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 457, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 308, in __init__
    self._init_batch_counters()
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 457, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 317, in _init_batch_counters
    self._train_counter = variables.Variable(0, dtype='int64', aggregation=agg)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 262, in __call__
    return cls._variable_v2_call(*args, **kwargs)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 244, in _variable_v2_call
    return previous_getter(
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 237, in <lambda>
    previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/ops/variable_scope.py", line 2633, in default_variable_creator_v2
    return resource_variable_ops.ResourceVariable(
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1507, in __init__
    self._init_from_args(
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1650, in _init_from_args
    initial_value = ops.convert_to_tensor(
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1499, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 263, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 275, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 300, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 97, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/home/yunhao/anaconda3/envs/py38/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 539, in ensure_initialized
    context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

我想要达到的结果

求各位给一个解决方案谢谢

  • 写回答

2条回答 默认 最新

  • 爱晚乏客游 2021-12-01 09:10
    关注

    cudnn版本呢?tf2.3的cudnn好像只支持7.6,你看下是不是cudnn版本不匹配导致的。不行的话就得降tf版本了

    评论

报告相同问题?

问题事件

  • 修改了问题 11月30日
  • 创建了问题 11月30日

悬赏问题

  • ¥15 无法输出helloworld
  • ¥15 高通uboot 打印ubi init err 22
  • ¥20 PDF元数据中的XMP媒体管理属性
  • ¥15 R语言中lasso回归报错
  • ¥15 网站突然不能访问了,上午还好好的
  • ¥15 有没有dl可以帮弄”我去图书馆”秒选道具和积分
  • ¥15 semrush,SEO,内嵌网站,api
  • ¥15 Stata:为什么reghdfe后的因变量没有被发现识别啊
  • ¥15 振荡电路,ADS仿真
  • ¥15 关于#c语言#的问题,请各位专家解答!