关于tensorflow2.x自定义训练后，训练过程被杀死

import os
os.environ['CUDA_VISIBLE_DEVICES']='1' 

class CycleGAN():
    def __init__(self):
        # Input shape
        self.img_rows = 40
        self.img_cols = 40
        self.channels = 1
        self.img_shape = (self.img_rows, self.img_cols, self.channels)
        # Calculate output shape of D (PatchGAN)
        patch = int(self.img_rows / 2**3)
        self.disc_patch = (patch, patch, 1)
        # Number of filters in the first layer of G and D
        self.gf = 32
        self.df = 64
        # Loss weights
        self.lambda_cycle = 10.0                    # Cycle-consistency loss
        self.lambda_id = 0.1 * self.lambda_cycle    # Identity loss
        optimizer = Adam(0.002, 0.5)
        # Build and compile the discriminators
        self.d_A = self.build_discriminator()
        self.d_B = self.build_discriminator()
        self.d_A.compile(loss='mse',
            optimizer=optimizer,
            metrics=['accuracy'])
        self.d_B.compile(loss='mse',
            optimizer=optimizer,
            metrics=['accuracy'])
        # Build the generators
        self.g_A2B = self.build_generator()
        self.g_B2A = self.build_generator()
        # Input images from both domains
        img_A = Input(shape=self.img_shape)
        img_B = Input(shape=self.img_shape)
        # Translate images to the other domain
        fake_B = self.g_A2B(img_A)
        fake_A = self.g_B2A(img_B)
        # Translate images back to original domain
        reconstr_A = self.g_B2A(fake_B)
        reconstr_B = self.g_A2B(fake_A)
        # Identity mapping of images
        img_A_id_truth = self.g_B2A(img_A)
        img_B_id_truth = self.g_A2B(img_B)
        # For the combined model we will only train the generators
        self.d_A.trainable = False
        self.d_B.trainable = False
        # Discriminators determines validity of translated images
        valid_A = self.d_A(fake_A)
        valid_B = self.d_B(fake_B)
        # Combined model trains generators to fool discriminators
        self.combined = Model(inputs=[img_A, img_B],
                              outputs=[ valid_A, valid_B,
                                        reconstr_A, reconstr_B,
                                        img_A_id_truth, img_B_id_truth ])
        self.combined.compile(loss=['mse', 'mse',
                                    'mae', 'mae',
                                    'mae', 'mae'],
                            loss_weights=[  1, 1,
                                            self.lambda_cycle, self.lambda_cycle,
                                            self.lambda_id, self.lambda_id ],
                            optimizer=optimizer)
 
    def build_generator(self):
        """U-Net Generator"""
 
        def conv2d(layer_input, filters, f_size=3):
            """Layers used during downsampling"""
            d = Conv2D(filters, kernel_size=f_size, strides=2, padding='same')(layer_input)
            d = Activation("relu")(d)#ReLU(alpha=0.2)(d)
            d = InstanceNormalization()(d)
            return d
 
        def deconv2d(layer_input, skip_input, filters, f_size=4, dropout_rate=0.2):
            """Layers used during upsampling"""
            u = UpSampling2D(size=2)(layer_input)
            u = Conv2D(filters, kernel_size=f_size, strides=1, padding='same', activation='relu')(u)
            if dropout_rate:
                u = Dropout(dropout_rate)(u)
            u = InstanceNormalization()(u)
            u = Concatenate()([u, skip_input])
            return u
 
        # Image input
        d0 = Input(shape=self.img_shape)
        # Downsampling
        d1 = conv2d(d0, self.gf)
        d2 = conv2d(d1, self.gf*2)
        d3 = conv2d(d2, self.gf*4)
        # Upsampling
        u1 = deconv2d(d3, d2, self.gf*4)
        u2 = deconv2d(u1, d1, self.gf*2)
        u4 = UpSampling2D(size=2)(u2)
        output_img = Conv2D(self.channels, kernel_size=4, strides=1, padding='same', activation='relu')(u4)
        model=Model(d0, output_img)
        model.summary
        return model
 
    def build_discriminator(self):
 
        def d_layer(layer_input, filters, f_size=4, normalization=True):
            """Discriminator layer"""
            d = Conv2D(filters, kernel_size=f_size, strides=2, padding='same')(layer_input)
            d = Activation("relu")(d)#LeakyReLU(alpha=0.2)(d)
            if normalization:
                d = InstanceNormalization()(d)
            return d
 
        img = Input(shape=self.img_shape)
 
        d1 = d_layer(img, self.df, normalization=True)
        d2 = d_layer(d1, self.df*2)
        d3 = d_layer(d2, self.df*4)
        d4 = d_layer(d3, self.df*8)
 
        validity = Conv2D(1, kernel_size=4, strides=1, padding='same')(d4)
 
        return Model(img, validity)
 
    def train(self,dataset,X_train,X_true,time,lat,lon,epochs,batch_size=1, sample_interval=50):
        start_time = datetime.datetime.now()
 
        # Adversarial loss ground truths

        for epoch in range(epochs):
            for batch_i, (imgs_A, imgs_B) in enumerate(dataset.shuffle(len(dataset)).batch(batch_size)):
                valid = np.ones((imgs_A.shape[0],3,3,1))
                fake = np.zeros((imgs_A.shape[0],3,3,1))
                imgs_A = np.expand_dims(imgs_A, axis=3)
                imgs_B=np.expand_dims(imgs_B, axis=3)
                fake_B = self.g_A2B.predict(imgs_A)
                fake_A = self.g_B2A.predict(imgs_B)
 
            # Train the discriminators (original images = real / translated = Fake)
                dA_loss_real = self.d_A.train_on_batch(imgs_A, valid)
                dA_loss_fake = self.d_A.train_on_batch(fake_A, fake)
                dA_loss = 0.5 * np.add(dA_loss_real, dA_loss_fake)
                dB_loss_real = self.d_B.train_on_batch(imgs_B, valid)
                dB_loss_fake = self.d_B.train_on_batch(fake_B, fake)
                dB_loss = 0.5 * np.add(dB_loss_real, dB_loss_fake)
                d_loss = 0.5 * np.add(dA_loss, dB_loss)
                g_loss = self.combined.train_on_batch([imgs_A, imgs_B],
                                                        [valid, valid,
                                                        imgs_A, imgs_B,
                                                        imgs_A, imgs_B])
                elapsed_time = datetime.datetime.now() - start_time
                del valid,fake
                if batch_i  % 20 == 0:
                    print ("[Epoch %d/%d] [Batch %d/%d] [D loss: %f, acc: %3d%%] [G loss: %05f, adv: %05f, recon: %05f, id: %05f] time: %s " \
                                                                       % ( epoch, epochs,
                                                                            batch_i, len(dataset)//batch_size,
                                                                            d_loss[0], 100*d_loss[1],
                                                                            g_loss[0],
                                                                            np.mean(g_loss[1:3]),
                                                                            np.mean(g_loss[3:5]),
                                                                            np.mean(g_loss[5:6]),
                                                                            elapsed_time))
            if epoch % sample_interval == 0:
                self.sample_images(epoch,X_train,X_true,time,lat,lon)
 
    def sample_images(self, epoch,X_train,X_true,time,lat,lon):
        r, c = 4, 4
        idx = np.random.randint(0, X_train.shape[0], r)
        imgs_A= X_train[idx]
        imgs_B=X_true[idx]
        time=time[idx]
        imgs_A = np.expand_dims(imgs_A, axis=3)
        imgs_B=np.expand_dims(imgs_B, axis=3)
        # Translate images to the other domain
        fake_B = self.g_A2B.predict(imgs_A)
        fig, axs = plt.subplots(r, c,figsize=(12,8),constrained_layout=True)
        for i in range(r):       
            axs[i,0].contourf(lon,lat,imgs_A[i,:,:,0],cmap='RdBu_r')
            axs[i,1].contourf(lon,lat,fake_B[i,:,:,0],cmap='RdBu_r')
            axs[i,2].contourf(lon,lat,imgs_B[i,:,:,0],cmap='RdBu_r')
            ax3=axs[i,3].contourf(lon,lat,imgs_B[i,:,:,0]-fake_B[i,:,:,0],levels=5,cmap='RdBu_r')
            fig.colorbar(ax3,ax=axs[i,3])
            axs[i,0].set_title(str(time[i].values).split("T")[0])
            axs[i,1].set_title("Translated")
            axs[i,2].set_title("obs")
            axs[i,3].set_title("difference_%04f" % np.sqrt(np.mean((imgs_B[i,:,:,0]-fake_B[i,:,:,0])**2)))
        fig.savefig("../figure/cycle/wpsh_%d.png" % epoch)
        plt.close()
if __name__ == '__main__':

#     profiler.warmup()
#     profiler.start(logdir='./logdir')
    keras.backend.clear_session()
    cgan =CycleGAN()
    cgan.train(train_dataset,train_x,train_y,time_date,
lat,lon,epochs=2000,batch_size=128, sample_interval=100)

这时我模仿论坛里面，写的Cycle GAN的网络模型，自建训练过程和论坛大佬的基本一致，除了自己构建的数据集。现在问题在于每当我跑一定的epoch数后，程序就会自动被杀死，并且没有任何报错，终端运行提示被杀死，jupyter 运行，直接kernel restart 。并且都没有对应报错。所以我怀疑有可能是内存溢出？还是可能我cuda等配置不匹配？我电脑配置为两张RTX3090（但只用了一张训练，实际显存24268MIB），Ubuntu20.04 ，cuda11.0（将cuda11.1的ptxas替换了11.0的），tensorflow2.4 ，数据量大概是（10936，40，40）。求各位大佬帮帮忙

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
climate_ling 2021-03-07 21:34
关注
加了gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
tf.config.experimental.set_memory_growth(gpus[0],True)也没用，大概跑个40分钟就不行了

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

关于 tensorflow 1.x sparse_tensor_dense_matmul 的问题 python tensorflow 有问必答
2021-09-15 11:21

回答 1 已采纳同学，你的indices搞反了 indices = np.vstack(( arr.row, arr.col)).transpose()
Tensorflow只有.data和.index文件能继续训练吗？ tensorflow 人工智能
2021-08-10 14:14

回答 1 已采纳如果有构建网络的代码，就可以。如果，没有构建网络的代码，那么需要有meta文件，这个是保存网络结构的。
module 'tensorflow._api.v2.train' has no attribute 'AdagradOptimizer' python tensorflow
2021-08-25 01:57

回答 1 已采纳你的代码应该是1.x版本的，但是你的环境是2.x版本的就这样了，函数名和接口变了，要么去查一下接口名变成什么了，要么就新建个1.x的环境运行
TensorFlow Serving + Docker + Tornado机器学习模型生产级快速部署
2021-08-01 15:48

fahaihappy的博客点击上方“AI搞事情”关注我们内容转载自知乎：https://zhuanlan.zhihu.com/p/52096200Justin ho〉本文将会介绍使用TensorFlow Servi...
tensorflow.keras训练模型预测问题 keras python tensorflow
2023-03-04 11:08

回答 2 已采纳这种情况可能是由于模型在训练过程中出现了过拟合的现象。过拟合通常是指模型在训练集上表现很好，但在测试集上表现不佳的情况。在训练过程中，模型过度适应了训练集的噪声和特定的样本，从而导致了 val_los
tensorflow2.x 深度学习使用相同梯度进行梯度下降的两个相同神经网络，得到的结果却不同 tensorflow 深度学习神经网络
2021-03-12 00:07

回答 2 已采纳在上面给的代码的第164行处插入 optimizer = optimizers.Adam(lr=1e-4) 重新初始化optimizer，这样两个模型训练后的测试结果就一样了，望采纳
tensorflow.keras训练问题 keras python tensorflow
2023-03-05 18:39

回答 2 已采纳尝试一下调整参数吧，或者加层数试试
TensorFlow Serving模型转换与部署
2020-01-05 14:41

tianyunzqs的博客 TensorFlow Serving安装1.1. 拉取镜像1.2. 下载官方代码1.3. 运行TF Serving1.4. 客户端验证2. 将ckpt模型转换为pb模型3. 模型部署4. 多模型部署4.1 多(单)用户单模型4.2 多(单)用户多模型4.3. 接口请求5. 新增模型...
module 'tensorflow.keras.layers' has no attribute 'Normalization keras tensorflow 深度学习
2022-08-09 16:15

回答 2 已采纳你是tensorflow哪个版本？keras哪个版本？不说版本很难查问题。可以试下把Normalization改为normalization试下可以试下把tensorflow.keras.layers
Tensorflow1.4.0中import tensorflow.compat.v1 as tf 报错：importerror：no module named compat.v1 python tensorflow 有问必答深度学习
2022-02-22 15:04

回答 2 已采纳 import tensorflow.compat.v1 应该是为2.0版本切换到1.0风格方式。默认就是1.0风格代码，直接import tensorflow不行么？
ModuleNotFoundError: No module named 'tensorflow.contrib' python tensorflow 有问必答
2022-04-11 16:44

回答 2 已采纳 tensorflow 2.0以后没有 tensorflow.contrib降低版本或者安装Tf-slim包
人工智能面试问题整理
2021-12-23 22:50

白拾Official的博客 2）为什么要特征归一化为了消除数据特征之间的量纲影响，我们需要对特征进行归一化处理，使得不同指标之间具有可比性。例如，分析一个人的身高和体重对健康的影响，如果使用米（m）和千克（kg）作为单位，那么身高...
运行keras报错 No module named 'tensorflow.python.tools'; 'tensorflow.python' is not a package keras python tensorflow 有问必答
2021-09-10 14:14

回答 2 已采纳降低或升级tensorflow的版本试试，另外检查一下你这个文件名是不是tensorflow.
TensorFlow 智能移动项目：11~12
2023-04-16 21:53

绝不原创的飞龙的博客原文：Intelligent mobile projects with TensorFlow 协议：CC BY-NC-SA 4.0 译者：飞龙本文来自【ApacheCN 深度学习译文集】，采用译后编辑（MTPE）流程来尽可能提升效率。不要担心自己的形象，只关心如何实现...
什么是人工智能？你需要知道的关于人工智能的一切
2020-03-21 11:40

半月夏微凉的博客 人工智能（Artificial Intelligence，AI）的执行指南讲述，从机器学习和通用人工智能到神经网络。什么是人工智能？这个问题取决于你问的对象是谁。早在20世纪50年代，这个领域之父Minsky和McCarthy就将人工智能...
没有解决我的问题, 去提问

悬赏问题

¥15 安卓adb backup备份应用数据失败
¥15 eclipse运行项目时遇到的问题
¥15 关于#c##的问题：最近需要用CAT工具Trados进行一些开发
¥15 南大pa1 小游戏没有界面，并且报了如下错误，尝试过换显卡驱动，但是好像不行
¥15 没有证书，nginx怎么反向代理到只能接受https的公网网站
¥50 成都蓉城足球俱乐部小程序抢票
¥15 yolov7训练自己的数据集
¥15 esp8266与51单片机连接问题(标签-单片机|关键词-串口)（相关搜索：51单片机|单片机|测试代码）
¥15 电力市场出清matlab yalmip kkt 双层优化问题
¥30 ros小车路径规划实现不了，如何解决？(操作系统-ubuntu)

关于tensorflow2.x自定义训练后，训练过程被杀死

2条回答 默认 最新

悬赏问题

2条回答默认最新