使用 tensorflow 训练网络 loss 突然出现 nan 的情况[已解决] 5C

在第167次epoch时模型loss突然变为nan,之前情况都是正常的,之后模型 loss 便一直为 nan,两个准确率变为 1 和 0。
尝试把学习率改为0或0.0000001,nan还是会在167次epoch出现。
尝试把loss改为loss = tf.log(tf.clip _ by _ value(y,1e-8,1.0)) 或 loss = tf.log(tf.cli _ p _ by _ value(y,1e-8,tf.reducemax(y))),nan还是会在167次epoch出现。
把softmax函数,改为log _ softmax函数,nan还是会在167次epoch出现。
把batch _ size改大五倍(从20改为100),nan会在33次epoch出现。
各位大佬们,谁能救救我啊,这是因为什么原因呢???调试了一星期了(悲伤)

qq_33614902
kepcum loss没问题吧,前面应该加一个负号吧。我遇到了同样的问题,刚开始训练没问题,训练到第96个epoch的时候就NAN了,请问您是怎么解决的?是train_data的问题吗?
9 个月之前 回复
weixin_39886667
lighter222 我也是这种情况,在没有用suffle的训练条件下,在某个特定迭代次数下就出现nan,多次重复训练都是在同一个位置出现nan
9 个月之前 回复
linger2012liu
lingerlanlan 楼主找到解决方案了吗
10 个月之前 回复
weixin_42329758
weixin_42329758 您好,能不能问问您画出loss曲线的程序代码,打扰了
10 个月之前 回复

2个回答

看下是不是梯度爆炸或者消失了,加上正则化或者随机化,或者逐层训练你的模型。

caozhy
贵阳老马马善福专业维修游泳池堵漏防水工程 回复Sugar_girl: 用tensorborad看
一年多之前 回复
Sugar_girl
FishSugar 请问如何看梯度是否爆炸或消失呢,现在我利用150epoch训练出的模型(因为在167epoch会出现loss为nan)进行在val,test上运行得到val,test上模型损失也为nan,但是模型准确率却在随着train数据的增加逐次增大,请问您觉得这种情况大概是因为什么原因呢?非常感谢您的回复!
一年多之前 回复

要不,吧sigmoid去掉试试,不过这样会不收敛的,应该还是网络的问题,看下是否在最后一层加了BN层

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
其他相关推荐
tensorflow 里loss 出现nan问题 新手问题
大家好, 新手刚刚学,IDE:spyder 预测一个停车场车辆的驶出率, 输入(时间和车辆驶入率)二维。 训练的数据就是,驶出率car/min,时间(时间:1代表一天,从凌晨10秒=10/24/3600的时候开始到晚上23点多),驶入率car/min, 只建了一层的hidden layer,然后print loss是都是nan... 不知道哪里出了问题,是因为层太简单了么?还是激活函数有问题呢? 看网上说排除零的影响,我把输入数和输出数都+1,变得非零了也还是nan... 代码如下: ``` import tensorflow as tf import numpy as np import pandas as pd data=pd.read_csv('0831new.csv') date=data['date'] erate=data['erate'] x=pd.concat([date,erate],axis=1) drate=data['drate'] y=np.array(drate) x=np.array(x) y=y.reshape([7112,1]) x=x+1 y=y+1 z=[] def add_layer(inputs, in_size, out_size, activation_function=None): # add one more layer and return the output of this layer Weights = tf.Variable(tf.random_normal([in_size, out_size])) biases = tf.Variable(tf.zeros([1, out_size]) + 0.01) Wx_plus_b = tf.matmul(inputs, Weights) + biases if activation_function is None: outputs = Wx_plus_b else: outputs = activation_function(Wx_plus_b) return outputs xs = tf.placeholder(tf.float32, [None, 2]) ys = tf.placeholder(tf.float32, [None, 1]) l1 = add_layer(xs, 2, 5, activation_function=tf.nn.tanh) prediction = add_layer(l1,5, 1, activation_function=None) loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction), reduction_indices=[1,0])) train_step= tf.train.GradientDescentOptimizer(0.001).minimize(loss) if int((tf.__version__).split('.')[1]) < 12: init = tf.initialize_all_variables() else: init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) for i in range(1000): # training sess.run(train_step, feed_dict={xs: x, ys: y}) if i % 25 == 0: # to see the step improvement print('loss:',sess.run(loss, feed_dict={xs:x, ys:y})) z.append(loss) ``` ![图片说明](https://img-ask.csdn.net/upload/201909/20/1568984044_999083.png) 帮忙给件建议吧~ 谢谢
Mask RCNN训练过程中loss为nan的情况(使用labelme标注的数据)
1. 不是batchsize的问题,不是学习率的问题。我已经将学习率调成了0,结果也是这样,即迭代几次之后(不是一上来就是nan),loss就为nan了,但是后面5个loss正常收敛。 2. 训练类别数与数据集中的类别数一致。 * 想问问 帖子里面有大神知道原因,希望告知!多谢!! ![图片说明](https://img-ask.csdn.net/upload/201904/01/1554111703_91454.png)
卷积神经网络训练loss变为nan
卷积神经网络训练,用的是mnist数据集,第一次训练前损失函数还是一个值,训练一次之后就变成nan了,使用的损失函数是ce = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=y, labels=tf.argmax(y_, 1)),cem = tf.reduce.mean(ce),应该不会出现真数为零或负的情况,而且训练前loss是存在的,只是训练后变为nan,求各位大牛答疑解惑,感激不尽。![图片说明](https://img-ask.csdn.net/upload/201902/18/1550420090_522553.png)![图片说明](https://img-ask.csdn.net/upload/201902/18/1550420096_230838.png)![图片说明](https://img-ask.csdn.net/upload/201902/18/1550420101_914053.png)
在Spyder界面中使用tensorflow进行fashion_mnist数据集学习,结果loss为非数,并且准确率一直未变
1.建立了一个3个全连接层的神经网络; 2.代码如下: ``` import matplotlib as mpl import matplotlib.pyplot as plt #%matplotlib inline import numpy as np import sklearn import pandas as pd import os import sys import time import tensorflow as tf from tensorflow import keras print(tf.__version__) print(sys.version_info) for module in mpl, np, sklearn, tf, keras: print(module.__name__,module.__version__) fashion_mnist = keras.datasets.fashion_mnist (x_train_all, y_train_all), (x_test, y_test) = fashion_mnist.load_data() x_valid, x_train = x_train_all[:5000], x_train_all[5000:] y_valid, y_train = y_train_all[:5000], y_train_all[5000:] #tf.keras.models.Sequential model = keras.models.Sequential() model.add(keras.layers.Flatten(input_shape= [28,28])) model.add(keras.layers.Dense(300, activation="relu")) model.add(keras.layers.Dense(100, activation="relu")) model.add(keras.layers.Dense(10,activation="softmax")) ###sparse为最后输出为index类型,如果为one hot类型,则不需加sparse model.compile(loss = "sparse_categorical_crossentropy",optimizer = "sgd", metrics = ["accuracy"]) #model.layers #model.summary() history = model.fit(x_train, y_train, epochs=10, validation_data=(x_valid,y_valid)) ``` 3.输出结果: ``` runfile('F:/new/new world/deep learning/tensorflow/ex2/tf_keras_classification_model.py', wdir='F:/new/new world/deep learning/tensorflow/ex2') 2.0.0 sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0) matplotlib 3.1.1 numpy 1.16.5 sklearn 0.21.3 tensorflow 2.0.0 tensorflow_core.keras 2.2.4-tf Train on 55000 samples, validate on 5000 samples Epoch 1/10 WARNING:tensorflow:Entity <function Function._initialize_uninitialized_variables.<locals>.initialize_variables at 0x0000025EAB633798> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: WARNING: Entity <function Function._initialize_uninitialized_variables.<locals>.initialize_variables at 0x0000025EAB633798> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: 55000/55000 [==============================] - 3s 58us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 2/10 55000/55000 [==============================] - 3s 48us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 3/10 55000/55000 [==============================] - 3s 47us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 4/10 55000/55000 [==============================] - 3s 48us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 5/10 55000/55000 [==============================] - 3s 47us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 6/10 55000/55000 [==============================] - 3s 48us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 7/10 55000/55000 [==============================] - 3s 47us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 8/10 55000/55000 [==============================] - 3s 48us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 9/10 55000/55000 [==============================] - 3s 48us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 Epoch 10/10 55000/55000 [==============================] - 3s 48us/sample - loss: nan - accuracy: 0.1008 - val_loss: nan - val_accuracy: 0.0914 ```
Tensorflow中训练cifar10出现name 'train_data' is not defined问题
用神经网络训练cifar10 ``` init=tf.global_variables_initializer() batch_size=20 train_steps=1000 with tf.Session() as sess: for i in range(train_steps): batch_data,batch_labels=train_data.next_batch(batch_size) loss_val,acc_val,_=sess.run( [loss,accuracy,train_op], feed_dict= {x: batch_data, y: batch_labels}) if i%500==0: print('[Train] Step:%d,loss:%4.5f,acc:%4.5f'\ %(i,loss_val,acc_val)) # ``` 出现name 'train_data' is not defined问题不知道怎么解决了
tensorflow训练完模型直接测试和导入模型进行测试的结果不同,一个很好,一个略差,这是为什么?
在tensorflow训练完模型,我直接采用同一个session进行测试,得到结果较好,但是采用训练完保存的模型,进行重新载入进行测试,结果较差,不懂是为什么会出现这样的结果。注:测试数据是一样的。以下是模型结果: 训练集:loss:0.384,acc:0.931. 验证集:loss:0.212,acc:0.968. 训练完在同一session内的测试集:acc:0.96。导入保存的模型进行测试:acc:0.29 ``` def create_model(hps): global_step = tf.Variable(tf.zeros([], tf.float64), name = 'global_step', trainable = False) scale = 1.0 / math.sqrt(hps.num_embedding_size + hps.num_lstm_nodes[-1]) / 3.0 print(type(scale)) gru_init = tf.random_normal_initializer(-scale, scale) with tf.variable_scope('Bi_GRU_nn', initializer = gru_init): for i in range(hps.num_lstm_layers): cell_bw = tf.contrib.rnn.GRUCell(hps.num_lstm_nodes[i], activation = tf.nn.relu, name = 'cell-bw') cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, output_keep_prob = dropout_keep_prob) cell_fw = tf.contrib.rnn.GRUCell(hps.num_lstm_nodes[i], activation = tf.nn.relu, name = 'cell-fw') cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, output_keep_prob = dropout_keep_prob) rnn_outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell_bw, cell_fw, inputs, dtype=tf.float32) embeddedWords = tf.concat(rnn_outputs, 2) finalOutput = embeddedWords[:, -1, :] outputSize = hps.num_lstm_nodes[-1] * 2 # 因为是双向LSTM,最终的输出值是fw和bw的拼接,因此要乘以2 last = tf.reshape(finalOutput, [-1, outputSize]) # reshape成全连接层的输入维度 last = tf.layers.batch_normalization(last, training = is_training) fc_init = tf.uniform_unit_scaling_initializer(factor = 1.0) with tf.variable_scope('fc', initializer = fc_init): fc1 = tf.layers.dense(last, hps.num_fc_nodes, name = 'fc1') fc1_batch_normalization = tf.layers.batch_normalization(fc1, training = is_training) fc_activation = tf.nn.relu(fc1_batch_normalization) logits = tf.layers.dense(fc_activation, hps.num_classes, name = 'fc2') with tf.name_scope('metrics'): softmax_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits = logits, labels = tf.argmax(outputs, 1)) loss = tf.reduce_mean(softmax_loss) # [0, 1, 5, 4, 2] ->argmax:2 因为在第二个位置上是最大的 y_pred = tf.argmax(tf.nn.softmax(logits), 1, output_type = tf.int64, name = 'y_pred') # 计算准确率,看看算对多少个 correct_pred = tf.equal(tf.argmax(outputs, 1), y_pred) # tf.cast 将数据转换成 tf.float32 类型 accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32)) with tf.name_scope('train_op'): tvar = tf.trainable_variables() for var in tvar: print('variable name: %s' % (var.name)) grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvar), hps.clip_lstm_grads) optimizer = tf.train.AdamOptimizer(hps.learning_rate) train_op = optimizer.apply_gradients(zip(grads, tvar), global_step) # return((inputs, outputs, is_training), (loss, accuracy, y_pred), (train_op, global_step)) return((inputs, outputs), (loss, accuracy, y_pred), (train_op, global_step)) placeholders, metrics, others = create_model(hps) content, labels = placeholders loss, accuracy, y_pred = metrics train_op, global_step = others def val_steps(sess, x_batch, y_batch, writer = None): loss_val, accuracy_val = sess.run([loss,accuracy], feed_dict = {inputs: x_batch, outputs: y_batch, is_training: hps.val_is_training, dropout_keep_prob: 1.0}) return loss_val, accuracy_val loss_summary = tf.summary.scalar('loss', loss) accuracy_summary = tf.summary.scalar('accuracy', accuracy) # 将所有的变量都集合起来 merged_summary = tf.summary.merge_all() # 用于test测试的summary merged_summary_test = tf.summary.merge([loss_summary, accuracy_summary]) LOG_DIR = '.' run_label = 'run_Bi-GRU_Dropout_tensorboard' run_dir = os.path.join(LOG_DIR, run_label) if not os.path.exists(run_dir): os.makedirs(run_dir) train_log_dir = os.path.join(run_dir, timestamp, 'train') test_los_dir = os.path.join(run_dir, timestamp, 'test') if not os.path.exists(train_log_dir): os.makedirs(train_log_dir) if not os.path.join(test_los_dir): os.makedirs(test_los_dir) # saver得到的文件句柄,可以将文件训练的快照保存到文件夹中去 saver = tf.train.Saver(tf.global_variables(), max_to_keep = 5) # train 代码 init_op = tf.global_variables_initializer() train_keep_prob_value = 0.2 test_keep_prob_value = 1.0 # 由于如果按照每一步都去计算的话,会很慢,所以我们规定每100次存储一次 output_summary_every_steps = 100 num_train_steps = 1000 # 每隔多少次保存一次 output_model_every_steps = 500 # 测试集测试 test_model_all_steps = 4000 i = 0 session_conf = tf.ConfigProto( gpu_options = tf.GPUOptions(allow_growth=True), allow_soft_placement = True, log_device_placement = False) with tf.Session(config = session_conf) as sess: sess.run(init_op) # 将训练过程中,将loss,accuracy写入文件里,后面是目录和计算图,如果想要在tensorboard中显示计算图,就想sess.graph加上 train_writer = tf.summary.FileWriter(train_log_dir, sess.graph) # 同样将测试的结果保存到tensorboard中,没有计算图 test_writer = tf.summary.FileWriter(test_los_dir) batches = batch_iter(list(zip(x_train, y_train)), hps.batch_size, hps.num_epochs) for batch in batches: train_x, train_y = zip(*batch) eval_ops = [loss, accuracy, train_op, global_step] should_out_summary = ((i + 1) % output_summary_every_steps == 0) if should_out_summary: eval_ops.append(merged_summary) # 那三个占位符输进去 # 计算loss, accuracy, train_op, global_step的图 eval_ops.append(merged_summary) outputs_train = sess.run(eval_ops, feed_dict={ inputs: train_x, outputs: train_y, dropout_keep_prob: train_keep_prob_value, is_training: hps.train_is_training }) loss_train, accuracy_train = outputs_train[0:2] if should_out_summary: # 由于我们想在100steps之后计算summary,所以上面 should_out_summary = ((i + 1) % output_summary_every_steps == 0)成立, # 即为真True,那么我们将训练的内容放入eval_ops的最后面了,因此,我们想获得summary的结果得在eval_ops_results的最后一个 train_summary_str = outputs_train[-1] # 将获得的结果写训练tensorboard文件夹中,由于训练从0开始,所以这里加上1,表示第几步的训练 train_writer.add_summary(train_summary_str, i + 1) test_summary_str = sess.run([merged_summary_test], feed_dict = {inputs: x_dev, outputs: y_dev, dropout_keep_prob: 1.0, is_training: hps.val_is_training })[0] test_writer.add_summary(test_summary_str, i + 1) current_step = tf.train.global_step(sess, global_step) if (i + 1) % 100 == 0: print("Step: %5d, loss: %3.3f, accuracy: %3.3f" % (i + 1, loss_train, accuracy_train)) # 500个batch校验一次 if (i + 1) % 500 == 0: loss_eval, accuracy_eval = val_steps(sess, x_dev, y_dev) print("Step: %5d, val_loss: %3.3f, val_accuracy: %3.3f" % (i + 1, loss_eval, accuracy_eval)) if (i + 1) % output_model_every_steps == 0: path = saver.save(sess,os.path.join(out_dir, 'ckp-%05d' % (i + 1))) print("Saved model checkpoint to {}\n".format(path)) print('model saved to ckp-%05d' % (i + 1)) if (i + 1) % test_model_all_steps == 0: # test_loss, test_acc, all_predictions= sess.run([loss, accuracy, y_pred], feed_dict = {inputs: x_test, outputs: y_test, dropout_keep_prob: 1.0}) test_loss, test_acc, all_predictions= sess.run([loss, accuracy, y_pred], feed_dict = {inputs: x_test, outputs: y_test, is_training: hps.val_is_training, dropout_keep_prob: 1.0}) print("test_loss: %3.3f, test_acc: %3.3d" % (test_loss, test_acc)) batches = batch_iter(list(x_test), 128, 1, shuffle=False) # Collect the predictions here all_predictions = [] for x_test_batch in batches: batch_predictions = sess.run(y_pred, {inputs: x_test_batch, is_training: hps.val_is_training, dropout_keep_prob: 1.0}) all_predictions = np.concatenate([all_predictions, batch_predictions]) correct_predictions = float(sum(all_predictions == y.flatten())) print("Total number of test examples: {}".format(len(y_test))) print("Accuracy: {:g}".format(correct_predictions/float(len(y_test)))) test_y = y_test.argmax(axis = 1) #生成混淆矩阵 conf_mat = confusion_matrix(test_y, all_predictions) fig, ax = plt.subplots(figsize = (4,2)) sns.heatmap(conf_mat, annot=True, fmt = 'd', xticklabels = cat_id_df.category_id.values, yticklabels = cat_id_df.category_id.values) font_set = FontProperties(fname = r"/usr/share/fonts/truetype/wqy/wqy-microhei.ttc", size=15) plt.ylabel(u'实际结果',fontsize = 18,fontproperties = font_set) plt.xlabel(u'预测结果',fontsize = 18,fontproperties = font_set) plt.savefig('./test.png') print('accuracy %s' % accuracy_score(all_predictions, test_y)) print(classification_report(test_y, all_predictions,target_names = cat_id_df['category_name'].values)) print(classification_report(test_y, all_predictions)) i += 1 ``` 以上的模型代码,请求各位大神帮我看看,为什么出现这样的结果?
tensorflow当中的loss里面的logits可不可以是placeholder
我使用tensorflow实现手写数字识别,我希望softmax_cross_entropy_with_logits里面的logits先用一个placeholder表示,然后在计算的时候再通过计算出的值再传给placeholder,但是会报错ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients。我知道直接把logits那里改成outputs就可以了,但是如果我一定要用logits的结果先是一个placeholder,我应该怎么解决呢。 ``` import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("/home/as/下载/resnet-152_mnist-master/mnist_dataset", one_hot=True) from tensorflow.contrib.layers import fully_connected x = tf.placeholder(dtype=tf.float32,shape=[None,784]) y = tf.placeholder(dtype=tf.float32,shape=[None,1]) hidden1 = fully_connected(x,100,activation_fn=tf.nn.elu, weights_initializer=tf.random_normal_initializer()) hidden2 = fully_connected(hidden1,200,activation_fn=tf.nn.elu, weights_initializer=tf.random_normal_initializer()) hidden3 = fully_connected(hidden2,200,activation_fn=tf.nn.elu, weights_initializer=tf.random_normal_initializer()) outputs = fully_connected(hidden3,10,activation_fn=None, weights_initializer=tf.random_normal_initializer()) a = tf.placeholder(tf.float32,[None,10]) loss = tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=a) reduce_mean_loss = tf.reduce_mean(loss) equal_result = tf.equal(tf.argmax(outputs,1),tf.argmax(y,1)) cast_result = tf.cast(equal_result,dtype=tf.float32) accuracy = tf.reduce_mean(cast_result) train_op = tf.train.AdamOptimizer(0.001).minimize(reduce_mean_loss) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for i in range(30000): xs,ys = mnist.train.next_batch(128) result = outputs.eval(feed_dict={x:xs}) sess.run(train_op,feed_dict={a:result,y:ys}) print(i) ```
tensorflow自定义的损失函数 focal_loss出现inf,在训练过程中出现inf
![图片说明](https://img-ask.csdn.net/upload/201905/05/1557048780_248292.png) ``` python def focal_loss(alpha=0.25, gamma=2.): """ focal loss used for train positive/negative samples rate out of balance, improve train performance """ def focal_loss_calc(y_true, y_pred): positive = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred)) negative = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred)) return -(alpha*K.pow(1.-positive, gamma)*K.log(positive) + (1-alpha)*K.pow(negative, gamma)*K.log(1.-negative)) return focal_loss_calc ``` ```python self.keras_model.compile(optimizer=optimizer, loss=dice_focal_loss, metrics=[ mean_iou, dice_loss, focal_loss()]) ``` 上面的focal loss 开始还是挺正常的,随着训练过程逐渐减小大0.025左右,然后就突然变成inf。何解
tensorflow训练过程权重不更新,loss不下降,输出保持不变,只有bias在非常缓慢地变化?
模型里没有参数被初始化为0 ,学习率从10的-5次方试到了0.1,输入数据都已经被归一化为了0-1之间,模型是改过的vgg16,有四个输出,使用了vgg16的预训练模型来初始化参数,输出中间结果也没有nan或者inf值。是不是不能自定义损失函数呢?但输出中间梯度发现并不是0,非常奇怪。 **train.py的部分代码** ``` def train(): x = tf.placeholder(tf.float32, [None, 182, 182, 2], name = 'image_input') y_ = tf.placeholder(tf.float32, [None, 8], name='label_input') global_step = tf.Variable(0, trainable=False) learning_rate = tf.train.exponential_decay(learning_rate=0.0001,decay_rate=0.9, global_step=TRAINING_STEPS, decay_steps=50,staircase=True) # 读取图片数据,pos是标签为1的图,neg是标签为0的图 pos, neg = get_data.get_image(img_path) #输入标签固定,输入数据每个batch前4张放pos,后4张放neg label_batch = np.reshape(np.array([1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]),[1, 8]) vgg = vgg16.Vgg16() vgg.build(x) #loss函数的定义在后面 loss = vgg.side_loss( y_,vgg.output1, vgg.output2, vgg.output3, vgg.output4) train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss, global_step=global_step) init_op = tf.global_variables_initializer() saver = tf.train.Saver() with tf.device('/gpu:0'): with tf.Session() as sess: sess.run(init_op) for i in range(TRAINING_STEPS): #在train.py的其他部分定义了batch_size= 4 start = i * batch_size end = start + batch_size #制作输入数据,前4个是标签为1的图,后4个是标签为0的图 image_list = [] image_list.append(pos[start:end]) image_list.append(neg[start:end]) image_batch = np.reshape(np.array(image_list),[-1,182,182,2]) _,loss_val,step = sess.run([train_step,loss,global_step], feed_dict={x: image_batch,y_:label_batch}) if i % 50 == 0: print("the step is %d,loss is %f" % (step, loss_val)) if loss_val < min_loss: min_loss = loss_val saver.save(sess, 'ckpt/vgg.ckpt', global_step=2000) ``` **Loss 函数的定义** ``` **loss函数的定义(写在了Vgg16类里)** ``` class Vgg16: #a,b,c,d都是vgg模型里的输出,是多输出模型 def side_loss(self,yi,a,b,c,d): self.loss1 = self.f_indicator(yi, a) self.loss2 = self.f_indicator(yi, b) self.loss3 = self.f_indicator(yi, c) self.loss_fuse = self.f_indicator(yi, d) self.loss_side = self.loss1 + self.loss2 + self.loss3 + self.loss_fuse res_loss = tf.reduce_sum(self.loss_side) return res_loss #损失函数的定义,标签为0时为log(1-yj),标签为1时为log(yj) def f_indicator(self,yi,yj): b = tf.where(yj>=1,yj*50,tf.abs(tf.log(tf.abs(1 - yj)))) res=tf.where(tf.equal(yi , 0.0), b,tf.abs(tf.log(tf.clip_by_value(yj, 1e-8, float("inf"))))) return res ```
修改的SSD—Tensorflow 版本在训练的时候遇到loss输入维度不一致
目前在学习目标检测识别的方向。 自己参考了一些论文 对原版的SSD进行了一些改动工作 前面的网络模型部分已经修改完成且不报错。 但是在进行训练操作的时候会出现 ’ValueError: Dimension 0 in both shapes must be equal, but are 233920 and 251392. Shapes are [233920] and [251392]. for 'ssd_losses/Select' (op: 'Select') with input shapes: [251392], [233920], [251392]. ‘ ‘两个形状中的尺寸0必须相等,但分别为233920和251392。形状有[233920]和[251392]。对于输入形状为[251392]、[233920]、[251392]的''ssd_losses/Select' (op: 'Select') ![图片说明](https://img-ask.csdn.net/upload/201904/06/1554539638_631515.png) ![图片说明](https://img-ask.csdn.net/upload/201904/06/1554539651_430990.png) # SSD loss function. # =========================================================================== # def ssd_losses(logits, localisations, gclasses, glocalisations, gscores, match_threshold=0.5, negative_ratio=3., alpha=1., label_smoothing=0., device='/cpu:0', scope=None): with tf.name_scope(scope, 'ssd_losses'): lshape = tfe.get_shape(logits[0], 5) num_classes = lshape[-1] batch_size = lshape[0] # Flatten out all vectors! flogits = [] fgclasses = [] fgscores = [] flocalisations = [] fglocalisations = [] for i in range(len(logits)): flogits.append(tf.reshape(logits[i], [-1, num_classes])) fgclasses.append(tf.reshape(gclasses[i], [-1])) fgscores.append(tf.reshape(gscores[i], [-1])) flocalisations.append(tf.reshape(localisations[i], [-1, 4])) fglocalisations.append(tf.reshape(glocalisations[i], [-1, 4])) # And concat the crap! logits = tf.concat(flogits, axis=0) gclasses = tf.concat(fgclasses, axis=0) gscores = tf.concat(fgscores, axis=0) localisations = tf.concat(flocalisations, axis=0) glocalisations = tf.concat(fglocalisations, axis=0) dtype = logits.dtype # Compute positive matching mask... pmask = gscores > match_threshold fpmask = tf.cast(pmask, dtype) n_positives = tf.reduce_sum(fpmask) # Hard negative mining... no_classes = tf.cast(pmask, tf.int32) predictions = slim.softmax(logits) nmask = tf.logical_and(tf.logical_not(pmask), gscores > -0.5) fnmask = tf.cast(nmask, dtype) nvalues = tf.where(nmask, predictions[:, 0], 1. - fnmask) nvalues_flat = tf.reshape(nvalues, [-1]) # Number of negative entries to select. max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32) n_neg = tf.cast(negative_ratio * n_positives, tf.int32) + batch_size n_neg = tf.minimum(n_neg, max_neg_entries) val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg) max_hard_pred = -val[-1] # Final negative mask. nmask = tf.logical_and(nmask, nvalues < max_hard_pred) fnmask = tf.cast(nmask, dtype) # Add cross-entropy loss. with tf.name_scope('cross_entropy_pos'): loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=gclasses) loss = tf.div(tf.reduce_sum(loss * fpmask), batch_size, name='value') tf.losses.add_loss(loss) with tf.name_scope('cross_entropy_neg'): loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=no_classes) loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name='value') tf.losses.add_loss(loss) # Add localization loss: smooth L1, L2, ... with tf.name_scope('localization'): # Weights Tensor: positive mask + random negative. weights = tf.expand_dims(alpha * fpmask, axis=-1) loss = custom_layers.abs_smooth(localisations - glocalisations) loss = tf.div(tf.reduce_sum(loss * weights), batch_size, name='value') tf.losses.add_loss(loss) ``` ``` 研究了一段时间的源码 (因为只是SSD-Tensorflow-Master中的ssd_vgg_300.py中定义网络结构的那部分做了修改 ,loss函数代码部分并没有进行改动)所以没所到错误所在,网上也找不到相关的解决方案。 希望大神能够帮忙解答 感激不尽~
tensorflow重载模型继续训练得到的loss比原模型继续训练得到的loss大,是什么原因??
我使用tensorflow训练了一个模型,在第10个epoch时保存模型,然后在一个新的文件里重载模型继续训练,结果我发现重载的模型在第一个epoch的loss比原模型在epoch=11的loss要大,我感觉既然是重载了原模型,那么重载模型训练的第一个epoch应该是和原模型训练的第11个epoch相等的,一直找不到问题或者自己逻辑的问题,希望大佬能指点迷津。源代码和重载模型的代码如下: ``` 原代码: from tensorflow.examples.tutorials.mnist import input_data import tensorflow as tf import os import numpy as np os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' mnist = input_data.read_data_sets("./",one_hot=True) tf.reset_default_graph() ###定义数据和标签 n_inputs = 784 n_classes = 10 X = tf.placeholder(tf.float32,[None,n_inputs],name='X') Y = tf.placeholder(tf.float32,[None,n_classes],name='Y') ###定义网络结构 n_hidden_1 = 256 n_hidden_2 = 256 layer_1 = tf.layers.dense(inputs=X,units=n_hidden_1,activation=tf.nn.relu,kernel_regularizer=tf.contrib.layers.l2_regularizer(0.01)) layer_2 = tf.layers.dense(inputs=layer_1,units=n_hidden_2,activation=tf.nn.relu,kernel_regularizer=tf.contrib.layers.l2_regularizer(0.01)) outputs = tf.layers.dense(inputs=layer_2,units=n_classes,name='outputs') pred = tf.argmax(tf.nn.softmax(outputs,axis=1),axis=1) print(pred.name) err = tf.count_nonzero((pred - tf.argmax(Y,axis=1))) cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=outputs,labels=Y),name='cost') print(cost.name) ###定义优化器 learning_rate = 0.001 optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost,name='OP') saver = tf.train.Saver() checkpoint = 'softmax_model/dense_model.cpkt' ###训练 batch_size = 100 training_epochs = 11 with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for epoch in range(training_epochs): batch_num = int(mnist.train.num_examples / batch_size) epoch_cost = 0 sumerr = 0 for i in range(batch_num): batch_x,batch_y = mnist.train.next_batch(batch_size) c,e = sess.run([cost,err],feed_dict={X:batch_x,Y:batch_y}) _ = sess.run(optimizer,feed_dict={X:batch_x,Y:batch_y}) epoch_cost += c / batch_num sumerr += e / mnist.train.num_examples if epoch == (training_epochs - 1): print('batch_cost = ',c) if epoch == (training_epochs - 2): saver.save(sess, checkpoint) print('test_error = ',sess.run(cost, feed_dict={X: mnist.test.images, Y: mnist.test.labels})) ``` ``` 重载模型的代码: from tensorflow.examples.tutorials.mnist import input_data import tensorflow as tf import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' mnist = input_data.read_data_sets("./",one_hot=True) #one_hot=True指对样本标签进行独热编码 file_path = 'softmax_model/dense_model.cpkt' saver = tf.train.import_meta_graph(file_path + '.meta') graph = tf.get_default_graph() X = graph.get_tensor_by_name('X:0') Y = graph.get_tensor_by_name('Y:0') cost = graph.get_operation_by_name('cost').outputs[0] train_op = graph.get_operation_by_name('OP') training_epoch = 10 learning_rate = 0.001 batch_size = 100 with tf.Session() as sess: saver.restore(sess,file_path) print('test_cost = ',sess.run(cost, feed_dict={X: mnist.test.images, Y: mnist.test.labels})) for epoch in range(training_epoch): batch_num = int(mnist.train.num_examples / batch_size) epoch_cost = 0 for i in range(batch_num): batch_x, batch_y = mnist.train.next_batch(batch_size) c = sess.run(cost, feed_dict={X: batch_x, Y: batch_y}) _ = sess.run(train_op, feed_dict={X: batch_x, Y: batch_y}) epoch_cost += c / batch_num print(epoch_cost) ``` 值得注意的是,我在原模型和重载模型里都计算了测试集的cost,两者的结果是一致的。说明参数载入应该是对的
请问tensorflow的训练的loss一直在1.几和0.几之间跳来跳去是算没收敛还是收敛了?
UCI上面找的训练集的输入是连续的,而标签是离散的,我见别人的LOSS都是持续下降到0.00几的,我这个把学习率调到0.001,激活函数是sigmoid,但还是在1.几和0.几之间徘徊,这是正常现象吗?不是的话是哪里出问题了? 跪求大佬解答!!
tensorflow载入训练好的模型进行预测,同一张图片预测的结果却不一样????
最近在跑deeplabv1,在测试代码的时候,跑通了训练程序,但是用训练好的模型进行与测试却发现相同的图片预测的结果不一样??请问有大神知道怎么回事吗? 用的是saver.restore()方法载入模型。代码如下: ``` def main(): """Create the model and start the inference process.""" args = get_arguments() # Prepare image. img = tf.image.decode_jpeg(tf.read_file(args.img_path), channels=3) # Convert RGB to BGR. img_r, img_g, img_b = tf.split(value=img, num_or_size_splits=3, axis=2) img = tf.cast(tf.concat(axis=2, values=[img_b, img_g, img_r]), dtype=tf.float32) # Extract mean. img -= IMG_MEAN # Create network. net = DeepLabLFOVModel() # Which variables to load. trainable = tf.trainable_variables() # Predictions. pred = net.preds(tf.expand_dims(img, dim=0)) # Set up TF session and initialize variables. config = tf.ConfigProto() config.gpu_options.allow_growth = True sess = tf.Session(config=config) #init = tf.global_variables_initializer() sess.run(tf.global_variables_initializer()) # Load weights. saver = tf.train.Saver(var_list=trainable) load(saver, sess, args.model_weights) # Perform inference. preds = sess.run([pred]) print(preds) if not os.path.exists(args.save_dir): os.makedirs(args.save_dir) msk = decode_labels(np.array(preds)[0, 0, :, :, 0]) im = Image.fromarray(msk) im.save(args.save_dir + 'mask1.png') print('The output file has been saved to {}'.format( args.save_dir + 'mask.png')) if __name__ == '__main__': main() ``` 其中load是 ``` def load(saver, sess, ckpt_path): '''Load trained weights. Args: saver: TensorFlow saver object. sess: TensorFlow session. ckpt_path: path to checkpoint file with parameters. ''' ckpt = tf.train.get_checkpoint_state(ckpt_path) if ckpt and ckpt.model_checkpoint_path: saver.restore(sess, ckpt.model_checkpoint_path) print("Restored model parameters from {}".format(ckpt_path)) ``` DeepLabLFOVMode类如下: ``` class DeepLabLFOVModel(object): """DeepLab-LargeFOV model with atrous convolution and bilinear upsampling. This class implements a multi-layer convolutional neural network for semantic image segmentation task. This is the same as the model described in this paper: https://arxiv.org/abs/1412.7062 - please look there for details. """ def __init__(self, weights_path=None): """Create the model. Args: weights_path: the path to the cpkt file with dictionary of weights from .caffemodel. """ self.variables = self._create_variables(weights_path) def _create_variables(self, weights_path): """Create all variables used by the network. This allows to share them between multiple calls to the loss function. Args: weights_path: the path to the ckpt file with dictionary of weights from .caffemodel. If none, initialise all variables randomly. Returns: A dictionary with all variables. """ var = list() index = 0 if weights_path is not None: with open(weights_path, "rb") as f: weights = cPickle.load(f) # Load pre-trained weights. for name, shape in net_skeleton: var.append(tf.Variable(weights[name], name=name)) del weights else: # Initialise all weights randomly with the Xavier scheme, # and # all biases to 0's. for name, shape in net_skeleton: if "/w" in name: # Weight filter. w = create_variable(name, list(shape)) var.append(w) else: b = create_bias_variable(name, list(shape)) var.append(b) return var def _create_network(self, input_batch, keep_prob): """Construct DeepLab-LargeFOV network. Args: input_batch: batch of pre-processed images. keep_prob: probability of keeping neurons intact. Returns: A downsampled segmentation mask. """ current = input_batch v_idx = 0 # Index variable. # Last block is the classification layer. for b_idx in xrange(len(dilations) - 1): for l_idx, dilation in enumerate(dilations[b_idx]): w = self.variables[v_idx * 2] b = self.variables[v_idx * 2 + 1] if dilation == 1: conv = tf.nn.conv2d(current, w, strides=[ 1, 1, 1, 1], padding='SAME') else: conv = tf.nn.atrous_conv2d( current, w, dilation, padding='SAME') current = tf.nn.relu(tf.nn.bias_add(conv, b)) v_idx += 1 # Optional pooling and dropout after each block. if b_idx < 3: current = tf.nn.max_pool(current, ksize=[1, ks, ks, 1], strides=[1, 2, 2, 1], padding='SAME') elif b_idx == 3: current = tf.nn.max_pool(current, ksize=[1, ks, ks, 1], strides=[1, 1, 1, 1], padding='SAME') elif b_idx == 4: current = tf.nn.max_pool(current, ksize=[1, ks, ks, 1], strides=[1, 1, 1, 1], padding='SAME') current = tf.nn.avg_pool(current, ksize=[1, ks, ks, 1], strides=[1, 1, 1, 1], padding='SAME') elif b_idx <= 6: current = tf.nn.dropout(current, keep_prob=keep_prob) # Classification layer; no ReLU. # w = self.variables[v_idx * 2] w = create_variable(name='w', shape=[1, 1, 1024, n_classes]) # b = self.variables[v_idx * 2 + 1] b = create_bias_variable(name='b', shape=[n_classes]) conv = tf.nn.conv2d(current, w, strides=[1, 1, 1, 1], padding='SAME') current = tf.nn.bias_add(conv, b) return current def prepare_label(self, input_batch, new_size): """Resize masks and perform one-hot encoding. Args: input_batch: input tensor of shape [batch_size H W 1]. new_size: a tensor with new height and width. Returns: Outputs a tensor of shape [batch_size h w 18] with last dimension comprised of 0's and 1's only. """ with tf.name_scope('label_encode'): # As labels are integer numbers, need to use NN interp. input_batch = tf.image.resize_nearest_neighbor( input_batch, new_size) # Reducing the channel dimension. input_batch = tf.squeeze(input_batch, squeeze_dims=[3]) input_batch = tf.one_hot(input_batch, depth=n_classes) return input_batch def preds(self, input_batch): """Create the network and run inference on the input batch. Args: input_batch: batch of pre-processed images. Returns: Argmax over the predictions of the network of the same shape as the input. """ raw_output = self._create_network( tf.cast(input_batch, tf.float32), keep_prob=tf.constant(1.0)) raw_output = tf.image.resize_bilinear( raw_output, tf.shape(input_batch)[1:3, ]) raw_output = tf.argmax(raw_output, dimension=3) raw_output = tf.expand_dims(raw_output, dim=3) # Create 4D-tensor. return tf.cast(raw_output, tf.uint8) def loss(self, img_batch, label_batch): """Create the network, run inference on the input batch and compute loss. Args: input_batch: batch of pre-processed images. Returns: Pixel-wise softmax loss. """ raw_output = self._create_network( tf.cast(img_batch, tf.float32), keep_prob=tf.constant(0.5)) prediction = tf.reshape(raw_output, [-1, n_classes]) # Need to resize labels and convert using one-hot encoding. label_batch = self.prepare_label( label_batch, tf.stack(raw_output.get_shape()[1:3])) gt = tf.reshape(label_batch, [-1, n_classes]) # Pixel-wise softmax loss. loss = tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=gt) reduced_loss = tf.reduce_mean(loss) return reduced_loss ``` 按理说载入模型应该没有问题,可是不知道为什么结果却不一样? 图片:![图片说明](https://img-ask.csdn.net/upload/201911/15/1573810836_83106.jpg) ![图片说明](https://img-ask.csdn.net/upload/201911/15/1573810850_924663.png) 预测的结果: ![图片说明](https://img-ask.csdn.net/upload/201911/15/1573810884_985680.png) ![图片说明](https://img-ask.csdn.net/upload/201911/15/1573810904_577649.png) 两次结果不一样,与保存的模型算出来的结果也不一样。 我用的是GitHub上这个人的代码: https://github.com/minar09/DeepLab-LFOV-TensorFlow 急急急,请问有大神知道吗???
运行tensorflow时出现tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed这个错误
运行tensorflow时出现tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed这个错误,查了一下说是gpu被占用了,从下面这里开始出问题的: ``` 2019-10-17 09:28:49.495166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6382 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1) (60000, 28, 28) (60000, 10) 2019-10-17 09:28:51.275415: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cublas64_100.dll'; dlerror: cublas64_100.dll not found ``` ![图片说明](https://img-ask.csdn.net/upload/201910/17/1571277238_292620.png) 最后显示的问题: ![图片说明](https://img-ask.csdn.net/upload/201910/17/1571277311_655722.png) 试了一下网上的方法,比如加代码: ``` gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) ``` 但最后提示: ![图片说明](https://img-ask.csdn.net/upload/201910/17/1571277460_72752.png) 现在不知道要怎么解决了。新手想试下简单的数字识别,步骤也是按教程一步步来的,可能用的版本和教程不一样,我用的是刚下的:2.0tensorflow和以下: ![图片说明](https://img-ask.csdn.net/upload/201910/17/1571277627_439100.png) 不知道会不会有版本问题,现在紧急求助各位大佬,还有没有其它可以尝试的方法。测试程序加法运算可以执行,数字识别图片运行的时候我看了下,GPU最大占有率才0.2%,下面是完整数字图片识别代码: ``` import os import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers, optimizers, datasets os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' #gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.2) #sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) (x, y), (x_val, y_val) = datasets.mnist.load_data() x = tf.convert_to_tensor(x, dtype=tf.float32) / 255. y = tf.convert_to_tensor(y, dtype=tf.int32) y = tf.one_hot(y, depth=10) print(x.shape, y.shape) train_dataset = tf.data.Dataset.from_tensor_slices((x, y)) train_dataset = train_dataset.batch(200) model = keras.Sequential([ layers.Dense(512, activation='relu'), layers.Dense(256, activation='relu'), layers.Dense(10)]) optimizer = optimizers.SGD(learning_rate=0.001) def train_epoch(epoch): # Step4.loop for step, (x, y) in enumerate(train_dataset): with tf.GradientTape() as tape: # [b, 28, 28] => [b, 784] x = tf.reshape(x, (-1, 28 * 28)) # Step1. compute output # [b, 784] => [b, 10] out = model(x) # Step2. compute loss loss = tf.reduce_sum(tf.square(out - y)) / x.shape[0] # Step3. optimize and update w1, w2, w3, b1, b2, b3 grads = tape.gradient(loss, model.trainable_variables) # w' = w - lr * grad optimizer.apply_gradients(zip(grads, model.trainable_variables)) if step % 100 == 0: print(epoch, step, 'loss:', loss.numpy()) def train(): for epoch in range(30): train_epoch(epoch) if __name__ == '__main__': train() ``` 希望能有人给下建议或解决方法,拜谢!
训练准确率很高,测试准确率很低(loss有一直下降)
tensorflow模型,4层cnn 跑出来的结果训练集能达到1,测试集准确率基本保持在0.5左右,二分类 数据有用shuffle, 有dropout,有正则化,learning rate也挺小的。 不知道是什么原因,求解答! ![图片说明](https://img-ask.csdn.net/upload/201912/11/1576068416_852077.png)
用TensorFlow 训练mask rcnn时,总是在执行训练语句时报错,进行不下去了,求大神
用TensorFlow 训练mask rcnn时,总是在执行训练语句时报错,进行不下去了,求大神 执行语句是: ``` python model_main.py --model_dir=C:/Users/zoyiJiang/Desktop/mask_rcnn_test-master/training --pipeline_config_path=C:/Users/zoyiJiang/Desktop/mask_rcnn_test-master/training/mask_rcnn_inception_v2_coco.config ``` 报错信息如下: ``` WARNING:tensorflow:Forced number of epochs for all eval validations to be 1. WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1. WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x000001C1EA335C80>) includes params argument, but params are not passed to Estimator. WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. Traceback (most recent call last): File "model_main.py", line 109, in <module> tf.app.run() File "E:\Python3.6\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "model_main.py", line 105, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "E:\Python3.6\lib\site-packages\tensorflow\python\estimator\training.py", line 439, in train_and_evaluate executor.run() File "E:\Python3.6\lib\site-packages\tensorflow\python\estimator\training.py", line 518, in run self.run_local() File "E:\Python3.6\lib\site-packages\tensorflow\python\estimator\training.py", line 650, in run_local hooks=train_hooks) File "E:\Python3.6\lib\site-packages\tensorflow\python\estimator\estimator.py", line 363, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "E:\Python3.6\lib\site-packages\tensorflow\python\estimator\estimator.py", line 843, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "E:\Python3.6\lib\site-packages\tensorflow\python\estimator\estimator.py", line 853, in _train_model_default input_fn, model_fn_lib.ModeKeys.TRAIN)) File "E:\Python3.6\lib\site-packages\tensorflow\python\estimator\estimator.py", line 691, in _get_features_and_labels_from_input_fn result = self._call_input_fn(input_fn, mode) File "E:\Python3.6\lib\site-packages\tensorflow\python\estimator\estimator.py", line 798, in _call_input_fn return input_fn(**kwargs) File "D:\Tensorflow\tf\models\research\object_detection\inputs.py", line 525, in _train_input_fn batch_size=params['batch_size'] if params else train_config.batch_size) File "D:\Tensorflow\tf\models\research\object_detection\builders\dataset_builder.py", line 149, in build dataset = data_map_fn(process_fn, num_parallel_calls=num_parallel_calls) File "E:\Python3.6\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 853, in map return ParallelMapDataset(self, map_func, num_parallel_calls) File "E:\Python3.6\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 1870, in __init__ super(ParallelMapDataset, self).__init__(input_dataset, map_func) File "E:\Python3.6\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 1839, in __init__ self._map_func.add_to_graph(ops.get_default_graph()) File "E:\Python3.6\lib\site-packages\tensorflow\python\framework\function.py", line 484, in add_to_graph self._create_definition_if_needed() File "E:\Python3.6\lib\site-packages\tensorflow\python\framework\function.py", line 319, in _create_definition_if_needed self._create_definition_if_needed_impl() File "E:\Python3.6\lib\site-packages\tensorflow\python\framework\function.py", line 336, in _create_definition_if_needed_impl outputs = self._func(*inputs) File "E:\Python3.6\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 1804, in tf_map_func ret = map_func(nested_args) File "D:\Tensorflow\tf\models\research\object_detection\builders\dataset_builder.py", line 130, in process_fn processed_tensors = transform_input_data_fn(processed_tensors) File "D:\Tensorflow\tf\models\research\object_detection\inputs.py", line 515, in transform_and_pad_input_data_fn tensor_dict=transform_data_fn(tensor_dict), File "D:\Tensorflow\tf\models\research\object_detection\inputs.py", line 129, in transform_input_data tf.expand_dims(tf.to_float(image), axis=0)) File "D:\Tensorflow\tf\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py", line 543, in preprocess parallel_iterations=self._parallel_iterations) File "D:\Tensorflow\tf\models\research\object_detection\utils\shape_utils.py", line 237, in static_or_dynamic_map_fn outputs = [fn(arg) for arg in tf.unstack(elems)] File "D:\Tensorflow\tf\models\research\object_detection\utils\shape_utils.py", line 237, in <listcomp> outputs = [fn(arg) for arg in tf.unstack(elems)] File "D:\Tensorflow\tf\models\research\object_detection\core\preprocessor.py", line 2264, in resize_to_range lambda: _resize_portrait_image(image)) File "E:\Python3.6\lib\site-packages\tensorflow\python\util\deprecation.py", line 432, in new_func return func(*args, **kwargs) File "E:\Python3.6\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2063, in cond orig_res_t, res_t = context_t.BuildCondBranch(true_fn) File "E:\Python3.6\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 1913, in BuildCondBranch original_result = fn() File "D:\Tensorflow\tf\models\research\object_detection\core\preprocessor.py", line 2263, in <lambda> lambda: _resize_landscape_image(image), File "D:\Tensorflow\tf\models\research\object_detection\core\preprocessor.py", line 2245, in _resize_landscape_image align_corners=align_corners, preserve_aspect_ratio=True) TypeError: resize_images() got an unexpected keyword argument 'preserve_aspect_ratio' ``` 根据提示的最后一句,是说没有一个有效参数 我用的是TensorFlow1.8 python3.6,下载的最新的TensorFlow-models-master
为什么用vgg16网络训练我自己的数据集,loss一直在1.7左右震荡,用训练好的模型进行预测出来的都是一个值?
最近用vgg16网络训练自己的数据集,最开始的学习率设为了1e-3,发现loss到了1.7左右就不再降了,然后我调小学习率,调到了1e-5,loss到了1.7还是不降,一直在1.7左右震荡,训练了很多个epoch还是不变,于是我想测试看看,发现测试出来的图片分类结果都是一个值,请问有大佬知道这是为什么吗?
keras 训练网络时出现ValueError
rt 使用keras中的model.fit函数进行训练时出现错误:ValueError: None values not supported. 错误信息如下: ``` File "C:/Users/Desktop/MNISTpractice/mnist.py", line 93, in <module> model.fit(x_train,y_train, epochs=2, callbacks=callback_list,validation_data=(x_val,y_val)) File "C:\Anaconda3\lib\site-packages\keras\engine\training.py", line 1575, in fit self._make_train_function() File "C:\Anaconda3\lib\site-packages\keras\engine\training.py", line 960, in _make_train_function loss=self.total_loss) File "C:\Anaconda3\lib\site-packages\keras\legacy\interfaces.py", line 87, in wrapper return func(*args, **kwargs) File "C:\Anaconda3\lib\site-packages\keras\optimizers.py", line 432, in get_updates m_t = (self.beta_1 * m) + (1. - self.beta_1) * g File "C:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 820, in binary_op_wrapper y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y") File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 639, in convert_to_tensor as_ref=False) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 704, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py", line 113, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in constant tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape)) File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py", line 360, in make_tensor_proto raise ValueError("None values not supported.") ValueError: None values not supported. ```
使用GPU训练深度神经网络,模型不收敛是什么原因?
无论是用caffe还是tensorflow来训练模型,只要是用GPU来训练,loss都不会下降,但同样情况下,用CPU来训练却没问题是什么原因呢?
Kafka实战(三) - Kafka的自我修养与定位
Apache Kafka是消息引擎系统,也是一个分布式流处理平台(Distributed Streaming Platform) Kafka是LinkedIn公司内部孵化的项目。LinkedIn最开始有强烈的数据强实时处理方面的需求,其内部的诸多子系统要执行多种类型的数据处理与分析,主要包括业务系统和应用程序性能监控,以及用户行为数据处理等。 遇到的主要问题: 数据正确性不足 数据的收集主要...
volatile 与 synchronize 详解
Java支持多个线程同时访问一个对象或者对象的成员变量,由于每个线程可以拥有这个变量的拷贝(虽然对象以及成员变量分配的内存是在共享内存中的,但是每个执行的线程还是可以拥有一份拷贝,这样做的目的是加速程序的执行,这是现代多核处理器的一个显著特性),所以程序在执行过程中,一个线程看到的变量并不一定是最新的。 volatile 关键字volatile可以用来修饰字段(成员变量),就是告知程序任何对该变量...
Java学习的正确打开方式
在博主认为,对于入门级学习java的最佳学习方法莫过于视频+博客+书籍+总结,前三者博主将淋漓尽致地挥毫于这篇博客文章中,至于总结在于个人,实际上越到后面你会发现学习的最好方式就是阅读参考官方文档其次就是国内的书籍,博客次之,这又是一个层次了,这里暂时不提后面再谈。博主将为各位入门java保驾护航,各位只管冲鸭!!!上天是公平的,只要不辜负时间,时间自然不会辜负你。 何谓学习?博主所理解的学习,它是一个过程,是一个不断累积、不断沉淀、不断总结、善于传达自己的个人见解以及乐于分享的过程。
程序员必须掌握的核心算法有哪些?
由于我之前一直强调数据结构以及算法学习的重要性,所以就有一些读者经常问我,数据结构与算法应该要学习到哪个程度呢?,说实话,这个问题我不知道要怎么回答你,主要取决于你想学习到哪些程度,不过针对这个问题,我稍微总结一下我学过的算法知识点,以及我觉得值得学习的算法。这些算法与数据结构的学习大多数是零散的,并没有一本把他们全部覆盖的书籍。下面是我觉得值得学习的一些算法以及数据结构,当然,我也会整理一些看过...
有哪些让程序员受益终生的建议
从业五年多,辗转两个大厂,出过书,创过业,从技术小白成长为基层管理,联合几个业内大牛回答下这个问题,希望能帮到大家,记得帮我点赞哦。 敲黑板!!!读了这篇文章,你将知道如何才能进大厂,如何实现财务自由,如何在工作中游刃有余,这篇文章很长,但绝对是精品,记得帮我点赞哦!!!! 一腔肺腑之言,能看进去多少,就看你自己了!!! 目录: 在校生篇: 为什么要尽量进大厂? 如何选择语言及方...
大学四年自学走来,这些私藏的实用工具/学习网站我贡献出来了
大学四年,看课本是不可能一直看课本的了,对于学习,特别是自学,善于搜索网上的一些资源来辅助,还是非常有必要的,下面我就把这几年私藏的各种资源,网站贡献出来给你们。主要有:电子书搜索、实用工具、在线视频学习网站、非视频学习网站、软件下载、面试/求职必备网站。 注意:文中提到的所有资源,文末我都给你整理好了,你们只管拿去,如果觉得不错,转发、分享就是最大的支持了。 一、电子书搜索 对于大部分程序员...
linux系列之常用运维命令整理笔录
本博客记录工作中需要的linux运维命令,大学时候开始接触linux,会一些基本操作,可是都没有整理起来,加上是做开发,不做运维,有些命令忘记了,所以现在整理成博客,当然vi,文件操作等就不介绍了,慢慢积累一些其它拓展的命令,博客不定时更新 free -m 其中:m表示兆,也可以用g,注意都要小写 Men:表示物理内存统计 total:表示物理内存总数(total=used+free) use...
比特币原理详解
一、什么是比特币 比特币是一种电子货币,是一种基于密码学的货币,在2008年11月1日由中本聪发表比特币白皮书,文中提出了一种去中心化的电子记账系统,我们平时的电子现金是银行来记账,因为银行的背后是国家信用。去中心化电子记账系统是参与者共同记账。比特币可以防止主权危机、信用风险。其好处不多做赘述,这一层面介绍的文章很多,本文主要从更深层的技术原理角度进行介绍。 二、问题引入 假设现有4个人...
GitHub开源史上最大规模中文知识图谱
近日,一直致力于知识图谱研究的 OwnThink 平台在 Github 上开源了史上最大规模 1.4 亿中文知识图谱,其中数据是以(实体、属性、值),(实体、关系、实体)混合的形式组织,数据格式采用 csv 格式。 到目前为止,OwnThink 项目开放了对话机器人、知识图谱、语义理解、自然语言处理工具。知识图谱融合了两千五百多万的实体,拥有亿级别的实体属性关系,机器人采用了基于知识图谱的语义感...
程序员接私活怎样防止做完了不给钱?
首先跟大家说明一点,我们做 IT 类的外包开发,是非标品开发,所以很有可能在开发过程中会有这样那样的需求修改,而这种需求修改很容易造成扯皮,进而影响到费用支付,甚至出现做完了项目收不到钱的情况。 那么,怎么保证自己的薪酬安全呢? 我们在开工前,一定要做好一些证据方面的准备(也就是“讨薪”的理论依据),这其中最重要的就是需求文档和验收标准。一定要让需求方提供这两个文档资料作为开发的基础。之后开发...
网页实现一个简单的音乐播放器(大佬别看。(⊙﹏⊙))
今天闲着无事,就想写点东西。然后听了下歌,就打算写个播放器。 于是乎用h5 audio的加上js简单的播放器完工了。 演示地点演示 html代码如下` music 这个年纪 七月的风 音乐 ` 然后就是css`*{ margin: 0; padding: 0; text-decoration: none; list-...
微信支付崩溃了,但是更让马化腾和张小龙崩溃的竟然是……
loonggg读完需要3分钟速读仅需1分钟事件还得还原到昨天晚上,10 月 29 日晚上 20:09-21:14 之间,微信支付发生故障,全国微信支付交易无法正常进行。然...
Python十大装B语法
Python 是一种代表简单思想的语言,其语法相对简单,很容易上手。不过,如果就此小视 Python 语法的精妙和深邃,那就大错特错了。本文精心筛选了最能展现 Python 语法之精妙的十个知识点,并附上详细的实例代码。如能在实战中融会贯通、灵活使用,必将使代码更为精炼、高效,同时也会极大提升代码B格,使之看上去更老练,读起来更优雅。
数据库优化 - SQL优化
以实际SQL入手,带你一步一步走上SQL优化之路!
2019年11月中国大陆编程语言排行榜
2019年11月2日,我统计了某招聘网站,获得有效程序员招聘数据9万条。针对招聘信息,提取编程语言关键字,并统计如下: 编程语言比例 rank pl_ percentage 1 java 33.62% 2 cpp 16.42% 3 c_sharp 12.82% 4 javascript 12.31% 5 python 7.93% 6 go 7.25% 7 p...
通俗易懂地给女朋友讲:线程池的内部原理
餐盘在灯光的照耀下格外晶莹洁白,女朋友拿起红酒杯轻轻地抿了一小口,对我说:“经常听你说线程池,到底线程池到底是个什么原理?”
《奇巧淫技》系列-python!!每天早上八点自动发送天气预报邮件到QQ邮箱
将代码部署服务器,每日早上定时获取到天气数据,并发送到邮箱。 也可以说是一个小型人工智障。 知识可以运用在不同地方,不一定非是天气预报。
经典算法(5)杨辉三角
杨辉三角 是经典算法,这篇博客对它的算法思想进行了讲解,并有完整的代码实现。
英特尔不为人知的 B 面
从 PC 时代至今,众人只知在 CPU、GPU、XPU、制程、工艺等战场中,英特尔在与同行硬件芯片制造商们的竞争中杀出重围,且在不断的成长进化中,成为全球知名的半导体公司。殊不知,在「刚硬」的背后,英特尔「柔性」的软件早已经做到了全方位的支持与支撑,并持续发挥独特的生态价值,推动产业合作共赢。 而对于这一不知人知的 B 面,很多人将其称之为英特尔隐形的翅膀,虽低调,但是影响力却不容小觑。 那么,在...
腾讯算法面试题:64匹马8个跑道需要多少轮才能选出最快的四匹?
昨天,有网友私信我,说去阿里面试,彻底的被打击到了。问了为什么网上大量使用ThreadLocal的源码都会加上private static?他被难住了,因为他从来都没有考虑过这个问题。无独有偶,今天笔者又发现有网友吐槽了一道腾讯的面试题,我们一起来看看。 腾讯算法面试题:64匹马8个跑道需要多少轮才能选出最快的四匹? 在互联网职场论坛,一名程序员发帖求助到。二面腾讯,其中一个算法题:64匹...
面试官:你连RESTful都不知道我怎么敢要你?
干货,2019 RESTful最贱实践
刷了几千道算法题,这些我私藏的刷题网站都在这里了!
遥想当年,机缘巧合入了 ACM 的坑,周边巨擘林立,从此过上了"天天被虐似死狗"的生活… 然而我是谁,我可是死狗中的战斗鸡,智力不够那刷题来凑,开始了夜以继日哼哧哼哧刷题的日子,从此"读题与提交齐飞, AC 与 WA 一色 ",我惊喜的发现被题虐既刺激又有快感,那一刻我泪流满面。这么好的事儿作为一个正直的人绝不能自己独享,经过激烈的颅内斗争,我决定把我私藏的十几个 T 的,阿不,十几个刷题网...
为啥国人偏爱Mybatis,而老外喜欢Hibernate/JPA呢?
关于SQL和ORM的争论,永远都不会终止,我也一直在思考这个问题。昨天又跟群里的小伙伴进行了一番讨论,感触还是有一些,于是就有了今天这篇文。 声明:本文不会下关于Mybatis和JPA两个持久层框架哪个更好这样的结论。只是摆事实,讲道理,所以,请各位看官勿喷。 一、事件起因 关于Mybatis和JPA孰优孰劣的问题,争论已经很多年了。一直也没有结论,毕竟每个人的喜好和习惯是大不相同的。我也看...
白话阿里巴巴Java开发手册高级篇
不久前,阿里巴巴发布了《阿里巴巴Java开发手册》,总结了阿里巴巴内部实际项目开发过程中开发人员应该遵守的研发流程规范,这些流程规范在一定程度上能够保证最终的项目交付质量,通过在时间中总结模式,并推广给广大开发人员,来避免研发人员在实践中容易犯的错误,确保最终在大规模协作的项目中达成既定目标。 无独有偶,笔者去年在公司里负责升级和制定研发流程、设计模板、设计标准、代码标准等规范,并在实际工作中进行...
SQL-小白最佳入门sql查询一
不要偷偷的查询我的个人资料,即使你再喜欢我,也不要这样,真的不好;
项目中的if else太多了,该怎么重构?
介绍 最近跟着公司的大佬开发了一款IM系统,类似QQ和微信哈,就是聊天软件。我们有一部分业务逻辑是这样的 if (msgType = "文本") { // dosomething } else if(msgType = "图片") { // doshomething } else if(msgType = "视频") { // doshomething } else { // doshom...
Nginx 原理和架构
Nginx 是一个免费的,开源的,高性能的 HTTP 服务器和反向代理,以及 IMAP / POP3 代理服务器。Nginx 以其高性能,稳定性,丰富的功能,简单的配置和低资源消耗而闻名。 Nginx 的整体架构 Nginx 里有一个 master 进程和多个 worker 进程。master 进程并不处理网络请求,主要负责调度工作进程:加载配置、启动工作进程及非停升级。worker 进程负责处...
YouTube排名第一的励志英文演讲《Dream(梦想)》
Idon’t know what that dream is that you have, I don't care how disappointing it might have been as you've been working toward that dream,but that dream that you’re holding in your mind, that it’s po...
“狗屁不通文章生成器”登顶GitHub热榜,分分钟写出万字形式主义大作
一、垃圾文字生成器介绍 最近在浏览GitHub的时候,发现了这样一个骨骼清奇的雷人项目,而且热度还特别高。 项目中文名:狗屁不通文章生成器 项目英文名:BullshitGenerator 根据作者的介绍,他是偶尔需要一些中文文字用于GUI开发时测试文本渲染,因此开发了这个废话生成器。但由于生成的废话实在是太过富于哲理,所以最近已经被小伙伴们给玩坏了。 他的文风可能是这样的: 你发现,...
程序员:我终于知道post和get的区别
是一个老生常谈的话题,然而随着不断的学习,对于以前的认识有很多误区,所以还是需要不断地总结的,学而时习之,不亦说乎
相关热词 c# clr dll c# 如何orm c# 固定大小的字符数组 c#框架设计 c# 删除数据库 c# 中文文字 图片转 c# 成员属性 接口 c#如何将程序封装 16进制负数转换 c# c#练手项目
立即提问

相似问题

3
关于Tensorflow的minst数字识别出现name XXX is not defined的问题
2
用darkenet训练yolov3,跑着跑着LOSS越来越大,然后就出现了大面积NAN,LOSS,IOU等都是NAN值
1
如何可视化tensorflow版的fater rcnn的训练过程?
0
急求!!!用tensorflow搭了lstm的模型进行负荷预测,求问有没有大佬知道loss值已知很大的原因是什么?
2
tensorflow当中的loss里面的logits可不可以是placeholder
1
deeplab v3+训练loss不收敛问题
1
卷积神经网络训练loss变为nan
1
tensorflow重载模型继续训练得到的loss比原模型继续训练得到的loss大,是什么原因??
1
请问tensorflow的训练的loss一直在1.几和0.几之间跳来跳去是算没收敛还是收敛了?
2
度量学习中三元组损失不收敛(loss无法下降到margin以下,样本的降维输出聚在一起)
1
Mask RCNN训练过程中loss为nan的情况(使用labelme标注的数据)
2
用TensorFlow 训练mask rcnn时,总是在执行训练语句时报错,进行不下去了,求大神
1
修改的SSD—Tensorflow 版本在训练的时候遇到loss输入维度不一致
1
训练时,loss一直变大是哪里错了?
1
tensorflow训练过程权重不更新,loss不下降,输出保持不变,只有bias在非常缓慢地变化?
1
TensorFlow的Keras如何使用Dataset作为数据输入?
1
tensorflow自定义的损失函数 focal_loss出现inf,在训练过程中出现inf
0
tensorflow CNN训练图片分类的时候,模型训练不出来,准确率0.1(分了十类),模型失效,是什么原因?
2
初学TensorFlow,跟着视频上写的代码,出现SyntaxError: invalid syntax错误
2
用tensorflow做机器翻译时训练代码有问题