2020-11-29 05:20

Negative CRF loss if mask_zero=False

When training a simple model with CRF its loss becomes negative after some time if mask_zero=False, which I've noticed while working on a larger BiLSTM+CRF for NER. My bigger model converges to ~90% acc without the CRF layer and to ~95% with the CRF layer, but its loss starts slightly positive and keeps decreasing until it becomes negative. This behaviour is quite unexpected since, as far as I can tell, the model is optimised through minimising the predictions negative log likelihood.

In order to further study this situation I've developed the following toy model, which yields training loss of -0.0777 if mask_zero=False and 0.0217 otherwise. The reader might argue that the amount of epochs is exaggerated, and indeed it is, however I would like to remind that the purpose of this code is to simple reproduce an issue which was observed on a model trained with way more data on only 5 epochs. Furthermore the loss function stays positive if mask_zero=True even for larger values for EPOCH. I've tried to investigating this issue by myself by reading the code without Much success... maybe , who is its main author, could point to some direction...

PS: Please notice that my word index on the embedding layer start at 1, hence mask_zero should not change anything...

import numpy

from keras.models import Sequential
from keras.layers import Embedding
from keras_contrib.layers import CRF

from numpy.random import seed

from tensorflow import set_random_seed

def build_dict(items):
    table = dict()
    for item in items:
        if item not in table:
            table[item] = len(table) + 1
    return table

def prepare_sequence(sequences, table):
    prepared = list()
    for seq in sequences:
        prep_seq = list()
        for item in seq:
            prep_seq.append(table.get(item, -1))

    return numpy.asarray(prepared)

data = [
    ('I went to Chicago from New York yesterday'.split(),
     'O O O B_LOC O B_LOC I_LOC O'.split())

words_table = build_dict(data[0][0])
labels_table = build_dict(data[0][1])

train_x = prepare_sequence([data[0][0]], words_table)
train_y = prepare_sequence([data[0][1]], labels_table)
train_y = numpy.expand_dims(train_y, -1)

EPOCHS = 700


model = Sequential()
model.add(Embedding(len(words_table) + 1, EMBED_DIM, mask_zero=False))  # Random embedding
crf = CRF(len(labels_table) + 1, sparse_target=True)

model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])
history = model.fit(train_x, train_y, epochs=EPOCHS, validation_data=[train_x, train_y], verbose=0)

# outputs -0.0777437686920166 if mask_zero=False and 0.02171158790588379 otherwise.


  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答


  • weixin_39835158 weixin_39835158 4月前

    Hey i found something .. change the learn mode of the CRF normal mode is join ..if you change it to marginal you will get sparse categorical cross entropy is sparse argument is kept true or else categorical cross entropy.

    点赞 评论 复制链接分享
  • weixin_39797381 weixin_39797381 4月前

    I just found a bug in CRF code, which is related to computing loss and causing negative loss when mask_zero=False and staying at quite big positive loss when mask_zero=True. The CRF loss (negative log-likelihood; nlogL) is composed of two parts: one is logZ and the other is energy (E; input/emission energy plus chain/transition energy). Among these two, the code computing logZ had a bug related to pad/mask. logZ is computed by recursion and each recursion computes intermediate term logS_k = logsumexp(logS_k-1 - E_k). The final logS_L becomes logZ (L means the length of sequence). The code applies mask to computing E_k but still updates logS_k even for padded inputs and this causes negative loss or big positive loss. The code computing logS_k is in step() method in crf.py. In this method, the 'if return_logZ' clause had a bug and need to be modified as follows:

            if return_logZ:
                energy = chain_energy + K.expand_dims(input_energy_t - prev_target_val, 2)
                new_target_val = K.logsumexp(-energy, 1)
                # added from here
                if len(states) > 3:
                    if K.backend() == 'theano':
                        m = states[3][:, t:(t + 2)]
                        m = K.slice(states[3], [0, t], [-1, 2])
                    is_valid = K.expand_dims(m[:, 0])
                    new_target_val = is_valid * new_target_val + (1 - is_valid) * prev_target_val
                # added until here
                return new_target_val, [new_target_val, i + 1]

    I've checked this solved the issues about negative loss when mask_zero=False and big positive loss when mask_zero=True at Embedding layer. However, this seems to have no effect on the performance of the model learning.

    点赞 评论 复制链接分享
  • weixin_39687542 weixin_39687542 4月前

    🎉 Congratulations on solving this issue that has been around for so long! Unfortunately, Keras has been discontinued, but you might try opening a new PR with your fix.

    点赞 评论 复制链接分享
  • weixin_39734458 weixin_39734458 4月前

    Found this too

    点赞 评论 复制链接分享
  • weixin_39829501 weixin_39829501 4月前

    I also encountered this situation. But I am using CNN to do sequence labeling,so I cannot set mask_zero=True in Keras. When using RNN,the CRF loss is about 5-6, but when using CNN, the loss becomes quite small and then become negative.

    点赞 评论 复制链接分享
  • weixin_39637723 weixin_39637723 4月前

    I am facing the exact same issue of . Any updates?

    点赞 评论 复制链接分享
  • weixin_39886251 weixin_39886251 4月前

    What if you add a masking layer?

    点赞 评论 复制链接分享
  • weixin_39687542 weixin_39687542 4月前

    If you add a masking layer, the problem disappears, if I remember correctly.

    点赞 评论 复制链接分享
  • glin_mk ReeseIMK 1月前

    I think, there is a log () in CRF's loss_function, the value of loss will be negative  when 0<X<1.

    点赞 评论 复制链接分享