问题
Multi-head attention进行分割时,是如何分割的?为什么这样做?
代码
Parameters:
x: Tensor
A tensor with shape [batch_size, seq_length, depth]
Returns:
A tensor with shape [batch_size, num_heads, seq_length, depth / num_heads]
我想要达到的结果
想要图解