If you have been following my last two posts, you will know that I've been trying (unsuccessfully so far) to prove to myself that the addition of an attention layer does indeed make a network better at predicting similarity between a pair of inputs. I have had good results with various self attention mechanism for a document classification system, but I just couldn't replicate a similar success with a similarity network.
Upon visualizing the intermediate outputs of the network, I observed that the attention layer seemed to be corrupting the data - so instead of highlighting specific portions of the input as might be expected, the attention layer output actually seemed to be more uniform than the input. This led me to suspect that either my model or the attention mechanism was somehow flawed or ill-suited for the problem I was trying to solve.
My similarity network was trying to treat the similarity problem as a classification problem, i.e, it would predict one of 6 discrete "similarity classes". However, the training data provided the similarities as continuous floating point numbers between 0 and 5. The examples I had seen before for similar Siamese network architectures (such as this example from the Keras distribution) typically minimize a continuous function such as Contrastive Divergence. So I decided to change my network to a regression network, more in keeping with the data provided and examples I had seen. This new network would learn to predict a similarity score by minimizing the Mean Squared Error (MSE) between label and prediction. The optimizer used was RMSProp.
With the classification network, the validation loss (categorical cross-entropy) on even the baseline (no attention) network kept oscillating and led me to believe that the network was overfitting on the training data. The learning curve from the baseline regression network (w/o Attn) looked much better, so the change certainly appears to be a step in the right direction. The network was evaluated using Root Mean Square Error (RMSE) between label and prediction on a held-out test set. In addition, I also computed the Pearson correlation and Spearman (rank) correlation coefficients between label and predicted values in the test set.
In addition, I decided to experiment with some different Attention implementations I found on the Tensorflow Neural Machine Translation(NMT) page - the additive style proposed by Bahdanau, and the multiplicative style proposed by Luong. The equations here are in the context of NMT, so I modified the equations a bit for my use case. In addition, I found that the attention style I was using from the Parikh paper is called the dot product style, so I included that too below with similar notation, for comparison. Note that the difference in "style" pertains only to how the alignment matrix α is computed, as shown below.
The alignment matrix is combined with the input signal to form the context vector, and the context vector is concatenated with the input signal and weighted with a learned weight and passed through a tanh layer.
One other thing to note is that unlike my original attention implementation, the alignment matrix in these equations is formed out of the raw inputs rather than the ones scaled through a tanh layer. I did try using scaled inputs with the dot style attention (MM-dot(s)) - this was my original attention layer without any change - but the results weren't as good as dot style attention without scaling (MM-dot).
For the two new attention styles, I added two new custom Keras Layers AttentionMMA for the additive (Bahdanau) style, and AttentionMMM for the multiplicative (Luong) style. These are called from the model with additive attention (MM-add) and model with multiplicative attention (MM-mult) notebooks respectively. The RMSE, Pearson and Spearman correlation coefficients for each of these models, each trained for 10 epochs, are summarized in the chart below.
As you can see, the dot style attention doesn't seem to do too well against the baseline, regardless of whether the input signal or scaled or not. However, both the additive and multiplicative attention styles result in a significantly lower RMSE and higher correlation coefficients than the baseline, with additive attention being giving the best results.
That's all I have for today. I hope you found it interesting. There are many variations among Attention mechanisms, and I was happy to find two that worked well with my similarity network.