-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Hi, Wang-Cheng,
Thanks for your awesome work SASRec! Recently, we are planning to develop a standardized sequential recommendation benchmark repo, where SASRec is the model we are going to implement first for sure. However, we found some potential issues in this official code, mainly related to the LayerNorm (LN) implementation.
As the original paper describes, the Pre-LN was used in SASRec, i.e.,
However, in the official code, only query (Q) is normalized before the attention layer, while the key and value (K, V) are not (cf. lines 79-83 in model.py):
Q = self.attention_layernorms[i](seqs)
mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs,
attn_mask=attention_mask)
seqs = Q + mha_outputs # Q is LN(seqs)In the standard Pre-LN, the K and V should also be normalized, i.e., Q = K = V = LN(seqs). We have indeed found a previous issue #33. Nonetheless, we believe that at least the V should be normalized.
We conducted some tiny experiments to evaluate the performance of the original normalization (Original), the standard Pre-LN (Pre Norm), and the Post-LN (Post Norm) implementations, where the Post-LN is the same as the original Transformer and BERT. The results can be reproduced by adopting the same hyperparameters as the original SASRec paper, using two processed datasets Amazon-Beauty and Movielens-1M.
-
Num_Block = 2
Beauty Movielens-1M NDCG@10 HR@10 NDCG@10 HR@10 Original 0.05244 0.09345 0.16997 0.29685 Pre Norm 0.05138 0.09265 0.16629 0.29122 Post Norm 0.05275 0.09363 0.17199 0.29719 -
Num_Block = 3
Beauty Movielens-1M NDCG@10 HR@10 NDCG@10 HR@10 Original 0.05095 0.09193 0.17113 0.30017 Pre Norm 0.05314 0.09475 0.16916 0.30248 Post Norm 0.05124 0.09215 0.17257 0.3050
Experiments show that the Post-LN implementation outperforms the original one, while the performance of the standard Pre-LN is not stable. We anticipate that this is due to the limited number of transformer blocks (<= 3) in SASRec, i.e., training SASRec with Post-LN may not suffer from the instability issue, while the expressive power of Pre-LN is limited (cf. this blog).
Given the above results, we suggest applying the standard Post-LN to the SASRec (or at least provide an option in the code, or clarification in the README about the LN implementation). The code modifications are simple:
- Line 19:
# outputs += inputs # Remove this line- Lines 79-87:
mha_outputs, _ = self.attention_layers[i](seqs, seqs, seqs,
attn_mask=attention_mask)
seqs = seqs + mha_outputs
seqs = self.attention_layernorms[i](seqs) # Apply Post-LN
seqs = torch.transpose(seqs, 0, 1)
seqs = seqs + self.forward_layers[i](seqs)
seqs = self.forward_layernorms[i](seqs) # Apply Post-LNThank you for your time and consideration.