In ProtTrans, the author says that:
No auxiliary tasks like BERT's next-sentence prediction were used for any model described here.
But in the PEER, the [CLS] token is used for ProtBert as a protein-level embedding representation. In this case the [CLS] token may not have the ability to represent sequence embedding.
For ProtBert, should we use the same strategy as for ESM (i.e., mean pooling over all residues) to get a fairer comparison?