The quadratic computational complexity of the self-attention mechanism in Transformer models severely constrains their applicability to long sequence inputs. We propose Contextual Priority Attention ...