Attention-Likelihood Relationship in Transformers

Understanding How Token Predictability Affects Attention Patterns

ICML 2020 Workshop

TL;DR: We discover that unexpected (low-likelihood) tokens cause transformer models to attend less to information from themselves when computing representations, particularly in higher layers. This correlation between token likelihood and attention values has implications for assessing LLM robustness in real-world scenarios.

Overview

Transformer models compute representations for each token by aggregating information from other tokens via attention mechanisms. But a fundamental question remains: Do transformers treat all tokens equally, or does the predictability of a token affect how it attends to context?

We investigate the relationship between a token’s likelihood (how expected it is given context) and its attention patterns. This analysis reveals systematic differences in how transformers process common versus rare tokens, with implications for model robustness and interpretability.

Method: Measuring Attention-Likelihood Correlation

Token Likelihood

For a token $x_t$ at position $t$, we compute its likelihood given preceding context:

\[p(x_t | x_{<t}) = \text{softmax}(f_\theta(x_{<t}))_{x_t}\]

Lower likelihood indicates the token is more surprising or unexpected given the context.

Self-Attention Strength

For each token $x_t$, we measure how much it attends to itself:

\[\alpha_{t \to t} = \text{Attention}(x_t, x_t)\]

This captures the degree to which a token’s representation incorporates information from its own embedding versus from context.

Correlation Analysis

Across many examples, we compute the Pearson correlation between $$\log p(x_t

x_{<t})$and$\alpha_{t \to t}$at each layer$\ell$$.

Results

Main Finding: Negative Correlation in Higher Layers

Correlation between log-likelihood and self-attention across layers:

Layer	GPT-2 Small	GPT-2 Medium	GPT-2 Large	BERT Base
1-3 (Early)	+0.12	+0.15	+0.11	+0.09
4-6 (Middle)	-0.03	-0.05	-0.08	-0.04
7-9 (Late)	-0.24	-0.31	-0.37	-0.28
10-12 (Final)	-0.41	-0.48	-0.52	-0.45

In final layers, low-likelihood (surprising) tokens attend significantly less to themselves, with correlations reaching -0.52 for GPT-2 Large.

Interpretation

This pattern suggests transformers employ a context-dependent aggregation strategy:

High-likelihood tokens (common, predictable): Retain more self-information, suggesting these tokens are well-understood and require less contextual refinement
Low-likelihood tokens (rare, surprising): Attend more to context, compensating for uncertainty by gathering more information from surrounding tokens

Implications

1. Robustness to Distribution Shift

Low-likelihood tokens—which occur more frequently out-of-distribution—rely heavily on context in final layers. This suggests:

Vulnerability: If context is corrupted, rare tokens suffer disproportionately
Adaptation: Models may naturally adapt to new tokens through contextual integration

2. Interpretability

The attention-likelihood correlation provides a lens for understanding which tokens a model finds “easy” vs. “hard”:

Tokens with high self-attention are handled confidently with minimal context
Tokens with low self-attention require extensive contextual reasoning

3. Model Calibration

The relationship between likelihood and attention can be used to:

Identify tokens where the model is uncertain
Develop better uncertainty estimates beyond raw probability scores
Improve selective prediction and abstention mechanisms

Key Findings

Transformers modulate attention based on token predictability: Self-attention strength is negatively correlated with token likelihood in higher layers, indicating adaptive information aggregation.
The effect emerges in deeper layers: Early layers show weak positive correlation, but the negative relationship strengthens progressively, reaching -0.5 in final layers.
Pattern is consistent across model scales and architectures: The attention-likelihood relationship holds for GPT-2 (small, medium, large) and BERT, suggesting a fundamental aspect of transformer computation.
Implications for robustness: Low-likelihood tokens (common in out-of-distribution data) rely more heavily on context, making them more vulnerable to context perturbations.

Citation

@article{ruscio2023attention,
  title   = {Attention-Likelihood Relationship in Transformers},
  author  = {Ruscio, Valeria and Maiorca, Valentino and Silvestri, Fabrizio},
  journal = {ICLR Tiny Papers Track},
  year    = {2023}
}

Authors

Valeria Ruscio¹ · Valentino Maiorca¹ · Fabrizio Silvestri¹

¹Sapienza University of Rome