Overview
Transformer models compute representations for each token by aggregating information from other tokens via attention mechanisms. But a fundamental question remains: Do transformers treat all tokens equally, or does the predictability of a token affect how it attends to context?
We investigate the relationship between a token’s likelihood (how expected it is given context) and its attention patterns. This analysis reveals systematic differences in how transformers process common versus rare tokens, with implications for model robustness and interpretability.
Method: Measuring Attention-Likelihood Correlation
Token Likelihood
For a token \(x_t\) at position \(t\), we compute its likelihood given preceding context:
\[p(x_t | x_{<t}) = \text{softmax}(f_\theta(x_{<t}))_{x_t}\]Lower likelihood indicates the token is more surprising or unexpected given the context.
Self-Attention Strength
For each token \(x_t\), we measure how much it attends to itself:
\[\alpha_{t \to t} = \text{Attention}(x_t, x_t)\]This captures the degree to which a token’s representation incorporates information from its own embedding versus from context.
Correlation Analysis
Across many examples, we compute the Pearson correlation between $$\log p(x_t | x_{<t})\(and\)\alpha_{t \to t}\(at each layer\)\ell$$. |
Results
Main Finding: Negative Correlation in Higher Layers
Correlation between log-likelihood and self-attention across layers:
Layer | GPT-2 Small | GPT-2 Medium | GPT-2 Large | BERT Base |
---|---|---|---|---|
1-3 (Early) | +0.12 | +0.15 | +0.11 | +0.09 |
4-6 (Middle) | -0.03 | -0.05 | -0.08 | -0.04 |
7-9 (Late) | -0.24 | -0.31 | -0.37 | -0.28 |
10-12 (Final) | -0.41 | -0.48 | -0.52 | -0.45 |
In final layers, low-likelihood (surprising) tokens attend significantly less to themselves, with correlations reaching -0.52 for GPT-2 Large.
Interpretation
This pattern suggests transformers employ a context-dependent aggregation strategy:
- High-likelihood tokens (common, predictable): Retain more self-information, suggesting these tokens are well-understood and require less contextual refinement
- Low-likelihood tokens (rare, surprising): Attend more to context, compensating for uncertainty by gathering more information from surrounding tokens
Implications
1. Robustness to Distribution Shift
Low-likelihood tokens—which occur more frequently out-of-distribution—rely heavily on context in final layers. This suggests:
- Vulnerability: If context is corrupted, rare tokens suffer disproportionately
- Adaptation: Models may naturally adapt to new tokens through contextual integration
2. Interpretability
The attention-likelihood correlation provides a lens for understanding which tokens a model finds “easy” vs. “hard”:
- Tokens with high self-attention are handled confidently with minimal context
- Tokens with low self-attention require extensive contextual reasoning
3. Model Calibration
The relationship between likelihood and attention can be used to:
- Identify tokens where the model is uncertain
- Develop better uncertainty estimates beyond raw probability scores
- Improve selective prediction and abstention mechanisms
Key Findings
-
Transformers modulate attention based on token predictability: Self-attention strength is negatively correlated with token likelihood in higher layers, indicating adaptive information aggregation.
-
The effect emerges in deeper layers: Early layers show weak positive correlation, but the negative relationship strengthens progressively, reaching -0.5 in final layers.
-
Pattern is consistent across model scales and architectures: The attention-likelihood relationship holds for GPT-2 (small, medium, large) and BERT, suggesting a fundamental aspect of transformer computation.
-
Implications for robustness: Low-likelihood tokens (common in out-of-distribution data) rely more heavily on context, making them more vulnerable to context perturbations.
Citation
@article{ruscio2023attention,
title = {Attention-Likelihood Relationship in Transformers},
author = {Ruscio, Valeria and Maiorca, Valentino and Silvestri, Fabrizio},
journal = {ICLR Tiny Papers Track},
year = {2023}
}
Authors
Valeria Ruscio¹ · Valentino Maiorca¹ · Fabrizio Silvestri¹
¹Sapienza University of Rome