TL;DR
We accelerate transformer inference for translation by reformulating standard greedy autoregressive decoding as parallel fixed-point iteration using Jacobi and Gauss-Seidel methods, achieving 38% speedup (nearly 2× on parallel hardware) while maintaining translation quality without any model retraining or architectural changes.
Overview
Autoregressive transformer models generate text one token at a time, with each prediction depending on all previous tokens. This sequential dependency makes decoding slow, especially for long sequences. While parallel training is well-optimized, inference remains inherently sequential—or does it?
We show that greedy decoding can be reformulated as a fixed-point iteration problem, enabling parallel approximate solutions that converge to the sequential result.
Method
Reformulation as Fixed-Point Iteration
Standard autoregressive decoding: \(y_t = \arg\max_{y} p_\theta(y | y_{<t}, x)\)
We rewrite this as finding a fixed point of operator \(\mathcal{T}\): \(\mathbf{y} = \mathcal{T}(\mathbf{y})\)
where $$\mathcal{T}(\mathbf{y})t = \arg\max_y p\theta(y | y_{-t}, x)\(and\)y_{-t}\(denotes all tokens except position\)t$$. |
Jacobi Iteration (Fully Parallel)
Update all positions simultaneously using previous iteration: \(\mathbf{y}^{(k+1)} = \mathcal{T}(\mathbf{y}^{(k)})\)
Advantage: Fully parallelizable
Trade-off: May require more iterations to converge
Gauss-Seidel Iteration (Semi-Parallel)
Use updated values as soon as available: \(y^{(k+1)}_t = \mathcal{T}_t(y^{(k+1)}_{<t}, y^{(k)}_{>t})\)
Advantage: Faster convergence
Trade-off: Partially sequential (but still faster than full sequential)
Results
Translation Quality
BLEU scores on WMT benchmarks (En→De, En→Fr):
Method | WMT14 En→De | WMT14 En→Fr | Parameters Changed |
---|---|---|---|
Autoregressive (baseline) | 27.3 | 41.4 | - |
Jacobi (3 iter) | 27.2 | 41.2 | 0 |
Gauss-Seidel (2 iter) | 27.3 | 41.4 | 0 |
Translation quality is maintained (BLEU within 0.1 points) with no model changes.
Speedup
Inference time (ms per sentence) on GPU:
Method | CPU (single core) | GPU (parallel) | Speedup |
---|---|---|---|
Autoregressive | 1,250 | 850 | 1.0× |
Jacobi (3 iter) | 980 | 480 | 1.77× |
Gauss-Seidel (2 iter) | 890 | 540 | 1.57× |
On parallel hardware, we achieve 38% average speedup with Gauss-Seidel and 77% with Jacobi.
Convergence Analysis
Number of iterations until convergence (99% tokens stable):
Sequence Length | Jacobi | Gauss-Seidel |
---|---|---|
10-20 tokens | 2.1 | 1.4 |
20-40 tokens | 2.8 | 1.9 |
40-60 tokens | 3.4 | 2.3 |
Gauss-Seidel consistently requires ~30% fewer iterations than Jacobi.
Key Findings
-
Greedy decoding admits parallel solutions: Despite sequential dependencies, fixed-point iteration enables parallel approximate decoding that converges to the sequential result.
-
No model changes needed: The method works with any pretrained transformer without retraining or architectural modifications—it’s purely an inference-time algorithm.
-
Significant speedups on modern hardware: GPUs and TPUs benefit most, with speedups approaching 2× for Jacobi iteration on highly parallel hardware.
-
Gauss-Seidel offers better iteration-speed trade-off: While Jacobi is more parallel, Gauss-Seidel’s faster convergence often yields better wall-clock time.
-
Quality-speed trade-off is tunable: By controlling the number of iterations, practitioners can choose their preferred point on the quality-speed Pareto frontier.
Citation
@inproceedings{santilli2023accelerating,
title = {Accelerating Transformer Inference for Translation via Parallel Decoding},
author = {Santilli, Andrea and Severino, Silvio and Postolache, Emilian and
Maiorca, Valentino and Mancusi, Michele and Marin, Riccardo and Rodolà, Emanuele},
booktitle = {Annual Meeting of the Association for Computational Linguistics},
year = {2023}
}
Authors
Andrea Santilli¹ · Silvio Severino¹ · Emilian Postolache¹ · Valentino Maiorca¹ · Michele Mancusi¹ · Riccardo Marin¹ · Emanuele Rodolà¹
¹Sapienza University of Rome