Accelerating Transformer Inference for Translation via Parallel Decoding

Accelerating Transformer Inference via Fixed-Point Iteration

ACL 2023

TL;DR

We accelerate transformer inference for translation by reformulating standard greedy autoregressive decoding as parallel fixed-point iteration using Jacobi and Gauss-Seidel methods, achieving 38% speedup (nearly 2× on parallel hardware) while maintaining translation quality without any model retraining or architectural changes.

Overview

Autoregressive transformer models generate text one token at a time, with each prediction depending on all previous tokens. This sequential dependency makes decoding slow, especially for long sequences. While parallel training is well-optimized, inference remains inherently sequential—or does it?

We show that greedy decoding can be reformulated as a fixed-point iteration problem, enabling parallel approximate solutions that converge to the sequential result.

Method

Reformulation as Fixed-Point Iteration

Standard autoregressive decoding: $y_t = \arg\max_{y} p_\theta(y | y_{<t}, x)$

We rewrite this as finding a fixed point of operator $\mathcal{T}$: $\mathbf{y} = \mathcal{T}(\mathbf{y})$

where $$\mathcal{T}(\mathbf{y})t = \arg\max_y p\theta(y

y_{-t}, x)$and$y_{-t}$denotes all tokens except position$t$$.

Jacobi Iteration (Fully Parallel)

Update all positions simultaneously using previous iteration: $\mathbf{y}^{(k+1)} = \mathcal{T}(\mathbf{y}^{(k)})$

Advantage: Fully parallelizable
Trade-off: May require more iterations to converge

Gauss-Seidel Iteration (Semi-Parallel)

Use updated values as soon as available: $y^{(k+1)}_t = \mathcal{T}_t(y^{(k+1)}_{<t}, y^{(k)}_{>t})$

Advantage: Faster convergence
Trade-off: Partially sequential (but still faster than full sequential)

Results

Translation Quality

BLEU scores on WMT benchmarks (En→De, En→Fr):

Method	WMT14 En→De	WMT14 En→Fr	Parameters Changed
Autoregressive (baseline)	27.3	41.4	-
Jacobi (3 iter)	27.2	41.2	0
Gauss-Seidel (2 iter)	27.3	41.4	0

Translation quality is maintained (BLEU within 0.1 points) with no model changes.

Speedup

Inference time (ms per sentence) on GPU:

Method	CPU (single core)	GPU (parallel)	Speedup
Autoregressive	1,250	850	1.0×
Jacobi (3 iter)	980	480	1.77×
Gauss-Seidel (2 iter)	890	540	1.57×

On parallel hardware, we achieve 38% average speedup with Gauss-Seidel and 77% with Jacobi.

Convergence Analysis

Number of iterations until convergence (99% tokens stable):

Sequence Length	Jacobi	Gauss-Seidel
10-20 tokens	2.1	1.4
20-40 tokens	2.8	1.9
40-60 tokens	3.4	2.3

Gauss-Seidel consistently requires ~30% fewer iterations than Jacobi.

Key Findings

Greedy decoding admits parallel solutions: Despite sequential dependencies, fixed-point iteration enables parallel approximate decoding that converges to the sequential result.
No model changes needed: The method works with any pretrained transformer without retraining or architectural modifications—it’s purely an inference-time algorithm.
Significant speedups on modern hardware: GPUs and TPUs benefit most, with speedups approaching 2× for Jacobi iteration on highly parallel hardware.
Gauss-Seidel offers better iteration-speed trade-off: While Jacobi is more parallel, Gauss-Seidel’s faster convergence often yields better wall-clock time.
Quality-speed trade-off is tunable: By controlling the number of iterations, practitioners can choose their preferred point on the quality-speed Pareto frontier.

Citation

@inproceedings{santilli2023accelerating,
  title     = {Accelerating Transformer Inference for Translation via Parallel Decoding},
  author    = {Santilli, Andrea and Severino, Silvio and Postolache, Emilian and
               Maiorca, Valentino and Mancusi, Michele and Marin, Riccardo and Rodolà, Emanuele},
  booktitle = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2023}
}

Authors

Andrea Santilli¹ · Silvio Severino¹ · Emilian Postolache¹ · Valentino Maiorca¹ · Michele Mancusi¹ · Riccardo Marin¹ · Emanuele Rodolà¹

¹Sapienza University of Rome