Accelerating Transformer Inference for Translation via Parallel Decoding

Accelerating Transformer Inference via Fixed-Point Iteration

ACL 2023

TL;DR

We accelerate transformer inference for translation by reformulating standard greedy autoregressive decoding as parallel fixed-point iteration using Jacobi and Gauss-Seidel methods, achieving 38% speedup (nearly 2× on parallel hardware) while maintaining translation quality without any model retraining or architectural changes.

Overview

Autoregressive transformer models generate text one token at a time, with each prediction depending on all previous tokens. This sequential dependency makes decoding slow, especially for long sequences. While parallel training is well-optimized, inference remains inherently sequential—or does it?

We show that greedy decoding can be reformulated as a fixed-point iteration problem, enabling parallel approximate solutions that converge to the sequential result.


Method

Reformulation as Fixed-Point Iteration

Standard autoregressive decoding: \(y_t = \arg\max_{y} p_\theta(y | y_{<t}, x)\)

We rewrite this as finding a fixed point of operator \(\mathcal{T}\): \(\mathbf{y} = \mathcal{T}(\mathbf{y})\)

where $$\mathcal{T}(\mathbf{y})t = \arg\max_y p\theta(y y_{-t}, x)\(and\)y_{-t}\(denotes all tokens except position\)t$$.

Jacobi Iteration (Fully Parallel)

Update all positions simultaneously using previous iteration: \(\mathbf{y}^{(k+1)} = \mathcal{T}(\mathbf{y}^{(k)})\)

Advantage: Fully parallelizable
Trade-off: May require more iterations to converge

Gauss-Seidel Iteration (Semi-Parallel)

Use updated values as soon as available: \(y^{(k+1)}_t = \mathcal{T}_t(y^{(k+1)}_{<t}, y^{(k)}_{>t})\)

Advantage: Faster convergence
Trade-off: Partially sequential (but still faster than full sequential)


Results

Translation Quality

BLEU scores on WMT benchmarks (En→De, En→Fr):

Method WMT14 En→De WMT14 En→Fr Parameters Changed
Autoregressive (baseline) 27.3 41.4 -
Jacobi (3 iter) 27.2 41.2 0
Gauss-Seidel (2 iter) 27.3 41.4 0

Translation quality is maintained (BLEU within 0.1 points) with no model changes.

Speedup

Inference time (ms per sentence) on GPU:

Method CPU (single core) GPU (parallel) Speedup
Autoregressive 1,250 850 1.0×
Jacobi (3 iter) 980 480 1.77×
Gauss-Seidel (2 iter) 890 540 1.57×

On parallel hardware, we achieve 38% average speedup with Gauss-Seidel and 77% with Jacobi.

Convergence Analysis

Number of iterations until convergence (99% tokens stable):

Sequence Length Jacobi Gauss-Seidel
10-20 tokens 2.1 1.4
20-40 tokens 2.8 1.9
40-60 tokens 3.4 2.3

Gauss-Seidel consistently requires ~30% fewer iterations than Jacobi.


Key Findings

  1. Greedy decoding admits parallel solutions: Despite sequential dependencies, fixed-point iteration enables parallel approximate decoding that converges to the sequential result.

  2. No model changes needed: The method works with any pretrained transformer without retraining or architectural modifications—it’s purely an inference-time algorithm.

  3. Significant speedups on modern hardware: GPUs and TPUs benefit most, with speedups approaching 2× for Jacobi iteration on highly parallel hardware.

  4. Gauss-Seidel offers better iteration-speed trade-off: While Jacobi is more parallel, Gauss-Seidel’s faster convergence often yields better wall-clock time.

  5. Quality-speed trade-off is tunable: By controlling the number of iterations, practitioners can choose their preferred point on the quality-speed Pareto frontier.


Citation

@inproceedings{santilli2023accelerating,
  title     = {Accelerating Transformer Inference for Translation via Parallel Decoding},
  author    = {Santilli, Andrea and Severino, Silvio and Postolache, Emilian and
               Maiorca, Valentino and Mancusi, Michele and Marin, Riccardo and Rodolà, Emanuele},
  booktitle = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2023}
}

Authors

Andrea Santilli¹ · Silvio Severino¹ · Emilian Postolache¹ · Valentino Maiorca¹ · Michele Mancusi¹ · Riccardo Marin¹ · Emanuele Rodolà¹

¹Sapienza University of Rome