Zero-Shot Policy Stitching via Semantic Alignment

Mapping representations in Reinforcement Learning for modular policy reuse

Overview

Training deep Reinforcement Learning (RL) agents is computationally expensive, often requiring millions of environment interactions. When visual appearance, task dynamics, or action spaces change—even slightly—most methods require costly retraining from scratch or extensive fine-tuning.

SAPS (Semantic Alignment for Policy Stitching) tackles this problem by enabling zero-shot modular policy reuse. Given independently trained RL agents, SAPS learns a simple affine transformation to map between their latent representations, allowing encoders and controllers from different agents to be “stitched” together without any additional training.

The key insight: independently trained RL encoders learn representations that differ primarily by geometric transformations (rotations, translations), not semantic structure. By estimating these transformations from a small set of semantically aligned “anchor” observations, we can seamlessly compose policies across visual and task variations.

Green Background Environment
Source encoder trained here
Yellow Background Environment
Zero-shot stitched policy works here
Using SAPS, a controller trained on an environment with a given visual variation (left) can be reused without any training or fine-tuning on a different environment (right) with comparable performance. The red trajectory shows the same controller driving successfully when connected to different encoders via semantic alignment.

Method: Semantic Alignment for Policy Stitching

The Stitching Problem

Consider two independently trained RL agents:

  • Agent u: Trained on environment variation u (e.g., green background, standard speed)
  • Agent v: Trained on environment variation v (e.g., red background, slow speed)

Each agent consists of:

  • Encoder \(\phi\): Maps observations to latent representations
  • Controller \(\psi\): Maps latent representations to actions

Goal: Create a new agent that combines encoder \(\phi_u\) with controller \(\psi_v\) to operate in a new environment variation (u’s visual appearance with v’s task dynamics) without any training.

Affine Transformation via SVD

SAPS estimates an affine transformation \(\tau_u^v: \mathcal{X}_u \rightarrow \mathcal{X}_v\) that maps latent representations from encoder u to be compatible with controller v:

\[\tau_u^v(\mathbf{x}_u) = \mathbf{R} \mathbf{x}_u + \mathbf{b}\]

where:

  • \(\mathbf{R}\) is an orthogonal rotation matrix obtained via Singular Value Decomposition (SVD)
  • \(\mathbf{b}\) is a translation vector

The stitched policy is then:

\[\tilde{\pi}_u^v(\mathbf{o}_u) = \psi_v^j[\tau_u^v(\phi_u^i(\mathbf{o}_u))]\]

Anchor Collection

The transformation is estimated from pairs of semantically aligned observations called “anchors”. Two approaches are used:

  1. Pixel-space transformation: When visual variation is well-defined (e.g., color changes), directly transform observations in pixel space
  2. Action replay: Replay the same action sequence in both environments (requires deterministic environments with same seed)

From anchor pairs \((\mathbf{a}_u, \mathbf{a}_v)\), we obtain embeddings \((\mathbf{x}_u, \mathbf{x}_v)\) and estimate \(\tau\) via SVD to find the optimal orthogonal alignment.


Experiments: CarRacing

We evaluate SAPS on CarRacing with multiple visual variations (green, red, blue backgrounds, far camera) and task variations (standard, slow speed, scrambled actions, no idle).

Zero-Shot Stitching Performance

Encoder Controller Naive R3L SAPS (ours) Improvement
Green Green 175±304 781±108 822±62 +5%
Green Slow 148±328 268±14 764±287 +185%
Red Red 43±205 776±92 807±52 +4%
Blue Blue 11±122 792±48 814±52 +3%
Far Green 152±204 527±142 714±45 +35%

Key observations:

  • Matches or exceeds R3L: SAPS achieves R3L-level performance without specialized training
  • Excels on challenging cases: Slow task (764 vs 268 for R3L, +185%) and far camera (714 vs 527 for R3L, +35%)
  • Massive improvement over naive: 4-74x better than naive stitching without alignment
  • Near end-to-end quality: Performance approaches end-to-end trained models (800-850 range)

Experiments: LunarLander

LunarLander tests SAPS on a precision control task with visual variations (white vs red background) and physics variations (gravity -10 vs -3).

Results: Success and Limitations

Encoder Controller Naive SAPS (ours) End-to-end
White (g=-10) White (g=-10) -413±72 19±56 221±86
White (g=-10) Red (g=-10) -390±176 8±60 -
White (g=-10) White (g=-3) -276±8 -242±51 -
Red (g=-10) White (g=-10) -444±116 52±44 192±30

Key findings:

  • Same gravity stitching: SAPS achieves positive scores (8-52), massive improvement over naive (-390 to -444)
  • Cross-gravity stitching: SAPS fails (-242), highlighting limitations of affine transformations for physics variations
  • Interpretation: Precision landing task is highly sensitive to latent space misalignment; small errors cascade into crashes

The LunarLander results reveal an important principle: affine transformations work well for visual variations and robust control tasks, but struggle with precision control and physics variations.


Latent Space Analysis

PCA Visualization

We visualize how SAPS aligns latent spaces by projecting encoder outputs to 3D via PCA.

Left: After applying SAPS alignment, embeddings from green and red background variations become thoroughly intermixed in the shared latent space. Same track positions with different colors map to similar vectors (see embedded screenshots). Right: Without alignment, green and red embeddings remain completely separated.

Cosine Similarity Analysis

We measure pairwise cosine similarity between matched frames from different environment variations:

CarRacing: Histogram of pairwise cosine similarities between matched states. SAPS (left) achieves near-perfect alignment (mean similarity 0.92-0.99), matching R3L (middle) without requiring specialized training. Naive stitching (right) shows very low similarity (0.23-0.30), confirming incompatibility.
LunarLander: Both SAPS and R3L achieve high cosine similarity (0.92-0.99), yet performance gaps remain. This confirms that high alignment quality is necessary but not sufficient for precision control tasks.

Quantitative results:

  • SAPS/R3L: Mean cosine similarity 0.92-0.99 across both environments
  • Naive: Mean cosine similarity 0.23-0.30
  • Interpretation: Independently trained models exhibit near-identical encodings for semantically identical frames once aligned

Key Findings

  1. Zero-shot stitching works for robust tasks: SAPS enables modular policy composition across visual and task variations in CarRacing, achieving performance comparable to end-to-end training without any fine-tuning.

  2. Near-perfect latent alignment: Cosine similarity analysis (0.92-0.99) confirms that independently trained RL encoders learn quasi-isometric representations differing primarily by geometric transformations.

  3. Practical advantage over R3L: SAPS achieves R3L-level alignment and performance using standard pre-trained models, without requiring specialized relative representation training.

  4. Precision control reveals limits: LunarLander results show that affine transformations have boundaries—they handle visual variations well but struggle with physics variations and precision-sensitive tasks.

  5. Task robustness as moderator: The gap between alignment quality and task performance highlights that robust control tasks (navigation) tolerate small alignment errors, while precision tasks (landing) amplify them.

  6. Enables compositional RL: By treating encoders and controllers as modular components connected via learned transformations, SAPS opens the door to compositional RL where policies can be assembled like LEGO blocks.


Citation

@article{ricciardi2025zeroshot,
  title   = {Mapping representations in Reinforcement Learning via Semantic Alignment for Zero-Shot Stitching},
  author  = {Antonio Pio Ricciardi and Valentino Maiorca and Luca Moschella and
             Riccardo Marin and Emanuele Rodol{\`a}},
  journal = {arXiv preprint arXiv:2503.01881},
  year    = {2025}
}

Authors

Antonio Pio Ricciardi¹ · Valentino Maiorca¹ · Luca Moschella¹ · Riccardo Marin² · Emanuele Rodolà¹

¹Sapienza University of Rome, Italy · ²University of Tübingen, Germany