Zero-Shot Policy Stitching via Semantic Alignment

Mapping representations in Reinforcement Learning for modular policy reuse

TL;DR: We enable zero-shot modular policy reuse in reinforcement learning by learning simple affine transformations (via SVD) between independently trained agents' latent representations using semantically aligned anchor observations. This allows encoders and controllers from different agents to be stitched together without training, achieving performance comparable to end-to-end training for robust control tasks.

Overview

Training deep Reinforcement Learning (RL) agents is computationally expensive, often requiring millions of environment interactions. When visual appearance, task dynamics, or action spaces change—even slightly—most methods require costly retraining from scratch or extensive fine-tuning.

SAPS (Semantic Alignment for Policy Stitching) tackles this problem by enabling zero-shot modular policy reuse. Given independently trained RL agents, SAPS learns a simple affine transformation to map between their latent representations, allowing encoders and controllers from different agents to be “stitched” together without any additional training.

The key insight: independently trained RL encoders learn representations that differ primarily by geometric transformations (rotations, translations), not semantic structure. By estimating these transformations from a small set of semantically aligned “anchor” observations, we can seamlessly compose policies across visual and task variations.

Green Background Environment
Source encoder trained here

Yellow Background Environment
Zero-shot stitched policy works here

Using SAPS, a controller trained on an environment with a given visual variation (left) can be reused without any training or fine-tuning on a different environment (right) with comparable performance. The red trajectory shows the same controller driving successfully when connected to different encoders via semantic alignment.

Method: Semantic Alignment for Policy Stitching

Core Idea

Independently trained RL encoders learn representations that differ primarily by geometric transformations (rotations, translations), not semantic structure. SAPS estimates these transformations using a small set of semantically aligned "anchor" observations, enabling zero-shot policy composition.

The Stitching Problem

Consider two independently trained RL agents:

Agent u: Trained on environment variation u (e.g., green background, standard speed)
Agent v: Trained on environment variation v (e.g., red background, slow speed)

Each agent consists of:

Encoder \(\phi\): Maps observations to latent representations
Controller \(\psi\): Maps latent representations to actions

Goal: Create a new agent that combines encoder \(\phi_u\) with controller \(\psi_v\) to operate in a new environment variation (u’s visual appearance with v’s task dynamics) without any training.

Affine Transformation via SVD

SAPS estimates an affine transformation \(\tau_u^v: \mathcal{X}_u \rightarrow \mathcal{X}_v\) that maps latent representations from encoder u to be compatible with controller v:

\[\tau_u^v(\mathbf{x}_u) = \mathbf{R} \mathbf{x}_u + \mathbf{b}\]

where:

\(\mathbf{R}\) is an orthogonal rotation matrix obtained via Singular Value Decomposition (SVD)
\(\mathbf{b}\) is a translation vector

The stitched policy is then:

\[\tilde{\pi}_u^v(\mathbf{o}_u) = \psi_v^j[\tau_u^v(\phi_u^i(\mathbf{o}_u))]\]

Anchor Collection

The transformation is estimated from pairs of semantically aligned observations called “anchors”. Two approaches are used:

Pixel-space transformation: When visual variation is well-defined (e.g., color changes), directly transform observations in pixel space
Action replay: Replay the same action sequence in both environments (requires deterministic environments with same seed)

From anchor pairs \((\mathbf{a}_u, \mathbf{a}_v)\), we obtain embeddings \((\mathbf{x}_u, \mathbf{x}_v)\) and estimate \(\tau\) via SVD to find the optimal orthogonal alignment.

Experiments: CarRacing

We evaluate SAPS on CarRacing with multiple visual variations (green, red, blue backgrounds, far camera) and task variations (standard, slow speed, scrambled actions, no idle).

Zero-Shot Stitching Performance

Key Result: SAPS consistently achieves performance comparable to end-to-end trained models across all variations, significantly outperforming baseline methods including R3L (which requires specialized training) and naive stitching (no alignment).

Encoder	Controller	Naive	R3L	SAPS (ours)	Improvement
Green	Green	175±304	781±108	822±62	+5%
Green	Slow	148±328	268±14	764±287	+185%
Red	Red	43±205	776±92	807±52	+4%
Blue	Blue	11±122	792±48	814±52	+3%
Far	Green	152±204	527±142	714±45	+35%

Key observations:

Matches or exceeds R3L: SAPS achieves R3L-level performance without specialized training
Excels on challenging cases: Slow task (764 vs 268 for R3L, +185%) and far camera (714 vs 527 for R3L, +35%)
Massive improvement over naive: 4-74x better than naive stitching without alignment
Near end-to-end quality: Performance approaches end-to-end trained models (800-850 range)

Experiments: LunarLander

LunarLander tests SAPS on a precision control task with visual variations (white vs red background) and physics variations (gravity -10 vs -3).

Results: Success and Limitations

Encoder	Controller	Naive	SAPS (ours)	End-to-end
White (g=-10)	White (g=-10)	-413±72	19±56	221±86
White (g=-10)	Red (g=-10)	-390±176	8±60	-
White (g=-10)	White (g=-3)	-276±8	-242±51	-
Red (g=-10)	White (g=-10)	-444±116	52±44	192±30

Key findings:

Same gravity stitching: SAPS achieves positive scores (8-52), massive improvement over naive (-390 to -444)
Cross-gravity stitching: SAPS fails (-242), highlighting limitations of affine transformations for physics variations
Interpretation: Precision landing task is highly sensitive to latent space misalignment; small errors cascade into crashes

The LunarLander results reveal an important principle: affine transformations work well for visual variations and robust control tasks, but struggle with precision control and physics variations.

Latent Space Analysis

PCA Visualization

We visualize how SAPS aligns latent spaces by projecting encoder outputs to 3D via PCA.

Left: After applying SAPS alignment, embeddings from green and red background variations become thoroughly intermixed in the shared latent space. Same track positions with different colors map to similar vectors (see embedded screenshots). Right: Without alignment, green and red embeddings remain completely separated.

Cosine Similarity Analysis

We measure pairwise cosine similarity between matched frames from different environment variations:

CarRacing: Histogram of pairwise cosine similarities between matched states. SAPS (left) achieves near-perfect alignment (mean similarity 0.92-0.99), matching R3L (middle) without requiring specialized training. Naive stitching (right) shows very low similarity (0.23-0.30), confirming incompatibility.

LunarLander: Both SAPS and R3L achieve high cosine similarity (0.92-0.99), yet performance gaps remain. This confirms that high alignment quality is necessary but not sufficient for precision control tasks.

Quantitative results:

SAPS/R3L: Mean cosine similarity 0.92-0.99 across both environments
Naive: Mean cosine similarity 0.23-0.30
Interpretation: Independently trained models exhibit near-identical encodings for semantically identical frames once aligned

Key Findings

Zero-shot stitching works for robust tasks: SAPS enables modular policy composition across visual and task variations in CarRacing, achieving performance comparable to end-to-end training without any fine-tuning.
Near-perfect latent alignment: Cosine similarity analysis (0.92-0.99) confirms that independently trained RL encoders learn quasi-isometric representations differing primarily by geometric transformations.
Practical advantage over R3L: SAPS achieves R3L-level alignment and performance using standard pre-trained models, without requiring specialized relative representation training.
Precision control reveals limits: LunarLander results show that affine transformations have boundaries—they handle visual variations well but struggle with physics variations and precision-sensitive tasks.
Task robustness as moderator: The gap between alignment quality and task performance highlights that robust control tasks (navigation) tolerate small alignment errors, while precision tasks (landing) amplify them.
Enables compositional RL: By treating encoders and controllers as modular components connected via learned transformations, SAPS opens the door to compositional RL where policies can be assembled like LEGO blocks.

Citation

@article{ricciardi2025zeroshot,
  title   = {Mapping representations in Reinforcement Learning via Semantic Alignment for Zero-Shot Stitching},
  author  = {Antonio Pio Ricciardi and Valentino Maiorca and Luca Moschella and
             Riccardo Marin and Emanuele Rodol{\`a}},
  journal = {arXiv preprint arXiv:2503.01881},
  year    = {2025}
}

Authors

Antonio Pio Ricciardi¹ · Valentino Maiorca¹ · Luca Moschella¹ · Riccardo Marin² · Emanuele Rodolà¹

¹Sapienza University of Rome, Italy · ²University of Tübingen, Germany

Related Work: This work builds on semantic alignment techniques from Latent Space Translation via Semantic Alignment (NeurIPS 2023) and complements R3L: Relative Representations for Reinforcement Learning by demonstrating zero-shot stitching from standard pre-trained models.