R3L: Relative Representations for Reinforcement Learning

Enabling zero-shot composition of RL agents through coordinate-free latent spaces

ICML 2025

TL;DR: We adapt relative representations to reinforcement learning, enabling zero-shot composition of encoders and controllers trained independently across different visual variations and task objectives. This achieves 75% reduction in training time (from N×M to N+M models) while maintaining performance comparable to end-to-end training.

Overview

Reinforcement learning agents typically fail when deployed in environments with even minor perceptual variations from their training conditions. An agent trained to drive on green grass will fail on red grass, despite the task being identical. This brittleness stems from a fundamental limitation: encoders trained in different visual contexts produce incompatible latent representations, making it impossible to reuse learned components.

R3L (Relative Representations for Reinforcement Learning) solves this by transforming absolute encoder outputs into a coordinate-free space where representations are defined relative to anchor points rather than in arbitrary coordinate systems. This enables unprecedented compositional generalization: encoders and controllers trained independently across different visual variations and task objectives can be stitched together zero-shot, without any fine-tuning.

The result? Train N visual variations and M task variations separately (N+M models), then compose them to obtain N×M agents—achieving 75% reduction in training time while maintaining performance comparable to end-to-end training.

Absolute Representations

Embeddings occupy distinct regions—composition impossible

Relative Representations (R3L)

Points align one-to-one—seamless stitching enabled

Latent space visualization comparing standard (absolute) and R3L (relative) encoder outputs for two visual variations (green/red grass). R3L transforms incompatible coordinate systems into an aligned, coordinate-free space.

Method: Coordinate-Free Representations for RL

The Problem: Incompatible Latent Spaces

When two RL agents are trained independently on perceptually different environments (e.g., green vs red background), their encoders learn semantically similar but coordinate-wise incompatible representations. Even when observing aligned frames showing the same semantic content, the encoder outputs occupy completely separate regions of the latent space.

This is analogous to two cartographers mapping the same territory using different projections—the maps describe the same reality but cannot be overlaid due to arbitrary coordinate choices.

The Solution: Relative Representations

R3L adapts the relative representations framework from representation learning to the RL setting. Instead of using raw encoder embeddings \(\phi(s)\), we express them relative to a set of anchor points \(\mathcal{A} = \{A_1, \ldots, A_K\}\):

\[\psi(s) = \left[ d(\phi(s), \phi(A_1)), \ldots, d(\phi(s), \phi(A_K)) \right]\]

where \(d(\cdot, \cdot)\) is a distance metric (we use cosine distance). This coordinate-free representation:

Removes arbitrary rotations/translations: Only relative distances matter, not absolute coordinates
Aligns latent spaces: Encoders trained independently produce comparable relative representations
Enables composition: Controllers can operate on any encoder’s relative space

Anchor Collection and Stabilization

Anchors are collected by playing identical action sequences in both environment variations, ensuring corresponding observations are aligned. During training, we use Exponential Moving Average (EMA) to stabilize anchor embeddings:

\[\phi(A_t) = \alpha \cdot \phi(A_t) + (1 - \alpha) \cdot \phi(A_{t-1})\]

This prevents chaotic shifts in the relative coordinate system as the encoder evolves during training. We use \(\alpha = 0.999\) for all experiments.

Anchor collection pipeline: aligned observations from different environment variations are encoded and stored as reference points for the relative coordinate system.

Latent Space Alignment Analysis

To verify that relative representations successfully align latent spaces, we analyze pairwise cosine similarities between ~800 aligned frames from two CarRacing environments (green vs red grass).

Left: Cosine similarity matrices comparing absolute (top) vs relative (bottom) representations. Absolute spaces show consistently low similarity even on the diagonal (aligned frames). Relative spaces reveal strong diagonal similarity plus semantic off-diagonal matches. Right: Qualitative frame pairs from high-similarity regions show semantically similar track segments despite different positions.

Key insight: Relative representations reveal quasi-isometric structure—independently trained encoders learn the same semantic content in different coordinate systems, which R3L makes invariant.

Zero-Shot Stitching Results

The core contribution of R3L is enabling modular composition of RL agents through zero-shot stitching: combining encoders and controllers trained independently without any fine-tuning.

Core Result: R3L enables composition of N visual variations × M task variations with only N+M trained models instead of N×M, achieving 75% training time reduction while maintaining competitive performance.

CarRacing: Visual and Task Variations

We train separate encoders for 4 visual variations (green/red/blue grass, far camera) and separate controllers for 4 task variations (standard, slow, scrambled actions, no-idle). Each encoder-controller pair can be stitched to create 16 different agents.

Encoder	Method	Green	Red	Blue	Slow	Scrambled	No-idle
Green	S.Abs (naive)	175±304	167±226	-4±79	148±328	106±217	213±201
Green	S.R3L (ours)	781±108	787±62	794±61	268±14	781±126	824±82
Red	S.Abs (naive)	157±248	43±205	22±112	83±191	138±244	252±228
Red	S.R3L (ours)	810±52	776±92	803±58	476±430	790±72	817±69
Blue	S.Abs (naive)	137±225	130±274	11±122	95±128	138±224	144±206
Blue	S.R3L (ours)	791±64	793±40	792±48	564±440	804±41	828±50

Zero-shot stitching results (mean scores ± std over 160 evaluations per cell). S.R3L achieves 750-830 scores, representing a 4-10× improvement over naive absolute stitching (S.Abs: 50-250), approaching end-to-end R3L performance (797-832).

Training Time Savings

Traditional Approach

Train all 16 visual-task combinations independently

52 hours

Complexity: O(V×T)

R3L Approach

Train only 4+4 models (diagonal), stitch the rest

13 hours

Complexity: O(V+T)

Savings: 75% reduction in training time, scaling quadratically with the number of variations. As the number of visual variations (V) and task variations (T) increases, savings grow from O(V×T) to O(V+T).

	V1 (green)	V2 (red)	V3 (blue)	V4 (far)
T1 (standard)	3h	-	-	-
T2 (slow)	-	4h	-	-
T3 (no-idle)	-	-	3h	-
T4 (scrambled)	-	-	-	3h

Train only the highlighted diagonal cells (13 hours total), then compose to obtain all 16 combinations via zero-shot stitching.

Training Dynamics

To verify that relative representations maintain training stability, we visualize learning curves across CarRacing and Atari environments. The plots show that R3L training converges smoothly without instability from the coordinate transformation.

CarRacing (Green) - Standard visual variation

CarRacing (Slow) - Temporal task variation

CarRacing (Far Camera) - Visual perspective variation

Pong - Atari environment

Boxing - Atari environment

Breakout - Atari environment

Key observation: All training curves show stable convergence patterns comparable to standard absolute training. The relative encoding does not introduce training instabilities or convergence issues, validating that coordinate-free representations are compatible with standard RL optimization procedures.

EMA Stabilization Analysis

Anchor embeddings evolve during training as the encoder learns. To prevent chaotic shifts in the relative coordinate system, we stabilize anchors using Exponential Moving Average (EMA). Below we compare different EMA coefficients:

Training curves for different EMA coefficients (α). Higher α values (0.999) provide better stability by smoothing anchor updates. We use α=0.999 for all experiments.

The EMA update formula is:

\[\phi(A_t) = \alpha \cdot \phi(A_t) + (1 - \alpha) \cdot \phi(A_{t-1})\]

With α=0.999, anchors update slowly, providing a stable coordinate frame while allowing gradual adaptation as the encoder improves. This is analogous to temperature control in alchemical transmutation—too fast and the structure collapses, too slow and transformation stalls.

Environment Visualizations

R3L enables composition across diverse visual variations. Below are examples of the different CarRacing environment appearances used in our experiments:

Green Grass - Standard CarRacing appearance

Red/Yellow Grass - Visual variation with color shift

These perceptually different environments produce incompatible latent spaces when using standard encoders, but R3L’s relative representations enable seamless stitching of components trained on different variations.

Atari Experiments

We test generalization beyond CarRacing on three Atari games (Pong, Boxing, Breakout) with background color variations to evaluate R3L on precision-critical, frame-perfect environments.

Game	Variation	E.Abs	E.R3L	Status
Pong	Plain/Green/Red	21±0	20-21±0	Perfect
Boxing	Plain/Green/Red	95-96±2	88-95±4	Competitive
Breakout	Plain/Green/Red	132-298±60	77-146±60	Degraded

Pong

Perfect performance maintained (21/21). Simple visual structure enables complete preservation of task-relevant information.

Boxing

Slight degradation but competitive. Moderate visual complexity handled well by relative representations.

Breakout

Significant performance drop. Numerous bricks create detailed patterns that may lose fine-grained spatial information in coordinate transformation.

Precision-Critical Games: While S.R3L vastly outperforms S.Abs (which consistently produces negative scores), stitched performance shows high variance in Atari. Small latent noise from coordinate transformation propagates through action predictions, causing failures in games requiring frame-perfect timing. CarRacing is more forgiving, allowing recovery from minor errors.

Key Findings

1. Quasi-Isometric Latent Spaces

RL encoders trained independently on visual variations learn semantically aligned representations obscured only by arbitrary coordinate differences—a structure that relative representations make invariant.

Like different map projections of the same territory

2. Zero-Shot Composition Works

Stitching encoders and controllers across visual and task variations achieves 4-10× improvement over naive composition, approaching end-to-end performance on visual variations (750-830 vs 800-850).

3. Quadratic Efficiency Gains

Training complexity reduces from O(V×T) to O(V+T), achieving 75% time savings that scale with variation diversity.

From multiplicative to additive scaling

4. Precision-Critical Tasks Remain Challenging

Temporal/dynamic variations (slow task: 268-564 vs E.Abs 996) and visually complex scenes (Breakout) expose limitations, suggesting coordinate transformation may lose fine-grained information.

5. Training Stability Maintained

Relative encoding does not compromise RL training dynamics—convergence and stability match standard absolute training across all environments. The coordinate-free transformation integrates seamlessly with standard RL optimization procedures.

Citation

@misc{ricciardi2025r3l,
  title         = {R3L: Relative Representations for Reinforcement Learning},
  author        = {Antonio Pio Ricciardi and Valentino Maiorca and Luca Moschella and
                   Riccardo Marin and Emanuele Rodolà},
  year          = {2025},
  eprint        = {2404.12917},
  archiveprefix = {arXiv},
  primaryclass  = {cs.LG},
  url           = {https://arxiv.org/abs/2404.12917}
}

Authors

Antonio Pio Ricciardi¹ · Valentino Maiorca¹’² · Luca Moschella¹ · Riccardo Marin³ · Emanuele Rodolà¹

¹Sapienza University of Rome, Italy · ²Institute of Science and Technology Austria (ISTA) · ³University of Tübingen, Germany

Future Directions: R3L demonstrates that coordinate-free representations enable modular composition in RL. Open questions include extending to non-deterministic environments, improving precision for high-frequency control tasks, and exploring alternative anchor collection methods that don't require parallel observation access.