ASIF

Coupled Data Turns Unimodal Models to Multimodal Without Training

NeurIPS 2023

Overview

Multimodal models like CLIP have revolutionized computer vision by aligning visual and language spaces, but they require training massive encoders from scratch on hundreds of millions of image-text pairs. ASIF demonstrates a radical alternative: create a common multimodal space without any training at all, using pre-trained unimodal encoders and a much smaller collection of image-text pairs.

The key insight is surprisingly simple: captions of similar images are themselves similar. By representing inputs through their similarities to a collection of ground-truth multimodal pairs—what we call relative representations—we can create a quasi mode-invariant space where images and their captions naturally align.

ASIF is a simple recipe to align the representations of two frozen pre-trained encoders, requiring no training unlike CLIP or LiT.

Method: Relative Representations

The Core Idea

Rather than training encoders to produce aligned embeddings, ASIF uses a collection of image-text pairs as a “Rosetta stone” between modalities. Each new input is represented as its similarities to the same-modality anchors in this collection.

The ASIF construction: Two unimodal pretrained encoders and a collection of coupled embeddings are sufficient to compare elements from different modes through representations made of similarities with ground-truth pairs.

Relative Representation Definition

Given an encoder \(E: X \to \mathbb{R}^d\) and a set of anchor samples \(\{x_1, \dots, x_n\}\), the relative representation of a new sample \(x'\) is the \(n\)-dimensional vector:

\[x' = [\text{sim}(x', x_1), \dots, \text{sim}(x', x_n)]\]

where \(\text{sim}\) is a similarity function (e.g., cosine similarity).

The crucial insight: When each anchor exists in two modalities (image and text), relative representations from both modalities live in the same \(n\)-dimensional space, enabling direct comparison between images and captions.

Zero-Shot Classification with ASIF

Zero-shot classification with ASIF: (a) Compute and store embeddings of the multimodal dataset, (b) Compute relative representations of the test image and candidate captions, (c) Treat the image's relative representation as if it were the ideal caption's relative representation, (d) Choose the most similar candidate caption.

The ASIF procedure follows these steps:

Key hyperparameters:

  • \(k\): Number of non-zero entries (typically 800 out of millions)
  • \(p\): Exponent for similarity weighting (typically 8)

These create sparse, interpretable representations where each dimension corresponds to similarity with a specific training pair.


The Similarity Principle

Why does this work? Because captions of similar images are themselves similar.

Distribution of similarities of 100k embedded pairs versus a query image and caption. The 1000 pairs with highest image similarity (orange) also have high caption similarity, validating our core assumption.

This natural correlation between visual and textual similarity in good encoders is what enables ASIF’s mode-invariant representations to work without any training.


Experiments

Zero-Shot Classification Performance

We evaluated ASIF against CLIP and LiT on standard benchmarks using only 1.6M image-text pairs from CC12M—up to 250× smaller than prior work:

Method Dataset Size ImageNet CIFAR100 Pets ImageNet-v2
CLIP 400M (private) 68.6 68.7 88.9 -
CLIP 15M (public) 31.3 - - -
LiT 901M (private) 70.1 70.9 88.1 61.7
ASIF (supervised) 1.6M (public) 60.9 50.2 81.5 52.2
ASIF (unsupervised) 1.6M (public) 53.0 46.5 74.7 45.9

ASIF achieves competitive performance with a fraction of the data, working even with unsupervised vision encoders (DINO).

ASIF performance scales with encoder quality. Even with smaller encoders and different architectures, the method remains effective and performance does not saturate, suggesting further improvements with larger models.

Scaling Laws

ImageNet accuracy improves smoothly as the multimodal dataset size grows. ASIF becomes effective very quickly, achieving 18% accuracy with just 10,000 pairs.

Performance scales smoothly with dataset size without saturation, even with smaller encoders. This suggests ASIF could benefit from larger multimodal collections while maintaining its training-free approach.

Interpretability

Every ASIF prediction can be traced back to specific training examples. Each dimension in the sparse relative representation corresponds to a unique image-text pair.

Deep dive into a classification: The relative representations (green vectors) reveal exactly which training samples contribute to the prediction. We can visualize the 23 most relevant image-text pairs, understand why the model chose "triumphal arch," and even identify broken training samples that can be removed.

This transparency enables:

  • Understanding correct and incorrect predictions
  • Identifying problematic training data
  • Curating better datasets through rapid iteration

Case Study: EuroSAT Classification

To further demonstrate ASIF’s interpretability, we examine satellite image classification on the EuroSAT dataset. The visualizations below show how ASIF makes predictions by identifying the most similar training samples in both image and text spaces.

Correct classification: The scatter plot reveals training samples closest to the query image (horizontal axis) and caption candidates (vertical axis). Larger markers indicate stronger combined similarity.
Analysis of prediction: Even with limited satellite imagery in CC12M, ASIF identifies relevant training samples. The visualized pairs explain both successes and failures.

These examples highlight the semantic gap between CC12M (general web images) and specialized domains like satellite imagery. Despite this challenge, ASIF’s interpretability makes it easy to diagnose issues and improve performance by adding targeted examples.


Unique Properties

No Training Required ASIF combines independently pre-trained encoders with multimodal data embeddings—no neuron tuning needed. Deploying or updating a model takes seconds, not days.

Data Efficiency By leveraging pre-trained encoders, ASIF requires far less multimodal data. We achieve competitive zero-shot performance with 1.6M pairs versus 400M+ for CLIP.

Radically Data-Centric

  • Add new concepts: Encode new image-text pairs and immediately use them
  • Remove data: Delete embeddings to remove training samples (crucial for data rights compliance)
  • Iterate rapidly: Test different multimodal collections without retraining

Built-in Interpretability Every prediction traces to at most \(k\) training pairs. No expensive attribution methods needed—interpretability is free by construction.

Fast Model Updates Example: ASIF initially achieved 29.4% on EuroSAT (satellite images). Adding just 100 EuroSAT image-text pairs boosted accuracy to 82.2%—in seconds, not hours.


Key Findings

  1. Training-free alignment works: Frozen unimodal encoders + coupled data embeddings create effective multimodal models

  2. Data efficiency matters: 1.6M pairs achieve competitive results versus models trained on 250× more data

  3. Relative representations enable mode-invariance: Similarities to multimodal anchors create naturally aligned spaces

  4. Interpretability comes free: Sparse representations directly link predictions to training examples

  5. Memory and retrieval rival learning: ASIF blurs the line between learning algorithms and retrieval systems, raising questions about the role of external memory in machine learning


Relation to Other Approaches

vs. k-Nearest Neighbors: Like k-NN, ASIF is non-parametric and requires explicit training data at test time. Unlike k-NN, ASIF performs open-ended classification with arbitrary new captions, functioning as a drop-in CLIP replacement.

vs. Kernel Methods: The relative representation can be viewed as an explicit feature map in a kernel space. However, ASIF emphasizes explicit coordinates for interpretability rather than operating in an implicit feature space.

vs. Prototypical Networks: ASIF aligns with classical few-shot learning approaches that leverage similarity to prototypes, but extends to the multimodal setting without task-specific training.


Implementation Considerations

Scalability: Our non-optimized implementation is less than 2× slower than CLIP. Techniques like product quantization (for memory) and inverse indexing (for speed) could enable scaling to billions of entries using libraries like FAISS.

Memory vs. Computation Trade-off: ASIF avoids training costs but requires storing all anchor embeddings and computing similarities at inference. This trade-off favors scenarios where:

  • Training resources are limited
  • Models need frequent updates
  • Interpretability is critical
  • Multimodal data is curated/evolving

Broader Impact

ASIF raises fundamental questions about foundation models:

Data Efficiency: If competitive performance is achievable with 250× less data and no training, how much of large-scale training is essential versus dataset quality?

Learning vs. Retrieval: ASIF satisfies the definition of a learning algorithm (performance improves with more data) yet stores information in an external memory rather than learned parameters. This challenges traditional views of machine learning.

Perception vs. Interpretation: ASIF separates perception (unimodal encoding) from interpretation (relative representation construction). The encoders act as sensors; meaning attribution happens through the multimodal dataset, not neural weights.

Data Rights and Forgetting: Unlike trained models requiring sophisticated unlearning techniques, ASIF enables trivial data removal—simply delete the corresponding embeddings.


Limitations

  • Performance still below CLIP/LiT when abundant multimodal data and training resources are available
  • Large memory footprint for the anchor embeddings (though compressible via quantization)
  • High-dimensional sparse representations may be less suitable for tasks like text-to-image generation
  • Limited evaluation on the full spectrum of multimodal tasks (focused on zero-shot classification)

Citation

@inproceedings{norelli2023asif,
  title     = {ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training},
  author    = {Norelli, Antonio and Fumero, Marco and Maiorca, Valentino and
               Moschella, Luca and Rodol\`{a}, Emanuele and Locatello, Francesco},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
  year      = {2023},
  url       = {https://arxiv.org/abs/2210.01738}
}

Authors

Antonio Norelli¹ · Marco Fumero¹ · Valentino Maiorca¹ · Luca Moschella¹ · Emanuele Rodol๠· Francesco Locatello²

¹Sapienza Università di Roma, Dipartimento di Informatica · ²Institute of Science and Technology Austria (ISTA)

Correspondence: Antonio Norelli (norelli@di.uniroma1.it)