Geometric relations in the latent space should depend on the data, not on how we choose to represent it

18 minute read

Published:

arXiv GitHub OpenReview Poster

Generative models, their latent spaces, geometry and identifiability

Latent spaces are central to many modern generative machine learning models, whether explicitly built with a bottleneck or implicitly driven by hidden states, as in VAEs, diffusion models, flows, and GANs.

In some models, such as diffusion models or GANs, the latent space is secondary, and the focus is mainly on sample quality. But in models like VAEs, the latent representation plays a crucial role in understanding both the data-generating process and its underlying structure.

Researchers in fields like biology, neuroscience, physics, chemistry, and the social sciences often rely on these representations to interpret complex data and make predictions with real-world impact.

A major challenge across these fields is the lack of identifiability in the latent space. This refers to the fundamental impossibility of uniquely recovering the latent variables from observed data. This limitation, which has been formally proven, undermines the reliability of predictions and hinders efforts to learn causal or disentangled representations.

While the problem, as stated above, is fundamentally unsolvable, a substantial body of research has focused on characterizing it and proposing potential workarounds. Many of these approaches, however, rely either on additional data, which may be expensive or impractical to collect, or on restrictive modeling assumptions that constrain the model’s flexibility and expressiveness.

However, in many scientific and practical applications, what matters most is not the specific values of two latent variables, but the relationships between them. The estimated latent codes are often used in downstream tasks that exploit a measure of similarity, such as in clustering, classification, or regression. This observation allows us to take a different approach by reframing the problem through a geometric lens.

Viewed this way, the identifiability problem reduces to a simple postulate:

Geometric relations in the latent space should depend on the data, not on how we choose to represent it

This intuitive idea dates back to Gauss, who, when tasked with mapping the Earth, explained his failure by proving that the Earth’s geometry cannot be accurately represented on a flat plane. This notion of representation invariance later became a cornerstone of Einstein’s theory of general relativity, where the geometry of spacetime is independent of the coordinate system used to describe it.

Inspired by these foundational ideas, we show how identifiability in generative models can be reframed as a problem of the intrinsic geometry of data and its latent representation. This approach not only offers a principled way to address the problem but also provides fresh insights into the current state of the art in identifiability of generative models.

In what follows, we assume familiarity with generative models, particularly latent variable models. While some prior knowledge of differential geometry is helpful, we will introduce the necessary concepts informally and intuitively. For a rigorous treatment, we encourage readers to consult the full paper.

This post is organized as follows:

Latent space geometry

Given high dimensional data representing observed phenomena such as images, audio, text, or protein sequences, we often assume the underlying mechanisms have lower intrinsic dimension. Therefore, learning useful representations of these high dimensional observations should be closely tied to uncovering the lower dimensional latent structure.

The central object for learning such representations in latent variable models is the generator (decoder) function, often called the decoder. This is a smooth mapping:

\[f: \mathbb{R}^d \to \mathbb{R}^D\]

that transforms latent variables $ \mathbf{z} \in \mathbb{R}^d $ into observed data $ \mathbf{x} \in \mathbb{R}^D $, assuming that $ d \leq D $. The decoder effectively discovers the data manifold embedded in the observed space and provides a way to transfer the geometry of this manifold back into the latent space.

To formalize this, we start with the notion of a manifold. In simple terms, a manifold is a space that looks Euclidean when examined locally. We can think of it as a surface that is flat in small neighborhoods but may have a more complex, curved global shape. The global structure emerges by smoothly joining these local neighborhoods.

Our decoder serves as a bridge between local neighborhoods in the latent space and their counterparts on the data manifold. Since the neighborhood around a point on the manifold is Euclidean, it naturally carries the standard inner product. We can then pull back this local geometry from the manifold to the latent space.

To recover this pullback metric, consider a point $\textcolor{#800080}{\mathbf{z}}$ in the latent space and two small perturbations around it, $\textcolor{#017100}{\Delta \mathbf{z}_1}$, $\textcolor{#017100}{\Delta \mathbf{z}_2}$ around this point. In a local neighborhood of $\textcolor{#800080}{\mathbf{z}}$, we approximate the generator function using a first-order Taylor expansion:

\[f(\textcolor{#800080}{\mathbf{z}} + \textcolor{#017100}{\Delta \mathbf{z}}) \approx f(\textcolor{#800080}{\mathbf{z}}) + \mathbf{J}_{\textcolor{#800080}{\mathbf{z}}} \textcolor{#017100}{\Delta \mathbf{z}}\]

where $\mathbf{J}_{\textcolor{#800080}{\mathbf{z}}}$ is the Jacobian of the generator function $f(\mathrm{z})$ with respect to the latent variables $ \mathrm{z} $.It is helpful to think of the Jacobian as a linear map that takes small changes in the latent space and translates them into changes in the data space.

Using this, we can now compute the distance between two perturbed points in the data space as follows:

\[\begin{aligned} \|f(\textcolor{#800080}{\mathbf{z}} + \textcolor{#017100}{\Delta \mathbf{z}_1}) - f(\textcolor{#800080}{\mathbf{z}} + \textcolor{#017100}{\Delta \mathbf{z}_2})\|^2 &= \|f(\textcolor{#800080}{\mathbf{z}}) + \mathbf{J}_{\textcolor{#800080}{\mathbf{z}}} \textcolor{#017100}{\Delta \mathbf{z}_1} - (f(\textcolor{#800080}{\mathbf{z}}) + \mathbf{J}_{\textcolor{#800080}{\mathbf{z}}} \textcolor{#017100}{\Delta \mathbf{z}_2})\|^2 \\ &= \|\mathbf{J}_{\textcolor{#800080}{\mathbf{z}}} (\textcolor{#017100}{\Delta \mathbf{z}_1} - \textcolor{#017100}{\Delta \mathbf{z}_2})\|^2 \\ &= (\textcolor{#017100}{\Delta \mathbf{z}_1} - \textcolor{#017100}{\Delta \mathbf{z}_2})^\top \mathbf{J}_{\textcolor{#800080}{\mathbf{z}}}^\top \mathbf{J}_{\textcolor{#800080}{\mathbf{z}}} (\textcolor{#017100}{\Delta \mathbf{z}_1} - \textcolor{#017100}{\Delta \mathbf{z}_2}) \\ &= (\textcolor{#017100}{\Delta \mathbf{z}_1} - \textcolor{#017100}{\Delta \mathbf{z}_2})^\top \mathbf{g}(\textcolor{#800080}{\mathbf{z}})(\textcolor{#017100}{\Delta \mathbf{z}_1} - \textcolor{#017100}{\Delta \mathbf{z}_2}) \end{aligned}\]

where the pullback metric is given by the smooth function $\mathbf{g}$ that each point $\mathbf{z}$ gives a symmetric positive definite matrix defining the inner product between two vectors $\textcolor{#017100}{\Delta \mathbf{z}_1}$ ,$\textcolor{#017100}{\Delta \mathbf{z}_2}$

Alt Text

As we will see, this construction is not only an intuitive and practical way to define a metric, but it also has a key property needed for our geometric analysis.

To summarize, using a decoder and the pullback metric, we capture the geometric structure of the data manifold in terms of local coordianates (the latent space). Can we then be sure that this metric is meaningful and does not depend on the particular parametrization of the manifold?

In standard differential geometry, if if two functions $f_a$ and $f_b$ both parameterize the same manifold, then their pullback metrics $ \mathbf{g}{a} $ and $ \mathbf{g}{b} $ are related by a smooth, one-to-one, and onto map known as a diffeomorphism. Crucially, this transformation is an isometry, meaning it preserves the geometric structure of the data manifold. This means that once we define a pullback metric, it provides a well defined and intrinsic description of the manifold’s geometry, independent of the specific parameterization.

The pullback metrics $ \mathbf{g}{a} $ and $ \mathbf{g}{b} $ are related by an isometry, meaning the geometry is preserved and does not depend on how the manifold is parameterized.

In terms of latent space geometry, this already sounds like we have what we want. Namely a meaningful metric that is invariant to the parametrization of the manifold.

However, we still need to connect this to the identifiability problem. Starting from the identifiability of model parameters, we need a way to analyze the problem in terms of a map between latent spaces of equivalent models - an indeterminacy transformation. If we can show that all indeterminacy transformations are isometries with respect to the pullback metric, then we can conclude that the pullback metric is identifiable. This will be our strategy for the rest of the post.

Identifiability

In statistics literature, when we talk about parameters of the model we usually think of the weights themselves. As in, possibly concatenated, vector of weights $\theta$ that defines the instance of the designed model. In this context two models with parameters $\theta_a$ and $\theta_b$ are said to be equivalent if they generate the same distribution over the data, i.e., $p_{\theta_a}(\mathbf{x}) = p_{\theta_b}(\mathbf{x})$ for all $\mathbf{x}$. In other words, the identifiability problem arises when there are whole sets of equivalent models with no principled way to distinguish between them.

However, in the context of latent variable models, an alternative but equivalent perspective on the parameters was propsed by Johnny and Ben in their paper (link cite). They make a small abstraction and consider the pair of a decoder function and associated latent distribution as the parameters of the model. In this framework, a model is defined by the generator function and the associated latent distribution $ \theta_a=(P_{Z_{a}}, f_a) $. Visualized in the figure below, we can see the situation where two models are equivalent, i.e., they generate the same image and data distribution on it, but the latent distributions and the associated generator functions are different.

Alt Text

While this abstraction seems dubious at first, it will allow us to analyze the identifiability problem in latent variable models in terms of transformations between the latent spaces $ \textcolor{#9058F3}{A_{a,b}} $ as visualized above. The paper by Johnny and Ben shows that all possible indeterminacy transformations are probabilistically equivalent to traveling back and forth along the generator functions. Ie. to go from one latent space $ \mathcal{Z}_a $ to another $ \mathcal{Z}_b $ we first travel to the manifold by $f_a $ and then to the $ \mathcal{Z}_b $ through the inverse of $ f_b $. Which leads to the following:

All indeterminacy transformations between latent spaces of equivalent models are of the form $ \textcolor{#9058F3}{A_{a,b}}(\mathrm{z}) := f_b^{-1} \circ f_a(\mathrm{z}) $.

In the next section, we will show how a geometric interpretation of latent variable models and analysis of the indeterminacy transformations allows us to establish identifiability of the pullback metric in the latent space.

Identifiable metric structures

From the previous section, we know have that for any two equivalent models $ \theta_a $ and $ \theta_b $, the transformation between the latent spaces is given by the indeterminacy transformation $ \textcolor{#9058F3}{A_{a,b}}(\mathrm{z}) := f_b^{-1} \circ f_a(\mathrm{z}) $. And, more importantly, given a set of equivalent models, we know that all possible indeterminacy transformations are of this form.

To treat $ \textcolor{#9058F3}{A_{a,b}}(\mathrm{z}) $ in geometric terms, we need look at its components: $ f_a $ and $ f_b^{-1} $. For this construction to be meaningful, we implicitly assume that the generator functions are injective and have the same image. The first assumption is necessary to ensure that the inverse function $ f_b^{-1} $ is well-defined, and the second assumption is needed to ensure that transport of probabilities between the two latent spaces is well-defined.

These two, together with some regularity assumptions, ensure that the image of the generator function $ f_a $ is a smooth manifold in the ambient space. This, in turn, renders $ \textcolor{#9058F3}{A_{a,b}}(\mathrm{z}) $ a smooth reparametrization (a diffeomorphism) of the same manifold. Here we losely refer to both the latent space and its image in the abmient space through $ f $ as manifolds. Although it may feel confusing, it is allowed since these objects are made equivalent (in geometric terms) by the decoder function.

Since our possible equivalent latent spaces are related by $ \textcolor{#9058F3}{A_{a,b}} $, the last piece of the puzzle is to show that $ \textcolor{#9058F3}{A_{a,b}} $ of this form is an isometry with respect to the pullback metric. This is exactly the main result of our paper (cite), where we do this by simply plugging in and checking the definition of an isometry. The details and a formal treatment are in the paper, but the intuition is grounded in the fact that our decoders are learning the same data manifold. Since the object is the same, this must also be true for its local neighborhoods even if we express them in a different coordinate system. Hence, the pullback metrics $ \mathbf{g}_a $ and $ \mathbf{g}_b $ must measure the geometric structures in the same way, i.e., they are related by an isometry.

What can we reliably measure in the latent space?

The main result above guarantees that any geometric relation based on the pullback metric is identifiable. This means that we can reliably measure distances, angles, and volumes in the latent space, and these measurements will be invariant to the choice of the generator function. The figure below illustrates how we can measure the length of a curve on the data manifold and how the pullback metric makes this possible to do in the latent space.

Given the latent space on the left and ambient space on the right, the flat $ \mathcal{Z}_a $ serves as a map of the data manifold $ \mathcal{M} $ on the right. On the right, we can see a tangent plane to the manifold at each point $ f(\mathrm{z}) $ corresponding to the latent variable $ z $. The picture is drawn in a way where the unit circle in the tangent plane corresponds to the unit ellipsis in the latent space. All vectors within the circle (and the ellipsis) are of unit length, but whereas the ones on the right are Euclidean, the ones on the left are measured with respect to the pullback metric.

Alt Text

Using this intuition, we can now, for example, measure the distance between two points on the data manifold $ (z_1,z_2) $ by finding a shortest path (geodesic) between them. In practice, this is done by optimizing the energy of a curve $ \gamma $ in the latent space that corresponds to the path on the data manifold. I.e. the following optimization problem: \(\min_{\gamma} \int_{0}^{1} \|\dot{\gamma}(t)\|_{\mathbf{g}}^2 dt\) where $ |\dot{\gamma}(t)|_{\mathbf{g}}^2 = \dot{\gamma}(t)^\top \mathbf{g}(\gamma(t)) \dot{\gamma}(t) $ is the pullback metric at the point $ \gamma(t) $ and gamma is a curve in the latent space that starts at $ \gamma(0) = z_1 $ and ends at $ \gamma(1) = z_2 $.

The result of this optimization in a latent space of a model is guaranteed to identifiable and will yield the same distance if we were to use another equivlent model that has a potentially very different looking latent space.

Does it work in practice?

The short answer is yes. To provide the long answer, we need to discuss what we see as evidence for it working in practice.

The form of the indeterminacy transformations $ \textcolor{#9058F3}{A_{a,b}}(\mathrm{z}) $ from the section on identifiability is proven in the limit of infinite data. Although using such theoretical limits is common in statistics, in practice we always work with finite data, which means that the models do not learn the true data manifold exactly and their approximations are stochastic. Furthermore, both the identifiability theory and our geometric treatment rely on the assumption that the generator functions are injective. This restriction is milder than what is usually needed for identifiability, but still a restriction we would like to relax in practice.

The consequence of these two considerations is that while the theory provides strong guarantees, in practice we should still expect some variability in whatever we measure in the latent space using the pullback metric. The good news, however, is that this variability is demonstrated to be smaller than the variability of the same measurements using the standard Euclidean metric. And furthermore, the relaxation of the injectivity assumption is shown not to affect the outcome of the experiments in practice.

To illustrate this, we compute the coefficient of variation of the distances between same pairs of points in the latent space across 30 model retrainings. The distances are computed using the pullback metric and the Euclidean metric. For the MNIST and CIFAR10 datasets, we use an injective decoder based om the $ \mathcal{M} $ flow architecture. For FMNIST and CelebA we use a non-injective decoder based on the CNN architecture. Below we show the histograms of the coefficient of variation and observe that the pullback metric has a lower coefficient of variation than the Euclidean metric in all cases with a narrower distribution.

Hello

In the following figure we show how a geodesic parametrized by a spline looks in the 2D latent space of the $ \mathcal{M} $ flow model trained on MNIST.

Hello

Furthermore, we show that this approach is feasible with bigger non-injective image models by training a CNN-based VAE on the CelebA dataset. The figure below shows insterpolation according to the geodesic and the Euclidean curves in the latent space.

Hello

What if we really want the Euclidean metric in the latent space to be identifiable?

FAQ round

  • We have assumed the Euclidean metric in the ambient space to calculate the dot product. However, this is not a limitation as the pullback metric can be defined for any valid ambient metric.

  • How to implement it? Any advise? Future blogpost?

  • Should we expect the histograms to overlap?

Take-aways

Seen from a geometric point of view, current solutions are essentially trying to make the Euclidean notion of geometric relations (distances, angles, volumes, etc.) identifiable—disregarding the geometry of the models involved. Even though some guarantees have been established by way of extra labeled data or by restricting the flexibility of the model (e.g., through multiple views or factorized structures), these approaches implicitly assume that the data manifold is flat. However, this is not always the case in practice, especially with high-dimensional data.

Instead, we propose to leverage the geometry of the learned model by pulling the ambient metric back to the latent space, and show that this makes the metric structure identifiable. This theoretical result guarantees that the pullback metric is suitable for use in downstream tasks such as clustering, classification, and regression, and may also be beneficial for causal discovery and disentanglement.