On the Permutation Invariance of Graph Generative Models

This blog post discusses the permutation invariance principle of graph generative models, which has often been taken for granted in graph-related tasks. While permutation symmetry is an elegant property of graph data, there is still more to learn about its empirical implications.

Introduction

Graphs are ubiquitous mathematical objects that arise in various domains, such as social networks, protein structures, and chemical molecules. A graph can be formally represented as a set of nodes $V$, optionally associated with node features $X$, and a set of edges $E$. Its topology is commonly expressed using an adjacency matrix $A$.

Plain graphs composed of nodes and edges (visuals taken from ).

Molecule graphs and scene graphs with node and edge features.

The goal of graph generative models is to generate novel graph samples that resemble those drawn from the true data distribution. Graphs can be stored in several ways in computational systems, including tree-based structures , SMILES strings (molecule graphs) , and more commonly, adjacency matrices . In this post, we focus on the adjacency matrix representation, which is widely used due to its flexibility. Formally, graph generative models learn a distribution $p_\theta$ and generate adjacency matrices $\hat{A} \sim p_\theta$.

Permutation Symmetry for Graph Representation Learning

Permutation symmetry (i.e., permutation invariance and equivariance) is a fundamental design principle in graph representation learning tasks. Below, we provide a quick recap of the key ideas using resources from .

A straightforward idea for constructing a deep neural network on graphs is to use the adjacency matrix directly as input. For instance, to obtain an embedding of the entire graph, one could flatten the adjacency matrix and pass it through a multi-layer perceptron (MLP):

\[\mathbf{z}_G = \text{MLP}({A}[1] \oplus {A}[2] \oplus \dots \oplus {A}[ \vert V \vert]),\]

where ${A}[i] \in \mathbb{R}^{\vert V \vert}$ corresponds to the $i$-th row of the adjacency matrix, and $\oplus$ denotes vector concatenation. The drawback of this method is that it relies on the arbitrary ordering of the nodes in the adjacency matrix. Consequently, such a model is not permutation invariant.

A fundamental requirement for graph representation learning models is that they should exhibit permutation invariance (or equivariance). Formally, a function $f$ that processes an adjacency matrix ${A}$ should ideally satisfy one of the following conditions:

\[\begin{aligned} f({PAP}^\top) &= f({A}) \quad \text{(Permutation Invariance)} \\ f({PAP}^\top) &= {P} f({A}) P^\top \quad \text{(Permutation Equivariance)} \end{aligned}\]

where ${P} \in \mathbb{R}^{\vert V \vert \times \vert V \vert}$ represents a permutation matrix .

Permutation invariance implies that the function’s output does not change with different node orderings in the adjacency matrix.
Permutation equivariance means that permuting the adjacency matrix results in a correspondingly permuted output of $f$.

Common graph representation learning tasks include node-level, edge-level, and graph-level tasks (shown from left to right; image credit: ).

Specifically, node- and edge-level tasks require permutation equivariance, while graph-level tasks require permutation invariance. A well-designed Graph Neural Network (GNN) should generalize across arbitrary permutations of node and edge indices. This ensures that outputs are symmetric with respect to node ordering in the input graph data. It has two key advantages: (i) during training, the model avoids the need to consider all possible node orderings (exhaustive data augmentation), thereby reducing learning complexity and enjoying theoretical benefits ; and (ii) at inference, the model generalizes to unseen graphs without ordering bias, since any permutation is handled consistently.

Rethinking Invariance Principle for Generative Models

For graph generative models, permutation symmetry boils down to the invariance of the learned probability distribution. Namely, $p_\theta(\hat{A}) = p_\theta(P \hat{A} P^\top)$ must hold for any valid permutation matrix $P$. Two adjacency matrices related by such a permutation are said to be isomorphic . Thus, permutation-invariant graph generative models assign equal probability to any graph belonging to the same isomorphism class. One of the pioneering works that achieves this property is , in the context of score-based (a.k.a. diffusion) generative models.

Permutation invariance of probability distributions means the likelihood is the same regardless of the node ordering. For example, $p_\theta({A}^{\pi_1}) = p_\theta({A}^{\pi_2}) = p_\theta({A}^{\pi_3}) = p_\theta({A}^{\pi_4}) = \cdots $.

Invariance of Probability Distributions

Theorem 1 (from ). If $ \mathbf{s} : \mathbb{R}^{N \times N} \to \mathbb{R}^{N \times N} $ is a permutation equivariant function, then the scalar function $ f_s = \int_{\gamma[\mathbf{0}, {A}]} \langle \mathbf{s}({X}), \mathrm{d}{X} \rangle_F + C $ is permutation invariant, where $ \langle {A}, {B} \rangle_F = \mathrm{tr}({A}^\top {B}) $ is the Frobenius inner product, $ \gamma[\mathbf{0}, {A}] $ is any curve from $ \mathbf{0} = \{0\}_{N \times N} $ to $ {A} $, and $ C \in \mathbb{R} $ is a constant.

We refer readers to for proof details. The implication of this theorem is that the scalar function $f_s$ can characterize the probability density $p_\theta(\cdot)$ up to a constant. This happens when the vector-valued function $\mathbf{s}(\cdot)$ represents the gradient of the log-likelihood, commonly called the (Stein) score function in the context of generative models .

Thus, if the learned gradient of the log-likelihood $\mathbf{s}_\theta(A) = \nabla_{A} \log p_\theta(A)$ is permutation equivariant, then the implicitly defined log-likelihood function $\log p_\theta(A)$ is permutation invariant, according to Theorem 1, as given by the line integral of $\mathbf{s}_\theta(A)$:

\[\log p_\theta({A}) = \int_{\gamma[0, {A}]} \langle \mathbf{s}_\theta({X}), \mathrm{d} {X} \rangle_F + \log p_\theta(\mathbf{0}).\]

Based on this theorem, as long as the score estimation neural network (equivalently, diffusion denoising network) is permutation equivariant, we can prove that the learned probability distribution is permutation invariant. Later works such as GDSS and DiGress also follow this idea.

The diffusion processes for graph generation using adjacency matrices in the continuous state space.

Implications on Scalability

While it is appealing to design a provably permutation invariant graph generative model, we find the empirical implications of this property are not as straightforward as one might expect.

Data distribution and target distribution for a 3-node tree graph. For permutation matrix $P_i$ and adjacency matrix $A_i$, filled/blank cells mean one/zero. The probability mass function (PMF) highlights the difference in modes. Our example here also shows graph automorphism (e.g., $P_1$ and $P_2$).

Theoretical Analysis

Here we provide a simple example to illustrate the potential challenges of learning permutation invariant graph generative models. As shown in the above figure, the empirical graph distribution $p_{\text{data}}$ (i.e., the observed training samples) may only assign a non-zero probability to a single observed adjacency matrix in its isomorphism class.

The ultimate goal of a graph generative model is to match this empirical distribution, which may be biased by the observed permutation. However, the target distribution that generative models are trained to match may differ from the empirical one, depending on the model design w.r.t. permutation symmetry.

For clarity, we define the effective target distribution as the closest distribution (e.g., measured in total variation distance) to the empirical data distribution achievable by the generative model, assuming sufficient data and model capacity.

Formally, given a training set of adjacency matrices $\{A_i\}_{i=1}^m$ with $N$ nodes , we define the union of their isomorphism classes as $\mathcal{A}^* = \bigcup_{i=1}^m \mathcal{I}_{A_i}$. Each isomorphism class $\mathcal{I}_{A_i}$ represents all adjacency matrices that are topologically equivalent to $A_i$ but may have different matrix representations. The corresponding effective target distribution distribution is

\[p_{\text{data}}^*(A) = \frac{1}{Z} \sum_{A^* \in \mathcal{A}^*} \delta(A - A^*),\]

where $Z = \vert \mathcal{A}^* \vert = O(n!m)$ is the normalizing constant. Note that $Z = n!m$ may not always be achievable due to graph automorphisms .

Lemma 2 (from ). Assume at least one training graph has $\Omega(n!)$ distinct adjacency matrices in its isomorphism class. Let $\mathcal{P}$ denote all discrete permutation-invariant distributions. The closest distributions in $\mathcal{P}$ to $p_{\text{data}}$, measured by total variation, have at least $\Omega(n!)$ modes. If, in addition, we restrict $\mathcal{P}$ to be the set of permutation-invariant distributions such that $p(A_i) = p(A_j) > 0$ for all matrices in the training set $\{A_l\}_{l=1}^m$, then the closest distribution is $\arg\min_{q \in \mathcal{P}} TV(q, p_{\text{data}}) = p_{\text{data}}^*.$

Under mild conditions, $p_{\text{data}}^*(A)$ with $O(n!m)$ modes, which is defined above, becomes the effective target distribution, which is the case for permutation invariant models using equivariant networks. In contrast, if we employ a non-equivariant network (i.e., the learned density is not invariant), the effective target distribution becomes $p_{\text{data}}(A)$, which only has $O(m)$ modes. While we discuss the number of modes from a general perspective, the analysis is also relevant to diffusion models The modes of the Dirac delta target distributions determine the components of the Gaussian mixture models (GMMs) in diffusion models, with each component centered exactly on a target mode. For an invariant model, the GMMs take the form $p_\sigma^*(A) = \frac{1}{Z} \sum_{A^*_i \in \mathcal{A}^*} \mathcal{N}(A; A_i, \sigma^2\mathbf{I}),$ which has an $O(n!)$ factor more components than the non-invariant one. Thus, learning with a permutation-invariant principle is arguably harder than with a non-invariant one, due to the $O(n!)$ surge in both the modes of the target distribution and the number of GMM components at various noise scales. .

Empirical Investigation

In practice, we typically observe only one adjacency matrix from each isomorphism class in the training data $\{A_i\}_{i=1}^m$. By applying permutation $n!$ times, one can construct $p_{\text{data}}^*$ (invariant distribution) from $p_{\text{data}}$ (non-invariant distribution).

We define a trade-off distribution, called the $l$-permuted empirical distribution:

\[p_{\text{data}}^l(A) = \frac{1}{ml} \sum_{i=1}^{m} \sum_{j=1}^{l} \delta(A - P_j A_i P_j^{\top}),\]

where $P_1, \ldots, P_l$ are $l$ distinct permutation matrices. The construction of $p_{\text{data}}^l$ has the following properties: (1) $p_{\text{data}}^l$ has $O(lm)$ modes governed by $l$; (2) With proper permutations, $p_{\text{data}}^l = p_{\text{data}}$ when $l=1$; (3) $p_{\text{data}}^l \approx p_{\text{data}}^*$ when $l=n!$ (identical if no non-trivial automorphisms). We use $p_{\text{data}}^l$ as the diffusion model’s target by tuning $l$ to study the impact of mode count on empirical performance.

The experimental setup is as follows. We use 10 random regular graphs with 16 nodes, with degrees in $[2, 11]$. The parameter $l$ ranges from 1 to 500, and all models are trained to convergence. For baselines, we consider invariant models for $p_{\text{data}}^*$: DiGress and PPGN (a highly-expressive 3WL-discriminative GNN The original PPGN model was proposed for graph representation learning. We reimplemented it based on the official codebase to adapt to the diffusion objective. ). For non-invariant models (corresponding to $p_{\text{data}}^l, \, l < n!$), we evaluate PPGN with index-based positional embeddings and SwinGNN (ours) . We measure recall, defined as the proportion of generated graphs that are isomorphic to any training graph. Recall lies in the range of $[0,1]$, requires isomorphism testing for computation, and is permutation-invariant on its own. A higher recall indicates a stronger ability to capture the toy data distribution.

Quantitative results on the $l$-permuted empirical distribution. Invariant models perform significantly worse than non-invariant models when the number of applied permutations ($l$) is small.

As shown in the plot above, invariant models such as DiGress and PPGN consistently fail to achieve high recall. In contrast, non-invariant models perform exceptionally well when $l$ is small, where only a few permutations are imposed, but their sample quality degrades significantly as $l$ increases, reflecting the difficulty of learning from distributions with many modes. Importantly, in practice, one typically sets $l=1$ for non-invariant models, which often leads to empirically stronger performance than their invariant counterparts.

Post-processing to Reclaim Sample Invariance

While non-invariant models often perform better empirically, they cannot guarantee permutation-invariant sampling. To bridge this gap, we propose a simple and provable trick: apply a random permutation to each generated sample, which yields invariant sampling at no extra cost.

Lemma 3 (from ). Let $A$ be a random adjacency matrix distributed according to any graph distribution on $n$ vertices. Let $P_r \sim \mathrm{Unif}(\mathcal{S}_n)$ be uniform over the set of permutation matrices. Then, the induced distribution of the random matrix $A_r = P_r A P_r^{\top}$, denoted as $q_\theta(A_r)$, is permutation invariant, i.e., $q_\theta(A_r) = q_\theta(P A_r P^{\top}), \forall P \in \mathcal{S}_n$.

This trick applies to all generative models. Importantly, the random permutation preserves the isomorphism class: $q_\theta$ is invariant but covers the same set of isomorphism classes as $p_\theta$. Thus, graphs from $q_\theta$ always have isomorphic counterparts under $p_\theta$.

In summary, there are two key observations: (1) invariant models ensure invariant sampling but may harm empirical performance, and (2) invariant sampling does not require invariant models.

Additional Experimental Results

Motivated by the aforementioned analysis, we design a new diffusion model that combines non-invariant objectives with invariant sampling to better capture graph distributions.

click here for more details on model design

Overall architecture of the proposed model.

We propose an efficient high-order graph transformer, SwinGNN , for graph diffusion. Drawing inspiration from $k$-order and $k$-WL GNNs , SwinGNN approximates the expressive 2-WL message passing to enhance graph isomorphism testing and function approximation capacity. To address the computational complexity of $O(n^4)$ in 2-WL GNNs, SwinGNN employs a transformer with window-based self-attention , treating edge values as tokens and reducing complexity to $O(n^2M^2)$ by confining attention to local $M \times M$ windows. A shifted window technique further enables cross-window interactions. Additionally, SwinGNN incorporates multi-scale edge representation learning through channel mixing-based downsampling and upsampling layers, constructing hierarchical graph representations to capture long-range interactions efficiently. Experimental results demonstrate its superior performance in graph generation tasks.

Qualitative comparison between invariant and non-invariant models.

From the qualitative results, especially on the grid dataset, we can clearly see the difference between the non-invariant and invariant models, compared with then-SOTA models GRAN , GDSS , and ours.

We also show the models’ superiority on the quantitative results on more synthetic and real-world datasets. The metrics are computed by maximum mean discrepancy (MMD) metrics characterizing graph structure properties, such as degree distributions. Please refer to the expandable box below and the original paper for more details.

Quantitative comparison between invariant and non-invariant models.

click here for more quantitative results

More quantitative results on synthetic and real-world graph datasets. Table 2 demonstrates further results on larger graph datasets. Table 3 shows results on molecule datasets using domain-specific metrics.

We apply non-invariant models on the scene graph generation task as well and show consistent scene-graph-to-image pair generation results. Please refer to for more details.

Discussion with Recent Works

It is interesting to see concurrent works adopting similar principles or arrive at similar findings regarding the permutation invariance property for generative models in the graph domain.

For instance, AlphaFold 3 (2024) from DeepMind employs a non-equivariant diffusion model architecture to predict properties of complex proteins. We quote from their paper:

The diffusion module operates directly on raw atom coordinates, and on a coarse abstract token representation, without rotational frames or any equivariant processing.

AlphaFold 3 indeed applies data augmentation to encourage equivariance but does not impose this condition through network design.

Similarly, DiffAlign , published in ICLR 2025, also discards the equivariance property for diffusion models on the retrosynthesis task and shows improved performance. From a theoretical perspective, they use copying graphs as a case study to illustrate the limitations of equivariance. Both works provide further empirical evidence that permutation invariance should not simply be taken for granted in generative models.

More interestingly, research from the optimization perspective also addresses this problem and provides fresh insights, studying the relationship between intrinsic equivariance and data augmentation .

Summary

This post examines the role of permutation invariance in graph generative models. While symmetry is essential in graph representation learning, enforcing it in generative models can make learning harder by introducing exponentially many modes. Empirical results show that non-invariant models often outperform invariant ones, especially with limited permutations. A simple post-processing trick—random permutation—restores invariant sampling without requiring invariant model design. Building on this, we propose non-invariant graph generative models that achieve strong performance on synthetic and real-world datasets . Recent works like AlphaFold 3 and DiffAlign further support the view that permutation invariance should not be taken for granted for generative models in the graph domain. It appears more rigorous theoretical analysis is needed to understand the relationship between permutation invariance and the performance of generative models.