Some Foundations of Graph Foundation Models¶

Bruno Ribeiro¶

Associate Professor of Computer Science¶

Purdue University¶

Graph Foundation Models¶

Goal: Extend LLMs to true in-context prediction over relational/graph-structured data

Challenge: Neural graph architectures lack the structural inductive biases needed for any-task relational reasoning

Take-away: In-context relational prediction requires novel mathematically-grounded architectures

The Need for Graph Foundation Models¶

Beyond relational algebra + filters: queries can be relational and predictive (link prediction, missing values, predictive ranking, node regression & classification, supply chain predictions).
- The answer isn't in the database, it is a predictive query.
- Disruption at location X is predicted to create supply chain delays, ORDER NOW!
Unseen graph: unseen schema and data uploaded.
Unknown tasks: no fine-tuning, no training data, no task-specific prompts.

Generalization across graphs and tasks requires graph foundation models

Graph Foundation Models¶

This talk: Diverse graph tasks
- Beatrice Bevilacqua, Joshua Robinson, Jure Leskovec, Bruno Ribeiro, Holographic node representations: Pre-training task-agnostic node embeddings, ICLR 2025.

Other aspects:

In-context graph learning
- Qian Huang, Hongyu Ren, Peng Chen, Gregor Kržmanc, Daniel Zeng, Percy Liang, Jure Leskovec, PRODIGY: Enabling In-context Learning Over Graphs, NeurIPS 2023.
- Jianan Zhao, Hesham Mostafa, Michael Galkin, Michael Bronstein, Zhaocheng Zhu, and Jian Tang. Graphany: A foundation model for node classification on any graph, ICLR 2025.
OOD node feature spaces
- Yangyi Shen, Jincheng Zhou, Beatrice Bevilacqua, Joshua Robinson, Charilaos Kanatsoulis, Jure Leskovec, Bruno Ribeiro, Zero-shot generalization of GNNS over distinct attribute domains, ICML 2025.
- Moshe Eliasof, Krishna Sri Ipsit Mantri, Beatrice Bevilacqua, Bruno Ribeiro, Carola-Bibiane Schönlieb, All In: Bridging Input Feature Spaces Towards Graph Foundation Models, NeurIPS 2025.

Holographic Node Representations¶

Question. Can we build a single set of pre-stored node embeddings of our graph database that LLMs can use for any type of query:

node classification,
link prediction, and
higher-order tasks (3-node, 4-node, ...)?
Similar to how Transformer/CLIP embeddings work across downstream tasks?

TL;DR: impossible with fixed-size node embeddings. We will see why, and how holographic node embeddings fix this.

No description has been provided for this image

ICLR 2025

and

ICLR 2020

1. The pre-training gap¶

In NLP and vision, one pre-trained representation serves many downstream tasks.

In graphs, practice is the opposite:

Task (order)	Typical architecture
Node classification	GCN, GAT, GraphSAGE, ...
Link prediction	SEAL, NBFNet, Neo-GNN, ...
Higher-order / motif	Subgraph GNNs, $k$-WL models, ...

Each AI model is tailored to a task order. Embeddings that are great for one order often fail on another.

Challenge: the barrier is not engineering — it is a fundamerntal limitation in the neural architecture (the permutation symmetries each task demands).

2. Graph Task Orders¶

Let $G = (\mathbf{A}, \mathbf{X})$ with $n$ nodes. A task of order $r$ makes predictions about sets of $r$ nodes:

$r = 1$: node classification — predict $y_v$ for each $v$.
$r = 2$: link prediction — predict $y_{\{u,v\}}$.
$r = 3$: triangle / triplet prediction (e.g. user–order–product predictions).
general $r$: joint $k$-node predictions.

Structural representations of order $r$ are equivariant functions $$ f: (\mathbf{A}, \mathbf{X}) \;\mapsto\; \mathbb{R}^{\binom{n}{r} \times d} $$ with $f(\mathbf{A},\mathbf{X})_S = f(\pi\circ\mathbf{A}, \pi\circ\mathbf{X})_{\pi\circ S}$ for all $\pi \in \mathbb{S}_n$.

Storage cost for any-size $r$ for $n$ nodes: $$\approx O(n^{(n/2)})$$

2. Simple prediction examples¶

A small heterogeneous database:

2 customers, 4 orders, 4 products; edges "customer places order" and "order contains product":

Heterogeneous graph: customers C1, C2; orders O1, O2, O3, O4; products P1, P2, P3, P4

$r = 1$ task. "How many units of product $P_2$ will sell next week?" — regression over a single node.
$r = 2$ task. "Next week, will the consumer cancel product $P_1$ at order $O_2$?" — link prediction (removal), joint over the pair $\{O_2, P_1\}$.
$r = 3$ task. "Next week, will customer $C_2$ place an new order with product $P_3$?" — joint over the triple $\{C_2, O_{\text{new}}, P_3\}$.

Bipartite graph: orders O1, O2 and products P1, P2, P3

Observation on this graph: $C_1 \simeq C_2$ and $P_1 \simeq P_2$

Structurally pairwise isomorphic (by the bilateral symmetry of the graph).
A node embedding that considers only graph structure (e.g., produced by a GNN) must assign $\mathbf{Z}_{C_1} = \mathbf{Z}_{C_2}$ and $\mathbf{Z}_{P_1} = \mathbf{Z}_{P_3}$.
- This is a consequence of structure: equivariance of the node embedding function.
Note that $(P_1, O_1)$ is a real (observed) edge but $(P_3, O_1)$ is absent.
- Yet edge predictions from node embeddings must agree: $g(\mathbf{Z}_{C_1}, \mathbf{Z}_{O_1}) = g(\mathbf{Z}_{C_2}, \mathbf{Z}_{O_1})$, where $g$ is a neural network.

2.1. Two Types of Node Embeddings¶

2.1.1. Structural Node Embeddings¶

Structural Node Representations: GNNs node embeddings encode a node's topological view

structurally equivalent (isomorphic) nodes get identical representations.

No description has been provided for this image

$$Z_{\text{Lynx}} = Z_{\text{Orca}} \quad \longleftarrow \text{ same local topology, different position}$$

Colors show node embedding values

2.1.2. Positional Embeddings (as Random Variables)¶

Positional node embeddings provide relative position of nodes

E.g.: Laplacian eigenvector node embeddings are only defined up to sign (and may have rotation ambiguities), so $\mathbf{Z}_i$ is a random variable with joint distribution:

$$p(\mathbf{Z}_1, \dots, \mathbf{Z}_n \mid \mathbf{A}) \;=\; p(\mathbf{Z}_{\pi(1)}, \dots, \mathbf{Z}_{\pi(n)} \mid \pi \circ \mathbf{A}) \quad \forall\, \pi \in \mathbb{S}_n,$$ where $\pi \circ \mathbf{A}$ is a permutation of the adjacency matrix (isomorphic graph).

Positional embeddings = samples from this joint (e.g., Laplacian eigenvector embeddings)

No description has been provided for this image

Structural embeddings = sufficient statistics of its marginals (e.g., GNNs)

No description has been provided for this image

Two strategies for handling this randomness.

	Positional embeddings	Structural embeddings
What they are	Samples from $p(\mathbf{Z}_1,\dots,\mathbf{Z}_n\mid\mathbf{A})$	Sufficient statistics of the marginals $p(\mathbf{Z}_i \mid \mathbf{A}),p(\mathbf{Z}_i,\mathbf{Z}_j \mid \mathbf{A}),\ldots$
Encode	Relative global position of each node	Structural role
Distinguish	Nodes with identical local structure	Cannot distinguish isomorphic nodes, by design
Drawback	Randomness + impossible tasks without multiple samples	Lose positional resolution
Examples	Random-sign eigenvectors, random-walk encodings	Degree, eigenvalue moments, distance distributions

Positional embeddings are draws from the equivariant joint $p(\mathbf{Z}_1,\dots,\mathbf{Z}_n\mid\mathbf{A})$.

Structural embeddings summarise the statistics (e.g., moments, parameters) of pre-defined marginals.

Main challenge: Some tasks are impossible with single positional embeddings: equivalent to estimating variance with a single random variable sample.

3. Structural Representations Designed for Specific Task Size¶

Concrete example — bivariate Gaussian. Let $$ (X_1, X_2) \sim \mathcal{N}\!\left(\boldsymbol{\mu},\; \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right), \qquad \rho \in (-1, 1). $$ Marginals: $X_i \sim \mathcal{N}(\mu_i, 1)$ — they see $\mu_1,\mu_2$ but not $\rho$.

Every $\rho$ produces the same pair of marginals but a different joint.

The off-diagonal of $\boldsymbol{\Sigma}$ is invisible to the marginals.

Graph analogue.

Node classification only needs $\Gamma(\mathbf A)_v$ (marginal statistics) — structural node embeddings.
Link prediction needs information about the joint $p(\mathbf Z_u, \mathbf Z_v \mid \mathbf A)$ — the "$\rho$" between nodes — which structural node embeddings throw away.

Going from order-$r$ to $r'<r$ is marginalization (easy). Going up is recovering a joint from its marginals — impossible in general.

3.1. Understanding Task Orders ($r$)¶

On the same 4-node path, ask: does a link exist between $v_1$ and $v_2$? And between $v_1$ and $v_3$?

4-node path with isomorphism classes highlighted and two link queries

Links $(v_1, v_2)$ and $(v_4, v_3)$ are isomorphic — there's an automorphism swapping $v_1\leftrightarrow v_4$, $v_2\leftrightarrow v_3$.
Links $(v_1, v_2)$ and $(v_1, v_3)$ are not isomorphic — the first is an edge (distance 1), the second is not (distance 2).

Yet any structural node embedding gives $\mathbf Z_{v_2} = \mathbf Z_{v_3}$.

So any downstream predictors $g(\mathbf Z_{v_1}, \cdot)$ satisfies $$ g(\mathbf Z_{v_1}, \mathbf Z_{v_2}) \;=\; g(\mathbf Z_{v_1}, \mathbf Z_{v_3}). $$ The model is forced to give non-isomorphic links the same prediction. ☠️

3.2. (Details) Isomorphic nodes must have equal marginals¶

Marginal for node $v$: $$ p(\mathbf Z_v \mid \mathbf A) \;=\; \int p(\mathbf Z_1,\dots,\mathbf Z_n \mid \mathbf A)\, \prod_{i\ne v}\! d\mathbf Z_i. $$

Claim. If $u \simeq v$ (isomorphic) then $p(\mathbf Z_u \mid \mathbf A) = p(\mathbf Z_v \mid \mathbf A)$.

Proof sketch. $u\simeq v$ means $\exists\,\pi$ with $\pi\circ\mathbf A = \mathbf A$ and $\pi(u)=v$. By equivariance, $$ p(\mathbf Z_1,\dots,\mathbf Z_n \mid \mathbf A) \;=\; p(\mathbf Z_{\pi\circ 1},\dots,\mathbf Z_{\pi\circ n} \mid \mathbf A). $$ Marginalizing both sides over everything except position $v$ (resp. $u$) gives the result. $\square$

Write $\Gamma(\mathbf A)_v$ structural embedding of node $v$. Then $\Gamma$ is $\mathbb S_n$-equivariant and $$ u \simeq v \;\Longrightarrow\; \Gamma(\mathbf A)_u = \Gamma(\mathbf A)_v. $$

If $\Gamma(\mathbf A)_v$ most expressive, then $$\Gamma(\mathbf A)_u = \Gamma(\mathbf A)_v \;\Longrightarrow\; u \simeq v \;\Longrightarrow\ p(\mathbf Z_u \mid \mathbf A) = p(\mathbf Z_v \mid \mathbf A);$$

Most-expressive structural node representations are sufficient statistics of the marginals of the positional joint.

3.3. (Details) Joint structural representations have a Storage Problem¶

Definition (Joint structural representations). A function $$ f: \mathbf A \;\longrightarrow\; \mathbb \cup_{S \in \mathcal{P}(V)} \mathbb{R}^{|S| \times d}, $$ where $\mathcal{P}(V)$ denotes the power set of $V$, is a joint structural representation if for isomorphic node sets $S_1 \simeq S_2$, $S_1,S_2 \subseteq V$, $$ f(\mathbf A)_{S_1} = f(\mathbf A)_{S_2}. $$

Theorem (Srinivasan & Ribeiro, 2020, informal). Joint structural representations of order $r$ exist and are sufficient to solve any order-$r$ task — including link prediction for $r=2$.

Corollary. Most-expressive structural node representations alone do not solve link prediction; most-expressive joint (order-2) structural representations do.

Goal: we need joint structural representations of every order we care about — and we'd like to pre-store them (compactly).

4. Impossibility for fixed-size pre-trained node embeddings¶

Proposition 2.3 (Bevilacqua et al., informal). For any node embedding model $f: (\mathbf{A},\mathbf{X}) \to \mathbb{R}^{n\times d}$, there exist two tasks of different orders such that at least one is not solvable from $f$.

True for any fixed $d$ independent of the graph, any architecture

This is a statement about the shape $\mathbb{R}^{n\times d}$ with $d$ fixed, not about any specific architecture.

Group-theoretic reading. Structural node embeddings are invariant under the automorphism group $\text{Aut}(G) \le \mathbb S_n$. So for any $\sigma \in \text{Aut}(G)$ and any downstream neural network $g$, $$ g(\mathbf Z_u, \mathbf Z_v) \;=\; g(\mathbf Z_{\sigma(u)}, \mathbf Z_{\sigma(v)}). $$

Theory: Predictions collapse along $\text{Aut}(G)$-orbits of node pairs — but link isomorphism classes are orbits of $\text{Aut}(G)$ acting on pairs, which is a strictly coarser partition than what we need. Hence the collapse on the previous slide.

Takeaway: fixed-size node representations cannot be most-expressive across all task orders.

5. Today's approach: Labeling tricks (joint structural reps, one conditioning at a time)¶

We need order-$r$ joint structural reps. How do existing link-prediction models (SEAL, NBFNet) get them?

Labeling trick (Zhang et al., 2021). To represent the pair $\{u, w\}$ ($r=2$):

Mark node $u$ with a one-hot feature: $\tilde{\mathbf X}^{(u)} = \mathbf X \oplus \mathbb 1_u$.
Run a structural GNN on $(\mathbf A, \tilde{\mathbf X}^{(u)})$.
Read off the embedding of $w$ in the augmented graph: $$ \mathbf Z_w^{(u)} \;=\; \text{GNN}(\mathbf A, \tilde{\mathbf X}^{(u)})_w. $$

Theorem (Zhang et al., 2021, informal). $\mathbf Z_w^{(u)}$ is a most-expressive joint structural representation of the pair $\{u, w\}$.

Why? Marking $u$ collapses $\text{Aut}(G)$ to the stabilizer $\text{Stab}_u = \{\sigma \in \text{Aut}(G) : \sigma(u) = u\}$. The GNN output at $w$ is now equivariant only under this subgroup — part of the automorphism group of the pair $\{u, w\}$, not of $w$ alone.

Labeling tricks compute conditional joint structural representations.

Labeling trick: Works only for a pre-set $r$ order task, does not generalize to other task orders

5.1. Labeling Trick Space Complexity (link prediction): The $n \times (n-1)$ tensor¶

One conditioning on $u$ yields $n-1$ joint pair embeddings at once: $$ \underbrace{\bigl\{\mathbf Z_w^{(u)}\bigr\}_{w \neq u}}_{n - 1 \text{ joint reps } \{u, w\}} \;\;\longleftarrow\;\; \text{one GNN pass with } u \text{ marked}. $$

To materialize the full order-2 joint structural tensor, repeat for every $u$: $$ u = 1, 2, \dots, n \;\Longrightarrow\; n \cdot (n-1) \text{ joint pair embeddings total.} $$

Generalizing to order $r$:

Conditioning	What you get (per pass)	# fwd passes
Mark 1 node ($r=2$)	$n-1$ order-2 joint reps	1
All $n$ nodes ($r=2$)	Full $n(n-1)$ tensor	$n$
Mark $r-1$ nodes (order $r$)	$n - r + 1$ order-$r$ joint reps	1
All markings (order $r$)	Full order-$r$ tensor	$\mathcal{O}(n^{r-1})$

Labeling tricks are simply the brute-force computation of joint structural reps — one conditioning at a time.

5.1.2. Labeling Trick: Storage Challenge¶

Total memory to store a labeling-trick model's output:

Task order $r$	Storage for the joint tensor
$r = 1$ (node)	$\mathcal{O}(n \, d)$
$r = 2$ (link)	$\mathcal{O}(n^2 \, d)$
$r = 3$ (triplet)	$\mathcal{O}(n^3 \, d)$
general $r$	$\mathcal{O}(n^r \, d)$

Worse: each conditioning gives a different set of embeddings, and they cannot be factored back into a single fixed-size per-node table (by Prop. 2.3). Two unpleasant options:

Store the full tensor — $\mathcal{O}(n^r d)$ memory, infeasible for $r \ge 2$ on large graphs.
Recompute at query time — no pre-training artifact to hand off to a downstream task.

Labeling tricks solve expressivity by sacrificing the very property that makes pre-training valuable: re-usability of a compact embedding across tasks.

6. The Desiderata¶

We want:

✅ Pre-computed node representations.
✅ Adaptable to any task order $r \ge 1$.
✅ Sufficiently expressive at every order.

We have seen:

Fixed-size node embeddings $\mathbb{R}^{n\times d}$: small but not expressive across orders (Prop. 2.3).
Labeling-trick embeddings: expressive but $\mathcal{O}(n^r d)$ space needed for order $r$ tasks.

There is a gap in the middle. Holographic representations live in this gap.

6.1. Solution: Holographic representations¶

The key insight.

Positional representations are more powerful than structural ones, if we can remove the randomness. Canonicalizing positional samples recovers joint structural representations of any order.

Holographic node representations do exactly this.

In the early days of neural networks, Feldman (1986) / Hinton (1990) proposed holographic embeddings as a way to distribute a concept across multiple views rather than a single view of a task.

Holographic node representations: Instead of a single fixed-size vector per node, give each node a sequence of $T$ vectors, each computed from a different symmetry-broken "view" of the graph.

These $T$ views are derandomized-samples of the positional joint; the reduction map averages out the sampling randomness in a controlled way.

The holograph node embeddings are a 3-axis tensor $\mathbf E_\theta(\mathbf A, \mathbf V_1) \in \mathbb{R}^{n \times T \times d_e}$

one slice per node ($n$)
one slice per view ($T$) which breaks symmetries whose value depends on the graph, and
a $d_e$-dimensional feature vector in each cell.

Pre-computed once, shared across all tasks and orders.

6.2. Two maps¶

Holographic representations factor into two learnable maps:

(1) Expansion map — symmetry-free, high-dim: $$ E_\theta: (\mathbf{A}, \mathbf V_1) \;\longrightarrow\; \underbrace{\mathbf V \in \mathbb{R}^{n \times T \times d_e}}_{\text{T views per node}} \; + \; \Lambda \text{ (breaking info symmetries)}. $$

(2) Reduction map — task-specific, light-weight: $$ R_\psi: \mathbb{R}^{n \times T \times d_e} \;\longrightarrow\; \mathbb{R}^{\binom{n}{r}\times d_r}. $$

Composition $R_\psi \circ E_\theta$ is structural at the target order $r$: $$ \pi \circ (R_\psi \circ E_\theta)(\mathbf A,\mathbf V_1) \;=\; (R_\psi \circ E_\theta)(\pi\circ\mathbf A, \pi\circ \mathbf V_1). $$

Pre-training $E_\theta$ once; learn a small $R_\psi$ per new task.

6.3. Symmetry Breaking Expansion: Generating the $T$ Views¶

One breaking step — mark node $v_t$ with a one-hot indicator, re-run GNN: $$ \mathbf{V}_t = f_\lambda\!\left(\mathbf{A},\; \mathbf{V}_1 \oplus \mathbb{1}_{v_t}\right) $$

Group-theoretic effect: each marking collapses the automorphism group to a stabilizer, $$ \operatorname{Aut}(G) \;\longrightarrow\; \operatorname{Stab}_{v_t} = \{\sigma \in \operatorname{Aut}(G) : \sigma(v_t)=v_t\}, $$ and sequential breakings $v_1, v_2, \dots, v_T$ shrink it further — eventually to trivial (Albertson–Collins, 1996).

	Role
$v_1,\dots,v_T$	Breaking nodes → distinct views
$h_\lambda(\mathbf{A}, \mathbf{V}_1)$	Selector: picks which nodes to break (degree, learned score, Gumbel)

Thm. 4.3. Sequential breaking lets $E_\theta$ exactly express canonical eigenvectors, subsuming Laplacian embeddings while eliminating sign/basis ambiguity (low computational cost sacrifices expressiveness).

6.4. The reduction map, at a glance¶

The expansion produced $T$ views indexed by $t$. Each $t$ belongs to a breaking group $\lambda \in \Lambda$

The rule that picked the broken nodes (e.g., $\lambda=1$ = "highest-degree nodes", $\lambda=2$ = "second-highest", …).
- Views in the same group are interchangeable; views from different groups carry different kinds of information.

For a target $r$-set $\{v_1, \dots, v_r\}$, the reduction map runs in two steps.

Step 1 — pool nodes within a view (reintroduce order-$r$ symmetry): $$ \mathbf U_{\lambda, t} \;=\; \phi_\lambda\!\left(\{\mathbf V_{t,u}\}_{u \in \{v_1,\dots,v_r\}}\right), \qquad \phi_\lambda \text{ permutation-invariant in the } r \text{ nodes.} $$

Step 2 — pool views within a breaking group (kill the arbitrariness of which node was broken): $$ \mathbf U_{\lambda,\{v_1,\dots,v_r\}} \;=\; \rho_\lambda\!\left(\{\mathbf U_{\lambda,t}\}_{t\,:\,\Lambda_t = \lambda}\right), \qquad \rho_\lambda \text{ permutation-invariant in } t. $$

Finally, concatenate across $\lambda$ — order between groups is preserved.

Within group $\lambda$: invariant ⇒ output is structural. Between groups: ordered ⇒ output is expressive.

6.5. The $\lambda$-Partition as a Spectral Partition¶

Core idea: The invariance structure of $\rho_\lambda$ mirrors the graph's eigenspace structure (Thm. 4.3).

	Within $\lambda$	Between $\lambda$'s
Spectral meaning	Same eigenspace	Different eigenvalues
Ambiguity	Basis choice is arbitrary	Ordering carries real info
Reduction map	Permutation-invariant $\rho_\lambda$	Concatenate (preserve order)

$$\underbrace{\text{invariant within } \lambda}_{\text{eigenspace basis ambiguity}} \;\Big|\; \underbrace{\text{ordered between } \lambda\text{'s}}_{\text{spectral multiplicity}}$$

Connection to SignNet (Lim et al., 2022; 2023): HoloGNN gets sign/basis canonicalization for free from the reduction map — no explicit quotient needed.

7. Why the size of $T$ matters¶

Key question: how large is $T$?

If $T = \mathcal{O}(n)$: we are back to $\mathcal{O}(n^2 d_e)$ storage — no better than labeling tricks.
If $T$ can be $\ll n$: we win.

Fact (paper, and theory of graph symmetry breaking, Albertson–Collins 1996). For almost all graphs, a small number of carefully chosen breakings suffices to canonicalize all nodes. Empirically:

Dataset	$T^\star$ used
Planetoid (CORA, Citeseer, PubMed)	8
MovieLens	8
RelBench (38M nodes!)	10

So in practice $T$ is a small constant — not a function of $n$.

7.1 Storage comparison¶

Let $k$ = max task order you care about.

Approach	Pre-stored size	Works for order $\le k$?
Fixed-size GNN node embedding	$\mathcal{O}(n d)$	Only $r=1$
SEAL-style labeling trick	$\mathcal{O}(n^r d)$ per order	Yes, but not pre-storable
Holographic (HoloGNN)	$\mathcal{O}(n \cdot T \cdot d_e)$, $T = \text{const}$	Yes, any $r$ — with caveats (next slide)

$\mathcal{O}(n \cdot T \cdot d_e)$ is linear in $n$ — the same scaling as a fixed-size node embedding, with a small constant blow-up. And the same stored tensor serves node classification, link prediction, and higher-order tasks, by swapping out the reduction map.

7.2 Why this escapes Prop. 2.3¶

Prop. 2.3 is about models producing output in $\mathbb{R}^{n\times d}$. Holographic representations live in $\mathbb{R}^{n \times T \times d_e}$ — where $T$ is variable, a function of the graph.

The probabilistic picture makes it more clear:

Object	What it captures	Suffices for
Fixed-size structural node emb.	Marginal parameters $\Gamma(\mathbf A)_v$	Order $r=1$
Single positional sample	One random draw from the joint	Nothing canonically
$T$ views + reduction map	Joint structural reps of any order	Any $r\ge 1$*

Holographic property: isomorphic nodes $u \simeq v$ may have different $T$-view representations along the $T$ axis
- Each view corresponds to a different element of the automorphism orbit.
- The reduction map $R_\psi$ then aggregates permutation-invariantly within each breaking group, restoring structurality at the output while the joint information from different views has already been combined.

Maximal expressivity per Def. 2.2 requires sequential breakings which can be combinatorial on $T^\star$ — see next slide.

We graph tasks into two jobs that prior models conflated: capturing joint node information (expansion) and enforcing task-order symmetries (reduction).

7.3 Caveat: Parallel Breaking Has Expressivity Limits¶

Practical HoloGNN uses single-node parallel breaking with $T^\star \le 10 \ll n$. Here is what the paper does — and does not — prove.

Expressivity ladder (Sec. 4.3, App. B.3):

Variant	$T^\star$	Guarantee
Single-node parallel	$n$	Links only ($r=2$) — Thm. 4.5
Sequential (Lanczos)	matrix rank	Canonical eigenvectors, any $r$ — Thm. 4.3
Practical HoloGNN	$\le 10$	No worst-case maximality guarantee

Two gaps to keep in mind:

Higher-order tasks ($r \ge 3$): Thm. 4.5 covers only $r=2$. Guaranteeing order-$r$ expressivity requires $(r{-}1)$-node markings (Thm. B.4) — restoring $\mathcal{O}(n^{r-1})$ cost. No free lunch.
Links ($r=2$), small $T^\star$: Thm. 4.5 needs $T^\star = n$. On highly symmetric graphs (regular, strongly-regular), a small $T^\star$ may fail to collapse $\operatorname{Aut}(G)$, and expressivity degrades.

Bottom line: HoloGNN trades worst-case expressivity for scalability. Empirically the approximation is benign (Planetoid, MovieLens, RelBench); on adversarial graphs, the trade-off bites.

8. Empirical validation — one highlight¶

CORA: pre-train on link prediction, adapt to node classification.

Model	Drop in accuracy
GCN	$\approx 7\%$
GPS graph transformer	$\approx 41\%$
NBFNet / SEAL	up to $41\%$
HoloGNN	$\approx 1.5\%$

Holographic embeddings really do transfer across orders.

And on RelBench (38M-node industrial graph) HoloGNN is the only pre-trainable model in the comparison that handles triplet-level tasks at all.

References¶

Bevilacqua, Robinson, Leskovec, Ribeiro. Holographic Node Representations: Pre-training Task-Agnostic Node Embeddings. ICLR 2025.
Srinivasan & Ribeiro. On the equivalence between positional node embeddings and structural graph representations. ICLR 2020.
Zhang, Li, Xia, Wang, Jin. Labeling trick: A theory of using graph neural networks for multi-node representation learning. NeurIPS 2021.
Zhu, Zhang, Xhonneux, Tang. Neural Bellman-Ford Networks: a general graph neural network framework for link prediction (NBFNet). NeurIPS 2021.
Zhang & Chen. Link prediction based on graph neural networks (SEAL). NeurIPS 2018.
Albertson & Collins. Symmetry breaking in graphs. Electronic J. Combinatorics, 1996.
Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. 1950.
Belkin & Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 2003.
Lim, Robinson, Zhao, Smidt, Sra, Maron, Jegelka. Sign and basis invariant networks for spectral graph representation learning (SignNet). ICLR 2022.
Lim, Robinson, Jegelka, Maron. Expressive sign equivariant networks for spectral geometric learning. NeurIPS 2023.
Rampášek, Galkin, Dwivedi, Luu, Wolf, Beaini. Recipe for a general, powerful, scalable graph transformer (GPS). NeurIPS 2022.
Feldman. Neural representation of conceptual knowledge. 1986.
Hinton. Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 1990.

In [ ]: