The adjacency matrix $A \in \mathbb{A}^{n \times n}$ is a square matrix whose elements indicate whether pairs of vertices have a relationship in some appropriate space $\mathbb{A}$.
Assumption:
Theory:
A sequence of $n^2$ random variables $(X_1,\ldots,X_{n^2}) \in \mathbb{A}^{n^2}$ is said to be a graph whenever its distribution has the following property: $$ P(X_1,\ldots,X_{n^2}) = P(\pi \circ (X_1,\ldots,X_{n^2})), $$ for $\pi \in \Pi$, where $\Pi \subseteq \mathbb{S}_{n^2}$ is a special subset of permutations of the symmetric group $\mathbb{S}_{n^2}$ of order $n^2$.
More specifically, $\Pi$ is defined such that there exist a bijection between each of the $n^2$ elements in the sequence $(X_1,\ldots,X_{n^2})$ and each element in the adjacency tensor $A_{i,j}$, $i,j \in V$, and the following condition holds $$ P(A_{1,1},A_{1,2},\ldots,A_{n,n}) = P(A_{\pi \circ 1, \pi \circ1},A_{\pi \circ 1,\pi \circ 2},\ldots,A_{\pi \circ n,\pi \circ n}), \quad \forall \pi \in \mathbb{S}_n. $$
Q: Consider the graph
Define $\text{vec}(A) = (0,0,0,0,1,\ldots,0,1,1,1)$. Consider a subgroup $\Pi \subseteq \mathbb{S}_{n^2}$ (a subgroup is a subset of a group that also forms a group) defined by some unknown bijection between the elements in the sequence and the edges in the adjacency matrix $A$. All we know about $\Pi$ is that for some $\pi \in \Pi$ we obtain $\pi \circ \text{vec}(A) = (0,0,0,0,0,\ldots,1,1,1,1)$.
Mark the only permutation performed by $\pi' \in \mathbb{S}_{n^2} \backslash \Pi$.
Assume we have a function $f: \mathbb{A}^{n\times n} \times V \to \mathbb{R}^d$ that produces a representation of node $v \in V$ in graph ${\bf A}$.
Let's consider the Arctic food web graph below.
Definition (Structural node representations): Structural node representations of an $n\times n$ adjacency matrix ${\bf A}$ (or $n\times n \times k$, $k \geq 2$, adjacency tensor ${\bf A}$) are obtained from a function $f : \mathbb{R}^{n\times n} \to \mathbb{R}^{n \times d}$ where $$ u \simeq v \implies (f({\bf A}))_{v,\cdot} = (f({\bf A}))_{u,\cdot},$$ where $(f({\bf A}))_{u,\cdot}$ denotes the row of $f({\bf A})$ corresponding to vertex $u$.
Note that structural node representations are equivariant to permutations of the adjacency matrix $\bf A$: $$\forall \pi \in \mathbb{S}_n \quad \quad f(\pi \circ {\bf A}) = \pi \circ f({\bf A}),$$ noting that $\pi$ over ${\bf A}$ jointly permutes rows and columns, but applied at $f$ just permutes the rows.
Definition (Most Expressive Structural node representations):
We say the structural representations are most expressive iff:
$$v \simeq u \Leftrightarrow f({\bf A})_{u,\cdot} = f({\bf A})_{v,\cdot},$$
that is, $f$ gives different embeddings to $u,v \in V$ if they are non-isomorphic, and the same embeddings if $u,v$ are isomorphic.
Graph neural networks (GNNs): GNNs output structural node embeddings.
Assume we have a function $f: \mathbb{A}^{n\times n} \times V \to \mathbb{R}^d$ that produces a representation of node $v \in V$ in graph ${\bf A}$.
Let's consider the Arctic food web graph below.
(Murphy et al., 2019b) gave the following (then) surprising result. The work defined Relational Pooling, a neural representation defined as $$ \Gamma(A,i) = \frac{1}{n!} \sum_{\pi \in \mathbb{S}_n} f(\pi \circ A, \pi \circ i), $$ where $f: \mathbb{A}^{n^2} \times V \to \mathbb{R}^d$ is an arbitrary learnable function (e.g., neural network) that outputs a $d$-dimensional node representation of node $i\in V$ in graph $A \in \mathbb{A}^{n^2}$.
Theorem 2.1 (Murphy et al., 2019b) If node and edge attributes come from a finite set and $f: \mathbb{A}^{n^2} \times V \to \mathbb{R}^d$ is an universal function approximator, then $$ \Gamma(A,i) = \frac{1}{n!} \sum_{\pi \in \mathbb{S}_n} f(\pi \circ A, \pi \circ i) $$ is a most-expressive graph representation of node $i\in V$ in graph $A \in\mathbb{A}^{n^2}$.
$k$-ary representations in Relational Poolling is equivalent to using subgraphs to represent a graph.
Graph $k$-reconstruction conjecture. Kelly et al. (1957) generalize the graph reconstruction conjecture, considering the multiset of subgraphs of size $k$, which we denote $D_k(G) = \{\{I(H): H \in S^{(k)}(G)\}\}$, where $S^{(k)}$ is the set of all $\binom{n}{k}$ $k$-size subsets of $V$. We often call an element in $D_k(G)$ a $k$-card. Let $G$ and $H$ be graphs, then $H$ is a $k$-reconstruction of $G$ if $H$ and $G$ have the same $k$-deck, denoted $H \sim_k G$. A graph $G$ is $k$-reconstructible if every $k$-reconstruction of $G$ is isomorphic to $G$, i.e., $H \sim_k G$ implies $H \cong G$.
Results for $k$-reconstruction usually state the least $k$ as a function of $n$ such that all graphs $G$ in $\mathcal{G}$ (or some subset) are $k$-reconstructible (Nydl, 2001). There exist extensive partial results in this direction, mostly describing $k$-reconstructibility (as a function of $n$) for a particular family of graphs, such as trees, disconnected graphs, complete multipartite graphs, and paths. More concretely, Nydl (2001), Spinoza and West (2019) showed graphs with $2k$ vertices that are not $k$-reconstructible. In practice, these results imply that for some fixed $k$ there will be graphs with not many more vertices than $k$ that are not $k$-reconstructible. Further, $k$-reconstructible graph functions such as degree sequence and connectedness have been studied in Manvel (1974) and Taylor (1990) depending on the size of $k$. Please refer to Cotta et al., (2021) for a more complete discussion.
Let's start with a CNN filter:
Note that the CNN filter is applied to a constant number of inputs in the last layer.
Topologically, an image is a lattice and the convolution is a filter applied over a patch of the lattice
Then, we apply this convolution in layers, multiple times. Note that each pixel in the previous layer has a convolution representation in the next layer.
Note that in CNNs:
Today, we are going to explore convolutions over more complex topologies.
Let $\bf A$ represent an arbitrary simple (no loops) undirected social network graph. We now illustrate how a convolution should act over the topological space of Facebook friends:
The graph neural layer applied to Bruno will act on his neighbors
The graph neural layer applied to one of his friends will act on her neighbors
The representation ${\bf h}^{(k)}_v \in \mathbb{R}^{1 \times d}$, $d \in \mathbb{N}$, at layer $k$ of node $v \in V$ can be described through:
The resulting filter recursion is: $$ {\bf h}^{(k)}_v = \sigma\left( {\bf h}^{(k-1)}_v {\bf W}_\text{self}^{(k)} + \text{Pooling}\left(({\bf h}^{(k-1)}_u)_{\forall u \in \mathcal{N}_v}\right) {\bf W}_\text{neigh}^{(k)} + {\bf b}^{(k)} \right), $$ where ${\bf W}_\text{self}^{(k)} \in \mathbb{R}^{d' \times d}$, ${\bf W}_\text{neigh}^{(k)} \in \mathbb{R}^{d' \times d}$, ${\bf b}^{(k)} \in \mathbb{R}^{1 \times d}$, $d'\in \mathbb{N}$ and $\text{Pooling}$ is a function that outputs a value that does not depend on the order of the inputs.
Given a sequence (input vector) of arbitrary length $n \geq 1$ $$ {\bf x} = (x_1,\ldots,x_n) $$ a permutation $\pi$ over the integers $\{1, 2,\ldots,n\}$ is such that ${\bf x}_\pi = (x_{\pi_1},\ldots,x_{\pi_n})$ is the correspondingly permuted version of ${\bf x}$.
Pooling is such that $$ \text{Pooling}({\bf x}) = \text{Pooling}(\pi \circ {\bf x}), \quad \forall \pi \in \mathbb{S}_n, $$
See our lecture on set representations for a complete list of possible Pooling functions.
Sum pooling example:
Using sum pooling, the resulting filter recursion is: $$ {\bf h}^{(k)}_v = \sigma\left( {\bf h}^{(k-1)}_v {\bf W}_\text{self}^{(k)} + \sum_{\forall u \in \mathcal{N}_v} {\bf h}^{(k-1)}_u {\bf W}_\text{neig}^{(k)} + {\bf b}^{(k)} \right), $$ where again $\mathcal{N}_v$ are the neighbors of node $v$ in the graph.
This variant is commonly known as GraphConv
in the literature, and has been thoroughly studied in Morris et al. 2019.
Most existing Graph Neural Networks are a variant of the above recursion.
Graph Attention Networks (GATs) (Veličković et al., 2018) replace ${\bf h}^{(k-1)}_u$ inside the pooling with $\alpha({\bf h}^{(k-1)}_v,{\bf h}^{(k-1)}_u) {\bf h}^{(k-1)}_u$, where $\alpha(\cdot,\cdot) \in (0,1)$ is an attention mechanism (Bahdanau et al., 2015).
More generally, an edge function can also be added before the set representation function, to better represent the connection between $v$ and its neighbors: $$ {\bf h}^{(k)}_v = \sigma\left( {\bf h}^{(k-1)}_v {\bf W}_\text{self}^{(k)} + \text{Pooling}\!\left(\left(\text{edgefunc}\left({\bf h}^{(k-1)}_v,{\bf h}^{(k-1)}_u\right)\right)_{\forall u \in \mathcal{N}_v}\right) {\bf W}_\text{neig}^{(k)} + {\bf b}^{(k)} \right), $$ where GAT is just one such example.
Definition (Most Expressive function): A function is most expressive if it gives the same representation to isomorphic graphs and can give different representations to non-isomorphic graphs.
Q: Are GNNs most-expressive representations for graphs?
A: No, there exist pairs of non-isomorphic graphs that cannot be distinguished by standard GNNs (Xu et al., 2019).
Independently of the learning procedure, certain non-isomorphic graphs will get the same representation.
Example 1: Circular Skip Link (CSL) task (Murphy et al., 2019)
We will see that we can quantify the expressive power of GNN in terms of the WL test, a well known heuristic used in graph theory to test graph isomorphism.
Recall our desideratum, which is not obtained by standard GNNs: different representations to non-isomorphic graphs.
Obviously, if a network is able to test for graph isomorphism, then it is able to assign a unique representation to each isomorphic class.
The graph isomorphism problem is NP-intermediate (i.e., not known to be solvable in poly time, nor to be NP complete).
The WL test is a combinatorial heuristic to test graph isomorphism.
Figure credit Expressive power of graph neural networks and the Weisfeiler-Lehman test
Theorem (Xu et al. 2019, Morris et al. 2019): the expressive power of GNNs is upper bounded by the WL test.
Remark: since there exist pairs of non-isomorphic graphs that are undistinguishable by the WL test, there exist pairs of non-isomorphic graphs that are undistinguishable by GNNs.
GNNs are at most as expressive as the WL test in distinguishing non-isomorphic graphs.
Let us see an example of non-isomorphic graphs that are indistinguishable by a GraphConv-based GNN.
In the environment you have in the scholar cluster install pyg as follows:
pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
pip install -q git+https://github.com/pyg-team/pytorch_geometric.git
where TORCH
is simply the version of Pytorch you have installed, which you can find with python as
import torch
print(torch.__version__)
# @title [RUN] Import modules
import os.path as osp
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import scipy.io as sio
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from scipy.special import comb
from sklearn.metrics import accuracy_score
from torch_geometric.data import InMemoryDataset
from torch_geometric.data.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn import GraphConv, global_add_pool
from torch_geometric.utils import to_networkx, to_undirected
# @title [RUN] Helper functions for plots and visualisations
def get_color(node_emb, hashmap):
max_val = max(hashmap.values(), default=0) + 1
crt_node_emb = np.round(node_emb, decimals=2)
if tuple(crt_node_emb) not in hashmap:
hashmap[tuple(crt_node_emb)] = max_val
max_val += 1
crt_node_emb = hashmap[tuple(crt_node_emb)]
# Map float number to a color
crt_color = cm.tab10(crt_node_emb, bytes=True)
crt_color = (
crt_color[0] / 255.0,
crt_color[1] / 255.0,
crt_color[2] / 255.0,
crt_color[3] / 255.0,
)
return crt_color
def draw_one_graph(
ax, edges, hash, label=None, node_emb=None, layout=None,
):
graph = nx.Graph()
graph.add_edges_from(zip(edges[0], edges[1]))
node_pos = layout(graph)
# Add colors according to node embeding
if node_emb is not None:
color_map = [
get_color(node_emb[node_id], hash)
for node_id in graph.nodes()
]
nx.draw_networkx_nodes(
graph, node_pos, node_color=color_map, nodelist=graph.nodes(), ax=ax
)
nx.draw_networkx_edges(graph, node_pos, ax=ax)
nx.draw_networkx_labels(graph, node_pos, ax=ax)
else:
nx.draw_networkx(graph, node_pos, ax=ax)
def gallery(
graphs,
hash=None,
labels=None,
node_emb=None,
max_fig_size=(40, 10),
layout=nx.layout.kamada_kawai_layout,
):
if hash is None:
hash = {}
num_graphs = len(graphs)
ff, axes = plt.subplots(
1, num_graphs, figsize=max_fig_size, subplot_kw={"xticks": [], "yticks": []}
)
if num_graphs == 1:
axes = [axes]
if node_emb is None:
node_emb = num_graphs * [None]
if labels is None:
labels = num_graphs * [" "]
for i in range(num_graphs):
draw_one_graph(
axes[i],
graphs[i].edge_index.numpy(),
hash,
labels[i],
node_emb[i],
layout,
)
if labels[i] != " ":
axes[i].set_title(f"Target: {labels[i]}", fontsize=28)
axes[i].set_axis_off()
plt.show()
Let us consider three exemplary graphs chosen from the Graph8c dataset, which contains all the possible connected non-isomorphic simple graphs with 8 nodes.
The goal is to understand whether a GraphConv-based GNN can disambiguate them.
To do so, we are going to assign a different class to each of them, and we are going to place them in our training set. A GNN can disambiguate them if and only if it can learn to correctly predict their targets.
# @title [RUN] `Graph8c` data retrieval
# Let's get three (selected) graphs from the Graph8c dataset from
# “Breaking the Limits of Message Passing Graph Neural Networks”
# (https://arxiv.org/pdf/2106.04319.pdf)
!wget https://raw.githubusercontent.com/balcilar/gnn-matlang/main/dataset/graph8c/raw/graph8c.g6
--2024-02-08 17:52:16-- https://raw.githubusercontent.com/balcilar/gnn-matlang/main/dataset/graph8c/raw/graph8c.g6 Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 77819 (76K) [text/plain] Saving to: ‘graph8c.g6.2’ graph8c.g6.2 100%[===================>] 76.00K --.-KB/s in 0.05s 2024-02-08 17:52:16 (1.55 MB/s) - ‘graph8c.g6.2’ saved [77819/77819]
IDS = [379, 4625, 4657]
class Grapg8cDataset(InMemoryDataset):
def __init__(self, root, transform=None, pre_transform=None):
super(Grapg8cDataset, self).__init__(root, transform, pre_transform)
self.data, self.slices = torch.load(self.processed_paths[0])
@property
def raw_file_names(self):
return ["graph8c.g6"]
@property
def processed_file_names(self):
return "data.pt"
def download(self):
# Download to `self.raw_dir`.
pass
def process(self):
# Read data into huge `Data` list.
dataset = nx.read_graph6(self.raw_file_names[0])
data_list = []
for i, datum in enumerate(dataset):
if i not in IDS:
continue
x = torch.ones(datum.number_of_nodes(), 1)
edge_index = to_undirected(
torch.tensor(list(datum.edges())).transpose(1, 0)
)
data_list.append(
Data(edge_index=edge_index, x=x, y=torch.tensor(IDS.index(i)))
)
if self.pre_filter is not None:
data_list = [data for data in data_list if self.pre_filter(data)]
if self.pre_transform is not None:
data_list = [self.pre_transform(data) for data in data_list]
data, slices = self.collate(data_list)
torch.save((data, slices), self.processed_paths[0])
ds = Grapg8cDataset(root="tmp/Graph8c")
G1, G2, G3 = ds[0], ds[1], ds[2]
Let us assign a different color for every different input feature.
gallery(
[G1, G2, G3],
labels=np.array([G1.y.item(), G2.y.item(), G3.y.item()]),
)
Now, it is time to implement our GNN and train to distinguish these three.
So far so good, it seems an easy task.
class MLP(nn.Module):
"""A simple feed forward neural network"""
def __init__(self, in_dim, emb_dim, num_layers=2):
super(MLP, self).__init__()
layer_list = []
layer_list.append(torch.nn.Linear(in_dim, emb_dim))
for _ in range(num_layers - 1):
layer_list.append(torch.nn.BatchNorm1d(emb_dim))
layer_list.append(torch.nn.ReLU())
layer_list.append(torch.nn.Linear(emb_dim, emb_dim))
self.layers = torch.nn.Sequential(*layer_list)
def forward(self, x):
return self.layers(x)
class GraphConvNet(torch.nn.Module):
def __init__(self, in_dim, emb_dim, out_dim, num_layers, num_final_layers=1):
super(GraphConvNet, self).__init__()
self.convs = torch.nn.ModuleList()
for layer in range(num_layers):
self.convs.append(
GraphConv(
emb_dim if layer != 0 else in_dim,
emb_dim,
)
)
self.pool = global_add_pool
self.final_layers = MLP(emb_dim, out_dim, num_layers=num_final_layers)
def forward(self, data):
h_node = data.x
for conv in self.convs:
h_node = F.relu(conv(h_node, data.edge_index))
h_graph = self.pool(h_node, data.batch)
return self.final_layers(h_graph), h_node.detach()
# @title [RUN] Hyperparameters GNNs
BATCH_SIZE = 128 # @param {type:"integer"}
NUM_EPOCHS = 25 # @param {type:"integer"}
HIDDEN_DIM = 64 # @param {type:"integer"}
NUM_LAYERS = 4 # @param {type:"integer"}
LR = 0.001 # @param {type:"number"}
SEED = 42 # @param {type:"integer"}
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# @title [RUN] Set the seed
torch.manual_seed(SEED)
np.random.seed(SEED)
def train(train_loader, model, optimiser, loss_fn, metric_fn):
"""Train model for one epoch"""
model.train()
total_loss = 0
num_graphs = 0
for data in train_loader:
optimiser.zero_grad()
data = data.to(DEVICE)
y_hat, _ = model(data)
loss = loss_fn(y_hat, data.y)
loss.backward()
optimiser.step()
total_loss += loss.item() * len(data.y)
num_graphs += len(data.y)
return total_loss / num_graphs
def evaluate(loader, model, metric_fn):
"""Evaluate model on dataset"""
y_pred, y_true = [], []
model.eval()
for data in loader:
data = data.to(DEVICE)
y_hat, _ = model(data)
y_pred.append(y_hat.detach().cpu())
y_true.append(data.y.detach().cpu())
y_pred = torch.cat(y_pred, dim=0)
y_true = torch.cat(y_true, dim=0)
return metric_fn(y_true, y_pred)
def run(
model,
train_loader,
loaders,
loss_fn,
metric_fn,
use_scheduler=False,
print_steps=True,
):
"""Train the model for NUM_EPOCHS epochs"""
# Instantiate optimiser and scheduler
optimiser = optim.Adam(model.parameters(), lr=LR)
scheduler = (
optim.lr_scheduler.StepLR(optimiser, step_size=DECAY_STEP, gamma=DECAY_RATE)
if use_scheduler
else None
)
curves = {name: [] for name in loaders.keys()}
for epoch in range(NUM_EPOCHS):
train_loss = train(
train_loader, model, optimiser, loss_fn, metric_fn
)
if scheduler is not None:
scheduler.step()
for name, loader in loaders.items():
curves[name].append(evaluate(loader, model, metric_fn))
if print_steps:
print(
f"[Epoch {epoch}]",
f"train loss: {train_loss:.6f}",
end=" ",
)
for name, metric in curves.items():
print(f"{name} metric: {metric[-1]:.3f}", end=" ")
print("\n")
return curves
# Instantiate our models
graphconv_model = GraphConvNet(
in_dim=ds.num_features,
emb_dim=HIDDEN_DIM,
out_dim=ds.num_classes,
num_layers=NUM_LAYERS,
).to(DEVICE)
# Create the Dataloader
train_loader = DataLoader(ds, BATCH_SIZE, shuffle=True)
# Train our GNN model
_ = run(
graphconv_model,
train_loader,
{"train": train_loader},
loss_fn=F.cross_entropy,
metric_fn=lambda x, y: accuracy_score(x, y.argmax(-1)),
)
[Epoch 0] train loss: 7.958117 train metric: 0.333 [Epoch 1] train loss: 3.211746 train metric: 0.333 [Epoch 2] train loss: 2.457222 train metric: 0.333 [Epoch 3] train loss: 1.130873 train metric: 0.333 [Epoch 4] train loss: 1.671703 train metric: 0.333 [Epoch 5] train loss: 1.296378 train metric: 0.667 [Epoch 6] train loss: 0.790557 train metric: 0.667 [Epoch 7] train loss: 0.870086 train metric: 0.333 [Epoch 8] train loss: 0.936117 train metric: 0.333 [Epoch 9] train loss: 0.965901 train metric: 0.333 [Epoch 10] train loss: 0.866404 train metric: 0.667 [Epoch 11] train loss: 0.778652 train metric: 0.667 [Epoch 12] train loss: 0.634940 train metric: 0.667 [Epoch 13] train loss: 0.609730 train metric: 0.667 [Epoch 14] train loss: 0.646420 train metric: 0.667 [Epoch 15] train loss: 0.644886 train metric: 0.667 [Epoch 16] train loss: 0.573604 train metric: 0.667 [Epoch 17] train loss: 0.552052 train metric: 0.667 [Epoch 18] train loss: 0.556559 train metric: 0.667 [Epoch 19] train loss: 0.563795 train metric: 0.667 [Epoch 20] train loss: 0.552862 train metric: 0.667 [Epoch 21] train loss: 0.539810 train metric: 0.667 [Epoch 22] train loss: 0.519177 train metric: 0.667 [Epoch 23] train loss: 0.504517 train metric: 0.667 [Epoch 24] train loss: 0.505564 train metric: 0.667
Since the accuracy is 2/3, the GNN model is able to distinguish 2/3 of the graphs.
Let us visualise our three graphs, and use a different color for every different node representation learnt by our GNN.
_, G1_h_nodes = graphconv_model.to("cpu")(G1)
_, G2_h_nodes = graphconv_model.to("cpu")(G2)
_, G3_h_nodes = graphconv_model.to("cpu")(G3)
stateful_hash = {}
gallery(
[G1, G2, G3],
hash=stateful_hash,
labels=np.array([G1.y.item(), G2.y.item(), G3.y.item()]),
node_emb=[G1_h_nodes.numpy(), G2_h_nodes.numpy(), G3_h_nodes.numpy()],
)
Our GNN model is unable to distinguish the second and the third graph!
Indeed, you can see that their nodes have the same colors, or, more precisely, these graphs have the same number of nodes having the same color.
Thus any graph representation obtained from these node representations will be the same.
We have seen that any standard GNN is at most as powerful as the WL test.
Does it mean that also the WL test is unable to distinguish these two non-isomorphic graphs?
hash2 = nx.weisfeiler_lehman_graph_hash(to_networkx(G2), iterations=6)
hash3 = nx.weisfeiler_lehman_graph_hash(to_networkx(G3), iterations=6)
hash2 == hash3
True
Indeed, the WL test also fails to tell them apart!
Actually, GraphConv is as powerful as the WL test (Morris et al., 2019).
But note that this is not true for all GNNs, as some are even less powerful.
Q: Is there a way to go beyond this expressivity limit?
There are many existing works that try to increase the expressive power of GNNs, trading off expressive power, complexity and generalisation capabilities.
Inspired or connected to higher order versions of the WL test (namely, the $k$-WL test).
While WL test updates the color of nodes, the $k$-WL test updates the color of tuples of $k$ nodes.
Morris et al., 2019 directly construct neural counterparts of the k-WL test.
Maron et al., 2019a,b,c characterize invariant and equivariant layers for learning on graphs.
We will focus on the latter, providing a characterization of the collection of permutation invariant and equivariant linear layers acting on graph data.
Consider our adjacency tensor ${\bf A} \in \mathbb{A}^{n\times n}$.
Given a matrix ${\bf X} \in \mathbb{R}^{a\times b}$, denote $\text{vec}({\bf X}) \in \mathbb{R}^{a b \times 1}$ its column stack.
In the following we show how to obtain equivariant linear layers.
Let ${\bf W} \in \mathbb{R}^{1\times n^2}$ denote our neuron weights.
Goal: ${\bf W}$ should be permutation equivariant.
Let $\pi \in \mathbb{S}_n$ be an element of the symmetric group of order $n$ (a pemutation) and let $\rho: \mathbb{S}_n \to \mathbb{R}^{n^2 \times n^2}$ be a representation of $\pi$ that acts on the (vectorized) adjacency matrix to permute it, i.e., $\rho(\pi)\text{vec}({\bf A}) $ is a (vectorized) permutation of ${\bf A}$.
Then, from our G-invariance and set lectures, we know that G-equivariance implies $$\begin{equation}\tag{8} \label{eq:eq} \forall \pi \in \mathbb{S}_n, \forall {\bf A} \in \mathbb{A}^{n \times n}:\quad \rho(\pi){\bf W}\text{vec}({\bf A}) = {\bf W}\rho(\pi)\mathrm{vec}({\bf A}) \end{equation}$$ where $\mathrm{vec}$ flattens the matrix into a vector.
Since Equation $(\ref{eq:eq})$ is true for all ${\bf A}$, we can impose the following: $$ \rho(\pi){\bf W} = {\bf W}\rho(\pi). $$
We can now use the Kronecker product property we saw in our G-equivariance lecture and in our set lecture to write $$ \rho(\pi){\bf W} = {\bf W}\rho(\pi) \implies \rho(\pi)\otimes \rho(\pi^{-1})^T\mathrm{vec}({\bf W}) = \mathrm{vec}({\bf W}). $$ For any permutation $\pi \in \mathbb{S}_n$ the permutation matrix is such that $\rho(\pi^{-1}) = \rho(\pi)^T$. Then, $$ \rho(\pi){\bf W} = {\bf W}\rho(\pi) \implies \rho(\pi)\otimes \rho(\pi)\mathrm{vec}({\bf W}) = \mathrm{vec}({\bf W}). $$
Finally, we need to compute the Reynolds operator $$\bar{P} = \sum_{\pi \in \mathbb{S}_n} \rho(\pi)\otimes \rho(\pi).$$ And we know from our G-invariance lecture that the Reynolds operator guarantees $\bar{P} \rho(\pi)\otimes \rho(\pi) = \bar{P}$, $\forall \pi \in \mathbb{S}_n$.
We now define the subspace of neuron parameters that will make our (linear) neural layer equivariant.
As seen in the G-invariant lecture, weights $$\mathrm{vec}({\bf W}) \in \text{Left-1-Eig}(\bar{P}) = \{\mathrm{vec}({\bf W}') \in \mathbb{R}^{n^4} : \mathrm{vec}({\bf W}')^T \bar{P} = \mathrm{vec}({\bf W}')^T \}$$ define an equivariant subspace.
Moreover, we know $\text{Left-1-Eig}(\bar{P})$ is span by the left eigenvectors of $\bar{P}$ with eigenvalue 1, therefore our neuron weights are defined as $$ \mathrm{vec}({\bf W}) = \sum_{i=1}^k \alpha_i {\bf v}_i^T, $$ where ${\bf v}_1^T,\ldots,{\bf v}_k^T$ are the $k$ left eigenvectors of $\bar{P}$, where $k$ is the rank of $\bar{P}$.
For example, the 1-left eigenvector basis for ${\bf A} \in \mathbb{A}^{n \times n}$ for $n = 5$ can be visualized as
where the number of basis element is $\text{bell}(k + l) = \text{bell}(4) = 15$ and $\text{bell}(n)$ is the Bell number $n$.
Figure credit (Maron et al. 2019a).
Note that this result can be extended to find equivalently the 1-left eigenvector basis for ${\bf A} \in \mathbb{A}^{n ^k}$.
We have seen how to obtain invariant and equivariant linear layers, which can be used to construct Higher-order GNNs, our first attempt at more expressive GNNs.
The message-passing paradigm of GNNs is limited by its anonymity (no sender info).
Thus, solve the problem by augmenting the node features with unique identifiers.
Several methods fall in this category. We will expand on two.
Identifiers as node features + average (Murphy et al. 2019)
Run a standard GNN model $f$ on the original graph augmented with additional one-hot node randomly assigned node ids as features. This gives a positional node representation by breaking symmetries. Then, use symmetrization to get a structural node representation from the positional ones:
$$ {\bf h}_v = \frac{1}{n!} \sum_{\pi \in \mathcal{S}_n} f({\bf A} \Vert (\pi \circ I_n), v),$$
where ${\bf A} \Vert (\pi \circ I_n)$ concatenates given features with the additional permuted node ids (thus, concatenates on the diagonal of ${\bf A}$).
Pros:
Cons:
Note that in practice this approach is simply applying the Reynolds operator to the permutation group $\mathbb{S}_n$, as we have seen in the G-invariant lecture and in the set lecture.
Random node features (Abboud et al. 2021, Sato et al., 2021)
Run a standard GNN model $f$ on the original graph augmented with additional randomized initial node features $R$, one for every node $v$, $R_v \sim \mathcal{D}$, $R_v \in \mathbb{R}^d$:
$${\bf h}_v = f({\bf A} \Vert R, v),$$
where ${\bf A} \Vert R$ concatenates given features with the additional random ones (thus, concatenates on the diagonal of ${\bf A}$).
Random values can act as ids of the nodes, and can "always" make nodes distinguishable. But it also makes the node embeddings positional. As we will see in our positional node embedding lecture, averaging over the distribution of random features (and node permutations) would allow us to recover structural node embeddings.
Pros:
Cons:
The inability of GNNs to distinguish non-isomorphic graphs is directly tied to the inability to count substructures.
Theorem (Chen et al. 2020): GNNs cannot perform induced-subgraph-count of substructures consisting of 3 or more nodes.
Thus:
Count the number of substructures each node belongs to as a pre-processing step.
Use the counts to augment node features (Bouritsas et al. 2022) or to guide the message passing (Bodnar et al. 2021).
Figure credit Bouritsas et al. 2022.
View each graph as a bag of subgraphs, obtained through a deterministic function (which is called subgraph selection policy)
Employ an architecture (or encoder) returning a node representation, that is invariant to permutation of node ids and invariat to permutations of elements in the bag.
For example, a valid architecture respecting those symmetries consists of a GNN applied to each subgraph independently, followed by a pooling function that aggregates the results.
Several variants (Cotta et al., 2021, Zhang and Li, 2021, You et al, 2021, Bevilacqua et al., 2022, Zhao et al., 2022, Papp et al., 2021, Frasca et al. 2022, Quian et al 2022).
Different from structural node representations, positional node representations are not generally invariant to permutations.
For instance, the Singular Value Decomposition of $A$ gives positional node representations, which are markedly different from the structural ones:
Consider an undirected graph $G$ (then ${\bf U} = {\bf V} = {\bf Z})$
Definition (Positional node representations [Srinivasan & Ribeiro, 2020]): Positional node representations of a graph $G$ are defined as joint samples of random variables $(\mathbf{Z}_i)_{i\in V}|{\bf A} \sim p(\cdot |{\bf A})$, $\mathbf{Z}_i \in \mathbb{R}^{d}$, $d \geq 1$, where $p(\cdot|{\bf A})$ is a $\mathbb{S}_n$-equivariant probability distribution on ${\bf A}$. That is, $$(Z_1,\ldots,Z_n)|{\bf A} \stackrel{d}{=} (Z_{\pi(1)},\ldots,Z_{\pi(n)})|(\pi \circ {\bf A}), \quad \pi \in \mathbb{S}_n,$$ where $\stackrel{d}{=}$ means equivalent in distribution.
Positional Representations are often obtained via matrix factorization (MF), which decomposes the adjacency matrix ${\bf A}$ into a vector of $k$-dimensional node factors $\Phi \in \mathbb{R}^{k \times n}$.
Q: What is the impact a permutation ${\bf B} = \pi \circ {\bf A}$ in a matrix factorization?
Now consider the following example:
Assume our graph is our Artic food web, with adjacency matrix ${\bf A}$.
Consider the node representations of the food web using the left singular vectors obtained by applying SVD to ${\bf A}$.
Since ${\bf A}$ is symmetric, these are just the eigenvectors $\Psi$ of ${\bf A}$.
In the figure, the color of node $v \in V$ is an RGB characterization of its eigenvector $\Psi_{\cdot,v}$.
Now consider the permutation $\pi \in \mathbb{S}_n$ such that ${\bf B} = \pi\circ {\bf A} = {\bf A}$ is the permuted adjacency matrix where the land and sea components are now swapped.
If we run the same SVD algorithm on ${\bf B}$ (recall ${\bf B} = {\bf A}$).
Since permuting ${\bf A}$ results in different positional embeddings of the nodes, then positional embeddings can be interpreted as samples from a joint distribution.
Let us consider the marginals of this joint distribution. The parameters of the marginal distribution are related to structural representations of the nodes.
Definition: Given a graph $G = (V, E)$ we say that two nodes $u, v \in V$ are isomophic (denoted as $ u \simeq v $) if they have symmetric positions in the graph.
In the figure (Zhang et al., 2021) $ v_1 \simeq v_4 $, $ v_2 \simeq v_3 $
Consider a bivariate normal distribution $(X_1, X_2) \sim \mathcal{N}\left(\boldsymbol{0}, \begin{bmatrix} 1 & \sigma \\ \sigma & 1\end{bmatrix}\right)$, where $\sigma \in \mathbb{R}$ is a scalar.
Note that from the marginal distributions we cannot estimate the parameter $\sigma$, which requires joint samples $(x_1, x_2)$ for it to be estimated.
A link prediction task is a joint task between two nodes via link function $\rho(Z_u, Z_v)$, where $Z_u, Z_v$ denote embeddings of nodes $u, v$ sampled jointly.
As we have seen in the bivariate case, marginal distributions are insufficient to solve a joint task. Therefore, GNNs may fail in solving link prediction tasks.
$\text{GNN}(G)_{\text{Linx}} = \text{GNN}(G)_{\text{Orca}}$ and therefore $\rho(\text{GNN}(G)_{\text{Hare}}, \text{GNN}(G)_{\text{Linx}}) = \rho(\text{GNN}(G)_{\text{Hare}}, \text{GNN}(G)_{\text{Orca}})$
However, the permutation sensitivity in matrix factorization algorithms makes it difficult to practically use them with a final link function for link prediction tasks (unless we can ensure invariance to permutations in the link function).
So how can we predict links without being sensitive to permutations?
Definition (Joint structural representations): Joint structural node set representations of an $n\times n$ adjacency matrix ${\bf A}$ (or $n\times n \times k$, $k \geq 2$, adjacency tensor ${\bf A}$) are obtained from a function $f : \mathbb{R}^{n\times n} \to \mathbb{R}^{|P^{\star}(V)| \times d}$, with $P^{\star}(V)$ the power set excluding the empty set, where isomorphic node sets (denoted by $\simeq$) obtain identical representations i.e. $$ s_1 \simeq s_2 \implies (f({\bf A}))_{s1,\cdot} = (f({\bf A}))_{s2,\cdot},$$ where $s_1, s_2 \in P^{\star}(V)$ and $(f({\bf A}))_{s_2,\cdot}$ denotes the row of $f({\bf A})$ corresponding to $s_2$.
Definition (Most Expressive Joint Structural representations: We say that the joint structural representations of a node set are most expressive iff $$ s_1 \simeq s_2 \Leftrightarrow (f({\bf A}))_{s1,\cdot} = (f({\bf A}))_{s2,\cdot},$$
Theorem (Srinivasan & Ribeiro, 2020) (informal): For a graph $G$, given a sequence of node sets $\vec{S}=(s_1,\ldots,s_n),\;s_i \in P^{\star}(V)$ and their associated random variables as a sequence $\vec{Y}=(y_1,\ldots,y_n)$, if for all pairs of isomorphic node sets $s_i, s_j$ in $\vec{S}$, their corresponding random variables $y_i, y_j$ are equal in distribution (i.e. $y_i \stackrel{d}{=} y_j)$, then there exists a function $g$ and most expressive joint structural representation $f$ s.t. $$g((f({\bf A}))_{si,\cdot},\epsilon) \stackrel{\text{a.s.}}{=} y_i,$$ where $\epsilon$ is a pure source of random noise .
Corollary (Srinivasan & Ribeiro, 2020) (informal): While most expressive structural node representations independently are not sufficient to perform link prediction, the most expressive joint structural representations (of a pair of nodes) are sufficient.
Thanks to the corollary, then we have that most expressive joint structural representations can be used to predict links.
Essentialy, from the above results we can conclude that:
Importantly, we can use joint structural representations for link prediction.
We have presented GNNs as standard architectures applied to graph data, which guarantee invariance to permutations of node ids.
We have expanded on their limited expressive power, which results, for example, in
We have introduced several lines of work trying to break this expressivity limit.
We have discussed the limitations of existing more expressive methods, and shown that the expressivity problem is not solved, yet.
We have shown that structural representations are all you need.. The challenge is to define the number of nodes in your structural representation before solving the task.
Srinivasan and Ribeiro, 2020. Srinivasan, B., and Ribeiro, B. On the Equivalence between Positional Node Embeddings and Structural Graph Representations. ICLR 2020.
Cotta, Leonardo, Christopher Morris, and Bruno Ribeiro. "Reconstruction for powerful graph representations." Advances in Neural Information Processing Systems 34 (2021): 1713-1726.
Morris et al., 2019. Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. AAAI 2019
Veličković et al., 2018. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention Networks. ICLR 2018
Bahdanau et al., 2015. Bahdanau, D., Cho, K. H., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ICLR 2015
Xu et al., 2019. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? ICLR 2019
Murphy et al., 2019. Murphy, R., Srinivasan, B., Rao., V. and Ribeiro, B. Relational Pooling for Graph Representations. ICML 2019
Maron et al., 2019a. Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and Equivariant Graph Networks. ICLR 2019
Maron et al., 2019b. Maron, H., Ben-Hamu, H., Segol, N., and Lipman, Y. On the Universality of Invariant Networks. ICML 2019
Maron et al., 2019c. Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y. Provably powerful graph networks. NeurIPS 2019
Geerts, 2020. Geerts, F. The expressive power of kth-order invariant graph networks. ArXiv Preprint
Abboud et al., 2021. Abboud, R., Ceylan, I. I., Grohe, M., and Lukasiewicz, T. The surprising power of graph neural networks with random node initialization. IJCAI-21
Sato et al., 2021. Sato, R., Yamada, M., and Kashima, H. Random features strengthen graph neural networks. SDM 2021
Chen et al., 2020. Chen, Z., Chen, L., Villar, S., Bruna, J. Can Graph Neural Networks Count Substructures? NeurIPS 2020
Bouritsas et al., 2022. Bouritsas, G., Frasca, F., Zafeiriou, S., and Bronstein, M. M. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022
Bodnar et al., 2021. Bodnar, C., Frasca, F., Otter, N., Wang, Y. G., Liò, P., Montúfar, G., and Bronstein, M. M. Weisfeiler and lehman go cellular: CW networks. NeurIPS 2021
Cotta et al., 2021. Cotta, L., Morris, C., and Ribeiro, B. Reconstruction for powerful graph representations. NeurIPS 2021
Zhang and Li, 2021. Zhang, M. and Li, P. Nested graph neural networks. NeurIPS 2021
You et al., 2021. You, J., Gomes-Selman, J., Ying, R., and Leskovec, J. Identity-aware graph neural networks. AAAI 2021.
Bevilacqua et al., 2022. Bevilacqua, B., Frasca, F., Lim, D., Srinivasan, D., Cai, C., Balamurugan, G., Bronstein, M. M., Maron, H. Equivariant Subgraph Aggregation Networks. ICLR 2022
Zhao et al., 2022. Zhao, L., Jin, W., Akoglu, L., and Shah, N. From stars to subgraphs: Uplifting any GNN with local structure awareness. ICLR 2022.
Papp et al., 2021. Papp, P. A., Martinkus, K., Faber, L., and Wattenhofer, R. Dropgnn: Random dropouts increase the expressiveness of graph neural networks. NeurIPS 2021
Frasca et al., 2022. Frasca, F., Bevilacqua, B., Bronstein, M. M., Maron, H. Understanding and Extending Subgraph GNNs by Rethinking their Symmetries. NeurIPS 2022
Quian et al., 2022. Qian, C., Rattan, G., Geerts, F., Morris, C., Niepert, M. Ordered Subgraph Aggregation Networks. NeurIPS 2022