Graph Representation Learning¶

Copyright: Bruno Ribeiro, Beatrice Bevilacqua¶

Purdue University¶


If reusing this material, please keep copyright notice above¶


What is a graph?¶

  • Graph $G=(V,E,X_V,X_E)$, where
    • $V$ is the set of vertex ids.
      • Without loss of generality $V=\{1,\ldots, n\}$.
    • $E \subseteq V \times V$ is the set of edges
    • $X_V$ and $X_E$ are the sets of node and edge attributes, respectively

Adjacency Matrix and Adjacency Tensor¶

The adjacency matrix $A \in \mathbb{A}^{n \times n}$ is a square matrix whose elements indicate whether pairs of vertices have a relationship in some appropriate space $\mathbb{A}$.

  • In the simplest case, $\mathbb{A} = \{0,1\}$, and $A_{ij} = 1$ if there is a connection from node $i$ to $j$, and otherwise $A_{ij} = 0$ (indicating a non-edge).
  • If there are edge attributes (e.g., edge relations), then $\mathbb{A}$ is some appropriate space that represents these relations (often encoded as a binary vector) or some real-valued connection strength $\mathbb{A} \subseteq \mathbb{R}^d$. This makes the adjacency matrix into a adjacency tensor ${\bf A}$.
  • Self-loops, e.g., $A_{ii}$, will encode node attributes (self-relations).
  • For an undirected graph, keep in mind that $A$ is a symmetric matrix ($A_{ij}=A_{ji}$). For the example graph below has adjacency matrix: $$ A = \begin{bmatrix} 0 & 1 & 0 & 0\\ 1 & 0 & 1 & 1\\ 0 & 1 & 0 & 1\\ 0 & 1 & 1 & 0 \end{bmatrix} $$

Types of graphs¶

Some less common graph tasks¶

Graph Tasks¶

Whole-graph classification task¶


Classical graph classification¶
  • We would like to design a function $f({\bf A})$ that produces a representation of graph ${\bf A} \in \mathbb{A}^{n\times n}$.

Whole-graph classification via end-to-end learning¶

  • Now, we would like to learn a function $f({\bf A})$ that produces a representation of graph ${\bf A} \in \mathbb{A}^{n\times n}$.

The above tasks are inductive graph tasks¶

  • These tasks are inductive graph tasks:
    • We train on a set of graphs
    • Apply them on a different set of graph

Node classification task (via self-supervised learning)¶

  • Given a set of node lables as training data (e.g., label $\in$ {Bot,Real user})
  • Predict labels of unlabeled nodes
    • In graphs, the boundaries between training and test data can be sometimes blurry
    • We will cover the different training and test splits in our class on Positional vs Structural Node Embeddings

Node classification via end-to-end learning¶

  • Assume we have a function $f: \mathbb{A}^{n\times n} \times V \to \mathbb{R}^d, d \geq 1$.
    • $f({\bf A},v)$ produces a representation of node $v \in V$ in graph ${\bf A} = \mathbb{A}^{n\times n}$.

Definition of a Graph through Invariant Theory¶

Assumption:

  • Vertex ids are arbitrary, the output of our graph predictions should not change if we permuted them.

Theory:

A sequence of $n^2$ random variables $(X_1,\ldots,X_{n^2}) \in \mathbb{A}^{n^2}$ is said to be a graph whenever its distribution has the following property: $$ P(X_1,\ldots,X_{n^2}) = P(\pi \circ (X_1,\ldots,X_{n^2})), $$ for $\pi \in \Pi$, where $\Pi \subseteq \mathbb{S}_{n^2}$ is a special subset of permutations of the symmetric group $\mathbb{S}_{n^2}$ of order $n^2$.

  • The symmetric group $\mathbb{S}_k$ is the group of permutations of the elements $(1,\ldots,k)$.

More specifically, $\Pi$ is defined such that there exist a bijection between each of the $n^2$ elements in the sequence $(X_1,\ldots,X_{n^2})$ and each element in the adjacency tensor $A_{i,j}$, $i,j \in V$, and the following condition holds $$ P(A_{1,1},A_{1,2},\ldots,A_{n,n}) = P(A_{\pi \circ 1, \pi \circ1},A_{\pi \circ 1,\pi \circ 2},\ldots,A_{\pi \circ n,\pi \circ n}), \quad \forall \pi \in \mathbb{S}_n. $$

Q: Consider the graph

with the following adjacency matrix $$ A = \begin{bmatrix} 0 & 1 & 1 & {\color{red}1}\\ 1 & 0 & 1 & 1\\ 1 & 1 & 0 & 1\\ {\color{red}0} & 1 & 1 & 0 \end{bmatrix}. $$

Define $\text{vec}(A) = (0,0,0,0,1,\ldots,0,1,1,1)$. Consider a subgroup $\Pi \subseteq \mathbb{S}_{n^2}$ (a subgroup is a subset of a group that also forms a group) defined by some unknown bijection between the elements in the sequence and the edges in the adjacency matrix $A$. All we know about $\Pi$ is that for some $\pi \in \Pi$ we obtain $\pi \circ \text{vec}(A) = (0,0,0,0,0,\ldots,1,1,1,1)$.

Mark the only permutation performed by $\pi' \in \mathbb{S}_{n^2} \backslash \Pi$.

  • That is, mark the only permutation that would be a valid permutation for a set but invalid for a graph.
  1. $(0,0,0,1,0,\ldots,1,1,1,1)$
  2. $(0,0,0,0,0,\ldots,1,1,1,1)$
  3. $(0,0,0,0,1,\ldots,0,1,1,1)$
  4. $(0,0,0,0,1,\ldots,1,0,1,1)$
  5. $(0,0,0,0,1,\ldots,1,1,0,1)$
  6. $(0,0,0,0,1,\ldots,1,1,1,0)$

Permutation action on graphs¶

  • It is easier to define the graph permutations over the adjacency tensor $A$ using all the elements of $\mathbb{S}_n$ rather than a subgroup of $\mathbb{S}_{n^2}$.
  • A group action is the application of a group representation of an element of the group onto some object.
    • For instance, the group action "$\circ$" of permutation $\pi \in \mathbb{S}_n$ on the adjacency matrix is denoted $\pi \circ {\bf A}$ and jointly permutes the rows and columns of ${\bf A}$ according to the permutation in $\pi$.

Permutation action on the graph¶

  • We call ${\bf A}$ and $\pi \circ {\bf A}$ isomorphic graphs, where $\pi \in \mathbb{S}_n$ is an element of the symmetric group of order $n$ and "$\circ$" is the appropriate action (representation) over the input.

The Consequence of Permutation Invariance on Node Representations¶

Structural Node Embeddings¶

  • Assume we have a function $f: \mathbb{A}^{n\times n} \times V \to \mathbb{R}^d$ that produces a representation of node $v \in V$ in graph ${\bf A}$.

  • Let's consider the Arctic food web graph below.

    • We have added the constraint that $f$ is invariant to permutations of the vertex ids
      • That is, $f(\pi \circ {\bf A}, \pi \circ v) = f({\bf A}, v).$
    • Let $Z_v = f({\bf A},v)$ be the a permutation-invariant representation of node $v \in V$
      • Representations with the same colors have the same representation vectors.
  • Node representations that are invariant to permutations are denoted structural node representations.
    • Because nodes that are structurally equivalent on the graph must get the same representations

  • In the above picture the coloring is obtained through $f$ which is a Graph Neural Network (which we will see next).

Structural node embeddings¶

Definition (Structural node representations): Structural node representations of an $n\times n$ adjacency matrix ${\bf A}$ (or $n\times n \times k$, $k \geq 2$, adjacency tensor ${\bf A}$) are obtained from a function $f : \mathbb{R}^{n\times n} \to \mathbb{R}^{n \times d}$ where $$ u \simeq v \implies (f({\bf A}))_{v,\cdot} = (f({\bf A}))_{u,\cdot},$$ where $(f({\bf A}))_{u,\cdot}$ denotes the row of $f({\bf A})$ corresponding to vertex $u$.

Note that structural node representations are equivariant to permutations of the adjacency matrix $\bf A$: $$\forall \pi \in \mathbb{S}_n \quad \quad f(\pi \circ {\bf A}) = \pi \circ f({\bf A}),$$ noting that $\pi$ over ${\bf A}$ jointly permutes rows and columns, but applied at $f$ just permutes the rows.

Definition (Most Expressive Structural node representations): We say the structural representations are most expressive iff:
$$v \simeq u \Leftrightarrow f({\bf A})_{u,\cdot} = f({\bf A})_{v,\cdot},$$ that is, $f$ gives different embeddings to $u,v \in V$ if they are non-isomorphic, and the same embeddings if $u,v$ are isomorphic.

Graph neural networks (GNNs): GNNs output structural node embeddings.

The Consequence of Permutation Invariance on Node Representations¶

Structural Node Embeddings¶

  • Assume we have a function $f: \mathbb{A}^{n\times n} \times V \to \mathbb{R}^d$ that produces a representation of node $v \in V$ in graph ${\bf A}$.

  • Let's consider the Arctic food web graph below.

    • We have added the constraint that $f$ is invariant to permutations of the vertex ids
      • That is, $f(\pi \circ {\bf A}, \pi \circ v) = f({\bf A}, v).$
    • Let $Z_v = f({\bf A},v)$ be the a permutation-invariant representation of node $v \in V$
      • Representations with the same colors have the same representation vectors.
  • Node representations that are invariant to permutations are denoted structural node representations.
    • Because nodes that are structurally equivalent on the graph must get the same representations

  • In the above picture the coloring is obtained through $f$ which is a Graph Neural Network (which we will see next).

Most-expressive Neural Networks for Graph Representation Learning through Symmetrization¶

(Murphy et al., 2019b) gave the following (then) surprising result. The work defined Relational Pooling, a neural representation defined as $$ \Gamma(A,i) = \frac{1}{n!} \sum_{\pi \in \mathbb{S}_n} f(\pi \circ A, \pi \circ i), $$ where $f: \mathbb{A}^{n^2} \times V \to \mathbb{R}^d$ is an arbitrary learnable function (e.g., neural network) that outputs a $d$-dimensional node representation of node $i\in V$ in graph $A \in \mathbb{A}^{n^2}$.

Theorem 2.1 (Murphy et al., 2019b) If node and edge attributes come from a finite set and $f: \mathbb{A}^{n^2} \times V \to \mathbb{R}^d$ is an universal function approximator, then $$ \Gamma(A,i) = \frac{1}{n!} \sum_{\pi \in \mathbb{S}_n} f(\pi \circ A, \pi \circ i) $$ is a most-expressive graph representation of node $i\in V$ in graph $A \in\mathbb{A}^{n^2}$.

  • For instance, $f$ could be an MLP or a recurrent neural network (RNN).

K-ary representations for graphs¶

$k$-ary representations in Relational Poolling is equivalent to using subgraphs to represent a graph.


  • Graph $k$-reconstruction conjecture. Kelly et al. (1957) generalize the graph reconstruction conjecture, considering the multiset of subgraphs of size $k$, which we denote $D_k(G) = \{\{I(H): H \in S^{(k)}(G)\}\}$, where $S^{(k)}$ is the set of all $\binom{n}{k}$ $k$-size subsets of $V$. We often call an element in $D_k(G)$ a $k$-card. Let $G$ and $H$ be graphs, then $H$ is a $k$-reconstruction of $G$ if $H$ and $G$ have the same $k$-deck, denoted $H \sim_k G$. A graph $G$ is $k$-reconstructible if every $k$-reconstruction of $G$ is isomorphic to $G$, i.e., $H \sim_k G$ implies $H \cong G$.

  • Results for $k$-reconstruction usually state the least $k$ as a function of $n$ such that all graphs $G$ in $\mathcal{G}$ (or some subset) are $k$-reconstructible (Nydl, 2001). There exist extensive partial results in this direction, mostly describing $k$-reconstructibility (as a function of $n$) for a particular family of graphs, such as trees, disconnected graphs, complete multipartite graphs, and paths. More concretely, Nydl (2001), Spinoza and West (2019) showed graphs with $2k$ vertices that are not $k$-reconstructible. In practice, these results imply that for some fixed $k$ there will be graphs with not many more vertices than $k$ that are not $k$-reconstructible. Further, $k$-reconstructible graph functions such as degree sequence and connectedness have been studied in Manvel (1974) and Taylor (1990) depending on the size of $k$. Please refer to Cotta et al., (2021) for a more complete discussion.

    • Nydl, V. (2001). Graph reconstruction from subgraphs. Discrete Mathematics, 235(1-3):335–341.
    • Spinoza, H. and West, D. B. (2019). Reconstruction from the deck of-vertex induced subgraphs. Journal of Graph Theory, 90(4):497–522.
    • Kostochka, A. V. and West, D. B. (2020). On reconstruction of n-vertex graphs from the multiset of ($n-l$)-vertex induced subgraphs. IEEE Transactions on Information Theory, PP:1–1.
    • Manvel, B. (1974). Some basic observations on Kelly’s conjecture for graphs. Discrete Mathematics, 8(2):181–185.
    • Taylor, R. (1990). Reconstructing degree sequences from k-vertex-deleted subgraphs. Discrete mathematics, 79(2):207–213.

Graph Neural Networks: Permutation-Equivariant Graph Representations (Inspired by Image Convolutions)¶

Let's start with a CNN filter:

Images as a lattice¶

Note that the CNN filter is applied to a constant number of inputs in the last layer.
Topologically, an image is a lattice and the convolution is a filter applied over a patch of the lattice

Layered representations¶

Then, we apply this convolution in layers, multiple times. Note that each pixel in the previous layer has a convolution representation in the next layer.

Note that in CNNs:

  • Convolutions are applied over fixed-size neighborhoods of a vertex (pixel)
  • Convolutions are permutation-sensitive

Today, we are going to explore convolutions over more complex topologies.

Can we design "convolutions" for arbitrary graphs?¶

Let $\bf A$ represent an arbitrary simple (no loops) undirected social network graph. We now illustrate how a convolution should act over the topological space of Facebook friends:

The graph neural layer applied to Bruno will act on his neighbors

The graph neural layer applied to one of his friends will act on her neighbors

Message Passing Graph Neural Layer¶

Since neighborhoods have varying sizes, Graph Neural Layers require a set representation
The composition of multiple Graph Neural layers is called a Graph Neural Network

Message Passing Graph Neural Networks (MPNNs, GNNs)¶

The representation ${\bf h}^{(k)}_v \in \mathbb{R}^{1 \times d}$, $d \in \mathbb{N}$, at layer $k$ of node $v \in V$ can be described through:

  • Its own embedding at layer $k-1$, ${\bf h}^{(k-1)}_v \in \mathbb{R}^{1 \times d'}$, $d' \in \mathbb{N}$ and
  • The embedding of its neighbors $({\bf h}^{(k-1)}_u)_{\forall u \in \mathcal{N}_v}$ where $\mathcal{N}_v$ is the set of neighbors of $v$ in $V$.

Recursion¶

The resulting filter recursion is: $$ {\bf h}^{(k)}_v = \sigma\left( {\bf h}^{(k-1)}_v {\bf W}_\text{self}^{(k)} + \text{Pooling}\left(({\bf h}^{(k-1)}_u)_{\forall u \in \mathcal{N}_v}\right) {\bf W}_\text{neigh}^{(k)} + {\bf b}^{(k)} \right), $$ where ${\bf W}_\text{self}^{(k)} \in \mathbb{R}^{d' \times d}$, ${\bf W}_\text{neigh}^{(k)} \in \mathbb{R}^{d' \times d}$, ${\bf b}^{(k)} \in \mathbb{R}^{1 \times d}$, $d'\in \mathbb{N}$ and $\text{Pooling}$ is a function that outputs a value that does not depend on the order of the inputs.

Pooling Function¶

Given a sequence (input vector) of arbitrary length $n \geq 1$ $$ {\bf x} = (x_1,\ldots,x_n) $$ a permutation $\pi$ over the integers $\{1, 2,\ldots,n\}$ is such that ${\bf x}_\pi = (x_{\pi_1},\ldots,x_{\pi_n})$ is the correspondingly permuted version of ${\bf x}$.

Pooling is such that $$ \text{Pooling}({\bf x}) = \text{Pooling}(\pi \circ {\bf x}), \quad \forall \pi \in \mathbb{S}_n, $$

See our lecture on set representations for a complete list of possible Pooling functions.

Sum pooling example:

Using sum pooling, the resulting filter recursion is: $$ {\bf h}^{(k)}_v = \sigma\left( {\bf h}^{(k-1)}_v {\bf W}_\text{self}^{(k)} + \sum_{\forall u \in \mathcal{N}_v} {\bf h}^{(k-1)}_u {\bf W}_\text{neig}^{(k)} + {\bf b}^{(k)} \right), $$ where again $\mathcal{N}_v$ are the neighbors of node $v$ in the graph.

This variant is commonly known as GraphConv in the literature, and has been thoroughly studied in Morris et al. 2019.

MPNN (GNN) Variants¶

Most existing Graph Neural Networks are a variant of the above recursion.

Graph Attention Networks (GATs) (Veličković et al., 2018) replace ${\bf h}^{(k-1)}_u$ inside the pooling with $\alpha({\bf h}^{(k-1)}_v,{\bf h}^{(k-1)}_u) {\bf h}^{(k-1)}_u$, where $\alpha(\cdot,\cdot) \in (0,1)$ is an attention mechanism (Bahdanau et al., 2015).

More generally, an edge function can also be added before the set representation function, to better represent the connection between $v$ and its neighbors: $$ {\bf h}^{(k)}_v = \sigma\left( {\bf h}^{(k-1)}_v {\bf W}_\text{self}^{(k)} + \text{Pooling}\!\left(\left(\text{edgefunc}\left({\bf h}^{(k-1)}_v,{\bf h}^{(k-1)}_u\right)\right)_{\forall u \in \mathcal{N}_v}\right) {\bf W}_\text{neig}^{(k)} + {\bf b}^{(k)} \right), $$ where GAT is just one such example.

Most-expressive graph function¶

  • GNNs assign the same representation to isomorphic graphs.
    • In view of the required constraint that $f$ is invariant to permutations of the vertex ids.
  • Additional desideratum: Different representations to non-isomorphic graphs.
    • Non isomorphic graphs have different structures, so they should have different representations.
    • A downstream classifier can then push them together, if needed.

Definition (Most Expressive function): A function is most expressive if it gives the same representation to isomorphic graphs and can give different representations to non-isomorphic graphs.

Q: Are GNNs most-expressive representations for graphs?

A: No, there exist pairs of non-isomorphic graphs that cannot be distinguished by standard GNNs (Xu et al., 2019).

Independently of the learning procedure, certain non-isomorphic graphs will get the same representation.

Example 1: Circular Skip Link (CSL) task (Murphy et al., 2019)

Task: Whole-graph classification for graphs with $M$ vertices. Goal: Predict skip length $L$.

MPNN (GNN) representation power¶

  • If GNNs are not most expressive, is there a way of quantifying their expressive power?
  • Do all GNNs have the same expressive power?

We will see that we can quantify the expressive power of GNN in terms of the WL test, a well known heuristic used in graph theory to test graph isomorphism.

Detour: The WL test¶

Recall our desideratum, which is not obtained by standard GNNs: different representations to non-isomorphic graphs.

Obviously, if a network is able to test for graph isomorphism, then it is able to assign a unique representation to each isomorphic class.

  • But testing for graph isomorphism is not an easy problem.

The graph isomorphism problem is NP-intermediate (i.e., not known to be solvable in poly time, nor to be NP complete).

The WL test is a combinatorial heuristic to test graph isomorphism.

  • If two graphs are distinguished by the WL test, they are non-isomorphic.
  • The contrary is not true!
WL test pseudocode Given two graphs ${\bf A}_1$, ${\bf A}_2$.      1. In every graph, assign the same initial color to every node (or start from given node features, if present).      2. Repeat until colors are stable:          Update the color of every node as: $$ c_{v, {\bf A}_1}^{(t)} = \text{HASH}( c_{v, {\bf A}_1}^{(t-1)}, \{\!\{ c_{u, {\bf A}_1}^{(t)} \}\!\}_{u \in \mathcal{N}_v})$$ $$ c_{v, {\bf A}_2}^{(t)} = \text{HASH}( c_{v, {\bf A}_2}^{(t-1)}, \{\!\{ c_{u, {\bf A}_2}^{(t)} \}\!\}_{u \in \mathcal{N}_v})$$          If $\{\!\{ c_{u, {\bf A}_1}^{(t)} \}\!\}_{v \in {\bf A}_1} \neq \{\!\{ c_{v, {\bf A}_2}^{(t)} \}\!\}_{v \in {\bf A}_2}$ then ${\bf A}_1$ and ${\bf A}_2$ non-isomorphic.

WL test: example¶

Figure credit Expressive power of graph neural networks and the Weisfeiler-Lehman test

Implications of WL test¶

Theorem (Xu et al. 2019, Morris et al. 2019): the expressive power of GNNs is upper bounded by the WL test.

Remark: since there exist pairs of non-isomorphic graphs that are undistinguishable by the WL test, there exist pairs of non-isomorphic graphs that are undistinguishable by GNNs.

GNNs are at most as expressive as the WL test in distinguishing non-isomorphic graphs.

Let us see an example of non-isomorphic graphs that are indistinguishable by a GraphConv-based GNN.

  • We will use Pytorch Geometric for our GNN implementation.

In the environment you have in the scholar cluster install pyg as follows:

pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

where TORCH is simply the version of Pytorch you have installed, which you can find with python as

import torch

print(torch.__version__)
In [1]:
# @title [RUN] Import modules

import os.path as osp

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import scipy.io as sio
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from scipy.special import comb
from sklearn.metrics import accuracy_score
from torch_geometric.data import InMemoryDataset
from torch_geometric.data.data import Data
from torch_geometric.loader import DataLoader
from torch_geometric.nn import GraphConv, global_add_pool
from torch_geometric.utils import to_networkx, to_undirected
In [2]:
# @title [RUN] Helper functions for plots and visualisations

def get_color(node_emb, hashmap):
    max_val = max(hashmap.values(), default=0) + 1
    crt_node_emb = np.round(node_emb, decimals=2)

    if tuple(crt_node_emb) not in hashmap:
        hashmap[tuple(crt_node_emb)] = max_val
        max_val += 1

    crt_node_emb = hashmap[tuple(crt_node_emb)]

    # Map float number to a color
    crt_color = cm.tab10(crt_node_emb, bytes=True)
    crt_color = (
        crt_color[0] / 255.0,
        crt_color[1] / 255.0,
        crt_color[2] / 255.0,
        crt_color[3] / 255.0,
    )
    return crt_color


def draw_one_graph(
    ax, edges, hash, label=None, node_emb=None, layout=None,
):

    graph = nx.Graph()
    graph.add_edges_from(zip(edges[0], edges[1]))
    node_pos = layout(graph)
    # Add colors according to node embeding
    if node_emb is not None:
        color_map = [
            get_color(node_emb[node_id], hash)
            for node_id in graph.nodes()
        ]

        nx.draw_networkx_nodes(
            graph, node_pos, node_color=color_map, nodelist=graph.nodes(), ax=ax
        )
        nx.draw_networkx_edges(graph, node_pos, ax=ax)
        nx.draw_networkx_labels(graph, node_pos, ax=ax)
    else:
        nx.draw_networkx(graph, node_pos, ax=ax)


def gallery(
    graphs,
    hash=None,
    labels=None,
    node_emb=None,
    max_fig_size=(40, 10),
    layout=nx.layout.kamada_kawai_layout,
):
    if hash is None:
        hash = {}

    num_graphs = len(graphs)
    ff, axes = plt.subplots(
        1, num_graphs, figsize=max_fig_size, subplot_kw={"xticks": [], "yticks": []}
    )
    if num_graphs == 1:
        axes = [axes]
    if node_emb is None:
        node_emb = num_graphs * [None]
    if labels is None:
        labels = num_graphs * [" "]

    for i in range(num_graphs):
        draw_one_graph(
            axes[i],
            graphs[i].edge_index.numpy(),
            hash,
            labels[i],
            node_emb[i],
            layout,
        )
        if labels[i] != " ":
            axes[i].set_title(f"Target: {labels[i]}", fontsize=28)
        axes[i].set_axis_off()

    plt.show()

Let us consider three exemplary graphs chosen from the Graph8c dataset, which contains all the possible connected non-isomorphic simple graphs with 8 nodes.

The goal is to understand whether a GraphConv-based GNN can disambiguate them.

To do so, we are going to assign a different class to each of them, and we are going to place them in our training set. A GNN can disambiguate them if and only if it can learn to correctly predict their targets.

In [3]:
# @title [RUN] `Graph8c` data retrieval
# Let's get three (selected) graphs from the Graph8c dataset from 
# “Breaking the Limits of Message Passing Graph Neural Networks”
# (https://arxiv.org/pdf/2106.04319.pdf)

!wget https://raw.githubusercontent.com/balcilar/gnn-matlang/main/dataset/graph8c/raw/graph8c.g6
--2024-02-08 17:52:16--  https://raw.githubusercontent.com/balcilar/gnn-matlang/main/dataset/graph8c/raw/graph8c.g6
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 77819 (76K) [text/plain]
Saving to: ‘graph8c.g6.2’

graph8c.g6.2        100%[===================>]  76.00K  --.-KB/s    in 0.05s   

2024-02-08 17:52:16 (1.55 MB/s) - ‘graph8c.g6.2’ saved [77819/77819]

In [4]:
IDS = [379, 4625, 4657]


class Grapg8cDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(Grapg8cDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ["graph8c.g6"]

    @property
    def processed_file_names(self):
        return "data.pt"

    def download(self):
        # Download to `self.raw_dir`.
        pass

    def process(self):
        # Read data into huge `Data` list.
        dataset = nx.read_graph6(self.raw_file_names[0])
        data_list = []
        for i, datum in enumerate(dataset):
            if i not in IDS:
                continue
            x = torch.ones(datum.number_of_nodes(), 1)
            edge_index = to_undirected(
                torch.tensor(list(datum.edges())).transpose(1, 0)
            )
            data_list.append(
                Data(edge_index=edge_index, x=x, y=torch.tensor(IDS.index(i)))
            )

        if self.pre_filter is not None:
            data_list = [data for data in data_list if self.pre_filter(data)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(data) for data in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])
In [5]:
ds = Grapg8cDataset(root="tmp/Graph8c")
G1, G2, G3 = ds[0], ds[1], ds[2]

What are those graphs?¶

Let us assign a different color for every different input feature.

In [6]:
gallery(
    [G1, G2, G3],
    labels=np.array([G1.y.item(), G2.y.item(), G3.y.item()]),
)

Now, it is time to implement our GNN and train to distinguish these three.

So far so good, it seems an easy task.

In [7]:
class MLP(nn.Module):
    """A simple feed forward neural network"""

    def __init__(self, in_dim, emb_dim, num_layers=2):
        super(MLP, self).__init__()
        layer_list = []
        layer_list.append(torch.nn.Linear(in_dim, emb_dim))

        for _ in range(num_layers - 1):
            layer_list.append(torch.nn.BatchNorm1d(emb_dim))
            layer_list.append(torch.nn.ReLU())
            layer_list.append(torch.nn.Linear(emb_dim, emb_dim))

        self.layers = torch.nn.Sequential(*layer_list)

    def forward(self, x):
        return self.layers(x)
In [8]:
class GraphConvNet(torch.nn.Module):
    def __init__(self, in_dim, emb_dim, out_dim, num_layers, num_final_layers=1):
        super(GraphConvNet, self).__init__()

        self.convs = torch.nn.ModuleList()

        for layer in range(num_layers):
            self.convs.append(
                GraphConv(
                    emb_dim if layer != 0 else in_dim,
                    emb_dim,
                )
            )

        self.pool = global_add_pool
        self.final_layers = MLP(emb_dim, out_dim, num_layers=num_final_layers)

    def forward(self, data):
        h_node = data.x
        for conv in self.convs:
            h_node = F.relu(conv(h_node, data.edge_index))
        h_graph = self.pool(h_node, data.batch)
        return self.final_layers(h_graph), h_node.detach()
In [9]:
# @title [RUN] Hyperparameters GNNs

BATCH_SIZE = 128  # @param {type:"integer"}
NUM_EPOCHS = 25  # @param {type:"integer"}
HIDDEN_DIM = 64  # @param {type:"integer"}
NUM_LAYERS = 4  # @param {type:"integer"}
LR = 0.001  # @param {type:"number"}
SEED = 42  # @param {type:"integer"}


DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
In [10]:
# @title [RUN] Set the seed

torch.manual_seed(SEED)
np.random.seed(SEED)
In [11]:
def train(train_loader, model, optimiser, loss_fn, metric_fn):
    """Train model for one epoch"""
    model.train()

    total_loss = 0
    num_graphs = 0
    for data in train_loader:
        optimiser.zero_grad()

        data = data.to(DEVICE)
        y_hat, _ = model(data)
        loss = loss_fn(y_hat, data.y)

        loss.backward()
        optimiser.step()

        total_loss += loss.item() * len(data.y)
        num_graphs += len(data.y)

    return total_loss / num_graphs


def evaluate(loader, model, metric_fn):
    """Evaluate model on dataset"""
    y_pred, y_true = [], []
    model.eval()
    for data in loader:
        data = data.to(DEVICE)
        y_hat, _ = model(data)

        y_pred.append(y_hat.detach().cpu())
        y_true.append(data.y.detach().cpu())

    y_pred = torch.cat(y_pred, dim=0)
    y_true = torch.cat(y_true, dim=0)

    return metric_fn(y_true, y_pred)
In [12]:
def run(
    model,
    train_loader,
    loaders,
    loss_fn,
    metric_fn,
    use_scheduler=False,
    print_steps=True,
):
    """Train the model for NUM_EPOCHS epochs"""

    # Instantiate optimiser and scheduler
    optimiser = optim.Adam(model.parameters(), lr=LR)
    scheduler = (
        optim.lr_scheduler.StepLR(optimiser, step_size=DECAY_STEP, gamma=DECAY_RATE)
        if use_scheduler
        else None
    )

    curves = {name: [] for name in loaders.keys()}

    for epoch in range(NUM_EPOCHS):

        train_loss = train(
            train_loader, model, optimiser, loss_fn, metric_fn
        )
        if scheduler is not None:
            scheduler.step()

        for name, loader in loaders.items():
            curves[name].append(evaluate(loader, model, metric_fn))

        if print_steps:
            print(
                f"[Epoch {epoch}]",
                f"train loss: {train_loss:.6f}",
                end=" ",
            )
            for name, metric in curves.items():
                print(f"{name} metric: {metric[-1]:.3f}", end=" ")
            print("\n")
    return curves
In [13]:
# Instantiate our models
graphconv_model = GraphConvNet(
    in_dim=ds.num_features,
    emb_dim=HIDDEN_DIM,
    out_dim=ds.num_classes,
    num_layers=NUM_LAYERS,
).to(DEVICE)
In [14]:
# Create the Dataloader
train_loader = DataLoader(ds, BATCH_SIZE, shuffle=True)
In [15]:
# Train our GNN model
_ = run(
    graphconv_model,
    train_loader,
    {"train": train_loader},
    loss_fn=F.cross_entropy,
    metric_fn=lambda x, y: accuracy_score(x, y.argmax(-1)),
)
[Epoch 0] train loss: 7.958117 train metric: 0.333 

[Epoch 1] train loss: 3.211746 train metric: 0.333 

[Epoch 2] train loss: 2.457222 train metric: 0.333 

[Epoch 3] train loss: 1.130873 train metric: 0.333 

[Epoch 4] train loss: 1.671703 train metric: 0.333 

[Epoch 5] train loss: 1.296378 train metric: 0.667 

[Epoch 6] train loss: 0.790557 train metric: 0.667 

[Epoch 7] train loss: 0.870086 train metric: 0.333 

[Epoch 8] train loss: 0.936117 train metric: 0.333 

[Epoch 9] train loss: 0.965901 train metric: 0.333 

[Epoch 10] train loss: 0.866404 train metric: 0.667 

[Epoch 11] train loss: 0.778652 train metric: 0.667 

[Epoch 12] train loss: 0.634940 train metric: 0.667 

[Epoch 13] train loss: 0.609730 train metric: 0.667 

[Epoch 14] train loss: 0.646420 train metric: 0.667 

[Epoch 15] train loss: 0.644886 train metric: 0.667 

[Epoch 16] train loss: 0.573604 train metric: 0.667 

[Epoch 17] train loss: 0.552052 train metric: 0.667 

[Epoch 18] train loss: 0.556559 train metric: 0.667 

[Epoch 19] train loss: 0.563795 train metric: 0.667 

[Epoch 20] train loss: 0.552862 train metric: 0.667 

[Epoch 21] train loss: 0.539810 train metric: 0.667 

[Epoch 22] train loss: 0.519177 train metric: 0.667 

[Epoch 23] train loss: 0.504517 train metric: 0.667 

[Epoch 24] train loss: 0.505564 train metric: 0.667 

Since the accuracy is 2/3, the GNN model is able to distinguish 2/3 of the graphs.

  • And since we had 3 graphs, it implies that 2 of them are indististinguishable.

Let us visualise our three graphs, and use a different color for every different node representation learnt by our GNN.

In [16]:
_, G1_h_nodes = graphconv_model.to("cpu")(G1)
_, G2_h_nodes = graphconv_model.to("cpu")(G2)
_, G3_h_nodes = graphconv_model.to("cpu")(G3)

stateful_hash = {}
gallery(
    [G1, G2, G3],
    hash=stateful_hash,
    labels=np.array([G1.y.item(), G2.y.item(), G3.y.item()]),
    node_emb=[G1_h_nodes.numpy(), G2_h_nodes.numpy(), G3_h_nodes.numpy()],
)

Our GNN model is unable to distinguish the second and the third graph!

Indeed, you can see that their nodes have the same colors, or, more precisely, these graphs have the same number of nodes having the same color.

Thus any graph representation obtained from these node representations will be the same.

We have seen that any standard GNN is at most as powerful as the WL test.

Does it mean that also the WL test is unable to distinguish these two non-isomorphic graphs?

In [17]:
hash2 = nx.weisfeiler_lehman_graph_hash(to_networkx(G2), iterations=6)
hash3 = nx.weisfeiler_lehman_graph_hash(to_networkx(G3), iterations=6)

hash2 == hash3
Out[17]:
True

Indeed, the WL test also fails to tell them apart!

Actually, GraphConv is as powerful as the WL test (Morris et al., 2019).

But note that this is not true for all GNNs, as some are even less powerful.

Q: Is there a way to go beyond this expressivity limit?

Expressive GNNs¶

There are many existing works that try to increase the expressive power of GNNs, trading off expressive power, complexity and generalisation capabilities.

Higher order GNNs (Morris et al., 2019, Maron et al., 2019a,b,c)¶

Inspired or connected to higher order versions of the WL test (namely, the $k$-WL test).

While WL test updates the color of nodes, the $k$-WL test updates the color of tuples of $k$ nodes.

  • Morris et al., 2019 directly construct neural counterparts of the k-WL test.

  • Maron et al., 2019a,b,c characterize invariant and equivariant layers for learning on graphs.

    • The resulting architectures (which have a dependency on $k$) have been shown to have the same expressive power of the corresponding $k$-WL test (Maron et al., 2019c, Geerts, 2020).

We will focus on the latter, providing a characterization of the collection of permutation invariant and equivariant linear layers acting on graph data.

G-equivariant layers for learning on graphs (Maron et al., 2019a)¶

Consider our adjacency tensor ${\bf A} \in \mathbb{A}^{n\times n}$.

Given a matrix ${\bf X} \in \mathbb{R}^{a\times b}$, denote $\text{vec}({\bf X}) \in \mathbb{R}^{a b \times 1}$ its column stack.

In the following we show how to obtain equivariant linear layers.

  • Those layers are applied to order-2 data, such as our adjacency tensor ${\bf A} \in \mathbb{A}^{n \times n}$.
  • However, they can be extended to general order-$k$ data ${\bf A} \in \mathbb{A}^{n^k}$.
  • It can be shown that stacking those layers leads to architectures that are universal, for sufficiently large $k$.

G-equivariant layers¶

Let ${\bf W} \in \mathbb{R}^{1\times n^2}$ denote our neuron weights.

Goal: ${\bf W}$ should be permutation equivariant.

Let $\pi \in \mathbb{S}_n$ be an element of the symmetric group of order $n$ (a pemutation) and let $\rho: \mathbb{S}_n \to \mathbb{R}^{n^2 \times n^2}$ be a representation of $\pi$ that acts on the (vectorized) adjacency matrix to permute it, i.e., $\rho(\pi)\text{vec}({\bf A}) $ is a (vectorized) permutation of ${\bf A}$.

Then, from our G-invariance and set lectures, we know that G-equivariance implies $$\begin{equation}\tag{8} \label{eq:eq} \forall \pi \in \mathbb{S}_n, \forall {\bf A} \in \mathbb{A}^{n \times n}:\quad \rho(\pi){\bf W}\text{vec}({\bf A}) = {\bf W}\rho(\pi)\mathrm{vec}({\bf A}) \end{equation}$$ where $\mathrm{vec}$ flattens the matrix into a vector.

Since Equation $(\ref{eq:eq})$ is true for all ${\bf A}$, we can impose the following: $$ \rho(\pi){\bf W} = {\bf W}\rho(\pi). $$

We can now use the Kronecker product property we saw in our G-equivariance lecture and in our set lecture to write $$ \rho(\pi){\bf W} = {\bf W}\rho(\pi) \implies \rho(\pi)\otimes \rho(\pi^{-1})^T\mathrm{vec}({\bf W}) = \mathrm{vec}({\bf W}). $$ For any permutation $\pi \in \mathbb{S}_n$ the permutation matrix is such that $\rho(\pi^{-1}) = \rho(\pi)^T$. Then, $$ \rho(\pi){\bf W} = {\bf W}\rho(\pi) \implies \rho(\pi)\otimes \rho(\pi)\mathrm{vec}({\bf W}) = \mathrm{vec}({\bf W}). $$

Finally, we need to compute the Reynolds operator $$\bar{P} = \sum_{\pi \in \mathbb{S}_n} \rho(\pi)\otimes \rho(\pi).$$ And we know from our G-invariance lecture that the Reynolds operator guarantees $\bar{P} \rho(\pi)\otimes \rho(\pi) = \bar{P}$, $\forall \pi \in \mathbb{S}_n$.

We now define the subspace of neuron parameters that will make our (linear) neural layer equivariant.

As seen in the G-invariant lecture, weights $$\mathrm{vec}({\bf W}) \in \text{Left-1-Eig}(\bar{P}) = \{\mathrm{vec}({\bf W}') \in \mathbb{R}^{n^4} : \mathrm{vec}({\bf W}')^T \bar{P} = \mathrm{vec}({\bf W}')^T \}$$ define an equivariant subspace.

Moreover, we know $\text{Left-1-Eig}(\bar{P})$ is span by the left eigenvectors of $\bar{P}$ with eigenvalue 1, therefore our neuron weights are defined as $$ \mathrm{vec}({\bf W}) = \sum_{i=1}^k \alpha_i {\bf v}_i^T, $$ where ${\bf v}_1^T,\ldots,{\bf v}_k^T$ are the $k$ left eigenvectors of $\bar{P}$, where $k$ is the rank of $\bar{P}$.

For example, the 1-left eigenvector basis for ${\bf A} \in \mathbb{A}^{n \times n}$ for $n = 5$ can be visualized as

where the number of basis element is $\text{bell}(k + l) = \text{bell}(4) = 15$ and $\text{bell}(n)$ is the Bell number $n$.

Figure credit (Maron et al. 2019a).

Note that this result can be extended to find equivalently the 1-left eigenvector basis for ${\bf A} \in \mathbb{A}^{n ^k}$.

  • Note: $k$-th order tensor ${\bf A} \in \mathbb{A}^{n ^k}$ encodes hyper-edge-values, where $\mathbf{A}_{i_1,...,i_k}$ represents the value of the hyper-edge $(i_1, \dots, i_k)$.
  • For sufficiently large $k$, architectures obtained by stacking equivariant layers are universal.

Applicability of higher order GNNs (including Maron et al., 2019a,b,c)¶

We have seen how to obtain invariant and equivariant linear layers, which can be used to construct Higher-order GNNs, our first attempt at more expressive GNNs.

  • Pros:
    • Expressivity of the resulting architecture increases as $k$ increases.
    • Universality is obtained for sufficiently large $k$.
  • Cons:
    • Computationally intractable for $k > 3$.

Feature augmentation approaches¶

The message-passing paradigm of GNNs is limited by its anonymity (no sender info).

Thus, solve the problem by augmenting the node features with unique identifiers.

Figure credit (Sato et al. 2021).

Several methods fall in this category. We will expand on two.

  1. Identifiers as node features + average (Murphy et al. 2019)

    Run a standard GNN model $f$ on the original graph augmented with additional one-hot node randomly assigned node ids as features. This gives a positional node representation by breaking symmetries. Then, use symmetrization to get a structural node representation from the positional ones:
    $$ {\bf h}_v = \frac{1}{n!} \sum_{\pi \in \mathcal{S}_n} f({\bf A} \Vert (\pi \circ I_n), v),$$ where ${\bf A} \Vert (\pi \circ I_n)$ concatenates given features with the additional permuted node ids (thus, concatenates on the diagonal of ${\bf A}$).

  • Pros:

    • It can be proved that they are universal (Murphy et al, 2019), and thus can approximate every function defined on graphs of any order.
  • Cons:

    • Computationally expensive (but tractable solutions can be obtained by trading off expressive power, exactly as we saw for Janossy Pooling in our set lecture.)

Note that in practice this approach is simply applying the Reynolds operator to the permutation group $\mathbb{S}_n$, as we have seen in the G-invariant lecture and in the set lecture.

  1. Random node features (Abboud et al. 2021, Sato et al., 2021)

    Run a standard GNN model $f$ on the original graph augmented with additional randomized initial node features $R$, one for every node $v$, $R_v \sim \mathcal{D}$, $R_v \in \mathbb{R}^d$:
    $${\bf h}_v = f({\bf A} \Vert R, v),$$ where ${\bf A} \Vert R$ concatenates given features with the additional random ones (thus, concatenates on the diagonal of ${\bf A}$).

Random values can act as ids of the nodes, and can "always" make nodes distinguishable. But it also makes the node embeddings positional. As we will see in our positional node embedding lecture, averaging over the distribution of random features (and node permutations) would allow us to recover structural node embeddings.

  • Pros:

    • Same computational complexity as standard GNNs.
    • It can be proved that they are universal (Abboud et al, 2021), and thus can approximate every function defined on graphs of any order.
  • Cons:

    • The resulting function $f$ (taking as input ${\bf A}$ augmented with random node ids) is invariant to permutations of the vertex ids only in expectation.
    • In practice: Unless the support of the random features is small, these are hard to train, hard to converge, and hard to generalize.

Substructure-aware models¶

The inability of GNNs to distinguish non-isomorphic graphs is directly tied to the inability to count substructures.

Theorem (Chen et al. 2020): GNNs cannot perform induced-subgraph-count of substructures consisting of 3 or more nodes.

Thus:

  1. Count the number of substructures each node belongs to as a pre-processing step.

  2. Use the counts to augment node features (Bouritsas et al. 2022) or to guide the message passing (Bodnar et al. 2021).

Figure credit Bouritsas et al. 2022.

  • Pros:
    • More expressive than standard GNNs
    • Might be tractable.
  • Cons:
    • Domain knowledge is required to choose which substructures to count.

Subgraph-based architectures¶

View each graph as a bag of subgraphs, obtained through a deterministic function (which is called subgraph selection policy)

Employ an architecture (or encoder) returning a node representation, that is invariant to permutation of node ids and invariat to permutations of elements in the bag.

For example, a valid architecture respecting those symmetries consists of a GNN applied to each subgraph independently, followed by a pooling function that aggregates the results.

Several variants (Cotta et al., 2021, Zhang and Li, 2021, You et al, 2021, Bevilacqua et al., 2022, Zhao et al., 2022, Papp et al., 2021, Frasca et al. 2022, Quian et al 2022).

  • Pros:
    • More expressive.
    • No domain knowledge is required, assuming domain agnostic subgraph selection policies.
  • Cons:
    • Computationally expensive for large graphs, where the number of subgraphs might be too large.

Positional Node Representations¶

Different from structural node representations, positional node representations are not generally invariant to permutations.

For instance, the Singular Value Decomposition of $A$ gives positional node representations, which are markedly different from the structural ones:


A General view of Matrix Factorization (Positional) Node Embeddings¶

Example:

Matrix Factorization as Metric Embedding of Nodes¶

Consider an undirected graph $G$ (then ${\bf U} = {\bf V} = {\bf Z})$

Positional node embeddings through invariant theory (Srinivasan & Ribeiro, 2020)¶

Definition (Positional node representations [Srinivasan & Ribeiro, 2020]): Positional node representations of a graph $G$ are defined as joint samples of random variables $(\mathbf{Z}_i)_{i\in V}|{\bf A} \sim p(\cdot |{\bf A})$, $\mathbf{Z}_i \in \mathbb{R}^{d}$, $d \geq 1$, where $p(\cdot|{\bf A})$ is a $\mathbb{S}_n$-equivariant probability distribution on ${\bf A}$. That is, $$(Z_1,\ldots,Z_n)|{\bf A} \stackrel{d}{=} (Z_{\pi(1)},\ldots,Z_{\pi(n)})|(\pi \circ {\bf A}), \quad \pi \in \mathbb{S}_n,$$ where $\stackrel{d}{=}$ means equivalent in distribution.

Intrinsic Randomness in Positional Representations¶

Positional Representations are often obtained via matrix factorization (MF), which decomposes the adjacency matrix ${\bf A}$ into a vector of $k$-dimensional node factors $\Phi \in \mathbb{R}^{k \times n}$.

  • Example: The spectral decomposition of a symmetric graph with adjacency matrix ${\bf A}$ is $${\bf A} = \Psi \Lambda \Psi^T, $$ where $\Lambda$ is the diagonal matrix of eigenvalues of ${\bf A}$ and $\Psi$ is an orthonormal matrix holding its eigenvectors.

Q: What is the impact a permutation ${\bf B} = \pi \circ {\bf A}$ in a matrix factorization?

  • In what follows we will see that positional node embeddings may be permutation sensitive since the node order influences the algorithm output.
  • Let's consider spectral decompositions of adjacency matrices of undirected graphs.

Impact of permutations on spectral decompositions¶

  • Assume $G$ is undirected, so that its adjacency matrix ${\bf A}$ is a symmetric matrix.
  • Consider two graphs $G$ and $H$. We want to test if they are isomorphic. Let ${\bf A}$ and ${\bf B}$ be the adjacency matrices of $G$ and $H$, respectively.
    • If $G$ and $H$ are isomorphic, ${\bf A}$ and ${\bf B}$ must have the same eigenvalues.
    • However, there are many pairs of graphs that are non-isomorphic but which have the same eigenvalues.
  • Assume that A and B have the same eigenvalues $\Lambda$, and that each of these eigenvalues has multiplicity 1, i.e., they are unique (for simplicity).
    • First, note that if ${\bf S}$ is a diagonal matrix with $\pm 1$ entries on its diagonal, then $${\bf A} = \Psi \Lambda \Psi^T = \Psi {\bf S} \Lambda {\bf S} \Psi^T,$$ since ${\bf S} = {\bf S}^T$ and ${\bf S}^2 = I$, where $I$ is the identity matrix.
    • Since ${\bf B}$ has the same eigenvalues, we can write $${\bf B} = \Phi \Lambda \Phi^T.$$ Then, $\exists \pi \in \mathbb{S}_n$ such that $$ \pi \circ {\bf B} = \rho(\pi) \Phi \Lambda \Phi^T \rho(\pi)^T = \Psi \Lambda \Psi^T = {\bf A},$$ where $\rho(\pi)$ is the group representation of $\pi$ that permutes the rows and columns of $n \times n$ adjacency matrices.
    • As each entry of $\Lambda$ is distinct, one would be excused to think that this implies $\rho(\pi)\Phi = \Psi$.
    • But, the eigenvectors (columns of $\Psi$ and $\Phi$) are only determined up to sign. So, this implies $$\rho(\pi)\Phi = \Psi {\bf S},$$ where ${\bf S}$ is a diagonal matrix with $\pm 1$ entries on its diagonal.
  • This means that if we factorize the permuted $\pi \circ {\bf A}$, the new ${\bf S}'$ diagonal matrix is not guaranteed to simply be a permutation ${\bf S}' = \pi \circ {\bf S}$, i.e., we could have ${\bf S}'\neq \pi \circ {\bf S}$.
    • The $n$ values of ${\bf S}$ are defined by the factorization algorithm arbitrarily.
      • There are $2^n$ distinct matrices ${\bf S}$. And this is one of the reasons why determining if two graphs are isomorphic is not an easy task.
      • The other reason is the presence of isomorphic nodes, which means repeated eigenvalues, (which we will not cover here).

Isomorphic nodes and further randomness of positional embeddings¶

Now consider the following example:

Assume our graph is our Artic food web, with adjacency matrix ${\bf A}$.

Consider the node representations of the food web using the left singular vectors obtained by applying SVD to ${\bf A}$.

  • Since ${\bf A}$ is symmetric, these are just the eigenvectors $\Psi$ of ${\bf A}$.

  • In the figure, the color of node $v \in V$ is an RGB characterization of its eigenvector $\Psi_{\cdot,v}$.

Now consider the permutation $\pi \in \mathbb{S}_n$ such that ${\bf B} = \pi\circ {\bf A} = {\bf A}$ is the permuted adjacency matrix where the land and sea components are now swapped.

  • Orca and Lynx swapped their ids, Penguin swapped its id with the Ground Squirrel, etc..

If we run the same SVD algorithm on ${\bf B}$ (recall ${\bf B} = {\bf A}$).

  • While the $n$ values of ${\bf S}$ are defined by the factorization algorithm arbitrarily, the same algorithm will likely output the same ${\bf S}$ given the same input.
  • Hence, we get the same eigenvector $\Psi$ but their association to the animals are different:
  • Note that compared to our previous example, now the Orca traded eigenvectors with the Lynx.
    • More generally, all landed animals have traded eigenvectors with their isomorphic sea animals.
    • The eigenvector assigned to an animal depends on its node id.
  • Despite assuming the factorization algorithm is deterministic:
    • If we assume node ids $V = \{1,\ldots,n\}$ are given to the animals arbitrarily, then the eigenvector given by the deterministic algorithm to isomorphic nodes is arbitrary.
    • Note (further arbitrary choices): The above is not even taking into account the fact that isomorphic nodes imply eigenvalue multiplicity greater than one, which makes the eigenvectors in $\Psi$ not unique even regardless of ${\bf S}$.

Matrix Factorization as a Sampled Positional Node Embeddings¶

  • Consider the spectral decomposition of ${\bf A}$ is $${\bf A} = \Psi \Lambda \Psi^T, $$ where $\Lambda$ is the diagonal matrix of eigenvalues of ${\bf A}$ and $\Psi$ is an orthonormal matrix holding its eigenvectors.
  • You should have by now realized that eigenvector of node $v \in V$, $\Psi_{\cdot,v}$ satisfies our definition of positional node embedding.
    • The eigenvector assigned to a specific node depends on the node id and other arbitrary decisions of the factorization algorithm (e.g., ${\bf S}$)
  • This permutation sensitivity makes it difficult to use positional embeddings in node classification tasks.
  • It also creates difficulties in link prediction tasks, say using a link function $$P({\bf A}_{uv} = 1|\Psi_{\cdot,u},\Psi_{\cdot,v}) \approx f(\Psi_{\cdot,u},\Psi_{\cdot,v})$$ since ${\bf S}$ may change with some permutation $\pi \in \mathbb{S}_n$.
    • In the presence of isomorphic nodes, the challenge is even greater, since the eigenvectors are not unique.

Marginal distribution¶

Since permuting ${\bf A}$ results in different positional embeddings of the nodes, then positional embeddings can be interpreted as samples from a joint distribution.

Let us consider the marginals of this joint distribution. The parameters of the marginal distribution are related to structural representations of the nodes.

Definition: Given a graph $G = (V, E)$ we say that two nodes $u, v \in V$ are isomophic (denoted as $ u \simeq v $) if they have symmetric positions in the graph.

  • More formally, we say two nodes $u,v \in V$, $u\neq v$, are isomorphic (denoted $v \simeq u$) if $\exists \pi \in \mathbb{S}_n$ that is not the identity, $\pi \neq (1,\ldots,n)$, such that ${\bf A} = \pi \cdot {\bf A}$ and $\pi(u) = v$, $\pi(v) = u$.

In the figure (Zhang et al., 2021) $ v_1 \simeq v_4 $, $ v_2 \simeq v_3 $

Detour: Distributions and their samples¶

Consider a bivariate normal distribution $(X_1, X_2) \sim \mathcal{N}\left(\boldsymbol{0}, \begin{bmatrix} 1 & \sigma \\ \sigma & 1\end{bmatrix}\right)$, where $\sigma \in \mathbb{R}$ is a scalar.

Note that from the marginal distributions we cannot estimate the parameter $\sigma$, which requires joint samples $(x_1, x_2)$ for it to be estimated.

Link Prediction Task¶

A link prediction task is a joint task between two nodes via link function $\rho(Z_u, Z_v)$, where $Z_u, Z_v$ denote embeddings of nodes $u, v$ sampled jointly.

As we have seen in the bivariate case, marginal distributions are insufficient to solve a joint task. Therefore, GNNs may fail in solving link prediction tasks.

$\text{GNN}(G)_{\text{Linx}} = \text{GNN}(G)_{\text{Orca}}$ and therefore $\rho(\text{GNN}(G)_{\text{Hare}}, \text{GNN}(G)_{\text{Linx}}) = \rho(\text{GNN}(G)_{\text{Hare}}, \text{GNN}(G)_{\text{Orca}})$

However, the permutation sensitivity in matrix factorization algorithms makes it difficult to practically use them with a final link function for link prediction tasks (unless we can ensure invariance to permutations in the link function).

So how can we predict links without being sensitive to permutations?

Joint structural representations¶

Definition (Joint structural representations): Joint structural node set representations of an $n\times n$ adjacency matrix ${\bf A}$ (or $n\times n \times k$, $k \geq 2$, adjacency tensor ${\bf A}$) are obtained from a function $f : \mathbb{R}^{n\times n} \to \mathbb{R}^{|P^{\star}(V)| \times d}$, with $P^{\star}(V)$ the power set excluding the empty set, where isomorphic node sets (denoted by $\simeq$) obtain identical representations i.e. $$ s_1 \simeq s_2 \implies (f({\bf A}))_{s1,\cdot} = (f({\bf A}))_{s2,\cdot},$$ where $s_1, s_2 \in P^{\star}(V)$ and $(f({\bf A}))_{s_2,\cdot}$ denotes the row of $f({\bf A})$ corresponding to $s_2$.

Example of a 3-node representation

Definition (Most Expressive Joint Structural representations: We say that the joint structural representations of a node set are most expressive iff $$ s_1 \simeq s_2 \Leftrightarrow (f({\bf A}))_{s1,\cdot} = (f({\bf A}))_{s2,\cdot},$$

Theorem (Srinivasan & Ribeiro, 2020) (informal): For a graph $G$, given a sequence of node sets $\vec{S}=(s_1,\ldots,s_n),\;s_i \in P^{\star}(V)$ and their associated random variables as a sequence $\vec{Y}=(y_1,\ldots,y_n)$, if for all pairs of isomorphic node sets $s_i, s_j$ in $\vec{S}$, their corresponding random variables $y_i, y_j$ are equal in distribution (i.e. $y_i \stackrel{d}{=} y_j)$, then there exists a function $g$ and most expressive joint structural representation $f$ s.t. $$g((f({\bf A}))_{si,\cdot},\epsilon) \stackrel{\text{a.s.}}{=} y_i,$$ where $\epsilon$ is a pure source of random noise .

Corollary (Srinivasan & Ribeiro, 2020) (informal): While most expressive structural node representations independently are not sufficient to perform link prediction, the most expressive joint structural representations (of a pair of nodes) are sufficient.

Thanks to the corollary, then we have that most expressive joint structural representations can be used to predict links.

Joint structural representations for link prediction¶

Essentialy, from the above results we can conclude that:

  1. Positional representations are simply samples from the joint structural distribution
  2. Joint structural representations are connected to the parameters of the distribution.

Importantly, we can use joint structural representations for link prediction.

Double Equivariance: Graphs Beyond Node Permutation Invariances¶

  • Consider a graph with discrete edge attributes (e.g., a Knowledge Graph)
  • Let $\mathbb{S}_n$ be the permutation group of the $n$ nodes in the graph
  • Let $\mathbb{S}_m$ be the permutation group of the $m$ relations in the graph
  • What if we defined the equivariance of the GNN to be over the group formed by $\mathbb{S}_n \times \mathbb{S}_m$.
    • We will denote these to be Double G-equivariant, since they will be equivariant to the overgroup $\mathbb{S}_n \times \mathbb{S}_m$.
  • Since $\forall \pi \in \mathbb{S}_n$ and $\forall \tau \in \mathbb{S}_m$, $\pi \circ \tau = \tau \circ \pi$, we just need to combine neural architectures designed to be equivariant of each one separately into multiple message-passing layers.

Domain Transfer Task¶

  • E.g.: Can we train in Wikigraph domains of Sports and Arts and inductively (zero-shot) transfer to Healthcare?
  • For more details, please refer to (Gao et al., 2023)

Conclusions¶

  • We have presented GNNs as standard architectures applied to graph data, which guarantee invariance to permutations of node ids.

  • We have expanded on their limited expressive power, which results, for example, in

    1. Non-isomorphic graphs getting the same representations,
    2. Inability to count substructures.
  • We have introduced several lines of work trying to break this expressivity limit.

  • We have discussed the limitations of existing more expressive methods, and shown that the expressivity problem is not solved, yet.

  • We have shown that structural representations are all you need.. The challenge is to define the number of nodes in your structural representation before solving the task.

    • Once you have chosen, you may not be able to recover a higher size representation from a smaller size one representation
    • Positional representations can do higher-order tasks more easily if we can remove the randomness, since they can answer a query on any joint number of nodes (by getting the appropriate-size structural representations back and using them for the task).

References¶

Srinivasan and Ribeiro, 2020. Srinivasan, B., and Ribeiro, B. On the Equivalence between Positional Node Embeddings and Structural Graph Representations. ICLR 2020.

Cotta, Leonardo, Christopher Morris, and Bruno Ribeiro. "Reconstruction for powerful graph representations." Advances in Neural Information Processing Systems 34 (2021): 1713-1726.

Morris et al., 2019. Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. AAAI 2019

Veličković et al., 2018. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention Networks. ICLR 2018

Bahdanau et al., 2015. Bahdanau, D., Cho, K. H., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ICLR 2015

Xu et al., 2019. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? ICLR 2019

Murphy et al., 2019. Murphy, R., Srinivasan, B., Rao., V. and Ribeiro, B. Relational Pooling for Graph Representations. ICML 2019

Maron et al., 2019a. Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and Equivariant Graph Networks. ICLR 2019

Maron et al., 2019b. Maron, H., Ben-Hamu, H., Segol, N., and Lipman, Y. On the Universality of Invariant Networks. ICML 2019

Maron et al., 2019c. Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y. Provably powerful graph networks. NeurIPS 2019

Geerts, 2020. Geerts, F. The expressive power of kth-order invariant graph networks. ArXiv Preprint

Abboud et al., 2021. Abboud, R., Ceylan, I. I., Grohe, M., and Lukasiewicz, T. The surprising power of graph neural networks with random node initialization. IJCAI-21

Sato et al., 2021. Sato, R., Yamada, M., and Kashima, H. Random features strengthen graph neural networks. SDM 2021

Chen et al., 2020. Chen, Z., Chen, L., Villar, S., Bruna, J. Can Graph Neural Networks Count Substructures? NeurIPS 2020

Bouritsas et al., 2022. Bouritsas, G., Frasca, F., Zafeiriou, S., and Bronstein, M. M. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022

Bodnar et al., 2021. Bodnar, C., Frasca, F., Otter, N., Wang, Y. G., Liò, P., Montúfar, G., and Bronstein, M. M. Weisfeiler and lehman go cellular: CW networks. NeurIPS 2021

Cotta et al., 2021. Cotta, L., Morris, C., and Ribeiro, B. Reconstruction for powerful graph representations. NeurIPS 2021

Zhang and Li, 2021. Zhang, M. and Li, P. Nested graph neural networks. NeurIPS 2021

You et al., 2021. You, J., Gomes-Selman, J., Ying, R., and Leskovec, J. Identity-aware graph neural networks. AAAI 2021.

Bevilacqua et al., 2022. Bevilacqua, B., Frasca, F., Lim, D., Srinivasan, D., Cai, C., Balamurugan, G., Bronstein, M. M., Maron, H. Equivariant Subgraph Aggregation Networks. ICLR 2022

Zhao et al., 2022. Zhao, L., Jin, W., Akoglu, L., and Shah, N. From stars to subgraphs: Uplifting any GNN with local structure awareness. ICLR 2022.

Papp et al., 2021. Papp, P. A., Martinkus, K., Faber, L., and Wattenhofer, R. Dropgnn: Random dropouts increase the expressiveness of graph neural networks. NeurIPS 2021

Frasca et al., 2022. Frasca, F., Bevilacqua, B., Bronstein, M. M., Maron, H. Understanding and Extending Subgraph GNNs by Rethinking their Symmetries. NeurIPS 2022

Quian et al., 2022. Qian, C., Rattan, G., Geerts, F., Morris, C., Niepert, M. Ordered Subgraph Aggregation Networks. NeurIPS 2022

In [ ]: