Graph Convolutional Network (GCN) - Background

Background Knowledge

Why we need GCN?

CNN cannot deal with Non-Euclidean structure data. The reason is that the neighborhood nodes of each vertice are not the same. So the convolutional kernel size is not the same.
We need a method to effectively extract information from graph structure for machine learning training.
Many data structures are non-euclidean, like social networks, computer networks, IoT, DNA information, etc. So graph could be a better way to represent the data and keep more information than the euclidean data.

Essentially, the difficulty to process the graph data is that the numbers of neighborhood vertices of one vertice are not constant. --> A fixed size of a convolutional kernel cannot be used for information extraction.

==> There are two main directions to process the graph data:

Transforming non-Euclidean data to Euclidean space
Find a convolutional kernel that can process the neighborhood vertices with different sizes. --> GCN

Spatial Domain and Spectral Domain

Spatial Domain. Extract the graph features by finding the neighbors of each vertex:
There are two problems to extract features in Spatial domain:
- How to find the neighbors of a vertex, i.e. how to determine receptive field?
- With the receptive field, how to deal with the features of the different numbers of neighbors.
The main drawbacks are:
- The neighbors of each vertex are different -> the computation need to deal with every vertex
- The extracted feature lose too much information

Spectral Domain. It wants to use the Spectral domain to calculate the convolution between kernel and graph data. Intuitively, we want to use the Eigenvalues and Eigenvectors of the Laplacian matrix of a graph to extract graph feature information

What does the GCN really do?

图神经网络的核心工作是对空间域(Spatial Domain)中节点的Embedding进行卷积操作(即聚合邻居Embedding信息)，然而图数据和图像数据的差别在于节点邻居个数、次序都是不定的，因此传统用于图像上的CNN模型中的卷积操作(Convolution Operator)不能直接用在图上，因此需要从频谱域(Spectral Domain)上重新定义这样的卷积操作再通过卷积定理转换回空间域上。

为了在频谱域和空间域中转换，我们借助了傅里叶公式，并且定义了图上傅里叶变换(从空间域变换到频谱域)和图上傅里叶逆变换(从频谱域回到空间域)的变换公式。具体操作是我们将节点的Embedding $f(i), i \in (1, ..., N)$ 通过傅里叶正变换从空间域变换到了频谱域 $\hat{f}$ ，在频谱域上和卷积核hh进行卷积操作，再将变换后的节点Embedding通过傅里叶逆变换回到空间域，参与后续的分类等任务.[2]

Laplacian Matrix

For an undirected graph, G = (V, E), its Laplacian matrix is L = D - A. D is the Degree matrix, and A is the Adjacent matrix.

There are also other two definitions of the Laplacian matrix:

Symmetric Normalized Laplacian. $L^{sys} = D^{-1/2}LD^{1/2}$ . --> Most papers used this format
Random Walk Normalization Laplacian. $L^{rw} = D^{-1}L$

For $L^{sys} = D^{-1/2}LD^{1/2}$ , since (1) L = D - A and (2) D is symmetric metrix, ==> $L^{sys}$ can be decomposed.

-> This is a very important characteristic. The reason is that The Fourier Transform needs to use the eigenvalues and eigenvectors of the Laplacian matrix => to transform graph data from spatial domain to spectral domain to calculate the convolution.

Laplacian matrix decomposition

Laplacian matrix is positive semi-definite (半正定矩阵). It has three characteristics:

Symmetric matrix must have n independent eigenvectors
==> $L = U \begin{pmatrix} \lambda_1 & & \\ & ... & \\ & & \lambda_n \end{pmatrix} U^{-1}$
The eigenvalues of a semi-definite matrix must be positive
Since U is orthogonal, $UU^T = I$ , since $U^T = U^{-1}$ ,
==> $L = U \begin{pmatrix} \lambda_1 & & \\ & ... & \\ & & \lambda_n \end{pmatrix} U^{T}$
The eigenvectors of a symmetric matrix are orthogonal

NOTES: Definition of 正定矩阵，半正定矩阵：

正定矩阵： A 是n阶方阵，如果对任何非零向量x，都有 $x^TAx>0$ ，其中 $x^T$ 表示 $x$ 的转置，就称A正定矩阵
<==> A的所有特征值 > 0
半正定矩阵：A是n阶实对称矩阵。如果对任意的实非零列向量x有 $x^TAx \geq 0$ ，就称A为半正定矩阵。
<==> A的所有特征值 >= 0

Fourier Transform on Graph

Fourier Transform: Spatial Domain -> Spectral Domain

Continuous Fourier Transform: $F(\omega) = F[f(t)] = \int f(t)e^{-i\omega t} dt$
The Fourier Transform of f(t) is the integral of the result of $f(t)$ and the basis function $e^{-i\omega t}$
The basis function $e^{-i\omega t}$ needs to fit the following criteria: $\Delta e^{-i\omega t} = \frac{\partial^2}{\partial t^2}e^{-i\omega t} = - \omega^2e^{-i\omega t}$
$\Delta$ is the Laplacian Operator: $\Delta f = \nabla^2f$
With the definition of eigenvalues and eigenvectors: $A\mu = \lambda\mu$ , where $A$ is a matrix, $\mu$ is its eigenvectors, $\lambda$ is its eigenvalues.
=> $e^{-i\omega t}$ is the eigenvectors of $\Delta$ , and $\omega$ is closely related to the corresponding eigenvalues
Laplacian matrix is the discrete Laplacian operator. We got the Laplacian matrix of a graph before.
=> To do Fourier Transform on Graph, we need to find the basis function. Intuitively, we have the $\Delta$ , and we need to find the corresponding $e^{-i\omega t}$
Short Summary:
傅里叶变换迁移到图上核心的工作就是把拉普拉斯算子的特征函数 $e^{-i\omega t}$ 对应到图拉普拉斯矩阵的特征向量上。因此求解图拉普拉斯矩阵的特征向量是至关重要的。而根据上文的介绍，图拉普拉斯矩阵是半正定矩阵，可以通过特征分解(谱分解)求得特征向量，即 $\{\mu_1, \mu_2, ... \mu_n\}$ 。
With the eigenvectors of a graph, $\{\mu_1, \mu_2, ... \mu_n\}$ , we can define the Fourier Transform on Graph similar to discrete Fourier transform: $F(\lambda_l) = \sum_{i= 1}^Nf(i)\mu_l(i)$
Where, $f$ is the transform on Graph vertice, like the vertice embedding. $f(i)$ corresponding to the vertex on graph one-by-one. $\mu_l(i)$ is the $l^{th}$ eigenvector's $i^{th}$ element.
The Fourier Transform on Graph for embedding $f(i)$ with eigenvalue $\lambda_l$
<=> The inner product between $\lambda_l$ and $\mu_l$
Therefore, the Fourier Transform on Graph for all eigenvalues or graph vertex is
$\begin{pmatrix} \hat{f}(\lambda_1)\\ \hat{f}(\lambda_2)\\ \vdots\\ \hat{f}(\lambda_N) \end{pmatrix} = \begin{pmatrix} \mu_1(1) & \mu_1(2) & ... & \mu_1(N) \\ \mu_2(1) & \mu_2(2) & ... & \mu_2(N) \\ \vdots & \vdots & \ddots & \vdots\\ \mu_N(1) & \mu_N(2) & ... & \mu_N(N) \end{pmatrix} \begin{pmatrix} f(1)\\ f(2)\\ \vdots\\ f(N) \end{pmatrix}$
=> $\hat{f} = U^Tf$
The embedding of input vertex $f$ , with left inner product $U^T$ , we can get the Fourior Transform on Graph $\hat{f}$

Inverse Fourier Transform: Spectral Domain -> Spatial Domain

Inverse Fourier Transform on the frequency $\omega$ : $F^{-1}[F(\omega)] = \frac{1}{2\Pi}\int F(\omega)e^{i\omega t}d\omega$
Similarly, for graph Inverser Fourier Transform on the eigenvalue $\lambda_l$ : $f(i) = \sum_{l=1}^N\hat{f}(\lambda_l)\mu_l(i)$
For the Inverse Fourier Transform on all graph vertex is:
$\begin{pmatrix} f(1)\\ f(2)\\ \vdots\\ f(N) \end{pmatrix} = \begin{pmatrix} \mu_1(1) & \mu_1(2) & ... & \mu_1(N) \\ \mu_2(1) & \mu_2(2) & ... & \mu_2(N) \\ \vdots & \vdots & \ddots & \vdots\\ \mu_N(1) & \mu_N(2) & ... & \mu_N(N) \end{pmatrix} \begin{pmatrix} \hat{f}(\lambda_1)\\ \hat{f}(\lambda_2)\\ \vdots\\ \hat{f}(\lambda_N) \end{pmatrix}$
=> $f= U^T\hat{f}$

Convolution on Graph

Convolution theorem: under suitable conditions, the Fourier transform of a convolution of two signals is the pointwise product of their Fourier transforms: $f*g=F^{-1}(F(f)\cdot F(g))$
The steps of doing convolution on Graph are
- The Fourier transform of $f$ : $\hat{f} = U^Tf$
- The Fourier transform of the convolutional kernel $h$ : $\hat{h} = U^Th$ , which also can be represented as $\begin{pmatrix} \hat{h}(\lambda_1) & & \\ & \ddots & \\ & & \hat{h}(\lambda_N) \end{pmatrix}$ . $\hat{h}(\lambda_l) = \sum_{i=1}^Nh(i)u_l^*(i)$
- The inner product of these two above: $\hat{h}\hat{f} = U^Th \odot U^Tf = \begin{pmatrix} \hat{h}(\lambda_1) & & \\ & \ddots & \\ & & \hat{h}(\lambda_N) \end{pmatrix} U^Tf$
- Lastly, left inner product $U$ to obtain the inverse Fourier Transform:
  $f*h = h*f = U\begin{pmatrix} \hat{h}(\lambda_1) & & \\ & \ddots & \\ & & \hat{h}(\lambda_N) \end{pmatrix} U^Tf$

Evolution of GCN

Gen 1: Naive method

From the graph convolution function: $f*h = U\begin{pmatrix} \hat{h}(\lambda_1) & & \\ & \ddots & \\ & & \hat{h}(\lambda_N) \end{pmatrix} U^Tf$ , we know that the convolution of graph on embedding vertex $f$ , is just operation of matrix. The information of the graph is stored in $U$ , so we need to find appropriate convolutional kernal $h$ , so that we can reduce the loss a model.

==> The target of machine learning on the graph is to find $h$ to reduce the loss of a model. In other words, we need to find the parameters for $\hat{h}(\lambda_1), \hat{h}(\lambda_2),..., \hat{h}(\lambda_n)$

==> Regarding the $\hat{h}(\lambda_1), \hat{h}(\lambda_2),..., \hat{h}(\lambda_n)$ as the model parameters, we can use gradient decent to find the proper values for them. Therefore, it regard $diag(\hat{h}(\lambda_i))$ as the parameter $diag(\theta_i)$ :

$y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx)$
$g_{\theta}(\Lambda) = \begin{pmatrix} \theta_1 & & \\ & \ddots & \\ & & \theta_n \end{pmatrix}$
Where, $\theta_i$ is similar to the weights in nerual network, and x is the f

Drawbacks:

$U$ is the eigenvetors of Lapalacian matrix $L$ , and the time complexity to find the eigenvectors is O(n^3)
For each forward propagation, it needs to calculate the $U$ , $g_{\theta}(\Lambda)$ , and $U^T$
The convolutional kernel has n parameters, where n is the number of vertex in a graph. For a large graph, there are too many parameters for a model

Gen 2: Reduce the number of parameters

Since there are n parameters in the Naive model, which is too many to learn. There is a new approximation method comes out:

$\hat{h}(\lambda_i) = \sum_{j=0}^K\alpha_j\lambda_i^j$
Then, $y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx)$
$g_{\theta}(\Lambda) = \begin{pmatrix} \sum_{j=0}^K\alpha_j\lambda_1^j & & \\ & \ddots & \\ & & \sum_{j=0}^K\alpha_j\lambda_n^j \end{pmatrix}$
Further,
$g_{\theta}(\Lambda) = \begin{pmatrix} \sum_{j=0}^K\alpha_j\lambda_1^j & & \\ & \ddots & \\ & & \sum_{j=0}^K\alpha_j\lambda_n^j \end{pmatrix} = \sum_{j=0}^K\alpha\Lambda^j$
Since, $L^2 = U\Lambda U^T U\Lambda U^T = U\Lambda^2 U^T$ ==> $L^j = U\Lambda^j U^T$
$y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx) = \sigma(\sum_{j=0}^K\alpha_jU\Lambda^jU^Tx) = \sigma(\sum_{j=0}^K\alpha_jL^j)$
Finally, $y_{out} = \sigma(\sum_{j=0}^K\alpha_jL^j)$
Where, $\alpha_j, j\in(1,...,K)$ are the model parameters.

Improvements:

Comparing to the n parameters in Naive model, there are only $\alpha_j, j\in(1,...,K)$ parameters in this model.
Since the $y_{out} = \sigma(\sum_{j=0}^K\alpha_jL^j)$ , the computation to decompose $L = U\Lambda U^T$ is not needed, which saves O(n^3)
With the approximation on the convolutional kernel, it introduces spatial localization. K is the receptive field, which means that the updates of embedding on a vertex will only related to its K-hop neighborhood. Hop k所有节点的加权聚合系数都为 $\alpha_k$
However, this method still need to calculate the $L^j$ , which needs time complexity $O(n^3)$

Gen 3: Reduce time complexity

Since the model in Gen 2 did not reduce the time complexity of the convolutional computation, the next work comes up with an approximation method based on Chebyshev polynomial to compute the convolutional kernel:

$g_{\theta}(\Lambda) \approx \sum_{k=0}^K \theta_k T_k(\hat{\Lambda})$
Where $\theta_k$ is the parameters of Chebyshev polynomial.
In $T_k(\hat{\Lambda})$ , $\hat{\Lambda} = \frac{2\Lambda}{\lambda_{max}}-I$ .
The reason is that Chebyshev polynomial need the input has value range [-1, 1]
Chebyshev polynomial is defined by the function:
- $T_k(\hat{\Lambda})x=2\hat{\Lambda}T_{k-1}(\hat{\Lambda})x - T_{k-2}(\hat{\Lambda})x$
- $T_0 = I, T_{1}(\hat{\Lambda}) = \hat{\Lambda}$

Since the calculation of $g_{\theta}(\Lambda)$ does not need matrix production, but only need recursion as above, the computation complexity of $T_k(\hat{\Lambda})$ becomes O(n^2). ==> the time complexity of $g_{\theta}(\Lambda)$ is O(Kn^2)

==> $y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx) \approx \sigma(\sum_{k=0}^K \theta_k T_k(\hat{L}) x)$

Gen 4: GCN

In GCN, there are two assumptions: (1) $\lambda_{max} \approx 2$ ; (2) $K = 1$ . Therefore,

$y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx)$
$Ug_{\theta}(\Lambda)U^Tx = \sum_{k=0}^1 \theta_k T_k(\hat{\Lambda})x = \theta_0x + \theta_1(L-I)x = \theta_0x - \theta_1D^{-1/2}AD^{-1/2}x$
There are two parameters $\theta_0, \theta_1$ , which are shared for all vertex in the graph.
Since we can restrict the parameter number to avoid over-fitting, so we set that $\theta = \theta_0 = -\theta_1$ :
$Ug_{\theta}(\Lambda)U^Tx \approx \theta(I+D^{-1/2}AD^{-1/2})x$
Since the value range of $I+D^{-1/2}AD^{-1/2}$ is [0,2], and the ouput $y_{out}$ can be used as next input, it will lead to the gradient explosion or vanishing.
We need a renormalization: $I+D^{-1/2}AD^{-1/2}$ -> $\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}$
Where, $\hat{A} = A+I, \hat{D_{ii}}=\sum_j\hat{A}_{ij}$
Finally, it is the GCN updating function:
The input matrix $X \in \mathbb{R}^{N\times C}$ , and every input sample has $C$ channels (i.e. every graph node have C channels). The Convolution has $F$ filters or feature maps:
$Z = \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}X\Theta$
Where, $\Theta \in \mathbb{R}^{C\times F}$ is the parameter matrix of the filters. $Z \in \mathbb{R}^{N\times F}$ is the output after the convolutional operation.
Furthermore, since $X$ is the input vector, and there are multiple layers in a neural network, we can represent $X$ by $H$ . The neural network parameters are $W$ . So the forward propagation function of the graph neural network becomes:
$H^{l+1} = \sigma( \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}H^{l}W^l)$

Explain of GCN

Intuitive understanding

The feed-forward propagation function of the GCN is

H^{l+1} = \sigma( \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}H^{l}W^l)

Intuitively, this function means that every vertex of the graph obtain the information of its neighbor nodes and collects them in its own embedding vector.
$\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}$ --> the normalized adjacent matrix
$H^{l}W^l$ --> a linear transform on all vertex in the $l^{th}$ layer.
Left production --> transform the feature of a vertex to the sum of its neighbors' features

Example

Assume we have a 2-layer GCN to do classification. Set $A_2 = \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}$ , based on the feed-forward propagation, we know the output of the GCN embedding is

Z = f(X,A) = softmax(A_2 ReLU(A_2XW^0)W^1)

Where,
- $W^0 \in \mathbb{R}^{H\times C}$ is the weight matrix, and is used to transform the input embedding X.
- $W^1 \in \mathbb{R}^{H\times F}$ is another weight matrix, and is used to transform the results from the hidden layer embedding H.
For semi-supervised learning, the loss function is
$L = -\sum_{l\in y_L}\sum_{f=1}^F Y_{lf}lnZ_{lf}$
Where, $Y_L$ is all the labeled samples.
Lastly, we can use gradient descent to calculate the $W^0$ and $W^1$

Drawback

Since GCN needs the input of the whole adjacent matrix $A$ and the input feature matrix $X$ , it needs a lot of memory. The author used the sparse matrix to mitigate the needs of memory and use GPU for accelerate the computing. However, it is still not suitable for a very large graph.

GraphSAGE introduces another way to conquer this issue.

Reference

PreviousGraph Models NextGraphSAGE

Last updated 5 years ago

Was this helpful?