Graph Convolutional Network (GCN) - Background

Background Knowledge

Why we need GCN?

  1. CNN cannot deal with Non-Euclidean structure data. The reason is that the neighborhood nodes of each vertice are not the same. So the convolutional kernel size is not the same.

  2. We need a method to effectively extract information from graph structure for machine learning training.

  3. Many data structures are non-euclidean, like social networks, computer networks, IoT, DNA information, etc. So graph could be a better way to represent the data and keep more information than the euclidean data.

Essentially, the difficulty to process the graph data is that the numbers of neighborhood vertices of one vertice are not constant. --> A fixed size of a convolutional kernel cannot be used for information extraction.

==> There are two main directions to process the graph data:

  • Transforming non-Euclidean data to Euclidean space

  • Find a convolutional kernel that can process the neighborhood vertices with different sizes. --> GCN

Spatial Domain and Spectral Domain

  • Spatial Domain. Extract the graph features by finding the neighbors of each vertex:

    There are two problems to extract features in Spatial domain:

    • How to find the neighbors of a vertex, i.e. how to determine receptive field?

    • With the receptive field, how to deal with the features of the different numbers of neighbors.

    The main drawbacks are:

    • The neighbors of each vertex are different -> the computation need to deal with every vertex

    • The extracted feature lose too much information

  • Spectral Domain. It wants to use the Spectral domain to calculate the convolution between kernel and graph data. Intuitively, we want to use the Eigenvalues and Eigenvectors of the Laplacian matrix of a graph to extract graph feature information

What does the GCN really do?

图神经网络的核心工作是对空间域(Spatial Domain)中节点的Embedding进行卷积操作(即聚合邻居Embedding信息),然而图数据和图像数据的差别在于节点邻居个数、次序都是不定的,因此传统用于图像上的CNN模型中的卷积操作(Convolution Operator)不能直接用在图上,因此需要从频谱域(Spectral Domain)上重新定义这样的卷积操作再通过卷积定理转换回空间域上。

为了在频谱域和空间域中转换,我们借助了傅里叶公式,并且定义了图上傅里叶变换(从空间域变换到频谱域)和图上傅里叶逆变换(从频谱域回到空间域)的变换公式。具体操作是我们将节点的Embedding f(i),i(1,...,N)f(i), i \in (1, ..., N)通过傅里叶正变换从空间域变换到了频谱域 f^\hat{f} ,在频谱域上和卷积核hh进行卷积操作,再将变换后的节点Embedding通过傅里叶逆变换回到空间域,参与后续的分类等任务.[2]

Laplacian Matrix

For an undirected graph, G = (V, E), its Laplacian matrix is L = D - A. D is the Degree matrix, and A is the Adjacent matrix.

There are also other two definitions of the Laplacian matrix:

  • Symmetric Normalized Laplacian. Lsys=D1/2LD1/2L^{sys} = D^{-1/2}LD^{1/2}. --> Most papers used this format

  • Random Walk Normalization Laplacian. Lrw=D1LL^{rw} = D^{-1}L

For Lsys=D1/2LD1/2L^{sys} = D^{-1/2}LD^{1/2} , since (1) L = D - A and (2) D is symmetric metrix, ==> LsysL^{sys} can be decomposed.

-> This is a very important characteristic. The reason is that The Fourier Transform needs to use the eigenvalues and eigenvectors of the Laplacian matrix => to transform graph data from spatial domain to spectral domain to calculate the convolution.

Laplacian matrix decomposition

Laplacian matrix is positive semi-definite (半正定矩阵). It has three characteristics:

  • Symmetric matrix must have n independent eigenvectors

    ==> L=U(λ1...λn)U1L = U \begin{pmatrix} \lambda_1 & & \\ & ... & \\ & & \lambda_n \end{pmatrix} U^{-1}

  • The eigenvalues of a semi-definite matrix must be positive

    Since U is orthogonal, UUT=IUU^T = I, since UT=U1U^T = U^{-1},

    ==> L=U(λ1...λn)UTL = U \begin{pmatrix} \lambda_1 & & \\ & ... & \\ & & \lambda_n \end{pmatrix} U^{T}

  • The eigenvectors of a symmetric matrix are orthogonal

NOTES: Definition of 正定矩阵,半正定矩阵:

  • 正定矩阵: A 是n阶方阵,如果对任何非零向量x,都有 xTAx>0x^TAx>0,其中 xTx^T 表示 xx 的转置,就称A正定矩阵

    <==> A的所有特征值 > 0

  • 半正定矩阵:A是n阶实对称矩阵。如果对任意的实非零列向量x有 xTAx0x^TAx \geq 0 ,就称A为半正定矩阵。

    <==> A的所有特征值 >= 0

Fourier Transform on Graph

Fourier Transform: Spatial Domain -> Spectral Domain

  • Continuous Fourier Transform: F(ω)=F[f(t)]=f(t)eiωtdtF(\omega) = F[f(t)] = \int f(t)e^{-i\omega t} dt

    The Fourier Transform of f(t) is the integral of the result of f(t)f(t) and the basis function eiωte^{-i\omega t}

  • The basis function eiωte^{-i\omega t} needs to fit the following criteria: Δeiωt=2t2eiωt=ω2eiωt\Delta e^{-i\omega t} = \frac{\partial^2}{\partial t^2}e^{-i\omega t} = - \omega^2e^{-i\omega t}

    Δ\Deltais the Laplacian Operator: Δf=2f\Delta f = \nabla^2f

  • With the definition of eigenvalues and eigenvectors: Aμ=λμA\mu = \lambda\mu, where AA is a matrix, μ\mu is its eigenvectors, λ\lambda is its eigenvalues.

    => eiωte^{-i\omega t} is the eigenvectors of Δ\Delta, and ω\omegais closely related to the corresponding eigenvalues

  • Laplacian matrix is the discrete Laplacian operator. We got the Laplacian matrix of a graph before.

    => To do Fourier Transform on Graph, we need to find the basis function. Intuitively, we have the Δ\Delta, and we need to find the corresponding eiωte^{-i\omega t}

    Short Summary:

    傅里叶变换迁移到图上核心的工作就是把拉普拉斯算子的特征函数 eiωte^{-i\omega t}对应到图拉普拉斯矩阵的特征向量上。因此求解图拉普拉斯矩阵的特征向量是至关重要的。而根据上文的介绍,图拉普拉斯矩阵是半正定矩阵,可以通过特征分解(谱分解)求得特征向量,即 {μ1,μ2,...μn}\{\mu_1, \mu_2, ... \mu_n\}

  • With the eigenvectors of a graph, {μ1,μ2,...μn}\{\mu_1, \mu_2, ... \mu_n\}, we can define the Fourier Transform on Graph similar to discrete Fourier transform: F(λl)=i=1Nf(i)μl(i)F(\lambda_l) = \sum_{i= 1}^Nf(i)\mu_l(i)

    Where, ff is the transform on Graph vertice, like the vertice embedding. f(i)f(i)corresponding to the vertex on graph one-by-one. μl(i)\mu_l(i)is the lthl^{th}eigenvector's ithi^{th}element.

  • The Fourier Transform on Graph for embedding f(i)f(i)with eigenvalue λl\lambda_l

    <=> The inner product between λl\lambda_l and μl\mu_l

  • Therefore, the Fourier Transform on Graph for all eigenvalues or graph vertex is

    (f^(λ1)f^(λ2)f^(λN))=(μ1(1)μ1(2)...μ1(N)μ2(1)μ2(2)...μ2(N)μN(1)μN(2)...μN(N))(f(1)f(2)f(N))\begin{pmatrix} \hat{f}(\lambda_1)\\ \hat{f}(\lambda_2)\\ \vdots\\ \hat{f}(\lambda_N) \end{pmatrix} = \begin{pmatrix} \mu_1(1) & \mu_1(2) & ... & \mu_1(N) \\ \mu_2(1) & \mu_2(2) & ... & \mu_2(N) \\ \vdots & \vdots & \ddots & \vdots\\ \mu_N(1) & \mu_N(2) & ... & \mu_N(N) \end{pmatrix} \begin{pmatrix} f(1)\\ f(2)\\ \vdots\\ f(N) \end{pmatrix}

    => f^=UTf\hat{f} = U^Tf

    The embedding of input vertex ff, with left inner product UTU^T, we can get the Fourior Transform on Graph f^\hat{f}

Inverse Fourier Transform: Spectral Domain -> Spatial Domain

  • Inverse Fourier Transform on the frequency ω\omega : F1[F(ω)]=12ΠF(ω)eiωtdωF^{-1}[F(\omega)] = \frac{1}{2\Pi}\int F(\omega)e^{i\omega t}d\omega

  • Similarly, for graph Inverser Fourier Transform on the eigenvalue λl\lambda_l: f(i)=l=1Nf^(λl)μl(i)f(i) = \sum_{l=1}^N\hat{f}(\lambda_l)\mu_l(i)

  • For the Inverse Fourier Transform on all graph vertex is:

    (f(1)f(2)f(N))=(μ1(1)μ1(2)...μ1(N)μ2(1)μ2(2)...μ2(N)μN(1)μN(2)...μN(N))(f^(λ1)f^(λ2)f^(λN))\begin{pmatrix} f(1)\\ f(2)\\ \vdots\\ f(N) \end{pmatrix} = \begin{pmatrix} \mu_1(1) & \mu_1(2) & ... & \mu_1(N) \\ \mu_2(1) & \mu_2(2) & ... & \mu_2(N) \\ \vdots & \vdots & \ddots & \vdots\\ \mu_N(1) & \mu_N(2) & ... & \mu_N(N) \end{pmatrix} \begin{pmatrix} \hat{f}(\lambda_1)\\ \hat{f}(\lambda_2)\\ \vdots\\ \hat{f}(\lambda_N) \end{pmatrix}

    => f=UTf^ f= U^T\hat{f}

Convolution on Graph

  • Convolution theorem: under suitable conditions, the Fourier transform of a convolution of two signals is the pointwise product of their Fourier transforms:fg=F1(F(f)F(g))f*g=F^{-1}(F(f)\cdot F(g))

  • The steps of doing convolution on Graph are

    • The Fourier transform of ff: f^=UTf\hat{f} = U^Tf

    • The Fourier transform of the convolutional kernel hh: h^=UTh\hat{h} = U^Th, which also can be represented as (h^(λ1)h^(λN))\begin{pmatrix} \hat{h}(\lambda_1) & & \\ & \ddots & \\ & & \hat{h}(\lambda_N) \end{pmatrix}. h^(λl)=i=1Nh(i)ul(i)\hat{h}(\lambda_l) = \sum_{i=1}^Nh(i)u_l^*(i)

    • The inner product of these two above: h^f^=UThUTf=(h^(λ1)h^(λN))UTf\hat{h}\hat{f} = U^Th \odot U^Tf = \begin{pmatrix} \hat{h}(\lambda_1) & & \\ & \ddots & \\ & & \hat{h}(\lambda_N) \end{pmatrix} U^Tf

    • Lastly, left inner product UU to obtain the inverse Fourier Transform:

      fh=hf=U(h^(λ1)h^(λN))UTff*h = h*f = U\begin{pmatrix} \hat{h}(\lambda_1) & & \\ & \ddots & \\ & & \hat{h}(\lambda_N) \end{pmatrix} U^Tf

Evolution of GCN

Gen 1: Naive method

From the graph convolution function: fh=U(h^(λ1)h^(λN))UTff*h = U\begin{pmatrix} \hat{h}(\lambda_1) & & \\ & \ddots & \\ & & \hat{h}(\lambda_N) \end{pmatrix} U^Tf , we know that the convolution of graph on embedding vertex ff, is just operation of matrix. The information of the graph is stored in UU, so we need to find appropriate convolutional kernal hh, so that we can reduce the loss a model.

==> The target of machine learning on the graph is to find hh to reduce the loss of a model. In other words, we need to find the parameters for h^(λ1),h^(λ2),...,h^(λn)\hat{h}(\lambda_1), \hat{h}(\lambda_2),..., \hat{h}(\lambda_n)

==> Regarding the h^(λ1),h^(λ2),...,h^(λn)\hat{h}(\lambda_1), \hat{h}(\lambda_2),..., \hat{h}(\lambda_n) as the model parameters, we can use gradient decent to find the proper values for them. Therefore, it regard diag(h^(λi))diag(\hat{h}(\lambda_i)) as the parameter diag(θi)diag(\theta_i):

  • yout=σ(Ugθ(Λ)UTx)y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx)

  • gθ(Λ)=(θ1θn)g_{\theta}(\Lambda) = \begin{pmatrix} \theta_1 & & \\ & \ddots & \\ & & \theta_n \end{pmatrix}

    Where, θi\theta_i is similar to the weights in nerual network, and x is the f

Drawbacks:

  • UU is the eigenvetors of Lapalacian matrix LL, and the time complexity to find the eigenvectors is O(n^3)

  • For each forward propagation, it needs to calculate the UU, gθ(Λ)g_{\theta}(\Lambda), and UTU^T

  • The convolutional kernel has n parameters, where n is the number of vertex in a graph. For a large graph, there are too many parameters for a model

Gen 2: Reduce the number of parameters

Since there are n parameters in the Naive model, which is too many to learn. There is a new approximation method comes out:

  • h^(λi)=j=0Kαjλij\hat{h}(\lambda_i) = \sum_{j=0}^K\alpha_j\lambda_i^j

  • Then, yout=σ(Ugθ(Λ)UTx)y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx)

    gθ(Λ)=(j=0Kαjλ1jj=0Kαjλnj)g_{\theta}(\Lambda) = \begin{pmatrix} \sum_{j=0}^K\alpha_j\lambda_1^j & & \\ & \ddots & \\ & & \sum_{j=0}^K\alpha_j\lambda_n^j \end{pmatrix}

  • Further,

    gθ(Λ)=(j=0Kαjλ1jj=0Kαjλnj)=j=0KαΛjg_{\theta}(\Lambda) = \begin{pmatrix} \sum_{j=0}^K\alpha_j\lambda_1^j & & \\ & \ddots & \\ & & \sum_{j=0}^K\alpha_j\lambda_n^j \end{pmatrix} = \sum_{j=0}^K\alpha\Lambda^j

    Since, L2=UΛUTUΛUT=UΛ2UTL^2 = U\Lambda U^T U\Lambda U^T = U\Lambda^2 U^T ==> Lj=UΛjUTL^j = U\Lambda^j U^T

    yout=σ(Ugθ(Λ)UTx)=σ(j=0KαjUΛjUTx)=σ(j=0KαjLj)y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx) = \sigma(\sum_{j=0}^K\alpha_jU\Lambda^jU^Tx) = \sigma(\sum_{j=0}^K\alpha_jL^j)

  • Finally, yout=σ(j=0KαjLj)y_{out} = \sigma(\sum_{j=0}^K\alpha_jL^j)

    Where, αj,j(1,...,K)\alpha_j, j\in(1,...,K) are the model parameters.

Improvements:

  • Comparing to the n parameters in Naive model, there are only αj,j(1,...,K)\alpha_j, j\in(1,...,K) parameters in this model.

  • Since the yout=σ(j=0KαjLj)y_{out} = \sigma(\sum_{j=0}^K\alpha_jL^j), the computation to decompose L=UΛUTL = U\Lambda U^T is not needed, which saves O(n^3)

  • With the approximation on the convolutional kernel, it introduces spatial localization. K is the receptive field, which means that the updates of embedding on a vertex will only related to its K-hop neighborhood. Hop k所有节点的加权聚合系数都为 αk\alpha_k

  • However, this method still need to calculate the LjL^j, which needs time complexity O(n3)O(n^3)

Gen 3: Reduce time complexity

Since the model in Gen 2 did not reduce the time complexity of the convolutional computation, the next work comes up with an approximation method based on Chebyshev polynomial to compute the convolutional kernel:

  • gθ(Λ)k=0KθkTk(Λ^)g_{\theta}(\Lambda) \approx \sum_{k=0}^K \theta_k T_k(\hat{\Lambda})

    Where θk\theta_k is the parameters of Chebyshev polynomial.

  • InTk(Λ^)T_k(\hat{\Lambda}), Λ^=2ΛλmaxI\hat{\Lambda} = \frac{2\Lambda}{\lambda_{max}}-I.

    The reason is that Chebyshev polynomial need the input has value range [-1, 1]

  • Chebyshev polynomial is defined by the function:

    • Tk(Λ^)x=2Λ^Tk1(Λ^)xTk2(Λ^)xT_k(\hat{\Lambda})x=2\hat{\Lambda}T_{k-1}(\hat{\Lambda})x - T_{k-2}(\hat{\Lambda})x

    • T0=I,T1(Λ^)=Λ^T_0 = I, T_{1}(\hat{\Lambda}) = \hat{\Lambda}

Since the calculation of gθ(Λ)g_{\theta}(\Lambda) does not need matrix production, but only need recursion as above, the computation complexity of Tk(Λ^)T_k(\hat{\Lambda}) becomes O(n^2). ==> the time complexity of gθ(Λ)g_{\theta}(\Lambda)is O(Kn^2)

==> yout=σ(Ugθ(Λ)UTx)σ(k=0KθkTk(L^)x)y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx) \approx \sigma(\sum_{k=0}^K \theta_k T_k(\hat{L}) x)

Gen 4: GCN

In GCN, there are two assumptions: (1) λmax2\lambda_{max} \approx 2 ; (2) K=1K = 1. Therefore,

  • yout=σ(Ugθ(Λ)UTx)y_{out} = \sigma(Ug_{\theta}(\Lambda)U^Tx)

  • Ugθ(Λ)UTx=k=01θkTk(Λ^)x=θ0x+θ1(LI)x=θ0xθ1D1/2AD1/2xUg_{\theta}(\Lambda)U^Tx = \sum_{k=0}^1 \theta_k T_k(\hat{\Lambda})x = \theta_0x + \theta_1(L-I)x = \theta_0x - \theta_1D^{-1/2}AD^{-1/2}x

  • There are two parameters θ0,θ1\theta_0, \theta_1, which are shared for all vertex in the graph.

  • Since we can restrict the parameter number to avoid over-fitting, so we set that θ=θ0=θ1\theta = \theta_0 = -\theta_1:

    Ugθ(Λ)UTxθ(I+D1/2AD1/2)xUg_{\theta}(\Lambda)U^Tx \approx \theta(I+D^{-1/2}AD^{-1/2})x

  • Since the value range of I+D1/2AD1/2I+D^{-1/2}AD^{-1/2} is [0,2], and the ouput youty_{out} can be used as next input, it will lead to the gradient explosion or vanishing.

    We need a renormalization: I+D1/2AD1/2I+D^{-1/2}AD^{-1/2} -> D^1/2A^D^1/2\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}

    Where, A^=A+I,Dii^=jA^ij\hat{A} = A+I, \hat{D_{ii}}=\sum_j\hat{A}_{ij}

  • Finally, it is the GCN updating function:

    The input matrix XRN×CX \in \mathbb{R}^{N\times C}, and every input sample has CC channels (i.e. every graph node have C channels). The Convolution has FF filters or feature maps:

    Z=D^1/2A^D^1/2XΘZ = \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}X\Theta

    Where, ΘRC×F\Theta \in \mathbb{R}^{C\times F} is the parameter matrix of the filters. ZRN×FZ \in \mathbb{R}^{N\times F} is the output after the convolutional operation.

  • Furthermore, since XX is the input vector, and there are multiple layers in a neural network, we can represent XX by HH. The neural network parameters are WW. So the forward propagation function of the graph neural network becomes:

    Hl+1=σ(D^1/2A^D^1/2HlWl)H^{l+1} = \sigma( \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}H^{l}W^l)

Explain of GCN

Intuitive understanding

The feed-forward propagation function of the GCN is

Hl+1=σ(D^1/2A^D^1/2HlWl)H^{l+1} = \sigma( \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}H^{l}W^l)
  • Intuitively, this function means that every vertex of the graph obtain the information of its neighbor nodes and collects them in its own embedding vector.

  • D^1/2A^D^1/2\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}--> the normalized adjacent matrix

  • HlWlH^{l}W^l --> a linear transform on all vertex in the lthl^{th} layer.

  • Left production --> transform the feature of a vertex to the sum of its neighbors' features

Example

Assume we have a 2-layer GCN to do classification. Set A2=D^1/2A^D^1/2A_2 = \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}, based on the feed-forward propagation, we know the output of the GCN embedding is

Z=f(X,A)=softmax(A2ReLU(A2XW0)W1)Z = f(X,A) = softmax(A_2 ReLU(A_2XW^0)W^1)
  • Where,

    • W0RH×CW^0 \in \mathbb{R}^{H\times C} is the weight matrix, and is used to transform the input embedding X.

    • W1RH×FW^1 \in \mathbb{R}^{H\times F} is another weight matrix, and is used to transform the results from the hidden layer embedding H.

  • For semi-supervised learning, the loss function is

    L=lyLf=1FYlflnZlfL = -\sum_{l\in y_L}\sum_{f=1}^F Y_{lf}lnZ_{lf}

    Where, YLY_L is all the labeled samples.

  • Lastly, we can use gradient descent to calculate the W0W^0 and W1W^1

Drawback

Since GCN needs the input of the whole adjacent matrix AA and the input feature matrix XX, it needs a lot of memory. The author used the sparse matrix to mitigate the needs of memory and use GPU for accelerate the computing. However, it is still not suitable for a very large graph.

GraphSAGE introduces another way to conquer this issue.

Local Connectivity and Parameter Sharing

Reference

Last updated

Was this helpful?