GraphSAGE

Why do we need the GraphSAGE:

GCN needs to load the whole adjacent matrix and feature matrix in the memory at once,
=> which is impossible for very large graph
GCN need to know the structure information of the whole graph, including the graph vertice in the testing set,
=> which is impossible for some tasks like time series prediction.

Inductive learning v.s. Transductive learning

Graph data is different from the other data format. Each vertex on a graph can use the information from its neighbors. Therefore, if some vertices and their neighbors belong to the training and testing set, respectively, how can we use that information? There are two types of methods:

Transductive Learning: training a machine learning model WITH the information of samples in the test set.
Since GCN used the all neighbors information while training a model, it is transductive learning.
Inductive Learning: training a machine learning model WITHOUT the information of samples in the test set. e.g. GraphSAGE

GraphSAGE

General framework

GraphSAGE is an inductive framework. Specifically, it only uses the vertices and their edges in training set for training. An advantage of inductive learning is that it can used the information of the known vertices to infer the embedding vector of the unknown vertices.

GraphSAGE comes from Graph SAmples and aggreGatE:
- SAmple: how to sample the neighbors
- aggreGatE: how to aggregate the embedding information of neighbors to updates its own embedding information
There are three steps of GraphSAGE:
- Sampling the neighors
- Aggregate the information from the neighbors in different levels to update the embedding information of current vertex
- Based on the updated embedding information to predict the label of the vertex

Detailed algorithms

Vertex embedding generation (Feed-forward propagation) method:

How to generate/update the embedding information of vertice in a graph?
- We assume that the GraphSAGE model is trained and all parameters are known
  - Parameters of $AGGREGATE_k, k \in \{1,...,K\}$
  - Weight matrix $W^k, k\in \{1,...,K\}$ , which are used to transform matrix between layers in the model

Detail of the embedding generation algorithm
- (line 1) Initialize the embedding information of vertex as the feature matrix $h_v^0$
- (line 3) For each vertex $v$
- (line 4) Sampling the embedding information $h_u, u\in N(v)$ of its neighbors.
  Here, $N(v)$ means sampling from the neighbors
- (line 5) Concatenate the neighbors' embedding $h_{N(v)}$ and the current vertex's embedding $h_v$ , and perform a non-linear transform $\sigma(W\cdot \square)$ to update the vertex's embedding
How to understand K?
- K is the number of aggregator and the number of weight matrixes, and the layer of topologies.
- Since the aggregators and weight matrixes in each layer share the information, the layer of topologies is the maximum number of hops to reach the farthest neighbor (hops).
- For example, in the visual illustration example, the red vertex needs the information from the neighbors of hop = 1 and hop = 2. Therefore,
  - First, in the k = 1, we need to aggregate the embedding information from the blue vertices to the red vertex. AND aggregate the embedding information of the green vertices to the blue vertices.
  - In the k = 2, the embedding information of the red vertex is updated again based on the updated embedding information of the blue vertices.
  ==> the updated embedding information of the red vertex has the information of the green vertices and the green vertices.

Sample algorithm

GraphSAGE used fixed size of samples. Specifically, it defines the needed number of neighbors, S. Then, sampling with replacement the neighbors of positive and negative samples until fixed sample size, S.

==> It used to ensure the number of neighbors of each vertex is the same. So that we can concatenate the information of multiple vertices to tensor and using GPU for training.

Aggregator Architectures

There are three aggregators in GraphSAGE:

Mean aggregator: take the elementwise mean of the vectors in $\{h_u^{k-1}, u \in N(v)\}$
-> best performance
$h_v^k \gets \sigma(W\cdot MEAN(\{h_v^{k-1}\} \cup \{ h_u^{k-1}, \forall u\in N(v)\} ))$
LSTM aggregator: more complex and expressive architecture
Pooling aggregator: feed the neighbor's vector independently to a fully connected neural network, then do an elementwise max-pooling to aggregate the information

Learning the parameter of GraphSAGE

For supervised learning tasks, it can use the cross-entropy between the predicted labels and the true labels as the loss function.

For unsupervised learning tasks, the loss function is designed below. It encourages the neighbor vertices to have similar embedding information.

J_{\mathcal{G}}\left(\mathbf{z}_{u}\right)=-\log \left(\sigma\left(\mathbf{z}_{u}^{\top} \mathbf{z}_{v}\right)\right)-Q \cdot \mathbb{E}_{v_{n} \sim P_{n}(v)} \log \left(\sigma\left(-\mathbf{z}_{u}^{\top} \mathbf{z}_{v_{n}}\right)\right)

Where,
- $z_u$ is the embedding of vertex $u$
- $v$ is the neighbor of vertex $u$
- $P_n$ is the distribution of negative sampling, which are a set of vertices that are not the neighbors of $u$
- $Q$ is the sample number of the negative sampling.

In summary

GraphSAGE uses sampling mechanism to avoid the restriction of GCN
GraphSAGE avoid to use the information of testing dataset
However, it does not take the importance of the different neighbors on a certain vertex, which is addressed in the GAT model

Reference

GNN 教程：GraphSAGE

PreviousGraph Convolutional Network (GCN) - Background NextGraph Attention Network (GAT)

Last updated 5 years ago

Was this helpful?