Graph Attention Network (GAT)

Fundamental

Two features in Graph data
- Graph architecture feature: for a vertex in a graph, its neighbors form the first feature, which is the graph architecture, or adjacent matrix
- Vertex feature: besides the graph structure, the features of a vertex are the second feature.
Limitation of GCN
- Can not deal with inductive task, which is the dynamic graph problem.
- Bottleneck of a directed graph. It is not easy to assign different weights to a vertex's different neighbors
Mask graph attention or global graph attention
- Global graph attention
  For a vertex, calculate the attention for the rest of all vertices on the graph
  - Pros: completely do not rely on the graph structure
  - Cons: Lost the graph structure feature and high computation cost
- Mask graph attention
  For a vertex, attention only focuses on its neighbors.

There are mainly two steps for GAT:

For a vertex $i$ , calculate the similarities between it and all of its neighbors:
$e_{ij} = a([Wh_i||Wh_j]), j\in N_i$
- The shared weights matrix $W$ is used to transform the embedding features of a vertex to a higher dimension.
- $[\cdot || \cdot]$ is concatenate the transformed matrixes
- $a(\cdot)$ maps the high dimensional features to a value, which used single-layer feedforward neural network to implemented.
Then, using a softmax to normalize the similarities:
$a_{ij} = \frac{exp(LeakyReLU(e_{ij}))}{\sum_{k\in N_i}exp(LeakyReLU(e_{ik}))}$

The second step is just a weighted sum up:

$h_i' = \sigma(\sum_{j\in N_i} \alpha_{ij}Wh_j)$
- $h_i'$ is the embedding vector of every vertect from the output of GAT
- $\sigma(\cdot)$ is the activation function
Multi-headed attention:
$h_i'(K) = ||_{k=1}^K \sigma(\sum_{j\in N_i} \alpha_{ij}Wh_j)$

There are mainly three types of attention mechanisms. All of them can be used to find the relevance of a vertex's neighbors.

The idea is to learn the relevance between vertex and its neighbors based on their embedding features.
Given a set of vertices $v_0, v_1, ..., v_{N_{x_0}}$ and their embedding vectors $x_0, x_1, ..., x_{N_{o^*}}$ , the attention weight $\alpha_{0, j}$ between vertex $v_0, v_j$ is
$\alpha_{0, j} = \frac{e_{0,j}}{\sum_{k\in N_{v_0}}e_{0, k}}$
where, $e_{0,j}$ is the relevance between $v_0, v_j$ .
In GAT, the attention weight is calculated as
$a_{0,j} = \frac{exp(LeakyReLU(e_{0,j}))}{\sum_{k\in N_{v_0}}exp(LeakyReLU(e_{0,k}))}$

In Learn attention weights, the similarity between two vertex does not calculated directly. Instead, they used a 1-layer feed-forward neural network to find the similarity.
Therefore, we can directly calculate the similarity between the vertices and calculate the relevance between a vertex and its neighbors.
$\alpha_{0, j}=\frac{\exp \left(\beta \cdot \cos \left(\mathbf{W} \mathbf{x}_{0}, \mathbf{W} \mathbf{x}_{j}\right)\right)}{\sum_{k \in N_{v_{0}}} \exp \left(\beta \cdot \cos \left(\mathbf{W} \mathbf{x}_{0}, \mathbf{W} \mathbf{x}_{k}\right)\right)}$
Where,
- $\beta$ is the trainable bias
- $cos(\cdot)$ is used to calculate the cosine similarity between the embedding features of vertices.
An obvious difference between Similarity-based attention and Learn attention weights is that
- Similarity-based attention used the cosine similarity to measure the similarity between vertex features, while Learn attention weights method used $a(\cdot)$ to learn the similarity.

Essentially, both of GCN and GAT aggregated the features of neighbors of a vertex to form new embedding vector and use the local stationery of graph to learn the features of a new vertex.
The difference is that
- GCN used the Laplacian matrix
- GAT used attention coefficents
It looks like GAT is better since it embedding the different relevance between vertice to the central vertex.

Essentially since GAT is based on node-wise computation.
- Every computation need to enumerate all vertex on a graph
=> No need the Laplacian matrix

Since the learnable parameters in GAT are $W, a(\cdot)$ , and the computation is node-wise
=> it less correlates to the graph architecture.
=> The change of graph architecture has little impact on the GAT, we just need to change $N_i$ , and recalculate the parameters.
However, GCN is based on the parameters of the whole graph.
=> A single change in the graph needs a completely re-training the parameters.

Last updated 5 years ago

Was this helpful?