Graph Attention Network (GAT)

Fundamental

  • Two features in Graph data

    • Graph architecture feature: for a vertex in a graph, its neighbors form the first feature, which is the graph architecture, or adjacent matrix

    • Vertex feature: besides the graph structure, the features of a vertex are the second feature.

  • Limitation of GCN

    • Can not deal with inductive task, which is the dynamic graph problem.

    • Bottleneck of a directed graph. It is not easy to assign different weights to a vertex's different neighbors

  • Mask graph attention or global graph attention

    • Global graph attention

      For a vertex, calculate the attention for the rest of all vertices on the graph

      • Pros: completely do not rely on the graph structure

      • Cons: Lost the graph structure feature and high computation cost

    • Mask graph attention

      For a vertex, attention only focuses on its neighbors.

Graph Attention Network

There are mainly two steps for GAT:

Attention coefficient

  • For a vertex ii, calculate the similarities between it and all of its neighbors:

    eij=a([WhiWhj]),jNie_{ij} = a([Wh_i||Wh_j]), j\in N_i

    • The shared weights matrix WW is used to transform the embedding features of a vertex to a higher dimension.

    • [][\cdot || \cdot]is concatenate the transformed matrixes

    • a()a(\cdot)maps the high dimensional features to a value, which used single-layer feedforward neural network to implemented.

  • Then, using a softmax to normalize the similarities:

    aij=exp(LeakyReLU(eij))kNiexp(LeakyReLU(eik))a_{ij} = \frac{exp(LeakyReLU(e_{ij}))}{\sum_{k\in N_i}exp(LeakyReLU(e_{ik}))}

Aggregate

The second step is just a weighted sum up:

  • hi=σ(jNiαijWhj)h_i' = \sigma(\sum_{j\in N_i} \alpha_{ij}Wh_j)

    • hih_i' is the embedding vector of every vertect from the output of GAT

    • σ()\sigma(\cdot) is the activation function

  • Multi-headed attention:

    hi(K)=k=1Kσ(jNiαijWhj)h_i'(K) = ||_{k=1}^K \sigma(\sum_{j\in N_i} \alpha_{ij}Wh_j)

3 types of Attention mechanisms

There are mainly three types of attention mechanisms. All of them can be used to find the relevance of a vertex's neighbors.

  • Learn attention weights

  • Similarity-based attention

  • Attention-guided walk (Not covered)

Learn attention weights

  • The idea is to learn the relevance between vertex and its neighbors based on their embedding features.

  • Given a set of vertices v0,v1,...,vNx0v_0, v_1, ..., v_{N_{x_0}}and their embedding vectors x0,x1,...,xNox_0, x_1, ..., x_{N_{o^*}}, the attention weight α0,j\alpha_{0, j} between vertex v0,vjv_0, v_jis

    α0,j=e0,jkNv0e0,k\alpha_{0, j} = \frac{e_{0,j}}{\sum_{k\in N_{v_0}}e_{0, k}}

    where, e0,je_{0,j} is the relevance between v0,vjv_0, v_j.

  • In GAT, the attention weight is calculated as

    a0,j=exp(LeakyReLU(e0,j))kNv0exp(LeakyReLU(e0,k))a_{0,j} = \frac{exp(LeakyReLU(e_{0,j}))}{\sum_{k\in N_{v_0}}exp(LeakyReLU(e_{0,k}))}

Similarity-based attention

  • In Learn attention weights, the similarity between two vertex does not calculated directly. Instead, they used a 1-layer feed-forward neural network to find the similarity.

  • Therefore, we can directly calculate the similarity between the vertices and calculate the relevance between a vertex and its neighbors.

    α0,j=exp(βcos(Wx0,Wxj))kNv0exp(βcos(Wx0,Wxk))\alpha_{0, j}=\frac{\exp \left(\beta \cdot \cos \left(\mathbf{W} \mathbf{x}_{0}, \mathbf{W} \mathbf{x}_{j}\right)\right)}{\sum_{k \in N_{v_{0}}} \exp \left(\beta \cdot \cos \left(\mathbf{W} \mathbf{x}_{0}, \mathbf{W} \mathbf{x}_{k}\right)\right)}

    Where,

    • β\beta is the trainable bias

    • cos()cos(\cdot) is used to calculate the cosine similarity between the embedding features of vertices.

  • An obvious difference between Similarity-based attention and Learn attention weights is that

    • Similarity-based attention used the cosine similarity to measure the similarity between vertex features, while Learn attention weights method used a()a(\cdot)to learn the similarity.

More thoughts

Connections between GAT and GCN

  • Essentially, both of GCN and GAT aggregated the features of neighbors of a vertex to form new embedding vector and use the local stationery of graph to learn the features of a new vertex.

  • The difference is that

    • GCN used the Laplacian matrix

    • GAT used attention coefficents

  • It looks like GAT is better since it embedding the different relevance between vertice to the central vertex.

Why GAT suite for a directed graph?

  • Essentially since GAT is based on node-wise computation.

    • Every computation need to enumerate all vertex on a graph

    => No need the Laplacian matrix

Why GAT suite for inductive tasks?

  • Since the learnable parameters in GAT are W,a()W, a(\cdot), and the computation is node-wise

    => it less correlates to the graph architecture.

    => The change of graph architecture has little impact on the GAT, we just need to change NiN_i, and recalculate the parameters.

  • However, GCN is based on the parameters of the whole graph.

    => A single change in the graph needs a completely re-training the parameters.

Reference

Last updated

Was this helpful?