Graph Attention Network (GAT)
Fundamental
Two features in Graph data
Graph architecture feature: for a vertex in a graph, its neighbors form the first feature, which is the graph architecture, or adjacent matrix
Vertex feature: besides the graph structure, the features of a vertex are the second feature.
Limitation of GCN
Can not deal with inductive task, which is the dynamic graph problem.
Bottleneck of a directed graph. It is not easy to assign different weights to a vertex's different neighbors
Mask graph attention or global graph attention
Global graph attention
For a vertex, calculate the attention for the rest of all vertices on the graph
Pros: completely do not rely on the graph structure
Cons: Lost the graph structure feature and high computation cost
Mask graph attention
For a vertex, attention only focuses on its neighbors.
Graph Attention Network
There are mainly two steps for GAT:
Attention coefficient
For a vertex , calculate the similarities between it and all of its neighbors:
The shared weights matrix is used to transform the embedding features of a vertex to a higher dimension.
is concatenate the transformed matrixes
maps the high dimensional features to a value, which used single-layer feedforward neural network to implemented.
Then, using a softmax to normalize the similarities:
Aggregate
The second step is just a weighted sum up:
is the embedding vector of every vertect from the output of GAT
is the activation function
Multi-headed attention:
3 types of Attention mechanisms
There are mainly three types of attention mechanisms. All of them can be used to find the relevance of a vertex's neighbors.
Learn attention weights
Similarity-based attention
Attention-guided walk (Not covered)
Learn attention weights
The idea is to learn the relevance between vertex and its neighbors based on their embedding features.
Given a set of vertices and their embedding vectors , the attention weight between vertex is
where, is the relevance between .
In GAT, the attention weight is calculated as
Similarity-based attention
In Learn attention weights, the similarity between two vertex does not calculated directly. Instead, they used a 1-layer feed-forward neural network to find the similarity.
Therefore, we can directly calculate the similarity between the vertices and calculate the relevance between a vertex and its neighbors.
Where,
is the trainable bias
is used to calculate the cosine similarity between the embedding features of vertices.
An obvious difference between Similarity-based attention and Learn attention weights is that
Similarity-based attention used the cosine similarity to measure the similarity between vertex features, while Learn attention weights method used to learn the similarity.
More thoughts
Connections between GAT and GCN
Essentially, both of GCN and GAT aggregated the features of neighbors of a vertex to form new embedding vector and use the local stationery of graph to learn the features of a new vertex.
The difference is that
GCN used the Laplacian matrix
GAT used attention coefficents
It looks like GAT is better since it embedding the different relevance between vertice to the central vertex.
Why GAT suite for a directed graph?
Essentially since GAT is based on node-wise computation.
Every computation need to enumerate all vertex on a graph
=> No need the Laplacian matrix
Why GAT suite for inductive tasks?
Since the learnable parameters in GAT are , and the computation is node-wise
=> it less correlates to the graph architecture.
=> The change of graph architecture has little impact on the GAT, we just need to change , and recalculate the parameters.
However, GCN is based on the parameters of the whole graph.
=> A single change in the graph needs a completely re-training the parameters.
Reference
Last updated
Was this helpful?