Introduction

In this blog post we introduce SIGN (Scalable Graph Inception Network), our new Graph Neural Network model which is able to scale to huge graphs.

Graph Neural Networks (GNNs) is a class of models which has emerged in recent years and which specializes in learning from graph-structured data. This blog post from Thomas Kipf provides a nice introduction to the topic.

GNNs have been extremely successful in modeling relational data in a variety of different domains: social networks, human-object interactions, computer graphics, particle physics, chemistry and medicine to mention a few.

However, until recently, most of the research in the field has focused on relatively small graphs, with Cora (∼ 5K nodes) still being among the most widely used. However, some of the most interesting possible industrial applications of GNNs deal with very large graphs, such as the Twitter of Facebook social networks, with billions of nodes.

Why is scaling Graph Neural Networks Challenging?

Let’s focus on the task of node-classification. Most of the popular GNN models, such as GCN, ChebNet, GAT, or MoNet, require to have the whole adjacency matrix (or some normalized version of it) of the graph, as well as all node features, in memory during training, in order to perform full-batch gradient descent. This is usually unfeasible for graphs with more than few hundred thousands nodes.

The easiest solution that comes to mind to solve the above problem would be to use mini-batch gradient descent instead of full batch. That is, to train on only a fraction of the nodes at each iteration. However, while this is easy to do when the data points are IID, the situation becomes more complex when the data points are related to each other in complex ways such in the case of graphs. While in classic deep learning you can sample any subset of data points to be your batch, with graph deep learning you need to make sure the subsampled nodes, which constitu a subgraph, still mantain a meaningful structure.

Sampling to Scale GNNs

The first work to tackle this problem was GraphSage, which introduced neighborhood sampling and mini-batch training as a way to train GNNs on large graphs. In particular,

While allowing to train GNNs on large graphs, GraphSage had the issue of introducing a lot of redundant computation. After GraphSage, many works focused on improving sampling of mini-batches, so that they could be more efficient and the models learnt faster.

The most recent works in this directions are ClusterGCN and GraphSaint. ClusterGCN clusters the graph, so that each batch consists of one (or more) cluster. This allows the nodes in the batch to share as many edges as possible, so that the left out edges are as few as possible. GraphSaint instead proposes the general framework of having a graph sampler, which at each batch samples a subgraph of the original graph on which we train.

It is also important to note that one of the advantages of sampling is that during training it acts as a sort of edge dropout, which regularizes the model and can help the performance \cite{dropedge}. However, edge dropout would require to still see all the edges at inference time, whereas it is not possible in this case, where still a subsample of the edges are seen at training time. Actually, the code for this methods do inference on the whole graph on CPU. However, this would not be possible for really huge graphs.

Do we really need sampling?

In our work we present an alternative to sampling. Our model has been inspired by the two following empirical findings:

  • Simple fixed aggregators (such as GCN \formula) have been shown to often outperform more complex ones, such as GAT or MPNN. Therefore, it looks like to obtain good results on most applications it is enough to have a fixed neighborhood aggregator
  • While deep learning success was built on models with a lot of layers, in graph deep learning it is still an open question whether many non-linear layers are needed. In particular, in the paper “Simplifying Graph Convolutional Networks”, it is shown that a model with a single layer (but multi-hop diffusion) can perform on par with models with multiple layers

If we combine this ingredient, that is, a fixed neighborhood aggregator and having a single layer, we can have a model which scales to huge graph while sidestepping graph sampling.

SIGN

Accordingly, we propose the following scalable architecture for node-wise classification:

where is a fixed graph diffusion matrix, , are learnable matrices, is the concatenation operation and , are non-linearities, the second one computing class probabilities.

Here we observe that the matrix products in the equation above do not depend on the learnable model parameters and can be easily precomputed. In particular, for very large graphs this pre-computation can be scaled efficiently using distributed computing infrastructures such as Apache Spark. This effectively reduces the computational complexity of the overall model to that of a multi-layer perceptron.

Moreover, by moving the diffusion to the precomputation step, we can aggregate information from all the neighbors, avoiding sampling and the possible loss of information that comes with it.

The main advantage of our model is it’s scalability. Since all the diffusion can be computed as a pre-processing, at training and inference time the model becomes an MLP. This means it can be trained using standard mini-batch gradient descent, and it is extremely fast! In particular, we found it to be between 1 and 2 orders of magnitude faster than ClusterGCN and GraphSaint at inference time, while also being much faster in converging at training time. All of this while mantaining accuracy performances very close to the state-of-the-art GraphSaint.

Conclusion and Future Work

Despite the limitation of allowing for a single graph convolutional layer, limiting the theoretical expressiveness of the model, SIGN has been shown to perform well in practice. Given it’s speed and simplicity of implementation, we envision SIGN to be a great solution to graph deep learning tasks in industry.