1629909164

**Accepted to CVPR2021 :tada:**

Official PyTorch code of **Unsupervised Hyperbolic Representation Learning via Message Passing Auto-Encoders**

**Jiwoong Park*****, ****Junho Cho*****, Hyung Jin Chang, Jin Young Choi** (* indicates equal contribution)

Embeddings of cora dataset. GAE is Graph Auto-Encoders in Euclidean space, HGCAE is our method. P is Poincare ball, H is Hyperboloid.

This repository provides HGCAE code in PyTorch for reproducibility with

- PoincareBall manifold
- Link prediction task and node clustering task on graph data
- 6 datasets: Cora, Citeseer, Wiki, Pubmed, Blog Catalog, Amazon Photo
- Amazon Photo was downloaded via torch-geometric package.

- Image clustering task on images
- 2 datasets: ImageNet10, ImageNetDog
- Image features extracted from ImageNet10, ImageNetDog with PICA image clustering algorithm
- Mutual K-NN graph from the image features provided.

- ImageNet-BNCR
- We have constructed a new dataset, ImageNet-BNCR(Balanced Number of Classes across Roots), via randomly choosing 3 leaf classes per root. We chose three roots, Artifacts, Natural objects, and Animal. Thus, there exist 9 leaf classes, and each leaf class contains 1,300 images in ImageNet-BNCR dataset.

We use docker to reproduce performance. Please refer guide.md

Before training, run our docker image:

`docker run --gpus all -it --rm --shm-size 100G -v $PWD:/workspace junhocho/hyperbolicgraphnn:8 bash`

If you want to cache edge splits for train/val dataset and load faster afterwards, mkdir ~/tmp and run:

`docker run --gpus all -it --rm --shm-size 100G -v $PWD:/workspace -v ~/tmp:/root/tmp junhocho/hyperbolicgraphnn:8 bash`

In the docker session, run each train shell script for each dataset to reproduce performance:

Run following commands to reproduce results:

`sh script/train_cora_lp.sh`

`sh script/train_citeseer_lp.sh`

`sh script/train_wiki_lp.sh`

`sh script/train_pubmed_lp.sh`

`sh script/train_blogcatalog_lp.sh`

`sh script/train_amazonphoto_lp.sh`

ROC | AP | |
---|---|---|

Cora | 0.94890703 | 0.94726805 |

Citeseer | 0.96059407 | 0.96305937 |

Wiki | 0.95510805 | 0.96200790 |

Pubmed | 0.96207212 | 0.96083080 |

Blog Catalog | 0.89683939 | 0.88651569 |

Amazon Photo | 0.98240673 | 0.97655753 |

`sh script/train_cora_nc.sh`

`sh script/train_citeseer_nc.sh`

`sh script/train_wiki_nc.sh`

`sh script/train_pubmed_nc.sh`

`sh script/train_blogcatalog_nc.sh`

`sh script/train_amazonphoto_nc.sh`

ACC | NMI | ARI | |
---|---|---|---|

Cora | 0.74667651 | 0.57252940 | 0.55212928 |

Citeseer | 0.69311692 | 0.42249294 | 0.44101404 |

Wiki | 0.45945946 | 0.46777881 | 0.21517031 |

Pubmed | 0.74849115 | 0.37759262 | 0.40770875 |

Blog Catalog | 0.55061586 | 0.32557388 | 0.25227964 |

Amazon Photo | 0.78130719 | 0.69623651 | 0.60342107 |

`sh script/train_ImageNet10.sh`

`sh script/train_ImageNetDog.sh`

ACC | NMI | ARI | |
---|---|---|---|

ImageNet10 | 0.85592308 | 0.79019131 | 0.74181220 |

ImageNetDog | 0.38738462 | 0.36059650 | 0.22696503 |

- At least 11GB VRAM is required to run on Pubmed, BlogCatalog, Amazon Photo.
- We have used GTX 1080ti only in our experiments.
- Other gpu architectures may not reproduce above performance.

- dataset : Choose dataset. Refer to each training scripts.
- c : Curvature of hypebolic space. Should be >0. Preferably choose from 0.1, 0.5 ,1 ,2.
- c_trainable : 0 or 1. Train c if 1.
- dropout : Dropout ratio.
- weight_decay : Weight decay.
- hidden_dim : Hidden layer dimension. Same dimension used in encoder and decoder.
- dim : Embedding dimension.
- lambda_rec : Input reconstruction loss weight.
- act : relu, elu, tanh.
- --manifold PoincareBall : Use Euclidean if training euclidean models.
- --node-cluster 1 : If specified perform node clustering task. If not, link prediction task.

This repo is inspired by hgcn.

And some of the code was forked from the following repositories:

This work is licensed under the MIT License

```
@inproceedings{park2021unsupervised,
title={Unsupervised Hyperbolic Representation Learning via Message Passing Auto-Encoders},
author={Jiwoong Park and Junho Cho and Hyung Jin Chang and Jin Young Choi},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
year={2021}
}
```

Author: junhocho

Download The Source Code : https://github.com/junhocho/HGCAE/archive/refs/heads/master.zip

GITHUB: https://github.com/junhocho/HGCAE

1602410400

A typical feedforward neural network takes the features of each data point as input and outputs the prediction. The neural network is trained utilizing the features and the label of each data point in the training data set. Such a framework has been shown to be very effective in a variety of applications, such as face identification, handwriting recognition, object detection, where no explicit relationships exist between data points. However, in some use cases, the prediction for a data point *v*(*i*) can be determined not only by its own features but also by the features of other data points *v*(*j*) when the relationship between *v*(*i*) and *v*(*j*) is given. For example, the topic of a journal paper (e.g computer science, physics, or biology) can be inferred from the frequency of words appearing in the paper. On the other hand, the reference in a paper can also be informative when predicting the topic of the paper. In this example, not only do we know the features of each individual data point (the word frequency), we also know the relationship between the data points (citation relation). So how can we combine them to increase the accuracy of the prediction?

By applying graph convolutional networks (GCN), the features of an individual data point and its connected data points will be combined and fed into the neural network. Let’s use the paper classification problem again as an example. In a citation graph (Fig. 1), each paper is represented by a vertex in the citation graph. The edges between the vertices represent the citation relationships. For simplicity, the edges are treated as undirected. Each paper and its feature vector are denoted as *v_i* and *x_i* respectively. Following the GCN model by Kipf and Welling [1], we can predict the topics of papers using a neural network with one hidden layer with the following steps:

Figure 1.(Image by Author) The architecture of graph convolutional networks. Each vertex vi represents a paper in the citation graph. xi is the feature vector of vi. W(0) and W(1) are the weight matrices of the 3-layer neural network. A, D, and I are the adjacency matrix, outdegree matrix, and identity matrix respectively. The horizontal and vertical propagations are highlighted in orange and blue respectively.

In the above workflow, steps 1 and 4 perform horizontal propagation where the information of each vertex is propagated to its neighbors. While steps 2 and 5 perform vertical propagation where the information on each layer is propagated to the next layer. (see Fig. 1) For a GCN with multiple hidden layers, there will be multiple iterations of horizontal and vertical propagations. It is worth noting that each time horizontal propagation is performed, the information of a vertex is propagated one-hop further on the graph. In this example, the horizontal propagation is performed twice (steps 2 and 4), so the prediction of each vertex not only depends on its own features, but also the features of all the vertices within 2-hop distance from it. Additionally, since the weight matrix W(0) and W(1)are shared by all the vertices, the size of the neural network does not have to increase with the graph size, which makes this approach scalable.

#classification #machine-learning #graph-convolution-network #semi-supervised-learning #graph-database

1593880440

**TL;DR:** *In this post, I discuss how to design local and computationally efficient provably powerful graph neural networks that are not based on the Weisfeiler-Lehman tests hierarchy. This is the second in the series of posts on the expressivity of graph neural networks. See Part 1 describing the relation between graph neural networks and the Weisfeiler-Lehman graph isomorphism test. In Part 3, I will argue why we should abandon the graph isomorphism problem altogether.*

Recent groundbreaking papers [1–2] established the connection between graph neural networks and the graph isomorphism tests, observing the analogy between the message passing mechanism and the Weisfeiler-Lehman (WL) test [3]. WL test is a general name for a hierarchy of graph-theoretical polynomial-time iterative algorithms for determining graph isomorphism. The *k*-WL test recolours *k*-tuples of vertices of a graph at each step according to some neighbourhood aggregation rules and stops upon reaching a stable colouring. If the histograms of colours of the two graphs are not the same, the graphs are deemed not isomorphic; otherwise, the graphs are possibly (but not necessarily) isomorphic.

Message passing neural networks are at most as powerful as the 1-WL test (also known as node colour refinement), and thus unable to distinguish between even very simple instances of non-isomorphic graphs. For example, message passing neural networks cannot count triangles [4], a motif known to play an important role in social networks where it is associated with the clustering coefficient indicative of how “tightly knit” the users are [5]. It is possible to design more expressive graph neural networks that replicate the increasingly more powerful *k*-WL tests [2,6]. However, such architectures result in high complexity and large number of parameters, but most importantly, typically require non-local operations that make them impractical.

Examples of non-isomorphic graphs that cannot be distinguished by 1-WL but can be distinguished by 3-WL due to its capability of counting triangles.

Thus, provably powerful graph neural networks based on the Weisfeiler-Lehman hierarchy are either not very powerful but practical, or powerful but impractical [7]. I argue that there is a different simple way to design efficient and provably powerful graph neural networks, which we proposed in a new paper with Giorgos Bouritsas and Fabrizio Frasca [8].

**Graph Substructure Networks. **The idea is actually very simple and conceptually similar to positional encoding or graphlet descriptors [9]: we make the message passing mechanism aware of the local graph structure, allowing for computing messages differently depending on the topological relationship between the endpoint nodes. This is done by passing to message passing functions additional structural descriptors associated with each node [10], which are constructed by subgraph isomorphism counting. In this way, we can partition the nodes of the graph into different equivalence classes reflecting topological characteristics that are shared both between nodes in each graph individually and across different graphs.

We call this architecture Graph Substructure Network (GSN). It has the same algorithmic design and memory and computational complexity as standard message passing neural networks, with an additional pre-computation step in which the structural descriptors are constructed. The choice of the substructures to count is crucial both to the expressive power of GSNs and the computational complexity of the pre-computation step.

The worst-case complexity of counting substructures of size *k* in a graph with *n* nodes is 𝒪(*nᵏ*). Thus, it is similar to high-order graph neural network models or Morris [2] and Maron [6]. However, GSN has several advantages over these methods. First, for some types of substructures such as paths and cycles the counting can be done with significantly lower complexity. Secondly, the computationally expensive step is done only once as preprocessing and thus does not affect network training and inference that remain linear, the same way as in message-passing neural networks. The memory complexity in training and inference is linear as well. Thirdly and most importantly, the expressive power of GSN is different from *k*-WL tests and in some cases is stronger.

**How powerful are GSNs?** The substructure counting endows GSN with more expressive power than the standard message-passing neural networks. First, it is important to clarify that the expressive power of GSN depends on the graph substructures used. Same way as we have a hierarchy of *k*-WL tests, we might have different variants of GSNs based on counting one or more structures. Using structures more complex than star graphs, GSNs can be made strictly more powerful than 1-WL (or the equivalent 2-WL) and thus also more powerful than standard message passing architectures. With 4-cliques, GSN is at least no less powerful than 3-WL, as shown by the following example of strongly regular graphs on which GSN succeeds while 3-WL fails:

Example of non-isomorphic strongly regular graphs with 16 vertices and node degree 6, where every two adjacent vertices have 2 mutual neighbours, and every two non-adjacent vertices also have 2 mutual neighbours. The 3-WL test fails on this example, while GSN with 4-clique structure can distinguish between them. In the graph on the left (known as the Rook’s graph) each node participates in exactly one 4-clique. The graph on the right (Shrikhande graph) has maximum cliques of size 3 (triangles). Figure from [8].

More generally speaking, for various substructures of 𝒪(1) size, as long as they cannot be counted by 3-WL, there exist graphs where GSN succeeds and 3-WL fails [11]. While we could not find examples to the contrary, they might in principle exist — that is why our statement about the power of GSN is of a weak form, “at least not less powerful”.

This holds for larger *k* as well; a generalisation of strongly regular graphs in the above figure, called *k*-*isoregular*, are instances on which the (*k*+1)-WL test fails [12]. These examples can also be distinguished by GSN with appropriate structures. The expressive power of GSNs can thus be captured by the following figure:

GSN is outside the Weisfeiler-Lehman hierarchy. With the appropriate structure (e.g. cliques or cycles of certain size), it is likely to be made at least not less powerful than k-WL.

How powerful can GSN be in principle? This is still an open question. The Graph Reconstruction Conjecture [13] postulates the possibility of recovering a graph from all its node-deleted substructures. Thus, if the Reconstruction Conjecture is correct, a GSN with substructures of size _n_−1 would be able to correctly test isomorphism of any graphs. However, the Reconstruction Conjecture is currently proven only for graphs of size _n≤_11 [14], and second, such large structures would be impractical.

The more interesting question is whether a similar result exists for “small” structures (of 𝒪(1) size independent of the number of nodes *n*). Our empirical results show that GSN with small substructures such as cycles, paths, and cliques work for strongly regular graphs, which are known to be a tough nut for the Weisfeiler-Lehman tests.

#geometric-deep-learning #deep-learning #graph-neural-networks #graph-theory #machine-learning #deep learning

1593440910

**TL;DR** *This is the first in a [series of posts] where I will discuss the evolution and future trends in the field of deep learning on graphs.*

Deep learning on graphs, also known as Geometric deep learning (GDL) [1], Graph representation learning (GRL), or relational inductive biases [2], has recently become one of the hottest topics in machine learning. While early works on graph learning go back at least a decade [3] if not two [4], it is undoubtedly the past few years’ progress that has taken these methods from a niche into the spotlight of the ML community and even to the popular science press (with *Quanta Magazine* running a series of excellent articles on geometric deep learning for the study of manifolds, drug discovery, and protein science).

Graphs are powerful mathematical abstractions that can describe complex systems of relations and interactions in fields ranging from biology and high-energy physics to social science and economics. Since the amount of graph-structured data produced in some of these fields nowadays is enormous (prominent examples being social networks like Twitter and Facebook), it is very tempting to try to apply deep learning techniques that have been remarkably successful in other data-rich settings.

There are multiple flavours to graph learning problems that are largely application-dependent. One dichotomy is between *node-wise* and *graph-wise* problems, where in the former one tries to predict properties of individual nodes in the graph (e.g. identify malicious users in a social network), while in the latter one tries to make a prediction about the entire graph (e.g. predict solubility of a molecule). Furthermore, like in traditional ML problems, we can distinguish between *supervised* and *unsupervised* (or *self-supervised*) settings, as well as *transductive* and *inductive* problems.

Similarly to convolutional neural networks used in image analysis and computer vision, the key to efficient learning on graphs is designing local operations with shared weights that do message passing [5] between every node and its neighbours. A major difference compared to classical deep neural networks dealing with grid-structured data is that on graphs such operations are *permutation-invariant*, i.e. independent of the order of neighbour nodes, as there is usually no canonical way of ordering them.

Despite their promise and a series of success stories of graph representation learning (among which I can selfishly list the [acquisition by Twitter] of the graph-based fake news detection startup Fabula AI I have founded together with my students), we have not witnessed so far anything close to the smashing success convolutional networks have had in computer vision. In the following, I will try to outline my views on the possible reasons and how the field could progress in the next few years.

**Standardised benchmarks **like ImageNet were surely one of the key success factors of deep learning in computer vision, with some [6] even arguing that data was more important than algorithms for the deep learning revolution. We have nothing similar to ImageNet in scale and complexity in the graph learning community yet. The [Open Graph Benchmark] launched in 2019 is perhaps the first attempt toward this goal trying to introduce challenging graph learning tasks on interesting real-world graph-structured datasets. One of the hurdles is that tech companies producing diverse and rich graphs from their users’ activity are reluctant to share these data due to concerns over privacy laws such as GDPR. A notable exception is Twitter that made a dataset of 160 million tweets with corresponding user engagement graphs available to the research community under certain privacy-preserving restrictions as part of the [RecSys Challenge]. I hope that many companies will follow suit in the future.

**Software libraries **available in the public domain played a paramount role in “democratising” deep learning and making it a popular tool. If until recently, graph learning implementations were primarily a collection of poorly written and scarcely tested code, nowadays there are libraries such as [PyTorch Geometric] or [Deep Graph Library (DGL)] that are professionally written and maintained with the help of industry sponsorship. It is not uncommon to see an implementation of a new graph deep learning architecture weeks after it appears on arxiv.

**Scalability** is one of the key factors limiting industrial applications that often need to deal with very large graphs (think of Twitter social network with hundreds of millions of nodes and billions of edges) and low latency constraints. The academic research community has until recently almost ignored this aspect, with many models described in the literature completely inadequate for large-scale settings. Furthermore, graphics hardware (GPU), whose happy marriage with classical deep learning architectures was one of the primary forces driving their mutual success, is not necessarily the best fit for graph-structured data. In the long run, we might need specialised hardware for graphs [7].

**Dynamic graphs **are another aspect that is scarcely addressed in the literature. While graphs are a common way of modelling complex systems, such an abstraction is often too simplistic as real-world systems are dynamic and evolve in time. Sometimes it is the temporal behaviour that provides crucial insights about the system. Despite some recent progress, designing graph neural network models capable of efficiently dealing with continuous-time graphs represented as a stream of node- or edge-wise events is still an open research question.

#deep-learning #representation-learning #network-science #graph-neural-networks #geometric-deep-learning #deep learning

1621327800

As 2020 is coming to an end, let’s see it off in style. Our journey in the world of Graph Analytics, Graph Databases, Knowledge Graphs and Graph AI culminate.

The representation of the relationships among data, information, knowledge and --ultimately-- wisdom, known as the data pyramid, has long been part of the language of information science. Digital transformation has made this relevant beyond the confines of information science. COVID-19 has brought years’ worth of digital transformation in just a few short months.

In this new knowledge-based digital world, encoding and making use of business and operational knowledge is the key to making progress and staying competitive. So how do we go from data to information, and from information to knowledge? This is the key question Knowledge Connexions aims to address.

Graphs in all shapes and forms are a key part of this.

Knowledge Connexions is a visionary event featuring a rich array of technological building blocks to support the transition to a knowledge-based economy: Connecting data, people and ideas, building a global knowledge ecosystem.

The Year of the Graph will be there, in the workshop “From databases to platforms: the evolution of Graph databases”. George Anadiotis, Alan Morrison, Steve Sarsfield, Juan Sequeda and Steven Xi bring many years of expertise in the domain, and will analyze Graph Databases from all possible angles.

This is the first step in the relaunch of the Year of the Graph Database Report. Year of the Graph Newsletter subscribers just got a 25% discount code. To be always in the know, subscribe to the newsletter, and follow the newly launched Year of the Graph account on Twitter! In addition to getting the famous YotG news stream every day, you will also get a 25% discount code.

#database #machine learning #artificial intelligence #data science #graph databases #graph algorithms #graph analytics #emerging technologies #knowledge graphs #semantic technologies

1595040180

TD;LR — Given a Graph representation in terms of adjacency matrix, it does not capture spatial correlation needed for convolution. This posts captures intuition on how Fourier domain helps in convolution

Pre-requisite for Graph Representation

Before we dive deep into Graph Convolutions, lets us understand what convolution does in spatial domain — the standard go to answer is to say “it captures latent spatial features”. This tell us what convolution does, yet still need to address why-of-convolution.

Lets take the standard MNIST-Data with 28*28 images.

MNIST Data Samples

Incase of an MLP — we flatten each image (28*28 matrix) into a single row-vector of dimension 784 (28*28) and pass that as input. Let say our hidden dimension is to reach 256-dim embedding. So the corresponding weight matrix here would be 784 x 256-D

Pictorial Representation

In convolution — we define a kernel (or mask 😢) to slide over the image to get _“spatial features” _😏

One may ask — whats new in this? It still does capture spatial relationship in the data. The answer to this is to look at the number of parameters. In case of MLP — since we treat each pixel to be independent to every other pixel we have 784 weights to each hidden dimension (if dimension embedding is 256 — then we have 784 x 256 as the weight matrix)

In case of a 2-d convolution (of greyscale image) — we have our input to be 28*28 and kernel say of dimension 3x3. Then the total number of trainable parameters for a single channel would be 9 (3*3) + 1 (optional bias). and if we choose to have 256 feature maps — we would have 256 maps * 10 total to 2560 trainable weights as compared to 200704 weights in case of MLP.

The key idea of convolution is to capitalise on Image Localisation — that is in the local view (mask region) of the kernel, the pixels in the neighbourhood are similar and have very little jump/edges. This is the reason why convolution applied to images makes sense.

Lets look at a case where it does not — We need to build a neural network to decide to approve bank loan. Our input is a 5-dimensional vector having the following features

Features — [Age, Income, Gender, Marital Status, Asset-Worth]

In this case convolution makes no sense. Because there is no shared information across feature space — each dimension of the feature vector is *“independent”*

#artificial-intelligence #machine-learning #graph #deep-learning #deep learning