Introduction

The last decade has witnessed an experimental revolution in data science and

machine learning, epitomised by deep learning methods. Indeed, many high-

dimensional learning tasks previously thought to be beyond reach – such as

computer vision, playing Go, or protein folding – are in fact feasible with

appropriate data and computational scale. Remarkably, the essence of deep

learning is built from two simple algorithmic principles: ﬁrst, the notion

of representation or feature learning, whereby adapted, often hierarchical,

features capture the appropriate notion of regularity for each task, and sec-

ond, learning by gradient descent-type optimisation, typically implemented as

backpropagation.

While learning generic functions in high dimensions is a cursed estimation

problem, most tasks of interest are not generic, and come with essential prede-

ﬁned regularities arising from the underlying low-dimensionality and structure

of the physical world. This book is concerned with exposing these regulari-

ties through uniﬁed geometric principles that can be applied throughout a wide

spectrum of applications.

Exploiting the known symmetries of a large system is a powerful and clas-

sical remedy against the curse of dimensionality, and forms the basis of most

physical theories. Deep learning systems are no exception, and since the early

days researchers have adapted neural networks to exploit the low-dimensional

geometry arising from physical measurements, e.g. grids in images, sequences

in time-series, or position and momentum in molecules, and their associated

symmetries, such as translation or rotation. Throughout our exposition, we will

describe these models, as well as many others, as natural instances of the same

underlying principle of geometric regularity.

Geometric Deep Learning is a ‘geometric uniﬁcation’ endeavour in the spirit

of Klein’s Erlangen Programme that serves a dual purpose. On one hand, it

provides a common mathematical framework to study the classical successful

4 Chapter 1

neural network architectures, such as CNNs, RNNs, GNNs, and Transform-

ers. We will show that all of the above can be obtained by the choice of a

geometric domain, its symmetry group, and appropriately constructed invari-

ant and equivariant neural network layers—what we refer to as the ‘Geometric

Deep Learning blueprint.’ We will exemplify instances of this blueprint on the

‘5G of Geometric Deep Learning’: Graphs, Grids, Groups, Geometric Graphs,

and Gauges. On the other hand, Geometric Deep Learning gives a construc-

tive procedure to incorporate prior physical knowledge into neural networks

and provide a principled way to build new architectures. With this premise,

we shall now explore how the ideas central to Geometric Deep Learning have

crystallised through time.

1.1 On the Shoulders of Giants

“Symmetry, as wide or as narrow as you may deﬁne its meaning, is one idea

by which man through the ages has tried to comprehend and create order,

beauty, and perfection.” This somewhat poetic deﬁnition of symmetry is given

in the eponymous book of the great mathematician Hermann Weyl (1952),

his Schwanengesang on the eve of retirement from the Institute for Advanced

Study in Princeton. Weyl traces the special place symmetry has occupied in

science and art to the ancient times, from Sumerian symmetric designs to the

Pythagoreans who believed the circle to be perfect due to its rotational sym-

metry. Plato considered the ﬁve regular polyhedra bearing his name today so

fundamental that they must be the basic building blocks shaping the material

world.

Yet, though Plato is credited with coining the term σνµµϵτρ´ια, which lit-

erally translates as ‘same measure’, he used it only vaguely to convey the

beauty of proportion in art and harmony in music. It was the astronomer and

mathematician Johannes Kepler (1611) to attempt the ﬁrst rigorous analysis of

the symmetric shape of water crystals. In his treatise ‘On the Six-Cornered

Snowﬂake’

he attributed the six-fold dihedral structure of snowﬂakes to

hexagonal packing of particles – an idea that though preceded the clear

understanding of how matter is formed, still holds today as the basis of

crystallography (Ball 2011).

In modern mathematics, symmetry is almost univocally expressed in the

language of group theory. The origins of this theory are attributed to Évariste

Galois, who coined the term and used it to study the solvability of polynomial

equations in the 1830s.

Two other names associated with group theory are

those of Sophus Lie and Felix Klein, who met and worked fruitfully together

for a period of time (Tobies 2019). The former would develop the theory of

Introduction 5

Plato

(ca. 370 BC)

Johannes Kepler

(1571–1630)

Figure 1.1

Plato believed that symmetric polyhedra (“Platonic solids”) were the fundamental

building blocks of nature. Johannes Kepler attributed for the ﬁrst time the six-fold

symmetry of water crystals to the hexagonal packing of particles, antedating modern

crystallography.

continuous symmetries that today bears his name (Lie groups); the latter pro-

claimed group theory to be the organising principle of geometry in his Erlangen

Programme, which we already mentioned in the Preface. Given that Klein’s

Programme is the inspiration for our book, it is worthwhile to spend more time

on its historical context and revolutionary impact.

1.1.1 A Strange New Universe out of Nothing

The foundations of modern geometry were formalised in ancient Greece nearly

2300 years ago by Euclid in a treatise named the Elements (Στoιξϵ`ια).

Euclidean geometry (which is still taught at school as ‘the geometry’) was

a set of results built upon ﬁve intuitive axioms or postulates. The Fifth Pos-

tulate – stating that it is possible to pass only one line parallel to a given line

through a point outside it – appeared less obvious and an illustrious row of

mathematicians broke their teeth trying to prove it since antiquity, to no avail.

An early approach to the problem of the parallels appears in the eleventh

century Persian treatise A commentary on the difﬁculties concerning the postu-

lates of Euclid’s Elements by Omar Khayyam.

The eighteenth-century Italian

Jesuit priest Giovanni Saccheri (1733) was likely aware of this previous work

judging by the title of his own work Euclides ab omni nævo vindicatus (‘Euclid

cleared of every stain’, see Figure 1.2). Like Khayyam, he considered the

summit angles of a quadrilateral with sides perpendicular to the base. The con-

clusion that acute angles lead to inﬁnitely many non-intersecting lines that can

be passed through a point not on a straight line seemed so counter-intuitive

6 Chapter 1

that he rejected it as ‘repugnatis naturæ linæ rectæ’ (‘repugnant to the nature

of straight lines’).

Euclid

(ca. 300 BC)

Omar Khayyam

(1048–1131)

Figure 1.2

The “Father of Geometry”, Euclid, laid the foundations of modern geometry in the

Elements. Omar Khayyam made early attempts to prove Euclid’s ﬁfth postulate, which

were also referenced in Saccheri’s work (Euclides ab omni nævo vindicatus).

The nineteenth century has brought the realisation that the Fifth Postulate is

not essential and one can construct alternative geometries based on different

notions of parallelism. One such early example is projective geometry, arising,

as the name suggests, in perspective drawing and architecture. In this geom-

etry, points and lines are interchangeable, and there are no parallel lines in

the usual sense: any lines meet in a ‘point at inﬁnity.’ While results in projec-

tive geometry are known since antiquity, it was ﬁrst systematically studied by

Jean-Victor Poncelet (1822)

; see Figure 1.3.

Gérard Desargues

(1591–1661)

Jean-Victor

Poncelet

(1788–1867)

Figure 1.3

Based on the prior work of Gérard Desargues, Jean-Victor Poncelet revived the interest

in projective geometry, one of the earliest examples of geometries not requiring the

parallel postulate.

Introduction 7

The credit for the ﬁrst construction of a true non-Euclidean geometry is

disputed. The princeps mathematicorum Carl Friedrich Gauss worked on it

around 1813 but never published any results.

The ﬁrst publication on the sub-

ject of non-Euclidean geometry was ‘On the Origins of Geometry’, by the

Russian mathematician Nikolai Lobachevsky (1829)—as depicted in Figure

1.4. In this work, he considered the Fifth Postulate an arbitrary limitation and

proposed an alternative one, that more than one line can pass through a point

that is parallel to a given one. Such a construction requires a space with nega-

tive curvature — what we now call a hyperbolic space — a notion that was still

not fully mastered at that time.

Lobachevsky’s idea appeared heretical and he

was openly derided by colleagues.

A similar construction was independently

discovered by the Hungarian János Bolyai, who published it in 1832 under the

name ‘absolute geometry.’ In an earlier letter to his father dated 1823, he wrote

enthusiastically about this new development: ‘I have discovered such wonder-

ful things that I was amazed... out of nothing I have created a strange new

universe.’

Carl Friedrich

Gauss

(1777–1855)

János

Bolyai

(1802–1860)

Nikolai

Lobachevsky

(1792–1856)

Bernhard

Riemann

(1826–1866)

Figure 1.4

Several mathematicians played pivotal roles in proposing early variants of non-

Euclidean geometries in the eighteenth century. Gauss reportedly worked on the topic

without publishing any material on it. The ﬁrst publication—focussed on hyperbolic

geometry—came from Lobachevsky (On the Origins of Geometry, depicted here), with

Bolyai having worked on it concurrently. Riemann introduced any such geometries later

on, one of which was elliptic geometry.

In the meantime, new geometries continued to come out like from a cornu-

copia. August Möbius (1827), of the eponymous surface fame, studied afﬁne

geometry. Gauss’ student Bernhardt Riemann introduced a very broad class

of geometries — called today Riemannian is his honour — in his habili-

tation lecture, subsequently published under the title Über die Hypothesen,

8 Chapter 1

welche der Geometrie zu Grunde liegen (‘On the Hypotheses on which Geom-

etry is Based,’ 1854). A special case of Riemannian geometry is the ‘elliptic’

geometry of the sphere, another construction violating Euclid’s Fifth Postu-

late, as there is no point on the sphere through which a line can be drawn that

never intersects a given line. Towards the second half of the nineteenth cen-

tury, Euclid’s monopoly over geometry was completely shuttered. New types

of geometry (Euclidean, afﬁne, projective, hyperbolic, spherical) emerged

and became independent ﬁelds of study. However, the relationships of these

geometries and their hierarchy were not understood.

It was in this exciting but messy situation that Felix Klein came forth, with

a genius insight to use group theory as an algebraic abstraction of symmetry to

organise the ‘geometric zoo.’

It appeared that Euclidean geometry is a special

case of afﬁne geometry, which is in turn a special case of projective geom-

etry (or, in terms of group theory, the Euclidean group is a subgroup of the

projective group). Klein, and independently the Italian geometer Eugenio Bel-

trami, further showed that constant-curvature non-Euclidean geometries (i.e.,

the hyperbolic geometry of Lobachevsky and Bolyai and the spherical geom-

etry or Riemann) could be obtained as special cases of projective geometry

;

also see Figure 1.5. More general Riemannian geometry with non-constant

curvature was not included in Klein’s uniﬁed geometric picture, and it took

another ﬁfty years before it was integrated, largely thanks to the work of Élie

Cartan in the 1920s.

Felix Klein

(1849–1925)

Eugenio Beltrami

(1835–1900)

Figure 1.5

Felix Klein’s Erlangen Programme offered a way to organise and categorise existing

geometries according to their symmetries. Additionally, Klein and Beltrami indepen-

dently proved that non-Euclidean geometries with constant curvature are special cases

of projective geometry.

Klein’s Erlangen Programme has had a profound methodological and cul-

tural impact on geometry and mathematics in general. It was in a sense the

Introduction 9

‘second algebraisation’ of geometry (the ﬁrst one being the analytic geometry

of René Descartes and the method of coordinates bearing his latinised name

Cartesius) that allowed to produce results impossible by previous methods.

Category Theory abstracting the relations between objects and now pervasive

in pure mathematics, can be “regarded as a continuation of the Klein Erlangen

Programme, in the sense that a geometrical space with its group of transforma-

tions is generalized to a category with its algebra of mappings,” in the words

of its creators Samuel Eilenberg and Saunders Mac Lane—see (Marquis 2009)

and Figure 1.6.

Élie

Cartan

(1869–1951)

Samuel

Eilenberg

(1913–1998)

Saunders

Mac Lane

(1909–2005)

Figure 1.6

While the Erlangen Programme outlined a path towards unifying all geometries under

a common lens, it was not until the work of Élie Cartan several decades later that this

uniﬁcation was entirely formalised. Additionally, the Erlangen Programme inspired the

creation of entire novel ﬁelds of mathematics, as evidenced by ‘General Theory of

Natural Equivalences’, the original text on Category Theory by Eilenberg and Mac

Lane.

1.1.2 The Theory of Everything

Considering projective geometry the most general of all, Klein complained

‘how persistently the mathematical physicist disregards the advantages

afforded him in many cases by only a moderate cultivation of the projective

view’ (Klein 1872). His advocating for the exploitation of geometry and the

principles of symmetry in physics has foretold the following century that was

truly revolutionary for the ﬁeld.

In Göttingen,

Klein’s colleague Emmy Noether (1918)

proved that every

differentiable symmetry of the action of a physical system has a corresponding

conservation law. It was by all means a stunning result: beforehand, meticulous

experimental observation was required to discover fundamental laws such as

10 Chapter 1

the conservation of energy, and even then, it was an empirical result not com-

ing from anywhere. Noether’s Theorem — “a guiding star to 20th and 21st

century physics,” in the words of the Nobel laureate Frank Wilczek — allowed

for example to show that the conservation of energy emerges from the transla-

tional symmetry of time, a rather intuitive idea that the results of an experiment

should not depend on whether it is conducted today or tomorrow. The ﬁrst page

of this landmark result is depicted in Figure 1.7.

Another symmetry associated with charge conservation, the global gauge

invariance of the electromagnetic ﬁeld, ﬁrst appeared in Maxwell’s formu-

lation of electrodynamics (Maxwell 1865); however, its importance initially

remained unnoticed. The same Hermann Weyl who wrote so dithyrambically

about symmetry is the one who ﬁrst introduced the concept of gauge invari-

ance in physics in the early 20th century,

emphasizing its role as a principle

from which electromagnetism can be derived. It took several decades until

this fundamental principle – in its generalised form developed by Yang and

Mills (1954) – proved successful in providing a uniﬁed framework to describe

the quantum-mechanical behaviour of electromagnetism and the weak and

strong forces, ﬁnally culminating in the Standard Model that captures all the

fundamental forces of nature but gravity. As succinctly put by another Nobel-

winning physicist, Philip Anderson (1972), “it is only slightly overstating the

case to say that physics is the study of symmetry.”

Emmy

Noether

(1882–1935)

Hermann

Weyl

(1885–1955)

Chen-Ning

Yang

(b. 1922)

Robert

Mills

(1927–1999)

Figure 1.7

The Erlangen Programme had signiﬁcant spillover effects in physics. The landmark

result was Noether’s theorem, which directly speciﬁes how conservation laws arise

directly from symmetry constraints. Later on, gauge symmetries – ﬁrst introduced by

Weyl, then developed by Yang and Mills – proved an effective abstraction for discover-

ing the presently best-known model of the physical world: the Standard Model.

Introduction 11

1.2 Towards Geometric Deep Learning

An impatient reader might wonder at this point, what does all this excursion

into the history of geometry and physics, however exciting it might be, have

to do with deep learning? As we will see, the geometric notions of symmetry

and invariance have been recognised as crucial even in early attempts to do

‘pattern recognition,’ and it is fair to say that geometry has accompanied the

nascent ﬁeld of artiﬁcial intelligence from its very beginning. While it is hard

to agree on a speciﬁc point in time when ‘artiﬁcial intelligence’ was born as

a scientiﬁc ﬁeld (at the end, humans have been obsessed with comprehending

intelligence and learning from the dawn of civilisation), and even the history

of deep learning is disputed, we will try a less risky task of looking at the

precursors of geometric deep learning — the main topic of our book. This

history can be packed into less than a century.

1.2.1 Early Neural Networks and the AI Winter

By the 1930s, it has become clear that the mind resides in the brain, and

research efforts turned to explaining brain functions such as memory, per-

ception, and reasoning, in terms of brain network structures. McCulloch and

Pitts (1943) are credited with the ﬁrst mathematical abstraction of the neu-

ron, showing its capability to compute logical functions. Just a year after the

legendary Dartmouth College workshop that coined the very term ‘artiﬁcial

intelligence,’

an American psychologist Frank Rosenblatt (1957) from Cor-

nell Aeronautical laboratory proposed a class of neural networks he called

‘perceptrons’

, cf. Figure 1.8. Perceptrons, ﬁrst implemented on a digital

machine and then in dedicated hardware, managed to solve simple pattern

recognition problems such as classiﬁcation of geometric shapes. However, the

quick rise of ‘connectionism’ (how AI researchers working on artiﬁcial neural

networks labeled themselves) received a bucket of cold water in the form of

the now infamous book Perceptrons by Minsky and Papert (1969); the neural

network model they used is depicted in Figure 1.9.

In the deep learning community, it is common to retrospectively blame

Minsky and Papert for the onset of the ﬁrst ‘AI Winter,’ which made neural

networks fall out of fashion for over a decade. A typical narrative mentions

the ‘XOR Affair,’ a proof that perceptrons were unable to learn even very sim-

ple logical functions as evidence of their poor expressive power. Some sources

even add a pinch of drama recalling that Rosenblatt and Minsky went to the

same school and even alleging that Rosenblatt’s premature death in a boating

accident in 1971 was a suicide in the aftermath of the criticism of his work by

colleagues.

12 Chapter 1

SENSORY

UNITS

(S-UNITS)

RETINAL

UNITS

CIRCUITS

RETNA

ASSOCIATION

UNITS

(A-UNITS)

RESPONSE

UNITS

(R-UNITS)

Frank Rosenblatt

(1928–1971)

Figure 1.8

The Perceptron proposed by Frank Rosenblatt (1957) was one of the simplest neural

network architectures.

The reality is probably more mundane and more nuanced at the same time.

First, a far more plausible reason for the ‘AI Winter’ in the USA is the 1969

Mansﬁeld Amendment, which required the military to fund “mission-oriented

direct research, rather than basic undirected research.” Since many efforts in

artiﬁcial intelligence at the time, including Rosenblatt’s research, were funded

by military agencies and did not show immediate utility, the cut in funding

has had dramatic effects. Second, neural networks and artiﬁcial intelligence

in general were over-hyped: it is enough to recall a 1958 New Yorker article

calling perceptrons a “ﬁrst serious rival to the human brain ever devised” and

“remarkable machines” that were “capable of what amounts to thought,”

the overconﬁdent MIT Summer Vision Project expecting a “construction of a

signiﬁcant part of a visual system” and achieving the ability to perform “pattern

recognition”

during one summer term of 1966. A realisation by the research

community that initial hopes to ‘solve intelligence’ had been overly optimistic

was just a matter of time.

If one, however, looks into the substance of the dispute, it is apparent that

what Rosenblatt called a ‘perceptron’ is rather different from what Minsky

and Papert understood under this term. Minsky and Papert focused their analy-

sis and criticism on a narrow class of single-layer neural networks they called

‘simple perceptrons’ (and what is typically associated with this term in modern

times, see Figure 1.8) that compute a weighted linear combination of the inputs

followed by a nonlinear function.

On the other hand, Rosenblatt considered a

broader class of architectures that antedated many ideas of what would now be

considered ‘modern’ deep learning, including multi-layered networks with ran-

dom and local connectivity.

Rosenblatt could have probably rebutted some

Introduction 13

Marvin Minsky

(1927–2016)

Seymour Papert

(1928–2016)

Figure 1.9

The inﬂuential book Perceptrons by Minsky and Papert (1969) considered simple

single-layer neural networks depicted here. It was perhaps the earliest geometric

approach to learning, including the introduction of group invariance.

of the criticism concerning the expressive power of perceptrons had he known

about the proof of the Thirteenth Hilbert Problem

by Vladimir Arnold (1956)

and Andrey Kolmogorov (1957) establishing that a continuous multivariate

function can be written as a superposition of continuous functions of a single

variable. The Arnold–Kolmogorov theorem was a precursor of a subsequent

class of results known as ‘universal approximation theorems’ for multi-layer

(or ‘deep’) neural networks that put these concerns to rest.

While most remember the book of Minsky and Papert for the role it played

in cutting the wings of the early day connectionists and lament the lost oppor-

tunities, an important overlooked aspect is that it for the ﬁrst time presented a

geometric analysis of learning problems. This fact is reﬂected in the very name

of the book, subtitled An introduction to Computational Geometry. At the time

it was a radically new idea, and Block (1970) in his critical review of the book

(which essentially stood to the defense of Rosenblatt), wondered whether “the

new subject of ‘Computational Geometry” would “grow into an active ﬁeld of

mathematics, or will it peter out in a miscellany of dead ends?” (the former

happened: computational geometry is now a well-established ﬁeld

Furthermore, Minsky and Papert probably deserve the credit for the ﬁrst

introduction of group theory into the realm of machine learning: their Group

Invariance theorem stated that if a neural network is invariant to some group,

then its output could be expressed as functions of the orbits of the group (we

will deﬁne these terms in subsequent Chapters). While they used this result to

prove limitations of what a perceptron could learn, similar approaches were

subsequently used by Amari (1978) for the construction of invariant features

in pattern recognition problems. An evolution of these ideas in the works of

14 Chapter 1

Sejnowski, Kienker, and Hinton (1986) and Shawe-Taylor (1989, 1993), unfor-

tunately rarely cited today, provided the foundations of the geometric learning

blueprint described in this book.

1.2.2 Universal Approximation and the Curse of Dimensionality

The aforementioned notion of universal approximation deserves further discus-

sion. The term refers to the ability to approximate any continuous multivariate

function to any desired accuracy; in the machine learning literature, this type

of results is usually credited to Cybenko (1989) and Hornik (1991). Figure

1.10 outlines the pioneering researchers working on universal approximation

results.

David

Hilbert

(1862–1943)

Andrey

Kolmogorov

(1903–1987)

Vladimir

Arnold

(1937–2010)

George

Cybenko

Kurt

Hornik

Figure 1.10

David Hilbert’s Thirteenth Problem, proven by Andrey Kolmogorov and Vladimir

Arnold, was one of the ﬁrst results showing that a multivariate continuous function

could be expressed as a composition and sum of simple one-dimensional functions.

George Cybenko and Kurt Hornik proved results speciﬁc to neural networks, showing

that a perceptron with one hidden layer can approximate any continuous function to any

desired accuracy.

Unlike ‘simple’ (single-layer) perceptrons criticised by Minsky and Papert

(1969), multilayer neural networks are universal approximators and thus are

an appealing choice of architecture for learning problems. We can think of

supervised machine learning as a function approximation problem: given the

outputs of some unknown function (e.g., an image classiﬁer) on a training set

(e.g., images of cats and dogs), we try to ﬁnd a function from some hypoth-

esis class that ﬁts well the training data and allows to predict the outputs on

previously unseen inputs (‘generalisation’).

Universal approximation guarantees that we can express functions from a

very broad regularity class (continuous functions) by means of a multi-layer

neural network. In other words, there exists a neural network with a certain

Introduction 15

number of neurons and certain weights that approximates a given function

mapping from the input to the output space (e.g., from the space of images

to the space of labels). However, universal approximation theorems do not tell

us how to ﬁnd such weights. In fact, learning (i.e., ﬁnding weights) in neu-

ral networks has been a big challenge in the early days. Rosenblatt showed

a learning algorithm only for a single-layer perceptron; to train multi-layer

neural networks, Ivakhnenko and Lapa (1966) used a layer-wise learning algo-

rithm called ‘group method of data handling.’ This allowed Ivakhnenko (1971)

to go as deep as eight layers — a remarkable feat for the early 1970s!

A breakthrough came with the invention of backpropagation, an algorithm

using the chain rule to compute the gradient of the weights with respect to

the loss function and allowed to use gradient descent-based optimisation tech-

niques to train neural networks. As of today, this is the standard approach in

deep learning. While the origins of backpropagation date back to at least the

1960s,

the ﬁrst convincing demonstration of this method in neural networks

was in the widely cited Nature paper of Rumelhart, Hinton, and Williams

(1986). The introduction of this simple and efﬁcient learning method has been

a key contributing factor to the return of neural networks to the AI scene in the

1980s and 1990s. The key contributors in this space are discussed in Figure

1.11.

Alexey

Ivakhnenko

(1913–2007)

Seppo

Linnainmaa

(b. 1945)

Paul

Werbos

(b. 1947)

David

Rumelhart

(1942–2011)

Figure 1.11

Four of the most important ﬁgures in the development of modern methods of learning

via gradient descent. Ivakhnenko derived a layer-wise learning algorithm which allowed

for training an eight-layer neural network in 1971. The backpropagation algorithm was

initially described by Linnainmaa, and ﬁrst deployed in deep neural networks by Wer-

bos. Rumelhart’s work offered the ﬁrst convincing demonstration of the method in deep

neural networks, which propelled the use of backpropagation in future works.

Looking at neural networks through the lens of approximation theory has

led some cynics to call deep learning a “gloriﬁed curve ﬁtting.” We will let the

16 Chapter 1

reader judge how true this maxim is by trying to answer the following impor-

tant question: how many samples (training examples) are needed to accurately

approximate a function? Approximation theorists will immediately retort that

the class of continuous functions that multilayer perceptrons can represent

is obviously way too large: one can pass inﬁnitely many different continu-

ous functions through a ﬁnite collection of points.

It is necessary to impose

additional regularity assumptions such as Lipschitz continuity,

in which case

one can provide a bound on the required number of samples. Unfortunately,

these bounds scale exponentially with dimension — a phenomenon colloqui-

ally known as the ‘curse of dimensionality’

— which is unacceptable in

machine learning problems: even small-scale pattern recognition problems,

such as image classiﬁcation, deal with input spaces of thousands of dimen-

sions. If one had to rely only on classical results from approximation theory,

machine learning would be impossible. In our illustration, the number of exam-

ples of cat and dog images that would be required in theory in order to learn to

tell them apart would be way larger than the number of atoms in the universe

— there are simply not enough cats and dogs around to do it.

The struggle of machine learning methods to scale to high dimensions was

brought up by the British mathematician Sir James Lighthill (1973) in a paper

that AI historians call the ‘Lighthill Report,’ in which he used the term ‘com-

binatorial explosion’ and claimed that existing AI methods could only work

on toy problems and would become intractable in real-world applications.

Lighthill further complained that “most workers in AI research and in related

ﬁelds confess to a pronounced feeling of disappointment in what has been

achieved in the past twenty-ﬁve years” and “in no part of the ﬁeld have the

discoveries made so far produced the major impact that was then promised.”

These were not mere frustrations of a grumpy academic stuck with a hard

problem: the Report was commissioned by the British Science Research Coun-

cil to evaluate academic research in the ﬁeld of artiﬁcial intelligence, and its

pessimistic conclusions resulted in funding cuts across the pond. Together

with similar decisions by the American funding agencies, this amounted to

a wrecking ball for AI research in the 1970s.

For us, the realisation that classical functional analysis cannot provide an

adequate framework to deal with learning problems will be the motivation

to seek stronger, geometric forms of regularity, that can be implemented in

a particular wiring of the neural network — such as the local connectivity of

convolutional neural networks. It is fair to say that the triumphant reemergence

of deep learning a decade ago owes, at least in part, to these insights. It is also

probably true that the role of symmetry and invariance as a broad organising

Introduction 17

and design principle in neural networks has not been given enough credit at the

time, and us highlighting these principles here are a hindsight.

1.2.3 Secrets of the Visual Cortex and the Neocognitron

The inspiration for the ﬁrst neural network architectures of the new ‘geo-

metric’ type came from neuroscience. In a series of experiments that would

become classical and bring them a Nobel Prize in medicine, the duo of

Harvard neurophysiologists David Hubel and Torsten Wiesel (1959; 1962)

unveiled the structure and function of a part of the brain responsible for pat-

tern recognition—the visual cortex. By presenting changing light patterns to

a cat and measuring the response of its brain cells (neurons) – as depicted in

Figure 1.12 – they showed that the neurons in the visual cortex have a multi-

layer structure with local spatial connectivity: a cell would produce response

only if cells in its proximity (‘receptive ﬁeld’

) were activated. Furthermore,

the organisation appeared to be hierarchical, where the responses of ‘simple

cells’ reacting to local primitive oriented step-like stimuli were aggregated

by ‘complex cells,’ which produced responses to more complex patterns. It

was hypothesized that cells in deeper layers of visual cortex would respond to

increasingly complex patterns composed of simpler ones, with a semi-joking

suggestion of the existence of a ‘grandmother cell’

that reacts only when

shown the face of one’s grandmother.

Electrical signal

from brain

Stimulus

Visual area

of brain

Recording electrode

David Hubel

(1926–2013)

Torsten Wiesel

(b. 1924)

Figure 1.12

The classical experiment of Hubel and Wiesel (1962) revealed the structure of the brain

visual cortex and inspired a new generation of neural network architectures mimicking

its local connectivity.

The understanding of the structure of the visual cortex has had profound

impact on early works in computer vision and pattern recognition, with mul-

tiple attempts to imitate its main ingredients. Kunihiko Fukushima (1980),

at that time a researcher at the Japan Broadcasting Corporation, developed a

18 Chapter 1

new neural network architecture “similar to the hierarchy model of the visual

nervous system proposed by Hubel and Wiesel,” which was given the name

neocognitron.

The neocognitron consisted of interleaved S- and C-layers of

neurons (a naming convention reﬂecting its inspiration in the biological visual

cortex); the neurons in each layer were arranged in 2D arrays following the

structure of the input image (‘retinotopic’), with multiple ‘cell-planes’ (fea-

ture maps in modern terminology) per layer. The S-layers were designed to be

translationally symmetric: they aggregated inputs from a local receptive ﬁeld

using shared learnable weights, resulting in cells in a single cell-plane have

receptive ﬁelds of the same function, but at different positions. The rationale

was to pick up patterns that could appear anywhere in the input. The C-layers

were ﬁxed and performed local pooling (a weighted average), affording insen-

sitivity to the speciﬁc location of the pattern: a C-neuron would be activated if

any of the neurons in its input are activated.

Since the main application of the neocognitron was character recognition,

translation invariance

was crucial. This property was a fundamental differ-

ence from earlier neural networks such as Rosenblatt’s perceptron: in order

to use a perceptron reliably, one had to ﬁrst normalise the position of the

input pattern, whereas in neocognitron the insensitivity to the pattern posi-

tion was baked into the architecture. Neocognitron achieved it by interleaving

translationally-equivariant local feature extraction layers with pooling, cre-

ating a multiscale representation—we will refer to this principle as scale

separation and study in subsequent Chapters why it cal also help deal with a

broader class of geometric transformations in addition to translations. Com-

putational experiments showed that Fukushima’s architecture was able to

successfully recognise complex patterns such as letters or digits, even in the

presence of noise and geometric distortions.

Looking from the vantage point of four decades of progress in the ﬁeld,

one ﬁnds that the neocognitron already had strikingly many characteristics of

modern deep learning architectures: depth (Fukishima simulated a seven-layer

network in his paper), local receptive ﬁelds, shared weights, and pooling. It

even used half-rectiﬁer (ReLU) activation function, which is often believed

to be introduced in recent deep learning architectures.

The main distinction

from modern systems was in the way the network was trained: neocogni-

tron was a ‘self-organised’ architecture trained in an unsupervised manner,

since backpropagation had still not been widely used in the neural network

community.

Introduction 19

1.2.4 Convolutional Neural Networks

Fukushima’s design was further developed by Yann LeCun, a fresh graduate

from the University of Paris

with a PhD thesis on the use of backpropa-

gation for training neural networks. In his ﬁrst post-doctoral position at the

AT&T Bell Laboratories, LeCun and colleagues built a system to recognise

hand-written digits on envelopes in order to allow the US Postal Service to

automatically route mail. In a paper that is now classical, LeCun et al. (1989)

described the ﬁrst three-layer convolutional neural network (CNN).

Simi-

larly to the neocognitron, LeCun’s CNN also used local connectivity with

shared weights and pooling (Figure 1.13). However, it forwent Fukushima’s

more complex nonlinear ﬁltering (inhibitory connections) in favour of sim-

ple linear ﬁlters that could be efﬁciently implemented as convolutions using

multiply-accumulate operations on a digital signal processor (DSP).

This

design choice, departing from the neuroscience inspiration and terminology

and moving into the realm of signal processing, would play a crucial role in

the ensuing success of deep learning. Another key novelty of CNN was the use

of backpropagation for training.

Kunihiko

Fukushima

10 output units

H2.12

H1.12

H1.1

layer H3

30 hidden units

layer H2

12 x 16 =192

hidden units

12 x 64 =768

layer H1

256 input units

fully connected

300 links

fully connected

6000 links

40,000 links

from 12 kernels

5 x 5 x 8

20,000 links

from 12 kernels

5 x 5

H2.1

Yann LeCun

Figure 1.13

The original variants of convolutional neural network architectures have been intro-

duced by Fukushima and LeCun (though the name “convolutional” would appear later).

Implemented on a digital signal processor, LeCun’s CNN allowed real-time handwrit-

ten digit recognition for the ﬁrst time.

LeCun’s works showed convincingly the power of gradient-based methods

for complex pattern recognition tasks and was one of the ﬁrst practical deep

learning-based systems for computer vision. An evolution of this architecture,

a CNN with ﬁve layers named LeNet-5 as a pun on the author’s name (LeCun

et al. 1998), was used by US banks to read handwritten cheques. The computer

vision research community, however, in its vast majority steered away from

neural networks and took a different path. The typical architecture of visual

20 Chapter 1

recognition systems of the ﬁrst decade of the new millennium was a care-

fully hand-crafted feature extractor (typically detecting interesting points in an

image and providing their local description in a way that is robust to perspec-

tive transformations and contrast changes

) followed by a simple classiﬁer

(most often a support vector machine (SVM) and more rarely, a small neural

network).

1.2.5 Recurrent Neural Networks, Vanishing gradients, and LSTMs

While CNNs were mainly applied to modelling data with spatial symmetries

such as images, another development was brewing: one that recognises that

data is often non-static, but can evolve in a temporal manner as well. The sim-

plest form of temporally-evolving data is a time series consisting of a sequence

of steps with a data point provided at each step.

A model handling time-series data hence needs to be capable of meaning-

fully adapting to sequences of arbitrary length – a feat that CNNs were not

trivially capable of. This led to development of Recurrent Neural Networks

(RNNs) in the late 1980s and early 1990s

, the earliest examples of which

simply applies a shared update rule distributed through time (Jordan 1986;

Elman 1990). At every step, the RNN state is updated as a function of the

previous state and the current input.

One issue that plagued RNNs from the beginning was the vanishing gradi-

ent problem. Vanishing gradients arise in very deep neural networks—of which

RNNs are often an instance of, given that their effective depth is equal to the

sequence length—with sigmoidal activations is unrolled over many steps, suf-

fers from gradients quickly approaching zero as backpropagation is performed.

This effect strongly limits the inﬂuence of earlier steps in the sequence to the

predictions made in the latter steps.

A solution to this problem was ﬁrst elaborated in Sepp Hochreiter’s Diploma

thesis (1991), carried out in Munich under the supervision of Jürgen Schmid-

huber, in the form of an architecture that was dubbed Long Short-Term

Memory (LSTM; Figure 1.14).

LSTMs combat the vanishing gradient prob-

lem by having a memory cell, with explicit gating mechanisms deciding how

much of that cell’s state to overwrite at every step — allowing one to learn the

degree of forgetting from data (or alternatively, remembering for long time).

In contrast, simple RNNs of Jordan (1986) and Elman (1990) perform a full

overwrite at every step.

This ability to remember context for long time turned crucial in natural lan-

guage processing and speech analysis applications, where recurrent models

were shown successful applications starting from the early 2000s (Gers and

Schmidhuber 2001; Graves et al. 2004). However, like it happened with CNNs

Introduction 21

Sepp

Hochreiter

Jürgen

Schmidhuber

Figure 1.14

The creators of the long short-term memory cell—the ﬁrst recurrent neural network

module capable of dealing with long-range dependencies—Hochreiter and Schmidhu-

ber, pictured alongside one of the earliest schematic depictions of their creation.

in computer vision, the breakthrough would need to wait for another decade to

come.

1.2.6 Gating Mechanisms and Time Warping

In our context, it may not be initially obvious whether recurrent models

embody any kind of symmetry or invariance principle similar to the trans-

lational invariance we have seen in CNNs. Nearly three decades after the

development of RNNs, Corentin Tallec and Yann Ollivier (2018) showed

that there is a type of symmetry underlying recurrent neural networks: time

warping.

Time series have a very important but subtle nuance: it is rarely the case that

the individual steps of a time series are evenly spaced. Indeed, in many natural

phenomena, we may deliberately choose to take many measurements at some

time points and very few (or no) measurements during others. In a way, the

notion of ‘time’ presented to the RNN working on such data undergoes a form

of warping operation. Can we somehow guarantee that our RNN is “resistant”

to time warping, in the sense that we will always be able to ﬁnd a set of useful

parameters for it, regardless of the warping operation applied?

In order to handle time warping with a dynamic rate, the network needs

to be able to dynamically estimate how quickly time is being warped (the

so-called ‘warping derivative’) at every step. This estimate is then used to

selectively overwrite the RNN’s state. Intuitively, for larger values of the warp-

ing derivative, a more strict state overwrite should be performed, as more time

has elapsed since.

The idea of dynamically overwriting data in a neural network is imple-

mented through the gating mechanism. Tallec and Ollivier effectively show

that, in order to satisfy time warping invariance, an RNN needs to have a

22 Chapter 1

gating mechanism. This provides a theoretical justiﬁcation for gate RNN mod-

els, such as the aforementioned LSTMs of Hochreiter and Schmidhuber or

the Gated Recurrent Unit (GRU) (Cho et al. 2014). Full overwriting used in

simple RNNs corresponds to an implicit assumption of a constant time warp-

ing derivative equal to one – a situation unlikely to happen in most real-world

scenarios – which also explains the success of LSTMs.

1.2.7 The Triumph of Deep Learning

As already mentioned, the initial reception of what would later be called “deep

learning” had initially been rather lukewarm. The computer vision commu-

nity, where the ﬁrst decade of the new century was dominated by handcrafted

feature descriptors, appeared to be particularly hostile to the use of neural

networks. However, the balance of power was changed by the rapid growth

in computing power and the amounts of available annotated visual data. It

became possible to implement and train increasingly bigger and more com-

plex CNNs that allowed addressing increasingly challenging visual pattern

recognition tasks,

culminating in a Holy Grail of computer vision at that

time: the ImageNet Large Scale Visual Recognition Challenge (Figure 1.15).

Established by the American-Chinese researcher Fei-Fei Li, ImageNet was an

annual challenge consisting in the classiﬁcation of millions of human-labelled

images into 1000 different categories.

25%

26%

16.4%

11.7%

6.7%

3.6%

3.1%

2011

XREC

2015

ResNet

10%

15%

20%

30%

Top-5 error

28%

2010

NEC-UIUC

2012

AlexNet

2013

ZFNet

2014

GoogLeNet

VGGNet

Human

2016

GoogLeNet

-v4

Fei Fei Li

Figure 1.15

ImageNet, a benchmark developed by Fei Fei Li at Stanford University was one of the

‘Holy Grail’ challenges in computer vision in the early 2010s. The dramatic perfor-

mance improvement provided by the convolutional neural network AlexNet in 2012 is

considered the turning point leading to the widespread adoption of deep learning in the

ﬁeld.

A CNN architecture developed at the University of Toronto by Krizhevsky,

Sutskever, and Hinton (2012) managed to beat by a large margin

all the

competing approaches such as smartly engineered feature detectors based on

decades of research in the ﬁeld. AlexNet (as the architecture was called in

Introduction 23

honour of its developer, Alex Krizhevsky; see Figure 1.16) was signiﬁcantly

bigger in terms of the number of parameters and layers compared to its older

sibling LeNet-5,

but conceptually the same. The key difference was the use

of a graphics processor (GPU) for training

, now the mainstream hardware

platform for deep learning.

224

128

192 192 128

128

192 192

128

2048 2048

1000

Stride

of 4

Max

pooling

Max

pooling

Max

pooling

dense

dense dense

CONV1

POOL1 POOL2

CONV2 CONV3 CONV4 CONV5

POOL3

FC1 FC2 FC3

Alex Krizhevsky

Figure 1.16

Alex Krizhevsky, pictured next to a schematic of AlexNet—the ﬁrst deep neural

network-based solution to win the ImageNet contest, by a signiﬁcant margin. AlexNet’s

result is commonly seen as a pivotal moment in the development of modern deep learn-

ing, ushering in a signiﬁcant developments in subsequent years—eventually leading to

surpassing human performance on ImageNet only a few years later.

The success of CNNs on ImageNet became the turning point for deep learn-

ing and heralded its broad acceptance in the following decade. A similar

transformation happened in natural language processing and speech recog-

nition, which moved entirely to neural network-based approaches during the

2010s; indicatively, Google and Facebook switched their machine translations

systems to LSTM-based architectures around 2016–2017. Multi-billion dollar

industries emerged as a result of this breakthrough, with deep learning success-

fully used in commercial systems ranging from speech recognition in Apple

iPhone to Tesla self-driving cars. More than forty years after the scathing

review of Rosenblatt’s work, the connectionists were ﬁnally vindicated.

1.2.8 Graph Neural Networks and Their Chemical Precursors

If the history of symmetry is tightly intertwined with physics, the history of

graph neural networks—one of the central topics of our book—has roots in

another branch of natural science: chemistry. Chemistry has historically been

– and still is – one of the most data-intensive academic disciplines. The emer-

gence of modern chemistry in the eighteenth century resulted in the rapid

24 Chapter 1

growth of known chemical compounds and an early need for their organi-

sation. This role was initially played by periodicals such as the Chemisches

Zentralblatt

and “chemical dictionaries” like the Gmelins Handbuch der

anorganischen Chemie (an early compendium of inorganic compounds ﬁrst

published in 1817

) and Beilsteins Handbuch der organischen Chemie (a sim-

ilar effort for organic chemistry) – all initially published in German, which was

the dominant language of science until the early 20th century.

In the English-speaking world, the Chemical Abstracts Service (CAS) was

created in 1907 and has gradually become the central repository for the world’s

published chemical information.

However, the sheer amount of data (the

Beilstein alone has grown to over 500 volumes and nearly half a million pages

over its lifetime) has quickly made it impractical to print and use such chemical

databases.

Since the mid-nineteenth century, chemists have established a universally

understood way to refer to chemical compounds through structural formulae,

indicating a compound’s atoms, the bonds between them, and even their 3D

geometry. But such structures did not lend themselves to easy retrieval. In

the ﬁrst half of the 20th century, with the rapid growth of newly discovered

compounds and their commercial use, the problem of organising, searching,

and comparing molecules became of crucial importance: for example, when a

pharmaceutical company sought to patent a new drug, the Patent Ofﬁce had to

verify whether a similar compound had been previously deposited.

To address this challenge, several systems for indexing molecules were

introduced in the 1940s, forming foundations for a new discipline that would

later be called chemoinformatics. One such system, named the ‘GKD chemical

cipher’ after the authors Gordon, Kendall, and Davison (1948), was developed

at the English tire ﬁrm Dunlop to be used with early punchcard-based com-

puters.

In essence, the GKD cipher was an algorithm for parsing a molecular

structure into a string that could be more easily looked up by a human or a

computer.

However, the GKD cipher and other related methods

were far from sat-

isfactory. In chemical compounds, similar structures often result in similar

properties. Chemists are trained to develop intuition to spot such analogies,

and look for them when comparing compounds.

On the other hand, when

a molecule is represented as a string (such as in the GKD cipher), the con-

stituents of a single chemical structure may be mapped into different positions

of the cipher. As a result, two molecules containing a similar substructure (and

thus possibly similar properties) might be encoded in very different ways (see

an example in Figure 1.17).

Introduction 25

George Vl

adu¸t

Figure 1.17

A ﬁgure from Vl

adu¸t et al. (1959) showing a chemical molecule (top left) and its frag-

ment (top right) and the corresponding GKD-ciphers (bottom). Note that this coding

system breaks the spatial locality of connected atoms in the molecule, such that the

fragment cipher cannot be found by simple substring matching in that of the the full

molecule. This drawback of early chemical representation methods was one of the moti-

vations for the search of structural representations of molecules as graphs.

This realisation has encouraged the development of “topological ciphers,”

trying to capture the structure of the molecule. First works of this kind were

done at the Dow Chemicals company by Opler and Norton (1956) and the

US Patent Ofﬁce by Ray and Kirsch (1957) – both heavy users of chemical

databases. One of the most famous such descriptors, known as the ‘Morgan

ﬁngerprint,’ was developed by Harry Morgan (1965) at the Chemical Abstracts

Service and used until today.

A ﬁgure that has played a key role in developing early “structural”

approaches for searching chemical databases is the Romanian-born Soviet

researcher George Vl

adu¸t (Figure 1.17). A chemist by training (with a PhD

in organic chemistry defended at the Moscow Mendeleev Institute in 1952),

he experienced a traumatic encounter with the gargantuan Beilstein handbook

in his freshman years.

This steered his research interests towards chemoin-

formatics (Vl

adu¸t et al. 1959), a ﬁeld in which he worked for the rest of his

life.

adu¸t is credited as one of the pioneers of using graph theory for modeling

the structures and reactions of chemical compounds. In a sense, this should not

come as a surprise: graph theory has been historically tied to chemistry, and

even the term ‘graph’ (referring to a set of nodes and edges, rather than a plot

of a function) was introduced by the mathematician James Sylvester (1878) as

a mathematical abstraction of chemical molecules; cf. Figure 1.18.

In particular, Vl

adu¸t advocated the formulation of molecular structure com-

parison as the graph isomorphism problem; his most famous work was on

26 Chapter 1

August Kekulé

(1829–1896)

James Joseph

Sylvester

(1814–1897)

Figure 1.18

The structural formula of benzene (C

) proposed by the 19th-century German

chemist August Kekulé. The term “graph” (in the sense used in graph theory) was

ﬁrst introduced as a model of molecules by James Sylvester in an 1878 Nature note,

explicitly relating molecules to graphs expressing their “Kekuléan diagrams”.

classifying chemical reactions as the partial isomorphism (maximum common

subgraph) of the reactant and product molecules (Vleduts 1963).

adu¸t’s work inspired

a pair of young researchers, Boris Weisfeiler (an

algebraic geometer) and Andrey Lehman

(self-described as a “program-

mer”

). In a classical joint paper (Weisfeiler and Leman 1968), the duo

introduced an iterative algorithm for testing whether a pair of graphs are iso-

morphic (i.e., the graphs have the same structure up to reordering of nodes),

which became known as the Weisfeiler-Lehman (WL) test

; see Figure 1.19.

Though the two had known each other from school years, their ways parted

shortly after their publication and each became accomplished in his respective

ﬁeld

Weisfeiler and Lehman’s initial conjecture that their algorithm solved the

graph isomorphism problem (and does it in polynomial time) was incorrect:

while Leman (1970) demonstrated it computationally for graphs with at most

nine nodes, a larger counterexample was found a year later (Adelson-Velskii

et al. 1969) (and in fact, a strongly regular graph failing the WL test called the

‘Shrinkhande graph’ had been known even earlier, (Shrikhande 1959)).

The paper of Weisfeiler and Lehman has become foundational in under-

standing graph isomorphism. To put their work of in historical perspective, one

should remember that in the 1960s, complexity theory was still embryonic and

algorithmic graph theory was only taking its ﬁrst baby steps. As Lehman recol-

lected in the late 1990s: “in the 60s, one could in matter of days re-discover all

the facts, ideas, and techniques in graph isomorphism theory. I doubt, that the

word ‘theory’ is applicable; everything was at such a basic level.” In the con-

text of graph neural networks, Weisfeiler and Lehman have recently become

Introduction 27

Boris

Weisfeiler

Andrey

Lehman

Figure 1.19

The creators of the eponymous Weisfeiler-Lehman graph isomorphism test, the bedrock

of expressivity analysis for both graph isomorphism and graph neural networks, are

pictured alongside the front page of their paper.

household names with the proof of the equivalence of their graph isomorphism

test to message passing (Xu et al. 2018; Morris et al. 2019).

1.2.9 Back to the Origins

Though chemists have been using GNN-like algorithms for decades, it is likely

that their works on molecular representation remained practically unknown in

the machine learning community.

We ﬁnd it hard to pinpoint precisely when

the concept of graph neural networks has begun to emerge: partly due to the

fact that most of the early work did not place graphs as a ﬁrst-class citizen,

partly since graph neural networks became practical only in the late 2010s,

and partly because this ﬁeld emerged from the conﬂuence of several adjacent

research areas; nonetheless, here we will discuss several pioneering works,

many of which have been designed by researchers in Figure 1.20.

Early forms of graph neural networks can be traced back at least to the

1990s, with examples including “Labeling RAAM” by Alessandro Sperduti

(1994), the “backpropagation through structure” of Goller and Kuchler (1996),

and adaptive processing of data structures (Sperduti and Starita 1997; Frasconi,

Gori, and Sperduti 1998). While these works were primarily concerned with

operating over “structures” (often trees or directed acyclic graphs), many of

the invariances preserved in their architectures are reminiscent of the GNNs

more commonly in use today.

The ﬁrst proper treatment of the processing of generic graph structures (and

the coining of the term ‘graph neural network’) happened after the turn of the

twenty-ﬁrst century. A University of Siena team led by Marco Gori (2005) and

28 Chapter 1

Franco Scarselli (2008) proposed the ﬁrst “GNN.” They relied on recurrent

mechanisms, required the neural network parameters to specify contraction

mappings, and thus computing node representations by searching for a ﬁxed

point — this in itself necessitated a special form of backpropagation and did

not depend on node features at all. All of the above issues were rectiﬁed by

the Gated GNN (GGNN) model of Yujia Li et al. (2015), which brought many

beneﬁts of modern RNNs, such as gating mechanisms (Cho et al. 2014) and

backpropagation through time. The neural network for graphs (NN4G) pro-

posed by Alessio Micheli (2009) around the same time used a feedforward

rather than recurrent architecture, in fact resembling more the modern GNNs.

Alessandro

Sperduti

Christoph

Goller

Andreas

Küchler

Marco

Gori

Franco

Scarselli

Alessio

Micheli

Figure 1.20

Six of the pioneering authors of graph neural networks (GNNs) the within machine

learning community. Alessandro Sperduti developed the earliest known GNN-like

architecture to be published at NeurIPS, in 1994. Goller and Küchler also published

an early form of backprop through structures in the ’90s. The term “GNN” was coined

by Gori and Scarselli’s work in 2005, ﬁrmly establishing the term that’s very popular to

this day—although, the concurrent NN4G work of Alessio Micheli uses a mechanism

that more closely resembles modern GNN implementations.

Another important class of graph neural networks, often referred to as “spec-

tral,” relied on the notion of the Graph Fourier transform (Bruna et al. 2013).

The roots of this construction are in the signal processing and computational

harmonic analysis communities, where dealing with non-Euclidean signals has

become prominent in the late 2000s and early 2010s.

Inﬂuential papers by

Shuman et al. (2013) and Sandryhaila and Moura (2013) popularised the notion

of “Graph Signal Processing” (GSP) and the generalisation of Fourier trans-

forms based on the eigenvectors of graph adjacency and Laplacian matrices.

The graph convolutional neural networks relying on spectral ﬁlters by Def-

ferrard, Bresson, and Vandergheynst (2016) and Kipf and Welling (2016) are

among the most cited in the ﬁeld.

Introduction 29

It is worth noting that, while the concept of GNNs experienced several inde-

pendent re-derivations in the 2010s arising from several perspectives (besides

the already mentioned connections to computational chemistry and signal pro-

cessing, we should highlight probabilistic graphical models (Dai, Dai, and

Song 2016) and natural language processing (Vaswani et al. 2017)), the fact

that all of these models arrived at different instances of a common blueprint

is certainly telling. In fact, it has recently been posited (Veli

ckovi

c 2022) that

GNNs may offer a universal framework for processing discretised data. Other

neural architectures we will discuss in this book (such as CNNs or RNNs) may

be recovered as special cases of GNNs, by inserting appropriate priors into the

message passing functions or the graph structure over which the messages are

computed.

In a somewhat ironic twist of fate, modern GNNs were triumphantly re-

introduced to chemistry (Figure 1.21, a ﬁeld they originated from, by David

Duvenaud et al. (2015) as a replacement for handcrafted Morgan’s molecular

ﬁngerprints, and by Justin Gilmer et al. (2017) in the form of message-passing

neural networks equivalent to the Weisfeiler-Lehman test. After ﬁfty years,

the circle ﬁnally closed. At the time of writing, graph neural networks have

become a standard tool in chemistry and already used in drug discovery and

design pipelines. A notable accolade was claimed with the GNN-based discov-

ery of novel antibiotic compounds (Stokes et al. 2020). DeepMind’s AlphaFold

2 (Jumper et al. 2021) used a form of GNNs in order to address a hallmark

problem in structural biology – the problem of protein folding.

David Duvenaud Justin Gilmer

Figure 1.21

The two works spearheaded by David Duvenaud and Justin Gilmer, respectively, have

greatly popularised the use of GNNs for both drug screening and quantum chemistry—

applications that remain prominent to this day.

30 Chapter 1

In 1999, Andrey Lehman wrote to a mathematician colleague that he had

the “pleasure to learn that ‘Weisfeiler-Leman’ was known and still caused

interest.”

He did not live to see the rise of GNNs based on his work of

ﬁfty years earlier. Nor did George Vl

adu¸t see the realisation of his ideas in

chemoinformatics, many of which remained on paper during his lifetime.

1.3 The ‘Erlangen Programme’ of Deep Learning

Our historical overview of the geometric foundations of deep learning has now

naturally brought us to the blueprint that underpins this book. Taking Convo-

lutional and Graph Neural Networks as two prototypical examples, at the ﬁrst

glance completely unrelated, we ﬁnd several common traits. First, both operate

on data (images in the case of CNNs or molecules in the case of GNNs) that has

some underlying geometric domain (respectively, a grid or a graph). Second,

in both cases the tasks have a natural notion of invariance (e.g. to the position

of an object in image classiﬁcation, or the numbering of atoms in a molecule

in chemical property prediction) that can be formulated through appropriate

symmetry group (translation in the former example and permutation in the lat-

ter). Third, both CNNs and GNNs incorporate the respective symmetries as an

inductive bias by making their layers interact appropriately with the action of

the symmetry group on the input. In CNNs, it comes in the form of convolu-

tional layers whose output transforms in the same way as the input (we will

call this property translation-equivariance, which is the same as saying that

convolution commutes with the shift operator). In GNNs, it assumes the form

of a symmetric message passing function that aggregates the neighbour nodes

irrespective of their order, and the overall output of a message passing layer

transforms in the same way with the permutation of the input (permutation-

equivariant). Finally, in some architectural instances, the data are processed

in a multi-scale fashion; this corresponds to pooling in CNNs (uniformly sub-

sampling the grid that underlies the image) or graph coarsening in some types

of GNNs.

Overall, this appears to be a very general principle that can be applied to

a broad range of problems, types of data, and architectures. We will use the

Geometric Deep Learning Blueprint to derive from ﬁrst principles some of

the most common and popular neural network architectures (CNNs, GNNs,

LSTMs, DeepSets, Transformers), which as of today constitute the majority

of the deep learning toolkit. As we will see in Chapters ?–?, all of the above

can be obtained by an appropriate choice of the domain and the associated

symmetry group.

Further extensions of the Blueprint, such as incorporating the symmetries

of the data in addition to those of the domain, allow to obtain a new class of

Introduction 31

equivariant GNN architectures, that will be discussed later on in this book.

Such architectures have recently come to limelight in molecular modeling, and

account for the fact that in molecular graphs (where nodes represent atoms and

edges are chemical bonds), the permutation of the nodes and the rotation of the

node features (atomic coordinates) are different symmetries.