Bayes Coffee Talk | University of Edinburgh

Do Deep Ensembles Capture Uncertainty
in Graph Neural Networks?

Viacheslav (Slava) Borovitskiy

https://vab.im

Talk Outline

Introduction

  • Motivating uncertainty quantification
  • Main approaches for uncertainty quantification
  • Graph Machine Learning
  • Origins of the research question

Do Deep Ensembles Capture Uncertainty in Graph Neural Networks?

  • Benchmarking
  • Analysis
  • Hypothesis

Summary and Future Work

Motivating uncertainty quantification

Motivating uncertainty quantification

:

Model-based decision-making

Model-based decision-making

Probabilistic models: input distribution (prediction + uncertainty)

Allows computing queries like probability of improvement: $\!\alpha(x) \!=\! \mathbb{P} \! \del{f(x) \!>\!\! f^*}$

They can drive selection of the next point to evaluate [Bayesian optimization]

Main approaches for uncertainty quantification

Main approaches for uncertainty quantification

Bayesian neural networks

Deep ensembles

$\underbrace{\hphantom{\text{Bayesian Neural Networks}\qquad{Deep ensembles}}}_{\text{defined by neural network architectures}}$

Gaussian processes

$\underbrace{\hphantom{\text{Gaussian Processes}}}_{\text{defined by kernels}}$

Graph Machine Learning

Includes many different settings.

Node-level tasks

Graph with features
$ \rightarrow $
Graph with labels

Graph-level tasks

Graph with features
$ \rightarrow $
$ y $

Also: edge-level tasks, graph generation, etc.

Example setting: Node regression

Example setting: Node regression

$$ f : V \to \R $$

$$ f\Big(\smash{\includegraphics[height=2.75em,width=1.5em]{figures/g1.svg}}\Big) = 2 $$

$$ f\Big(\smash{\includegraphics[height=2.75em,width=1.5em]{figures/g2.svg}}\Big) = 3 $$

$$ f\Big(\smash{\includegraphics[height=2.75em,width=1.5em]{figures/g3.svg}}\Big) = 5 $$

Example setting: Probabilistic node regression

$$ f : V \to \text{distributions over } \R $$

$$ f\Big(\smash{\includegraphics[height=2.75em,width=1.5em]{figures/g1.svg}}\Big) = \f{N}(2, 0.5^2) $$

$$ f\Big(\smash{\includegraphics[height=2.75em,width=1.5em]{figures/g2.svg}}\Big) = \f{N}(3, 0.2^2) $$

$$ f\Big(\smash{\includegraphics[height=2.75em,width=1.5em]{figures/g3.svg}}\Big) = \f{N}(5, 0.7^2) $$

Origins of the research question

Benchmark: Caltrans Performance Measurement System (PeMS)

San Jose highway network: graph with 1016 nodes

325 labeled nodes with known traffic speed in miles per hour

Use 250 labeled nodes for training data and 75 for test data

Dataset details: Borovitskiy et al. (AISTATS 2021)

Trivial baseline

Trivial baseline

: prediction

: uncertainty

Results

Results

Figure: Uncertainty estimates of a geometric Gaussian process.     

PEMS benchmark: GNN ensemble fails miserably w.r.t. NLL.

Do Deep Ensembles Capture Uncertainty in Graph Neural Networks?

Do Deep Ensembles Capture Uncertainty in Graph Neural Networks?

Is the failure on PEMS an isolated anomaly?

Or is it symptomatic of a broader incompatibility
between deep ensembles and graph neural networks?

We investigated this for message passing graph neural networks.
(GCN, GAT, etc.)

A Comprehensive Benchmark

Evaluating across diverse tasks, scales, and structural properties.

Node Classification

  • Cora & Citeseer
    (Classic, Small)
  • Tolokers2
    (Modern, larger)

Node Regression

  • Artnetviews
    (Modern, larger)
  • Chameleon
    (Benign heterophily)

Graph Regression

  • QM9-5% HOMO-LUMO
    (Classic, data-scarce)

Evaluating Predictive Uncertainty

Our primary metric is the Negative Log-Likelihood (NLL).

Conceptually:

$$ \text{NLL} \approx \text{Prediction Error} + \text{Uncertainty Calibration} $$

To achieve a good NLL, a model must be accurate
AND "know what it doesn't know".

Deep Ensembles: Graphs vs. Computer Vision

How much more likelihood does the ensemble assign to test data?

Classification Likelihood Ratios
Likelihood ratios $\exp(\mathrm{NLL}_{\mathrm{GNN}}-\mathrm{NLL}_{\mathrm{DE}})$.
Regression Likelihood Ratios
Likelihood ratios $\exp(\mathrm{NLL}_{\mathrm{GNN}}-\mathrm{NLL}_{\mathrm{DE}})$.

Computer Vision

Typical improvements: ~20%

Graph Neural Networks

Improvements: often as low as 0.1%

The gains exist, but they are surprisingly marginal. Why?

Disentangling NLL Gains

Do deep ensembles improve uncertainty, or just point predictions?

Deconstructing NLL Gains

NLL

  • DE-R Baseline: ensemble's mean prediction + variance of a single model.
  • Observation: DE-R — nearly identical NLL improvements to DE.

Conclusion: Ensembling on graphs acts primarily as a variance-reducing smoother for point predictions, not a better uncertainty estimator.

How do Deep Ensembles capture uncertainty?

Total predictive uncertainty breaks down into:

Aleatoric Uncertainty

Irreducible uncertainty
(e.g., measurement errors)

DE: Averaging individual model uncertainties

Epistemic Uncertainty

The model's lack of knowledge
(e.g., in regions with no training data)

DE: Captured by disagreement among models

If individual models in an ensemble do not disagree,
they cannot capture epistemic uncertainty.

The Root Cause: Epistemic Collapse

Let's decompose the uncertainty of our GNN ensembles.

GNNs consistently converge to highly similar predictions.
Epistemic uncertainty collapses to near-zero.

Because they do not meaningfully disagree, they neutralize the very mechanism that makes deep ensembles work (for uncertainty quantification).

Why do GNNs Collapse? (Hypothesis)

Deep ensembles rely on a highly non-convex loss to find diverse solutions.

Weight space: Non-convex

Averaging weights degrades performance dramatically. Thus individual models live in different regions of the weight space.

$\rightarrow$

Function space: Convex-like

Despite having completely different weights, individual models all implement virtually the same predictive function.

Hypothesis: The structural inductive bias of message passing acts as a homogenizing force, severely restricting functional diversity.

Summary & Future Work

Summary & Future Work

Key Takeaways

  • Message passing GNN ensembles only offer marginal NLL gains (in contrast to deep ensembles in computer vision).
  • Improvements come from smoothing predictions, not qualitatively better uncertainty estimates.
  • Epistemic collapse: Functional convexity destroys the core mechanism through which ensembles improve uncertainty.

Future Directions

  • How does message passing induce "convexity"?
  • What about graph transformers?
Main author
Pedro C. Vieira
Paper Info
Do Deep Ensembles Actually Capture Uncertainty in Graph Neural Networks?
P. C. Vieira, P. Ribeiro, V. Borovitskiy
(Available on arXiv)

Thank you!