Probabilistic Learning for Geological Modeling

Problem: model geology deep underground.
Data: measurements from sparse wells.

One deterministic answer is not an option!

Gaussian Processes

Definition. A Gaussian process is random function $f : X \times \Omega \to \R$ such that for any $x_1,..,x_n$ , the vector $f(x_1),..,f(x_n)$ is multivariate Gaussian.

The distribution of a Gaussian process is characterized by

a mean function $m(x) = \E(f(x))$ ,
a kernel (covariance) function $k(x, x') = \Cov(f(x), f(x'))$ ,

Notation: $f \~ \f{GP}(m, k)$ .

Gaussian Process Regression

Takes

a prior Gaussian Process $\f{GP}(m, k)$
and data $(x_1, y_1), .., (x_n, y_n) \in X \x \R$ ,

giving the posterior Gaussian process $\f{GP}(\tilde{m}, \tilde{k})$ .

The functions $\tilde{m}$ and $\tilde{k}$ may be explicitly expressed in terms of $m$ and $k$ .

Another Application: Bayesian Optimization

Goal: minimize unknown function $\phi$ in as few evaluations as possible.

Build GP posterior $f \given \v{y}$ using data $(x_1,\phi(x_1)),..,(x_n,\phi(x_n))$ .
Choose $\htmlClass{fragment}{ x_{n+1} = \argmax_{x\in\c{X}} \underbrace{\alpha_{f\given\v{y}}(x).}_{\text{acquisition function}} }$ For instance, expected improvement $\alpha_{f\given\v{y}}(x) = \E_{f\given\v{y}} \max(0, {\displaystyle\min_{i=1,..,n}} \phi(x_i) - f(x)).$

Bayesian Optimization 1/15

Bayesian Optimization 2/15

Bayesian Optimization 3/15

Bayesian Optimization 4/15

Bayesian Optimization 5/15

Bayesian Optimization 6/15

Bayesian Optimization 7/15

Bayesian Optimization 8/15

Bayesian Optimization 9/15

Bayesian Optimization 10/15

Bayesian Optimization 11/15

Bayesian Optimization 12/15

Bayesian Optimization 13/15

Bayesian Optimization 14/15

Bayesian Optimization 15/15

Bayesian Optimization in the Real World

Efficient Algorithms

Which Algorithms?

Conditioning

Sampling

Complexity:

O(n^3)

for conditioning and

O(n^3 + s^3)

for sampling
with

n

data points and

s

sampling nodes.

Conditioning Gaussian Processes

Assume we have data $(x_1, y_1), .., (x_n, y_n)$ and the standard Bayesian model $y_i = f(x_i) + \varepsilon_i, \qquad f \sim \f{GP}(0, k), \qquad \varepsilon_i \stackrel{\text{i.i.d.}}{\sim} \f{N}(0, \sigma^2).$

Conditioning Gaussian Processes

The posterior Gaussian process $f \sim \f{GP}(\tilde{m}, \tilde{k})$ is then used for predictions $\begin{aligned} \tilde{m}(a)~=~&k(a, \v{x}) \del{k(\v{x}, \v{x}) + \sigma^2 I}^{-1} \v{y}, \\ \tilde{k}(a, b) ~=~ k(a, b) - &\underbrace{k(a, \v{x})}_{\text{vector}~1 \x n} (\underbrace{k(\v{x}, \v{x}) + \sigma^2 I}_{\text{matrix}~n \x n})^{-1} \underbrace{k(\v{x}, b)}_{\text{vector}~n \x 1}. %\\ %\text{where} %\quad %\v{x} \stackrel{\text{def}}{=} (x_1, .., &x_n)^{\top}, %\quad %k(\v{x}, \v{x}) \stackrel{\text{def}}{=} \del{k(x_i, x_j)}_{\substack{1 \leq i \leq I \\ 1 \leq j \leq J}}. \end{aligned}$

Notation: $\v{x} \stackrel{\text{def}}{=} (x_1, .., x_n)^{\top}$ and $k(\v{x}, \v{x}) \stackrel{\text{def}}{=} \del{k(x_i, x_j)}_{\substack{1 \leq i \leq n \\ 1 \leq j \leq n}}$ .

Efficient Approximate Conditioning: Idea

Idea. Approximate the posterior process $\f{GP}(\tilde{m}, \tilde{k})$ with a simpler process belonging to some parametric family $\f{GP}(m_{\theta}, k_{\theta})$ .

Take a vector $\v{z} = (z_1, .., z_m)^{\top}$ of pseudo-locations, $m \ll n$ .
Consider pseudo-observations $\v{u} = (u_1, .., u_m)^{\top} \sim \f{N}(\v{\mu}, \m{\Sigma})$ .

Put $\theta = (\v{z}, \v{\mu}, \m{\Sigma})$ and $f_{\theta}$ to be the posterior under Bayesian model $\begin{aligned} u_i = f(z_i), && f \sim \f{GP}(0, k), && \v{u} = (u_1, .., u_m)^{\top} \sim \f{N}(\v{\mu}, \m{\Sigma}) \end{aligned}$

Efficient Approximate Conditioning: Approximate Posterior

We have $\tilde{f}_{\theta} \sim \f{GP}(m_{\theta}, k_{\theta})$ with $\begin{alignedat}{3} m_{\theta}(a) &= k(a, \v{z}) &&k(\v{z}, \v{z})^{-1} \v{\mu}, \\ k_{\theta}(a, b) &= k(a, b) &&- k(a, \v{z}) k(\v{z}, \v{z})^{-1} k(\v{z}, b) \\ &&&+ k(a, \v{z}) \underbrace{ k(\v{z}, \v{z})^{-1} \m{\Sigma} k(\v{z}, \v{z})^{-1} }_{\text{matrix}~m \x m} k(\v{z}, b). \end{alignedat}$ If $\v{z}, \v{\mu}, \m{\Sigma}$ are given, computing approximate posterior is $O(m^3)$ .

Efficient Conditioning: Optimization Problem

The optimization problem is $D_{KL}(f_{\theta}||\tilde{f}) \to \min$ . $\begin{aligned} D_{KL}(f_{\theta}||\tilde{f}) \stackrel{\text{thm}}{=}& D_{KL}(f_{\theta}(\v{x} \oplus \v{z})||\tilde{f}(\v{x} \oplus \v{z})) \\ \stackrel{\text{thm}}{=}& D_{KL}(N(\v{\mu}, \m{\Sigma})||N(0, k(\v{z}, \v{z}))) \\ & - \sum_{i=1}^n \E \ln p_{\f{N}(f(x_i), \sigma^2)}(y_i) + C. \end{aligned}$

Can be solved by stochastic gradient descent with iteration cost $O(m^3)$ .

Efficient Conditioning: Result

Result. We approximate the posterior process $\f{GP}(\tilde{m}, \tilde{k})$ with a simpler process belonging to a certain parametric family $\f{GP}(m_{\theta}, k_{\theta})$ .

O(n^3)

turns into

O(m^3)

where

m

controls accuracy

Naive Sampling

Take $f \sim \f{GP}(m, k)$ and a finite set of locations $\v{v} = (v_1, .., v_s)$ .

Restricting $f$ to $\v{v}$ we get the random vector $f(\v{v}) \sim \f{N}(m(\v{v}), k(\v{v}, \v{v}))$ .

Then $\begin{aligned} f(\v{v}) \stackrel{\text{d}}{=} m(\v{v}) + k(\v{v}, \v{v})^{1/2} \v{\varepsilon}, && \v{\varepsilon} \sim \f{N}(\v{0}, \m{I}) \end{aligned}.$

Problem: it takes $O(s^3)$ to find $k(\v{v}, \v{v})^{1/2}$ and the sample lives on a grid.

Efficient Sampling From Priors

Priors are usually assumed stationary. If $f \sim \f{GP}(0, k)$ is stationary, then $\begin{aligned} k(a, b) = \int e^{2 \pi i \langle a-b, \xi \rangle} s(\xi) \mathrm{d} \xi \approx \frac{1}{l} \sum_{j=1}^l e^{2 \pi i \langle a-b, \xi_j \rangle}, && \xi_j \stackrel{\text{i.i.d.}}{\sim} s(\xi). \end{aligned}$

This gives the approximation for the random process itself $\begin{aligned} f(a) = \frac{1}{l} \sum_{j=1}^l w_j e^{2 \pi i \langle a, \xi_j \rangle}, && w_j \stackrel{\text{i.i.d.}}{\sim} \f{N}(0,1), && \xi_j \stackrel{\text{i.i.d.}}{\sim} s(\xi). \end{aligned}$

It takes only $O(s l)$ to sample on grid of size $s$ . And we do not need a grid!

Matheron's Rule

The posterior Gaussian process $\tilde{f} \sim \f{GP}(\tilde{m}, \tilde{k})$ can be expressed by $\tilde{f}(a) \stackrel{\text{d}}{=} f(a) + k(a, \v{x}) \del{k(\v{x}, \v{x}) + \sigma^2 I}^{-1} (\v{y} - f(\v{x}) -\v{\varepsilon}),$ where $\v{\varepsilon} \stackrel{\text{i.i.d.}}{\sim} \f{N}(0, \sigma^2)$ .

With this, o sample on grid of size $s$ with $n$ data points $O(s l + n^3)$ .

Can we improve $O(n^3)$ scaling in this case?

Efficient Sampling From Posteriors

Proposition. A sample from the $f_{\theta}$ may be obtained as follows.

Sample $\v{u} = (u_1, .., u_m) \sim \f{N}(\v{\mu}, \m{\Sigma})$ ,
Return a sample from the posterior Gaussian process with $(z_1, u_1), .., (z_m, u_m)$ as data.

O(n^3 + s^3)

turns into

O(m^3 + s l)

where

m \ll n

and

s

control accuracy

Off-topic: Gaussian Processes on
Non-Euclidean Domains

Non-Euclidean Domains

Manifolds

Graphs

How should we build and use Gaussian process models in these settings?

An Application: Modeling Dynamical Systems with Uncertainty

$\htmlData{fragment-index=0,class=fragment}{ x_0 } \qquad \htmlData{fragment-index=1,class=fragment}{ x_1 = x_0 + f(x_0)\Delta t } \qquad \htmlData{fragment-index=2,class=fragment}{ x_2 = x_1 + f(x_1)\Delta t } \qquad \htmlData{fragment-index=3,class=fragment}{ .. }$

Wind Speed Modeling: Vector Fields on Manifolds

Traffic Speed Extrapolation on a Road Network

(a) Prediction

(b) Standard Deviation

Thank you!

Viacheslav Borovitskiy (Slava)

viacheslav.borovitskiy@gmail.com https://vab.im

Centro Pi Seminar, IMPA

On a Common Ground Between

Geological Modeling and Machine Learning

Viacheslav Borovitskiy (Slava)