Centro Pi Seminar, IMPA
On a Common Ground Between
Geological Modeling and Machine Learning
Viacheslav Borovitskiy (Slava)

Problem: model geology deep underground.
Data: measurements from sparse wells.
One deterministic answer is not an option!
Definition. A Gaussian process is random function f:X×Ω→R such that for any x1,..,xn, the vector f(x1),..,f(xn) is multivariate Gaussian.
The distribution of a Gaussian process is characterized by
Notation: f∼GP(m,k).
Takes
giving the posterior Gaussian process GP(m~,k~).
The functions m~ and k~ may be explicitly expressed in terms of m and k.
Goal: minimize unknown function ϕ in as few evaluations as possible.
Assume we have data (x1,y1),..,(xn,yn) and the standard Bayesian model yi=f(xi)+εi,f∼GP(0,k),εi∼i.i.d.N(0,σ2).
The posterior Gaussian process f∼GP(m~,k~) is then used for predictions m~(a) = k~(a,b) = k(a,b)−k(a,x)(k(x,x)+σ2I)−1y,vector 1×nk(a,x)(matrix n×nk(x,x)+σ2I)−1vector n×1k(x,b).
Notation: x=def(x1,..,xn)⊤ and k(x,x)=def(k(xi,xj))1≤i≤n1≤j≤n.
Idea. Approximate the posterior process GP(m~,k~) with a simpler process belonging to some parametric family GP(mθ,kθ).
Take a vector z=(z1,..,zm)⊤ of pseudo-locations, m≪n.
Consider pseudo-observations u=(u1,..,um)⊤∼N(μ,Σ).
Put θ=(z,μ,Σ) and fθ to be the posterior under Bayesian model ui=f(zi),f∼GP(0,k),u=(u1,..,um)⊤∼N(μ,Σ)
We have f~θ∼GP(mθ,kθ) with mθ(a)kθ(a,b)=k(a,z)=k(a,b)k(z,z)−1μ,−k(a,z)k(z,z)−1k(z,b)+k(a,z)matrix m×mk(z,z)−1Σk(z,z)−1k(z,b). If z,μ,Σ are given, computing approximate posterior is O(m3).
The optimization problem is DKL(fθ∣∣f~)→min. DKL(fθ∣∣f~)=thm=thmDKL(fθ(x⊕z)∣∣f~(x⊕z))DKL(N(μ,Σ)∣∣N(0,k(z,z)))−i=1∑nElnpN(f(xi),σ2)(yi)+C.
Can be solved by stochastic gradient descent with iteration cost O(m3).
Result. We approximate the posterior process GP(m~,k~) with a simpler process belonging to a certain parametric family GP(mθ,kθ).
O(n3) turns into O(m3) where m controls accuracyTake f∼GP(m,k) and a finite set of locations v=(v1,..,vs).
Restricting f to v we get the random vector f(v)∼N(m(v),k(v,v)).
Then f(v)=dm(v)+k(v,v)1/2ε,ε∼N(0,I).
Problem: it takes O(s3) to find k(v,v)1/2 and the sample lives on a grid.
Priors are usually assumed stationary. If f∼GP(0,k) is stationary, then k(a,b)=∫e2πi⟨a−b,ξ⟩s(ξ)dξ≈l1j=1∑le2πi⟨a−b,ξj⟩,ξj∼i.i.d.s(ξ).
This gives the approximation for the random process itself f(a)=l1j=1∑lwje2πi⟨a,ξj⟩,wj∼i.i.d.N(0,1),ξj∼i.i.d.s(ξ).
It takes only O(sl) to sample on grid of size s. And we do not need a grid!
The posterior Gaussian process f~∼GP(m~,k~) can be expressed by f~(a)=df(a)+k(a,x)(k(x,x)+σ2I)−1(y−f(x)−ε), where ε∼i.i.d.N(0,σ2).
With this, o sample on grid of size s with n data points O(sl+n3).
Can we improve O(n3) scaling in this case?
Proposition. A sample from the fθ may be obtained as follows.
x0x1=x0+f(x0)Δtx2=x1+f(x1)Δt..