Bayesian Optimization with GPopt

Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb

Due to the way it mixes several – relatively – simple concepts, Bayesian optimization (BO) is one of the most elegant mathematical tool I’ve encountered so far. In this post, I introduce GPopt, a tool for BO that I implemented in Python (no technical docs yet, but coming soon). The examples of GPopt’s use showcased here are based on Gaussian Processes (GP) and Expected Improvement (EI): what does that mean?

GPs are Bayesian statistical/machine learning (ML) models which create a distribution on functions, and especially on black-box functions. If we let $f$ be the black-box – and expensive-to-evaluate – function whose minimum is searched, a GP is firstly adjusted (in a supervised learning way) to a small set of points at which $f$ is evaluated. This small set of points is the initial design. Then the GP, thanks to its probabilistic nature, will exploit its knowledge of previous points at which $f$ has already been evaluated, to explore new points: potential minimas maximizing an EI criterion. It’s a reinforcement learning question!

For more details on Bayesian optimization applied to hyperparameters calibration in ML, you can read Chapter 6 of this document. In this post, a Branin (2D) and a Hartmann (3D) functions will be used as examples of objective functions $f$ , and Matérn 5/2 is the GP’s covariance.

Installing and importing the packages:

!pip install GPopt

!pip install matplotlib==3.1.3

import GPopt as gp 
import time
import sys
import numpy as np
import matplotlib.pyplot as plt
from os import chdir
from scipy.optimize import minimize

The objective function to be minimized:

# branin function
def branin(x):
    x1 = x[0]
    x2 = x[1]
    term1 = (x2 - (5.1*x1**2)/(4*np.pi**2) + (5*x1)/np.pi - 6)**2
    term2 = 10*(1-1/(8*np.pi))*np.cos(x1)    
    return (term1 + term2 + 10)

# A local minimum for verification
print("\n")
res = minimize(branin, x0=[0, 0], method='Nelder-Mead', tol=1e-6)
print(res.x)
print(branin(res.x))

[3.14159271 2.27500017]
0.397887357729795

Firstly, I demonstrate the convergence of the optimizer towards a minimum (and show that the optimizer can pick up the procedure where it left):

# n_init is the number of points in the initial design 
# n_iter is the number of iterations of the optimizer 
gp_opt = gp.GPOpt(objective_func=branin, 
                lower_bound = np.array([-5, 0]), 
                 upper_bound = np.array([10, 15]),
                 n_init=10, n_iter=10)    
gp_opt.optimize(verbose=1)

Plotting the changes in expected improvement as a function of the number of iterations.

print(gp_opt.x_min) # current minimum
print(gp_opt.y_min) # current minimum
plt.plot(np.diff(gp_opt.max_ei))

[9.31152344 1.68457031]
0.9445903336427559

image-title-here

Adding more iterations to the optimizer:

gp_opt.optimize(verbose=1, n_more_iter=10)

 ...Done. 

 Optimization loop... 

10/10 [██████████████████████████████] - 2s 186ms/step

(array([3.22692871, 2.63122559]), 0.6107733232129569)

Plotting the changes in expected improvement as a function of the number of iterations (again).

print(gp_opt.x_min) # current minimum
print(gp_opt.y_min) # current minimum
plt.plot(np.diff(gp_opt.max_ei))

[3.22692871 2.63122559]
0.6107733232129569

image-title-here

Adding more iterations to the optimizer (again):

gp_opt.optimize(verbose=1, n_more_iter=80)

Plotting the changes in expected improvement as a function of the number of iterations (again).

print(gp_opt.x_min) # current minimum
print(gp_opt.y_min) # current minimum
plt.plot(np.diff(gp_opt.max_ei))

[9.44061279 2.48199463]
0.3991320518189241

image-title-here

The 3 previous graphs suggest the possibility of stopping the optimizer earlier, when the algorithm is not improving on previous points’ results anymore:

# # early stopping
# abs_tol is the parameter that controls early stopping

gp_opt2 = gp.GPOpt(objective_func=branin, 
                  lower_bound = np.array([-5, 0]), 
                 upper_bound = np.array([10, 15]),
                 n_init=10, n_iter=190)    
gp_opt2.optimize(verbose=2, abs_tol=1e-4) 
print("\n")

We can observe that only 58 iterations are necessary when abs_tol = 1e-4

print(gp_opt2.n_iter)
print(gp_opt2.x_min)
print(gp_opt2.y_min)

58
[9.44061279 2.48199463]
0.3991320518189241

Illustrating convergence:

plt.plot(gp_opt2.max_ei)

image-title-here

We notice that in this example, GPopt falls into a local minimum but is very close to the previous minimum found with gradient-free optimizer (Nelder-Mead). The opposite situation can occur too:

# [0, 1]^3
def hartmann3(xx):
    
    alpha = np.array([1.0, 1.2, 3.0, 3.2])
    
    A = np.array([3.0, 10, 30, 
                  0.1, 10, 35,
                  3.0, 10, 30,
                  0.1, 10, 35]).reshape(4, 3)
            
    P = 1e-4 * np.array([3689, 1170, 2673,
                        4699, 4387, 7470, 
                        1091, 8732, 5547, 
                        381, 5743, 8828]).reshape(4, 3)

    xxmat = np.tile(xx,4).reshape(4, 3)
    
    inner = np.sum(A*(xxmat-P)**2, axis = 1)
    outer = np.sum(alpha * np.exp(-inner))

    return(-outer)

# Fails, but may work with multiple restarts
print("\n")
res = minimize(hartmann3, x0=[0, 0, 0], method='Nelder-Mead', tol=1e-6)
print(res.x)
print(hartmann3(res.x))

[0.36872308 0.11756145 0.26757363]
-1.00081686355956

# hartmann 3D
gp_opt4 = gp.GPOpt(objective_func=hartmann3, 
                lower_bound = np.repeat(0, 3), 
                upper_bound = np.repeat(1, 3), 
                 n_init=20, n_iter=280)    
gp_opt4.optimize(verbose=2, abs_tol=1e-4)

print(gp_opt4.n_iter)
print(gp_opt4.x_min)
print(gp_opt4.y_min)
print("\n")

51
[0.07220459 0.55792236 0.85662842]
-3.8600590626769904

The question is, in the case of BO applied to ML cross-validation, do we really want to find the global minimum of the objective function?