Rust needs BFGS. What is BFGS?

06 Aug 2018
Paul Kernfeld dot com

Rust has the potential to be a language for building world-class machine learning tools, but it’s currently missing some important pieces of mathematical infrastructure. BFGS is one of these missing pieces.

In many popular machine learning algorithms, the goal is to find parameters that minimize the errors between the algorithm’s predictions and the training data. This can be formulated as finding an input that minimizes a loss function. Linear regression, logistic regression, neural networks, and a couple different bayesian techniques such as maximum likelihood estimation and maximum a priori estimation can be formulated as minimization problems.

There is no general “best” way to minimize a function; different kinds of functions require different strategies. However, Python’s scipy and R’s optim both prominently feature an algorithm called BFGS. I’ll explain what BFGS stands for, the problem that it solves, and how it solves it.

Introduction

BFGS stands for Broyden-Fletcher-Goldfarb-Shanno, the names of four researchers who each independently published the algorithm in 1970. I have never heard of so many people independently discovering the same thing at once!

First, a couple definitions:

BFGS is useful when ff:

In high dimensions, it is important to make use of gg. Gradient descent is the simplest way to do this, taking tiny steps in the direction of the gradient.

Newton’s method improves on gradient descent by using a second-degree Taylor series to approximate the objective function with a quadratic bowl. Here is what the Newton update looks like:

xk+1=xkH(xk)1g(xk) x_{k+1} = x_k - H(x_k)^{-1}g(x_k)

Here’s how we might write this in Rust:

pub fn newton<F, G, H>(x0: Array1<f64>, f: F, g: G, h: H) -> Array1<f64>
    where
        F: Fn(&Array1<f64>) -> f64,
        G: Fn(&Array1<f64>) -> Array1<f64>,
        H: Fn(&Array1<f64>) -> Array2<f64>,
{
    let mut x = x0;
    let mut f_x = f(&x);

    loop {    
        let f_x_old = f_x;

        x.scaled_add(-1.0, h(&x).inv().dot(g(&x)))

        f_x = f(&x);

        // If f only improved by a small amount, call it a day
        if stop(f_x_old, f_x) {
            return x;
        }
    }
}

This can still be improved in a couple ways, though. Notably:

Time for math

Feel free to skim over this section if you’d like to get to the code.

The Hessian

The Hessian is the matrix of second derivatives of ff, where element Hi,jH_{i,j} is 2fxixj{\frac {\partial ^{2}f}{\partial x_{i}\partial x_{j}}}.

A Hessian is always square. In our case, it’s P×PP \times P.

In our case, the Hessian is symmetric because xi(fxj)=xj(fxi){\frac {\partial }{\partial x_{i}}}\left({\frac {\partial f}{\partial x_{j}}}\right)={\frac {\partial }{\partial x_{j}}}\left({\frac {\partial f}{\partial x_{i}}}\right), i.e. the order of differentiation doesn’t matter. This opens up the possibility of storing only half of HH, although I haven’t tried to do that.

Approximating the Hessian

Since we’re only approximating the objective function anyways, we might as well approximate the Hessian as well. Methods that do this are called quasi-Newton methods. I’ll call the approximate Hessian BB.

When minimizing, it’s important that BB is positive definite. This means that our quadratic approximation has a single point that is a global minimum. There are no saddle points, the minimum can’t be a line or plane, and there can’t be any maxima. If BB stops being positive definitel, we can reset it to its initial state.

The secant method

In order to approximate the Hessian, we can use the secant method, which is like Newton’s method except that instead of computing the derivative it uses finite differences to approximate it. This lets us approximate the HH based on successive evaluations of gg. Thus, the user never needs to compute the Hessian. Note that we are not using the secant method to approximate gg from evaluations of ff; we are actually calculating gg.

The secant method in 1d. This animation by Ralf Pfeifer is from Wikimedia Commons and is licensed under CC BY-SA 3.0.

The Sherman-Morrison formula

Inverting the Hessian for each step of the optimization process would be expensive. It would be much nicer if we could just work directly with the inverse Hessian, avoiding inversions at all. The Sherman-Morrison formula provides a way to apply an update to the inverse of a matrix.

(A+uvT)1=A1A1uvTA11+vTA1u (A+uv^T)^{-1} = A^{-1} - \frac{A^{-1}uv^T A^{-1}}{1 + v^T A^{-1}u}

BFGS

BFGS is a quasi-Newton method that uses the secant method to approximate HH from successive evaluations of gg along with the Sherman-Morrison formula to efficiently perform the updates without any matrix inversions. The derivation of the update is kind of complicated but you can find it on Wikipedia.

Here is the main loop. I’ve left out some details but the full code is here.

pub fn bfgs<F, G>(x0: Array1<f64>, f: F, g: G) -> Array1<f64>
    where
        F: Fn(&Array1<f64>) -> f64,
        G: Fn(&Array1<f64>) -> Array1<f64>,
{
    let mut x = x0;
    let mut f_x = f(&x);
    let mut g_x = g(&x);

    // Initialize the inverse approximate Hessian to the identity matrix
    let mut b_inv = new_identity_matrix(x.len());

    loop {
        // Find the search direction
        let search_dir = -1.0 * b_inv.dot(&g_x);

        // Find a good step size
        let epsilon = line_search(|epsilon| f(&(&search_dir * epsilon + &x)));

        // Save the old state
        let f_x_old = f_x;
        let g_x_old = g_x;

        // Take a step in the search direction
        x.scaled_add(epsilon, &search_dir);
        f_x = f(&x);
        g_x = g(&x);

        // Compute deltas between old and new
        let y = (&g_x - &g_x_old).into_shape((p, 1)).unwrap();
        let s = (epsilon * search_dir).into_shape((p, 1)).unwrap();
        let sy = s.t().dot(&y).into_shape(()).unwrap()[()];
        let ss = s.dot(&s.t());

        if stop(f_x_old, f_x) {
            return x;
        }

        // Update the Hessian approximation
        let to_add = ss * (sy + &y.t().dot(&b_inv.dot(&y))) / sy.powi(2);
        let to_sub = (b_inv.dot(&y).dot(&s.t()) + s.dot(&y.t().dot(&b_inv))) / sy;
        b_inv = b_inv + to_add - to_sub;
    }
}

I have tested this on a few simple functions, including the Rosenbrock function.

Limitations

BFGS is not trivially parallelizable because it involves matrix multiplication. However, particular parts of the computation might be parallelizable, e.g. using ndarray-parallel.

The memory usage of BFGS can be high if there are a lot of parameters.

Future

The line search that I have implemented could be improved quite a bit. In fact, it’s terrible!

There are also a few algorithms derived from BFGS that provide various benefits:

Want to help?

If you want to use this from rust, it’s published as the crate bfgs.

Rustaceans: I think that my code is doing some unnecessary copying and allocation, and I’d love help optimizing it.

Mathematicians: any suggestions for functions that I should use to test this code? In particular, I don’t feel confident that everything is as numerically stable as it should be.

Thanks

To my brother Eric for walking me through most of this math. I did not think that I could ever understand BFGS, but now I mostly understand BFGS. Also thanks to Max Livingston for reviewing.