Solvers

Proximal-Gradient

copt.minimize_proximal_gradient(fun, x0[, …])

Proximal gradient descent.

The proximal-gradient method [BT2009], [N2013] is a method to solve problems of the form

\[\argmin_{\bs{x} \in \mathbb{R}^d} f(\bs{x}) + g(\bs{x})\]

where $f$ is a differentiable function for which we have access to its gradient and $g$ is a potentially non-smooth function for which we have access to its proximal operator.

References

BT2009

Beck, Amir, and Marc Teboulle. “Gradient-based algorithms with applications to signal recovery.” Convex optimization in signal processing and communications (2009)

N2013

Nesterov, Yu. “Gradient methods for minimizing composite functions.” Mathematical Programming 140.1 (2013): 125-161.

Primal-dual hybrid gradient

copt.minimize_primal_dual(f_grad, x0[, …])

Primal-dual hybrid gradient splitting method.

The primal-dual hybrid gradient method [C2013] [V2013] [CP2016] is a method to solve problems of the form

\[\argmin_{\bs{x} \in \mathbb{R}^d} f(\bs{x}) + g(\bs{x}) + h(\bs{A}\bs{x})\]

where $f$ is a differentiable function for which we have access to its gradient and $g$ and $h$ are potentially non-smooth functions for which we have access to their proximal operator.

References

C2013

Condat, Laurent. “A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms.” Journal of Optimization Theory and Applications 158.2 (2013): 460-479.

V2013

Vũ, Bằng Công. “A splitting algorithm for dual monotone inclusions involving cocoercive operators.” Advances in Computational Mathematics 38.3 (2013)

CP2016

Chambolle, Antonin, and Thomas Pock. “An introduction to continuous optimization for imaging.” Acta Numerica 25 (2016)

Three-operator splitting

copt.minimize_three_split(f_grad, x0[, …])

Davis-Yin three operator splitting method.

The three operator splitting [DY2017] [PG2018] is a method to solve problems of the form

\[\argmin_{\bs{x} \in \mathbb{R}^d} f(\bs{x}) + g(\bs{x}) + h(\bs{x})\]

where $f$ is a differentiable function for which we have access to its gradient and $g$ and $h$ are potentially non-smooth functions for which we have access to their proximal operator.

References

DY2017

Davis, Damek, and Wotao Yin. “A three-operator splitting scheme and its optimization applications.” Set-Valued and Variational Analysis, 2017.

PG2018

Pedregosa, Fabian, and Gauthier Gidel. “Adaptive Three Operator Splitting.” Proceedings of the 35th International Conference on Machine Learning, 2018.

Frank-Wolfe

copt.minimize_frank_wolfe(fun, x0, lmo[, …])

Frank-Wolfe algorithm.

The Frank-Wolfe (FW) or conditional gradient algorithm [J2003], [P2018], [PANJ2018] is a method for constrained optimization. It can solve problems of the form

\[\argmin_{\bs{x} \in \mathcal{D}} f(\bs{x})\]

where \(f\) is a differentiable function for which we have access to its gradient and \(\mathcal{D}\) is a compact set for which we have access to its linear minimization oracle (lmo). This is a routine that given a vector \(\bs{u}\) returns a solution to

\[\argmin_{\bs{x} \in D}\, \langle\bs{u}, \bs{x}\rangle~.\]

Contrary to other constrained optimization algorithms like projected gradient descent, the Frank-Wolfe algorithm does not require access to a projection, hence why it is sometimes referred to as a projection-free algorithm. It instead relies exclusively on the linear minimization oracle described above.

The Frank-Wolfe algorithm is implemented in this library in the method copt.minimize_frank_wolfe(). As most other methods it takes as argument an objective function to minimize, but unlike most other methods, it requires access to a linear minimization oracle, which is a routine that for a given $d$-dimensional vector \(\bs{u}\) solves the linear problems \(\argmin_{\bs{z} \in D}\, \langle \bs{u}, \bs{z}\rangle\).

At each iteration, the Frank-Wolfe algorithm uses the linear minimization oracle to identify the vertex \(\bs{s}_t\) that correlates most with the negative gradient. Then next iterate \(\bs{x}^+\) is constructed as a convex combination of the current iterate \(\bs{x}\) and the newly acquired vertex \(\bs{s}\):

\[\boldsymbol{x}^+ = (1 - \gamma)\boldsymbol{x} + \gamma \boldsymbol{s}\]

The step-size \(\gamma\) can be chosen by different strategies:

  • Backtracking line-search. This is the default option and corresponds to the keyword argument step_size="backtracking" This is typically the fastest and simplest method, if unsure, use this option.

  • Demyanov-Rubinov step-size. This is a step-size of the form

    \[\gamma = \langle \nabla f(\bs{x}), \bs{s} - \bs{x}\rangle / (L \|\bs{s} - \bs{x}\|^2)~.\]

    This step-size typically performs well but has the drawback that it requires knowledge of the Lipschitz constant of \(\nabla f\). This step-size can be used with the keyword argument step_size="DR". In this case the Lipschitz constant \(L\) needs to be specified through the keyword argument lipschitz. For example, if the lipschitz constant is 0.1, then the signature should include step_size="DR", lipschitz=0.1.

  • Oblivious step-size. This is the very simple step-size of the form

    \[\gamma = \frac{2}{t+2}~,\]

    where \(t\) is the number of iterations. This step-size is oblivious since it doesn’t use any previous information of the objective. It typically performs worst than the alternatives, but is simple to implement and can be competitive in the case in the case of noisy objectives.

References:

J2003

Jaggi, Martin. “Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization.” ICML 2013.

P2018

Pedregosa, Fabian “Notes on the Frank-Wolfe Algorithm”, 2018

PANJ2018

Pedregosa, Fabian, Armin Askari, Geoffrey Negiar, and Martin Jaggi. “Step-Size Adaptivity in Projection-Free Optimization.” arXiv:1806.05123 (2018).

LJ2015

Lacoste-Julien, Simon, and Martin Jaggi. “On the global linear convergence of Frank-Wolfe optimization variants.” Advances in Neural Information Processing Systems. 2015.

Stochastic methods

copt.minimize_saga(f_deriv, A, b, x0, step_size)

Stochastic average gradient augmented (SAGA) algorithm.

copt.minimize_svrg(f_deriv, A, b, x0, step_size)

Stochastic average gradient augmented (SAGA) algorithm.

copt.minimize_vrtos(f_deriv, A, b, x0, step_size)

Variance-reduced three operator splitting (VRTOS) algorithm.

copt.minimize_sfw(f_deriv, A, b, x0, lmo[, …])

Stochastic Frank-Wolfe (SFW) algorithm.