Solvers¶

Proximal-Gradient¶

copt.minimize_proximal_gradient(fun, x0[, …])

Proximal gradient descent.

The proximal-gradient method [BT2009], [N2013] is a method to solve problems of the form

\[\argmin_{\bs{x} \in \mathbb{R}^d} f(\bs{x}) + g(\bs{x})\]

where $f$ is a differentiable function for which we have access to its gradient and $g$ is a potentially non-smooth function for which we have access to its proximal operator.

Examples

Group Lasso regularization

References

BT2009: Beck, Amir, and Marc Teboulle. “Gradient-based algorithms with applications to signal recovery.” Convex optimization in signal processing and communications (2009)
N2013: Nesterov, Yu. “Gradient methods for minimizing composite functions.” Mathematical Programming 140.1 (2013): 125-161.

Primal-dual hybrid gradient¶

copt.minimize_primal_dual(f_grad, x0[, …])

Primal-dual hybrid gradient splitting method.

The primal-dual hybrid gradient method [C2013] [V2013] [CP2016] is a method to solve problems of the form

\[\argmin_{\bs{x} \in \mathbb{R}^d} f(\bs{x}) + g(\bs{x}) + h(\bs{A}\bs{x})\]

where $f$ is a differentiable function for which we have access to its gradient and $g$ and $h$ are potentially non-smooth functions for which we have access to their proximal operator.

Examples

Total variation regularization

References

C2013: Condat, Laurent. “A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms.” Journal of Optimization Theory and Applications 158.2 (2013): 460-479.
V2013: Vũ, Bằng Công. “A splitting algorithm for dual monotone inclusions involving cocoercive operators.” Advances in Computational Mathematics 38.3 (2013)
CP2016: Chambolle, Antonin, and Thomas Pock. “An introduction to continuous optimization for imaging.” Acta Numerica 25 (2016)

Three-operator splitting¶

copt.minimize_three_split(f_grad, x0[, …])

Davis-Yin three operator splitting method.

The three operator splitting [DY2017] [PG2018] is a method to solve problems of the form

\[\argmin_{\bs{x} \in \mathbb{R}^d} f(\bs{x}) + g(\bs{x}) + h(\bs{x})\]

where $f$ is a differentiable function for which we have access to its gradient and $g$ and $h$ are potentially non-smooth functions for which we have access to their proximal operator.

Examples

References

DY2017: Davis, Damek, and Wotao Yin. “A three-operator splitting scheme and its optimization applications.” Set-Valued and Variational Analysis, 2017.
PG2018: Pedregosa, Fabian, and Gauthier Gidel. “Adaptive Three Operator Splitting.” Proceedings of the 35th International Conference on Machine Learning, 2018.

Frank-Wolfe¶

copt.minimize_frank_wolfe(fun, x0, lmo[, …])

Frank-Wolfe algorithm.

The Frank-Wolfe (FW) or conditional gradient algorithm [J2003], [P2018], [PANJ2018] is a method for constrained optimization. It can solve problems of the form

\[\argmin_{\bs{x} \in \mathcal{D}} f(\bs{x})\]

where $f$ is a differentiable function for which we have access to its gradient and $\mathcal{D}$ is a compact set for which we have access to its linear minimization oracle (lmo). This is a routine that given a vector $\bs{u}$ returns a solution to

\[\argmin_{\bs{x} \in D}\, \langle\bs{u}, \bs{x}\rangle~.\]

Contrary to other constrained optimization algorithms like projected gradient descent, the Frank-Wolfe algorithm does not require access to a projection, hence why it is sometimes referred to as a projection-free algorithm. It instead relies exclusively on the linear minimization oracle described above.

The Frank-Wolfe algorithm is implemented in this library in the method copt.minimize_frank_wolfe(). As most other methods it takes as argument an objective function to minimize, but unlike most other methods, it requires access to a linear minimization oracle, which is a routine that for a given $d$-dimensional vector $\bs{u}$ solves the linear problems $\argmin_{\bs{z} \in D}\, \langle \bs{u}, \bs{z}\rangle$.

At each iteration, the Frank-Wolfe algorithm uses the linear minimization oracle to identify the vertex $\bs{s}_t$ that correlates most with the negative gradient. Then next iterate $\bs{x}^+$ is constructed as a convex combination of the current iterate $\bs{x}$ and the newly acquired vertex $\bs{s}$:

\[\boldsymbol{x}^+ = (1 - \gamma)\boldsymbol{x} + \gamma \boldsymbol{s}\]

The step-size $\gamma$ can be chosen by different strategies:

Backtracking line-search. This is the default option and corresponds to the keyword argument step_size="backtracking" This is typically the fastest and simplest method, if unsure, use this option.

Demyanov-Rubinov step-size. This is a step-size of the form

\[\gamma = \langle \nabla f(\bs{x}), \bs{s} - \bs{x}\rangle / (L \|\bs{s} - \bs{x}\|^2)~.\]

This step-size typically performs well but has the drawback that it requires knowledge of the Lipschitz constant of $\nabla f$. This step-size can be used with the keyword argument step_size="DR". In this case the Lipschitz constant $L$ needs to be specified through the keyword argument lipschitz. For example, if the lipschitz constant is 0.1, then the signature should include step_size="DR", lipschitz=0.1.

Oblivious step-size. This is the very simple step-size of the form

\[\gamma = \frac{2}{t+2}~,\]

where $t$ is the number of iterations. This step-size is oblivious since it doesn’t use any previous information of the objective. It typically performs worst than the alternatives, but is simple to implement and can be competitive in the case in the case of noisy objectives.

Examples

References:

J2003: Jaggi, Martin. “Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization.” ICML 2013.
P2018: Pedregosa, Fabian “Notes on the Frank-Wolfe Algorithm”, 2018
PANJ2018: Pedregosa, Fabian, Armin Askari, Geoffrey Negiar, and Martin Jaggi. “Step-Size Adaptivity in Projection-Free Optimization.” arXiv:1806.05123 (2018).
LJ2015: Lacoste-Julien, Simon, and Martin Jaggi. “On the global linear convergence of Frank-Wolfe optimization variants.” Advances in Neural Information Processing Systems. 2015.

Stochastic methods¶

`copt.minimize_saga`(f_deriv, A, b, x0, step_size)	Stochastic average gradient augmented (SAGA) algorithm.
`copt.minimize_svrg`(f_deriv, A, b, x0, step_size)	Stochastic average gradient augmented (SAGA) algorithm.
`copt.minimize_vrtos`(f_deriv, A, b, x0, step_size)	Variance-reduced three operator splitting (VRTOS) algorithm.
`copt.minimize_sfw`(f_deriv, A, b, x0, lmo[, …])	Stochastic Frank-Wolfe (SFW) algorithm.

Examples: