Solvers¶
Proximal-Gradient¶
|
Proximal gradient descent. |
The proximal-gradient method [BT2009], [N2013] is a method to solve problems of the form
where $f$ is a differentiable function for which we have access to its gradient and $g$ is a potentially non-smooth function for which we have access to its proximal operator.
Examples
References
- BT2009
Beck, Amir, and Marc Teboulle. “Gradient-based algorithms with applications to signal recovery.” Convex optimization in signal processing and communications (2009)
- N2013
Nesterov, Yu. “Gradient methods for minimizing composite functions.” Mathematical Programming 140.1 (2013): 125-161.
Primal-dual hybrid gradient¶
|
Primal-dual hybrid gradient splitting method. |
The primal-dual hybrid gradient method [C2013] [V2013] [CP2016] is a method to solve problems of the form
where $f$ is a differentiable function for which we have access to its gradient and $g$ and $h$ are potentially non-smooth functions for which we have access to their proximal operator.
Examples
References
- C2013
Condat, Laurent. “A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms.” Journal of Optimization Theory and Applications 158.2 (2013): 460-479.
- V2013
Vũ, Bằng Công. “A splitting algorithm for dual monotone inclusions involving cocoercive operators.” Advances in Computational Mathematics 38.3 (2013)
- CP2016
Chambolle, Antonin, and Thomas Pock. “An introduction to continuous optimization for imaging.” Acta Numerica 25 (2016)
Three-operator splitting¶
|
Davis-Yin three operator splitting method. |
The three operator splitting [DY2017] [PG2018] is a method to solve problems of the form
where $f$ is a differentiable function for which we have access to its gradient and $g$ and $h$ are potentially non-smooth functions for which we have access to their proximal operator.
Examples
References
- DY2017
Davis, Damek, and Wotao Yin. “A three-operator splitting scheme and its optimization applications.” Set-Valued and Variational Analysis, 2017.
- PG2018
Pedregosa, Fabian, and Gauthier Gidel. “Adaptive Three Operator Splitting.” Proceedings of the 35th International Conference on Machine Learning, 2018.
Frank-Wolfe¶
|
Frank-Wolfe algorithm. |
The Frank-Wolfe (FW) or conditional gradient algorithm [J2003], [P2018], [PANJ2018] is a method for constrained optimization. It can solve problems of the form
where \(f\) is a differentiable function for which we have access to its gradient and \(\mathcal{D}\) is a compact set for which we have access to its linear minimization oracle (lmo). This is a routine that given a vector \(\bs{u}\) returns a solution to
Contrary to other constrained optimization algorithms like projected gradient descent, the Frank-Wolfe algorithm does not require access to a projection, hence why it is sometimes referred to as a projection-free algorithm. It instead relies exclusively on the linear minimization oracle described above.
The Frank-Wolfe algorithm is implemented in this library in the method copt.minimize_frank_wolfe()
. As most other methods it takes as argument an objective function to minimize, but unlike most other methods, it requires access to a linear minimization oracle, which is a routine that for a given $d$-dimensional vector \(\bs{u}\) solves the linear problems \(\argmin_{\bs{z} \in D}\, \langle \bs{u}, \bs{z}\rangle\).
At each iteration, the Frank-Wolfe algorithm uses the linear minimization oracle to identify the vertex \(\bs{s}_t\) that correlates most with the negative gradient. Then next iterate \(\bs{x}^+\) is constructed as a convex combination of the current iterate \(\bs{x}\) and the newly acquired vertex \(\bs{s}\):
The step-size \(\gamma\) can be chosen by different strategies:
Backtracking line-search. This is the default option and corresponds to the keyword argument
step_size="backtracking"
This is typically the fastest and simplest method, if unsure, use this option.Demyanov-Rubinov step-size. This is a step-size of the form
\[\gamma = \langle \nabla f(\bs{x}), \bs{s} - \bs{x}\rangle / (L \|\bs{s} - \bs{x}\|^2)~.\]This step-size typically performs well but has the drawback that it requires knowledge of the Lipschitz constant of \(\nabla f\). This step-size can be used with the keyword argument
step_size="DR"
. In this case the Lipschitz constant \(L\) needs to be specified through the keyword argumentlipschitz
. For example, if the lipschitz constant is 0.1, then the signature should includestep_size="DR", lipschitz=0.1
.Oblivious step-size. This is the very simple step-size of the form
\[\gamma = \frac{2}{t+2}~,\]where \(t\) is the number of iterations. This step-size is oblivious since it doesn’t use any previous information of the objective. It typically performs worst than the alternatives, but is simple to implement and can be competitive in the case in the case of noisy objectives.
Examples
References:
- J2003
Jaggi, Martin. “Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization.” ICML 2013.
- P2018
Pedregosa, Fabian “Notes on the Frank-Wolfe Algorithm”, 2018
- PANJ2018
Pedregosa, Fabian, Armin Askari, Geoffrey Negiar, and Martin Jaggi. “Step-Size Adaptivity in Projection-Free Optimization.” arXiv:1806.05123 (2018).
- LJ2015
Lacoste-Julien, Simon, and Martin Jaggi. “On the global linear convergence of Frank-Wolfe optimization variants.” Advances in Neural Information Processing Systems. 2015.
Stochastic methods¶
|
Stochastic average gradient augmented (SAGA) algorithm. |
|
Stochastic average gradient augmented (SAGA) algorithm. |
|
Variance-reduced three operator splitting (VRTOS) algorithm. |
|
Stochastic Frank-Wolfe (SFW) algorithm. |