The Bregman method is an iterative method to solve convex optimization problems with equality constraints. The linearized Bregman method approximates subproblems in Bregman iterations with a linearized version. The split Bregman method decouples variables in the original problem to make subproblems in Bregman iterations much easier to solve, utilizing both the advantages of alternating minimization and Bregman method. In this post, I’ll use the denominator layout for gradients.
What is Bregman Divergence
The Bregman divergence (or Bregman distance) of a convex function F(x) at point y is defined as the difference between itself and its 1st-order Talyor expansion:
DFp(x,y)=F(x)−F(y)−pT(x−y)
where p∈∂F(y) is a subgradient.
Here I list some of its properties and more can be found on wiki:
- DFp(x,y)≥0
- DFp(y,y)=0
- DFp(x,y)+DFq(y,z)−DFq(x,z)=(p−q)T(y−x)
Bregman Method
Consider the following constrained problem:
xargmin F(x)s.t.G(x)=0
where F(x) and G(x) are convex, G(x) is differentiable, and minxG(x)=0.
An intuitive way to solve the above problem is to transform it into an unconstrained form:
xargmin F(x)+λG(x)
where λ→∞ to enforce that G(x)≈0. However, for many problems, a large λ would make the problem numerically unstable. It’s also difficult to determine how “large” is large for λ.
The Bregman method turns Equation (2) into a sequence of unconstrained problems. Rather than choosing a large λ, the Bregman method approaches the equality with arbitrary precision by solving subproblems iteratively, with a fixed but smaller λ.
Instead of solving Equation (3), the Bregman method solves the following problem:
xk+1=xargmin DFpk(x,xk)+λG(x)=xargmin F(x)−pkTx+λG(x)
By the subgradient optimality condition, we know that:
0∈∂F(xk+1)−pk+λ∇G(xk+1)
Let pk+1∈∂F(xk+1), the Bregman iteration is going to like:
xk+1pk+1=xargmin F(x)−pkTx+λG(x)=pk−λ∇G(xk+1)
where x0=argminxG(x) and pk=0.
It can prove that as k→∞, G(xk)→0, thus the minimization problem in Equation (6) approximates the original problem in Equation (2).
If G(xk)=21∥Ax−b∥22, then ∇G(xk+1)=AT(Axk+1−b). Equation (6) can be transformed into a simpler form:
xk+1bk+1=xargmin F(x)+2λ∥Ax−bk∥22=bk+b−Axk
which simply adds the error in the constraint back to the right-hand side. Equation (7) can be proved by merging pkTx into G(x) and substituting pk with its explict form pk=p0−λ∑i=1k∇G(xi).
Linearized Bregman Method
The Bregman method doesn’t reduce the complexity of solving Equation (3), especially when F(x) is not differentiable (e.g. l1 norm) and not separable (its elements are coupled with each other). Suppose that F(x) is separable, it would be easier to solve the problem if we could separate elements in G(x) either.
That is what the linearized Bregman method does. It linearizes G(x) with its 1st-order Talyor expansion at xk:
xk+1=xargmin DFpk(x,xk)+λG(xk)+λ∇G(xk)T(x−xk)+2λμ∥x−xk∥22=xargmin DFpk(x,xk)+2λμ∥x−(xk−μ1∇G(xk))∥22
where 2μ∥x−xk∥22 is a penalty term since this approximation only works when x nears xk. Note that ∇G(xk) is merged into the l2 norm as the same Equation (6) in post RPCA. Let the l2 norm term as a new function, then we can derive pk+1 by Equation (6):
pk+1=pk−λμ(x−xk+μ1∇G(xk))
Equation (8) is much easier to solve since all elements of x are separable.
Split Bregman Method
The split Bregman method is used when F(x) is not obviously separable. The key idea of the split Bregman is to decouple l1 and l2 terms by equality constraints.
Consider the following optimization problem:
xargmin ∥F(x)∥1+λG(x)
where F(x) and G(x) are both convex and differentiable. Let d=F(x), then we transform the original problem into a constrained problem:
x,dargmin ∥d∥1+λG(x)s.t. F(x)−d=0
which is a joint optimization over x and b. This can be further transformed into an unconstrained problem with a penalty as the Bregman method did:
x,dargmin ∥d∥1+λG(x)+2μ∥F(x)−d∥22
Let H(x,d)=∥d∥1+λG(x) and A(x,d)=F(x)−d (both separable for x and d), by Equation (6), we have:
xk+1,dk+1px,k+1pd,k+1=x,dargmin H(x,d)−px,kTx−pd,kTd+2μ∥A(x,d)∥22=x,dargmin H(x,d)−px,kTx−pd,kTd+2μ∥F(x)−d∥22=px,k−μ(∇xA(xk+1,dk+1))TA(xk+1,dk+1)=px,k−μ(∇F(xk+1))T(F(xk+1)−dk+1)=pd,k−μ(∇dA(xk+1,dk+1))TA(xk+1,dk+1)=pd,k−μ(dk+1−F(xk+1))
If F(x) is linear, then the above iteration can be simplified as that in Equation (7):
xk+1,dk+1bk+1=x,dargmin H(x,d)+2μ∥F(x)−d−bk∥22=bk−F(xk+1)+dk+1
Equation (14) is still a joint minimizaiton problem, which can be solved by (alternating minimization or coordinate descent) minimizing with respect to x and d, separately:
xk+1dk+1bk+1=xargmin λG(x)+2μ∥F(x)−dk−bk∥22=dargmin ∥d∥1+2μ∥F(xk+1)−d−bk∥22=bk−F(xk+1)+dk+1
The first subproblem can be solved by setting the gradient to zero. The second subproblem has an explicit solution with soft-thresholding operator in this post. Note that these two subproblem are just one iteration of the alternating minimization method. To achieve the same convergence rate of Equation (14), it requires more iterations (which are called inner loops and the Bregman iterations are outer loops). For most applications, one iteration is only needed.