The Bregman method is an iterative method to solve convex optimization problems with equality constraints. The linearized Bregman method approximates subproblems in Bregman iterations with a linearized version. The split Bregman method decouples variables in the original problem to make subproblems in Bregman iterations much easier to solve, utilizing both the advantages of alternating minimization and Bregman method. In this post, I’ll use the denominator layout for gradients.

What is Bregman Divergence

The Bregman divergence (or Bregman distance) of a convex function F(x)F(\mathbf{x}) at point y\mathbf{y} is defined as the difference between itself and its 1st-order Talyor expansion:

DFp(x,y)=F(x)F(y)pT(xy)\begin{equation} \begin{split} D_F^{\mathbf{p}}(\mathbf{x}, \mathbf{y}) = F(\mathbf{x}) - F(\mathbf{y}) - \mathbf{p}^T(\mathbf{x} - \mathbf{y}) \end{split} \end{equation}

where pF(y)\mathbf{p} \in \partial F(\mathbf{y}) is a subgradient.

Here I list some of its properties and more can be found on wiki:

  • DFp(x,y)0D_F^{\mathbf{p}}(\mathbf{x}, \mathbf{y}) \ge 0
  • DFp(y,y)=0D_F^{\mathbf{p}}(\mathbf{y}, \mathbf{y}) = 0
  • DFp(x,y)+DFq(y,z)DFq(x,z)=(pq)T(yx)D_F^{\mathbf{p}}(\mathbf{x}, \mathbf{y}) + D_F^{\mathbf{q}}(\mathbf{y}, \mathbf{z}) - D_F^{\mathbf{q}}(\mathbf{x}, \mathbf{z}) = (\mathbf{p}-\mathbf{q})^T(\mathbf{y}-\mathbf{x})

Bregman Method

Consider the following constrained problem:

arg minx F(x)s.t.G(x)=0\begin{equation} \begin{split} \argmin_{\mathbf{x}}\ &F(\mathbf{x}) \\ &s.t. G(\mathbf{x}) = 0 \end{split} \end{equation}

where F(x)F(\mathbf{x}) and G(x)G(\mathbf{x}) are convex, G(x)G(\mathbf{x}) is differentiable, and minxG(x)=0\min_{\mathbf{x}} G(\mathbf{x}) = 0.

An intuitive way to solve the above problem is to transform it into an unconstrained form:

arg minx F(x)+λG(x)\begin{equation} \begin{split} \argmin_{\mathbf{x}}\ F(\mathbf{x}) + \lambda G(\mathbf{x}) \end{split} \end{equation}

where λ\lambda \rightarrow \infty to enforce that G(x)0G(\mathbf{x}) \approx 0. However, for many problems, a large λ\lambda would make the problem numerically unstable. It’s also difficult to determine how “large” is large for λ\lambda.

The Bregman method turns Equation (2) into a sequence of unconstrained problems. Rather than choosing a large λ\lambda, the Bregman method approaches the equality with arbitrary precision by solving subproblems iteratively, with a fixed but smaller λ\lambda.

Instead of solving Equation (3), the Bregman method solves the following problem:

xk+1=arg minx DFpk(x,xk)+λG(x)=arg minx F(x)pkTx+λG(x)\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \argmin_{\mathbf{x}}\ D_F^{\mathbf{p}_k}(\mathbf{x}, \mathbf{x}_k) + \lambda G(\mathbf{x})\\ &= \argmin_{\mathbf{x}}\ F(\mathbf{x}) - \mathbf{p}_k^T\mathbf{x} + \lambda G(\mathbf{x}) \end{split} \end{equation}

By the subgradient optimality condition, we know that:

0F(xk+1)pk+λG(xk+1)\begin{equation} \begin{split} 0 \in \partial F(\mathbf{x}_{k+1}) - \mathbf{p}_k + \lambda \nabla G(\mathbf{x}_{k+1})\\ \end{split} \end{equation}

Let pk+1F(xk+1)\mathbf{p}_{k+1} \in \partial F(\mathbf{x}_{k+1}), the Bregman iteration is going to like:

xk+1=arg minx F(x)pkTx+λG(x)pk+1=pkλG(xk+1)\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \argmin_{\mathbf{x}}\ F(\mathbf{x}) - \mathbf{p}_k^T\mathbf{x} + \lambda G(\mathbf{x})\\ \mathbf{p}_{k+1} &= \mathbf{p}_k - \lambda \nabla G(\mathbf{x}_{k+1}) \end{split} \end{equation}

where x0=arg minxG(x)\mathbf{x}_0 = \argmin_{\mathbf{x}} G(\mathbf{x}) and pk=0\mathbf{p}_k = 0.

It can prove that as kk \rightarrow \infty, G(xk)0G(\mathbf{x}_k) \rightarrow 0, thus the minimization problem in Equation (6) approximates the original problem in Equation (2).

If G(xk)=12Axb22G(\mathbf{x}_k) = \frac{1}{2}\|\mathbf{A}\mathbf{x} -\mathbf{b} \|_2^2, then G(xk+1)=AT(Axk+1b)\nabla G(\mathbf{x}_{k+1}) = \mathbf{A}^T\left(\mathbf{A}\mathbf{x}_{k+1} - \mathbf{b}\right). Equation (6) can be transformed into a simpler form:

xk+1=arg minx F(x)+λ2Axbk22bk+1=bk+bAxk\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \argmin_{\mathbf{x}}\ F(\mathbf{x}) + \frac{\lambda}{2} \|\mathbf{A}\mathbf{x} -\mathbf{b}_k \|_2^2\\ \mathbf{b}_{k+1} &= \mathbf{b}_k + \mathbf{b} - \mathbf{A}\mathbf{x}_k \end{split} \end{equation}

which simply adds the error in the constraint back to the right-hand side. Equation (7) can be proved by merging pkTx\mathbf{p}_k^T\mathbf{x} into G(x)G(\mathbf{x}) and substituting pk\mathbf{p}_k with its explict form pk=p0λi=1kG(xi)\mathbf{p}_k = \mathbf{p}_0 - \lambda \sum_{i=1}^{k} \nabla G(\mathbf{x}_i).

Linearized Bregman Method

The Bregman method doesn’t reduce the complexity of solving Equation (3), especially when F(x)F(\mathbf{x}) is not differentiable (e.g. l1l_1 norm) and not separable (its elements are coupled with each other). Suppose that F(x)F(\mathbf{x}) is separable, it would be easier to solve the problem if we could separate elements in G(x)G(\mathbf{x}) either.

That is what the linearized Bregman method does. It linearizes G(x)G(\mathbf{x}) with its 1st-order Talyor expansion at xk\mathbf{x}_k:

xk+1=arg minx DFpk(x,xk)+λG(xk)+λG(xk)T(xxk)+λμ2xxk22=arg minx DFpk(x,xk)+λμ2x(xk1μG(xk))22\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \argmin_{\mathbf{x}}\ D_F^{\mathbf{p}_k}(\mathbf{x}, \mathbf{x}_k)\\ &+ \lambda G(\mathbf{x}_k) + \lambda\nabla G(\mathbf{x}_k)^T(\mathbf{x} - \mathbf{x}_k) + \frac{\lambda\mu}{2} \|\mathbf{x} -\mathbf{x}_k\|_2^2\\ &= \argmin_{\mathbf{x}}\ D_F^{\mathbf{p}_k}(\mathbf{x}, \mathbf{x}_k) \\&+ \frac{\lambda\mu}{2} \|\mathbf{x} -\left(\mathbf{x}_k - \frac{1}{\mu} \nabla G(\mathbf{x}_k) \right)\|_2^2\\ \end{split} \end{equation}

where μ2xxk22\frac{\mu}{2} \|\mathbf{x} -\mathbf{x}_k\|_2^2 is a penalty term since this approximation only works when x\mathbf{x} nears xk\mathbf{x}_k. Note that G(xk)\nabla G(\mathbf{x}_k) is merged into the l2l_2 norm as the same Equation (6) in post RPCA. Let the l2l_2 norm term as a new function, then we can derive pk+1\mathbf{p}_{k+1} by Equation (6):

pk+1=pkλμ(xxk+1μG(xk))\begin{equation} \begin{split} \mathbf{p}_{k+1} = \mathbf{p}_{k} -\lambda\mu (\mathbf{x} - \mathbf{x}_k + \frac{1}{\mu} \nabla G(\mathbf{x}_k)) \end{split} \end{equation}

Equation (8) is much easier to solve since all elements of x\mathbf{x} are separable.

Split Bregman Method

The split Bregman method is used when F(x)F(\mathbf{x}) is not obviously separable. The key idea of the split Bregman is to decouple l1l_1 and l2l_2 terms by equality constraints.

Consider the following optimization problem:

arg minx F(x)1+λG(x)\begin{equation} \begin{split} \argmin_{\mathbf{x}}\ \|F(\mathbf{x})\|_1 + \lambda G(\mathbf{x}) \end{split} \end{equation}

where F(x)F(\mathbf{x}) and G(x)G(\mathbf{x}) are both convex and differentiable. Let d=F(x)\mathbf{d} = F(\mathbf{x}), then we transform the original problem into a constrained problem:

arg minx,d d1+λG(x)s.t. F(x)d=0\begin{equation} \begin{split} \argmin_{\mathbf{x}, \mathbf{d}}\ &\|\mathbf{d}\|_1 + \lambda G(\mathbf{x})\\ &s.t.\ F(\mathbf{x}) - \mathbf{d} = 0 \end{split} \end{equation}

which is a joint optimization over x\mathbf{x} and b\mathbf{b}. This can be further transformed into an unconstrained problem with a penalty as the Bregman method did:

arg minx,d d1+λG(x)+μ2F(x)d22\begin{equation} \begin{split} \argmin_{\mathbf{x}, \mathbf{d}}\ &\|\mathbf{d}\|_1 + \lambda G(\mathbf{x}) + \frac{\mu}{2}\|F(\mathbf{x}) - \mathbf{d}\|_2^2\\ \end{split} \end{equation}

Let H(x,d)=d1+λG(x)H(\mathbf{x}, \mathbf{d}) = \|\mathbf{d}\|_1 + \lambda G(\mathbf{x}) and A(x,d)=F(x)dA(\mathbf{x}, \mathbf{d}) = F(\mathbf{x}) - \mathbf{d} (both separable for x\mathbf{x} and d\mathbf{d}), by Equation (6), we have:

xk+1,dk+1=arg minx,d H(x,d)px,kTxpd,kTd+μ2A(x,d)22=arg minx,d H(x,d)px,kTxpd,kTd+μ2F(x)d22px,k+1=px,kμ(xA(xk+1,dk+1))TA(xk+1,dk+1)=px,kμ(F(xk+1))T(F(xk+1)dk+1)pd,k+1=pd,kμ(dA(xk+1,dk+1))TA(xk+1,dk+1)=pd,kμ(dk+1F(xk+1))\begin{equation} \begin{split} \mathbf{x}_{k+1},\mathbf{d}_{k+1} &= \argmin_{\mathbf{x}, \mathbf{d}}\ H(\mathbf{x}, \mathbf{d}) - \mathbf{p}_{\mathbf{x},k}^T\mathbf{x} - \mathbf{p}_{\mathbf{d},k}^T\mathbf{d} \\&+ \frac{\mu}{2}\|A(\mathbf{x}, \mathbf{d})\|_2^2\\ &= \argmin_{\mathbf{x}, \mathbf{d}}\ H(\mathbf{x}, \mathbf{d}) - \mathbf{p}_{\mathbf{x},k}^T\mathbf{x} - \mathbf{p}_{\mathbf{d},k}^T\mathbf{d} \\&+ \frac{\mu}{2}\|F(\mathbf{x}) - \mathbf{d}\|_2^2\\ \mathbf{p}_{\mathbf{x},k+1} &= \mathbf{p}_{\mathbf{x},k} - \mu \left(\nabla_{\mathbf{x}} A(\mathbf{x}_{k+1}, \mathbf{d}_{k+1})\right)^T A(\mathbf{x}_{k+1}, \mathbf{d}_{k+1})\\ &= \mathbf{p}_{\mathbf{x},k} - \mu \left(\nabla F(\mathbf{x}_{k+1})\right)^T(F(\mathbf{x}_{k+1}) - \mathbf{d}_{k+1})\\ \mathbf{p}_{\mathbf{d},k+1} &= \mathbf{p}_{\mathbf{d},k} - \mu \left(\nabla_{\mathbf{d}} A(\mathbf{x}_{k+1}, \mathbf{d}_{k+1})\right)^T A(\mathbf{x}_{k+1}, \mathbf{d}_{k+1})\\ &= \mathbf{p}_{\mathbf{d},k} - \mu (\mathbf{d}_{k+1}- F(\mathbf{x}_{k+1}))\\ \end{split} \end{equation}

If F(x)F(\mathbf{x}) is linear, then the above iteration can be simplified as that in Equation (7):

xk+1,dk+1=arg minx,d H(x,d)+μ2F(x)dbk22bk+1=bkF(xk+1)+dk+1\begin{equation} \begin{split} \mathbf{x}_{k+1},\mathbf{d}_{k+1} &= \argmin_{\mathbf{x}, \mathbf{d}}\ H(\mathbf{x}, \mathbf{d}) + \frac{\mu}{2}\|F(\mathbf{x}) - \mathbf{d} - \mathbf{b}_k \|_2^2\\ \mathbf{b}_{k+1} &= \mathbf{b}_k - F(\mathbf{x}_{k+1}) + \mathbf{d}_{k+1} \end{split} \end{equation}

Equation (14) is still a joint minimizaiton problem, which can be solved by (alternating minimization or coordinate descent) minimizing with respect to x\mathbf{x} and d\mathbf{d}, separately:

xk+1=arg minx λG(x)+μ2F(x)dkbk22dk+1=arg mind d1+μ2F(xk+1)dbk22bk+1=bkF(xk+1)+dk+1\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \argmin_{\mathbf{x}}\ \lambda G(\mathbf{x}) + \frac{\mu}{2}\|F(\mathbf{x}) - \mathbf{d}_k - \mathbf{b}_k \|_2^2\\ \mathbf{d}_{k+1} &= \argmin_{\mathbf{d}}\ \|\mathbf{d}\|_1 + \frac{\mu}{2}\|F(\mathbf{x}_{k+1}) - \mathbf{d} - \mathbf{b}_k \|_2^2\\ \mathbf{b}_{k+1} &= \mathbf{b}_k - F(\mathbf{x}_{k+1}) + \mathbf{d}_{k+1} \end{split} \end{equation}

The first subproblem can be solved by setting the gradient to zero. The second subproblem has an explicit solution with soft-thresholding operator in this post. Note that these two subproblem are just one iteration of the alternating minimization method. To achieve the same convergence rate of Equation (14), it requires more iterations (which are called inner loops and the Bregman iterations are outer loops). For most applications, one iteration is only needed.