Background of PCA

Principle component analysis (PCA) is the most widely used dimension reduction technique. Given a matrix XRm×n\mathbf{X} \in \mathbb{R}^{m \times n} in which each column of X\mathbf{X} represents a measurement and the rank of X\mathbf{X} is rr, PCA solves the optimization problem as follows:

arg minLRm×n XLF2s.t.  rank(L)=l,lr\begin{equation} \begin{split} \argmin_{\mathbf{L} \in \mathbb{R}^{m \times n}} &\ \|\mathbf{X} - \mathbf{L}\|_F^2\\ \text{s.t.} &\ \ rank(\mathbf{L}) = l, l \ll r \end{split} \end{equation}

which is also known as low-rank approximation. Mathematically, it can be expressed as X=L+N\mathbf{X}=\mathbf{L}+\mathbf{N}, where each element of N\mathbf{N} follows an iid normal distribution.

The Eckart-Young-Mirsky theorem states that the best rank-ll approximation in terms of the Frobenius norm is L=i=1lσiuiviT\mathbf{L} = \sum_{i=1}^{l} \sigma_i u_i v_i^T, where X=i=1rσiuiviT\mathbf{X} = \sum_{i=1}^{r} \sigma_i u_i v_i^T represents the compact SVD of X\mathbf{X}. The proof of this theorem can be found in this note.

I also have a note talking about PCA in decomposition formula.

Introduction to RPCA

One drawback of PCA is that it is highly sensitive to corrupted data. The “corrupted” here means significant changes occur on measurements, like specularities on human face images or missing records in database. That’s why robust principle component analysis (RPCA) comes. RPCA aims to recover a low-rank matrix Y\mathbf{Y} from highly corrupted measurements X=L+S\mathbf{X}=\mathbf{L}+\mathbf{S}, where the elements in S\mathbf{S} can have arbitrarily large magnitude and they are supposed to be sparse. The optimization problem is:

arg minL,SRm×n L+λvec(S)1s.t. L+S=X\begin{equation} \begin{split} \argmin_{\mathbf{L}, \mathbf{S} \in \mathbb{R}^{m \times n}} &\ \|\mathbf{L}\|_* + \lambda \|vec(\mathbf{S})\|_1\\ \text{s.t.} &\ \mathbf{L} + \mathbf{S} = \mathbf{X} \end{split} \end{equation}

where vec()vec(\cdot) represents vectorize operator and \|\cdot\|_* and 1\|\cdot\|_1 represent matrix nuclear norm and vector l1l_1-norm, respectively. At first sight, the problem seems impossible to solve due to insufficient information. In fact, we need to assume that the low-rank component L\mathbf{L} is not sparse and the sparsity pattern of S\mathbf{S} is selected uniformly at random. More formally speaking, the incoherence condition should be made, see candes’ paper for more information.

Two instances that RPCA is not capable of disentangling two matrices. Suppose that the matrix X\mathbf{X} has only a one at the top left corner and zeros everywhere else. How can we decide whether X\mathbf{X} is the low-rank or sparse? Suppose that the first column of S\mathbf{S} is the opposite of that of L\mathbf{L} and zeros everywhere else. We would not able to recover L\mathbf{L} and S\mathbf{S} since the column space of X\mathbf{X} falls into that of L\mathbf{L}.

The main result of candes’ paper says that the RPCA problem with above assumptions, has high probability to obtain the solution, given λ=1/max(m,n)\lambda = 1/\sqrt{max(m, n)}. The choice of λ\lambda is a pure mathematical analysis and it works correctly. However, we may improve the performance of RPCA by choosing λ\lambda in accordance with prior knowledge about the solution.

Solutions

Proximal Gradient Descent

This kind of technique is also called the Principal Component Pursuit (PCP). Before introduce proximal gradient descent (PGD), lets’ recall that how gradient descent works. Considering f(x)R,xRmf(\mathbf{x}) \in \mathbb{R}, \mathbf{x} \in \mathbb{R}^{m}, f(x)f(\mathbf{x}) is convex and differentiable, we want to find the solution of the problem:

arg minxRm f(x)\begin{equation} \begin{split} \argmin_{\mathbf{x} \in \mathbb{R}^m} &\ f(\mathbf{x}) \end{split} \end{equation}

Firstly,we expand the 2-order Taylor series of the function f(x)f(\mathbf{x}) and substitute the Hessian matrix f(x)f''(\mathbf{x}) with an identity matrix 1tI\frac{1}{t} \mathbf{I}:

f(z)f(x)+f(x)(zx)+12(zx)Tf(x)(zx)f(x)+f(x)(zx)+12(zx)T1tI(zx)\begin{equation} \begin{split} f(\mathbf{z}) &\approx f(\mathbf{x}) + f'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T f''(\mathbf{x}) (\mathbf{z}-\mathbf{x})\\ &\approx f(\mathbf{x}) + f'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T \frac{1}{t}\mathbf{I} (\mathbf{z}-\mathbf{x}) \end{split} \end{equation}

and this is a quadratic approximation of f(x)f(\mathbf{x}) at point x\mathbf{x} (here we use the numerator layout). The original problem can be reduced to the following problem:

arg minzRm f(x)+f(x)(zx)+12(zx)T1tI(zx)\begin{equation} \begin{split} \argmin_{\mathbf{z} \in \mathbb{R}^m} &\ f(\mathbf{x}) + f'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T \frac{1}{t}\mathbf{I} (\mathbf{z}-\mathbf{x}) \end{split} \end{equation}

which further reduces to the form:

arg minzRm 12tz(xtf(x)T)22\begin{equation} \begin{split} \argmin_{\mathbf{z} \in \mathbb{R}^m} \ \frac{1}{2t}\|\mathbf{z}-(\mathbf{x}-tf'(\mathbf{x})^T)\|_2^2 \end{split} \end{equation}

So the gradient descent update is:

x+=xtf(x)T\begin{equation} \begin{split} \mathbf{x}^+ = \mathbf{x}-tf'(\mathbf{x})^T \end{split} \end{equation}

But what if f(x)f(\mathbf{x}) is not differentiable? Lets’ break f(x)f(\mathbf{x}) into two parts g(x)g(\mathbf{x}) and h(x)h(\mathbf{x}), f(x)=g(x)+h(x)f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x}), such that g(x)g(\mathbf{x}) is convex and differentiable and h((x))h(\mathbf(x)) is convex but not differentiable. We can still approximate f(x)f(\mathbf{x}) at point x\mathbf{x} with Tayler series of g(x)g(\mathbf{x}):

f(z)=g(z)+h(z)g(x)+g(x)(zx)+12(zx)Tg(x)(zx)+h(z)g(x)+g(x)(zx)+12(zx)T1tI(zx)+h(z)\begin{equation} \begin{split} f(\mathbf{z}) &= g(\mathbf{z}) + h(\mathbf{z})\\ &\approx g(\mathbf{x}) + g'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T g''(\mathbf{x}) (\mathbf{z}-\mathbf{x}) + h(\mathbf{z})\\ &\approx g(\mathbf{x}) + g'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T \frac{1}{t}\mathbf{I} (\mathbf{z}-\mathbf{x}) + h(\mathbf{z}) \end{split} \end{equation}

The optimization problem can be reduced to:

arg minzRm 12tz(xtg(x)T)22+h(z)\begin{equation} \begin{split} \argmin_{\mathbf{z} \in \mathbb{R}^m} \ \frac{1}{2t}\|\mathbf{z}-(\mathbf{x}-tg'(\mathbf{x})^T)\|_2^2 + h(\mathbf{z}) \end{split} \end{equation}

Here we define proximal mapping proxt(x)prox_t(\mathbf{x}) of h(x)h(\mathbf{x}) as:

proxt(x)=arg minzRm 12tzx22+h(z)\begin{equation} \begin{split} prox_t(\mathbf{x}) = \argmin_{\mathbf{z} \in \mathbb{R}^m} \ \frac{1}{2t}\|\mathbf{z}-\mathbf{x}\|_2^2 + h(\mathbf{z}) \end{split} \end{equation}

If we know the proximal mapping of h(x)h(\mathbf{x}), we can easily get the solution of the above optimization problem also known as the PGD update:

x+=proxt(xtg(x)T)=xtxproxt(xtg(x)T)t=xtGt(x)\begin{equation} \begin{split} \mathbf{x}^+ &= prox_t(\mathbf{x}-tg'(\mathbf{x})^T)\\ &= \mathbf{x} - t\frac{\mathbf{x}-prox_t(\mathbf{x}-tg'(\mathbf{x})^T)}{t}\\ &= \mathbf{x} - t G_t(\mathbf{x}) \end{split} \end{equation}

where Gt(x)G_t(\mathbf{x}) is the generalized gradient of f(x)f(\mathbf{x}).

Here we list a few common proximal mappings.

h(x)h(\mathbf{x})proxt(x)prox_t(\mathbf{x})note
x1|\mathbf{x}|_1St(x)=sign(x)max(xt,0)S_t(\mathbf{x}) = sign(\mathbf{x}) max(\lvert\mathbf{x}\rvert-t, 0)l1l_1-norm of vectors, soft-thresholding operator
X|\mathbf{X}|_*SVTt(X)=Udiag(St(diag(Σ)))VSVT_t(\mathbf{X})=\mathbf{U} diag(S_t(diag(\mathbf{\Sigma}))) \mathbf{V}nuclear norm of matrices, singular value thresholding operator

Back to RPCA, we transform the original problem into an unconstrained problem using a sightly relaxed version of the original problem:

arg minL,SRm×n μL+μλvec(S)1+12XLSF2\begin{equation} \begin{split} \argmin_{\mathbf{L}, \mathbf{S} \in \mathbb{R}^{m \times n}} &\ \mu\|\mathbf{L}\|_* + \mu\lambda \|vec(\mathbf{S})\|_1 + \frac{1}{2} \|\mathbf{X}-\mathbf{L}-\mathbf{S}\|_F^2 \end{split} \end{equation}

where as μ\mu approaches 0, any solution to the above problem approaches the solution set of the original problem. Here we let:

h([LS])=μL+μλvec(S)1=μ[I,0][LS]+μλvec([0,I][LS])1g([LS])=12XLSF2=12X[I,I][LS]F2arg minL,SRm×n h([LS])+g([LS])\begin{equation} \begin{split} h(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= \mu\|\mathbf{L}\|_* + \mu\lambda \|vec(\mathbf{S})\|_1\\ &= \mu\|\begin{bmatrix} \mathbf{I} , \mathbf{0} \end{bmatrix} \begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}\|_* + \mu\lambda \|vec\left(\begin{bmatrix} \mathbf{0} , \mathbf{I} \end{bmatrix} \begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}\right)\|_1\\ g(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= \frac{1}{2} \|\mathbf{X}-\mathbf{L}-\mathbf{S}\|_F^2\\ &= \frac{1}{2} \|\mathbf{X}-\begin{bmatrix} \mathbf{I} , \mathbf{I} \end{bmatrix} \begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}\|_F^2\\ \argmin_{\mathbf{L}, \mathbf{S} \in \mathbb{R}^{m \times n}} &\ h(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) + g(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix})\\ \end{split} \end{equation}

The gradient of g([LS])g(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) in numerator layout is:

g([LS])=[XLSXLS]T\begin{equation} \begin{split} g'(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= -\begin{bmatrix} \mathbf{X} - \mathbf{L} - \mathbf{S} \\ \mathbf{X} - \mathbf{L} - \mathbf{S} \end{bmatrix}^T \end{split} \end{equation}

The proximal mapping of h([LS])h(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) can be decomposed into two separate proximal mappings:

proxt([LS])=arg minZL,ZSRm×n 12t[ZLZS][LS]F2+h([ZLZS])=arg minZLRm×n 12tZLLF2+μL+arg minZSRm×n 12tZSSF2+μλvec(S)1\begin{equation} \begin{split} prox_t(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= \argmin_{\mathbf{Z_L}, \mathbf{Z_S} \in \mathbb{R}^{m \times n}} \ \frac{1}{2t}\|\begin{bmatrix} \mathbf{Z_L} \\ \mathbf{Z_S} \end{bmatrix}-\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}\|_F^2 + h(\begin{bmatrix} \mathbf{Z_L} \\ \mathbf{Z_S}\end{bmatrix})\\ &= \argmin_{\mathbf{Z_L} \in \mathbb{R}^{m \times n}} \ \frac{1}{2t}\|\mathbf{Z_L} -\mathbf{L}\|_F^2 + \mu\|\mathbf{L}\|_* + \argmin_{\mathbf{Z_S} \in \mathbb{R}^{m \times n}} \ \frac{1}{2t}\|\mathbf{Z_S} -\mathbf{S}\|_F^2 + \mu\lambda\|vec(\mathbf{S})\|_1 \\ \end{split} \end{equation}

so the solution would be:

proxt([LS])=[SVTμt(L)uvec(Sμλt(vec(S)))]\begin{equation} \begin{split} prox_t(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= \begin{bmatrix} SVT_{\mu t}\left(\mathbf{L}\right) \\ uvec\left(S_{\mu \lambda t}\left(vec(\mathbf{S})\right)\right) \end{bmatrix} \end{split} \end{equation}

where uvec()uvec(\cdot) represents unvectorize operator.

Finally, we get our proximal gradient descent update rules of the RPCA problem:

[L(k)S(k)]=proxt([L(k1)+t(XL(k1)S(k1))S(k1)+t(XL(k1)S(k1))])=[SVTμt(L(k1)+t(XL(k1)S(k1)))uvec(Sμλt(vec(S(k1)+t(XL(k1)S(k1)))))]\begin{equation} \begin{split} \begin{bmatrix} \mathbf{L}^{(k)} \\ \mathbf{S}^{(k)} \end{bmatrix} &= prox_t\left(\begin{bmatrix} \mathbf{L}^{(k-1)}+t(\mathbf{X}-\mathbf{L}^{(k-1)} - \mathbf{S}^{(k-1)}) \\ \mathbf{S}^{(k-1)}+t(\mathbf{X}-\mathbf{L}^{(k-1)} - \mathbf{S}^{(k-1)}) \end{bmatrix} \right)\\ &= \begin{bmatrix} SVT_{\mu t}\left(\mathbf{L}^{(k-1)}+t(\mathbf{X}-\mathbf{L}^{(k-1)} - \mathbf{S}^{(k-1)})\right) \\ uvec\left(S_{\mu \lambda t}\left(vec(\mathbf{S}^{(k-1)}+t(\mathbf{X}-\mathbf{L}^{(k-1)} - \mathbf{S}^{(k-1)}))\right)\right) \end{bmatrix} \end{split} \end{equation}

Practicalities of PGD

Convergence

For f(x)=g(x)+h(x)f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x}), we assume:

  • gg is convex , differentiable, and gg' is Lipschitz continuous with constant L>0L > 0
  • hh is convex and the proximal mapping of hh can be evaluated
    then proximal gradient descent with fixed step size t1Lt \le \frac{1}{L} satisfies:

f(x(k))fx(0)x222tk\begin{equation} \begin{split} f(\mathbf{x}^{(k)}) - f^{\ast} \le \frac{\|\mathbf{x}^{(0)} - \mathbf{x}^{\ast}\|_2^2}{2tk} \end{split} \end{equation}

The above theorem suggests that PGD has convergence rate O(1/ϵ)O(1/\epsilon).

Appendix

Proof of SVT

For each τ0\tau \ge 0 and YRm×n\mathbf{Y} \in \mathbf{R}^{m \times n}, the singular value shrinkage operator obeys

Dτ(Y)=arg minX12XYF2+τX\begin{equation} \begin{split} \mathcal{D}_\tau(\mathbf{Y}) = \argmin_{\mathbf{X}} \frac{1}{2}\| \mathbf{X} - \mathbf{Y} \|_F^2 + \tau \| \mathbf{X} \|_* \end{split} \end{equation}

Proof. Since the function h(X):=12XYF2+τXh(\mathbf{X}) := \frac{1}{2}\| \mathbf{X} - \mathbf{Y} \|_F^2 + \tau \| \mathbf{X} \|_* is strictly convex, it is easy to see that there exists a unique minimizer, and we thus need to prove that it is equal to Dτ(Y)\mathcal{D}_\tau(\mathbf{Y}).

To do this, recall the definition of a subgradient of a convex function f:Rm×nRf: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}. We say that Z\mathbf{Z} is a subgradient of ff at X0\mathbf{X}_0, denoted Zf(X0)Z \in \partial f(\mathbf{X}_0), if f(X)f(X0)+Z,XX0f(\mathbf{X}) \ge f(\mathbf{X}_0) + \langle \mathbf{Z}, \mathbf{X} - \mathbf{X}_0 \rangle for all X\mathbf{X}. Now X^\hat{\mathbf{X}} minimizes hh if and only if 0 is a subgradient of the functional hh at the point X^\hat{\mathbf{X}}, i.e. 0X^Y+τX^0 \in \hat{\mathbf{X}} − \mathbf{Y} + \tau \partial \|\hat{\mathbf{X}}\|_*, where X^\partial \|\hat{\mathbf{X}}\|_* is the set of subgradients of the nuclear norm.

Let XRm×n\mathbf{X} \in R^{m \times n} be an arbitrary matrix and UΣV\mathbf{U} \mathbf{\Sigma} \mathbf{V}^* be its SVD. It is known that X={UV+W:WRm×n,UW=0,WV=0,W21}\partial \|\mathbf{X}\|_* = \{ \mathbf{U} \mathbf{V}^* + \mathbf{W}: \mathbf{W} \in \mathbb{R}^{m \times n}, \mathbf{U}*\mathbf{W}=0, \mathbf{W}\mathbf{V}=0, \|\mathbf{W}\|_2 \le 1\}. Set X^=Dτ(Y)\hat{\mathbf{X}} = \mathcal{D}_\tau(\mathbf{Y}) for short. In order to show that X^\hat{\mathbf{X}} obeys the optimal condition, decompose the SVD of Y\mathbf{Y} as Y=U0Σ0V0+U1Σ1V1\mathbf{Y} = \mathbf{U}_0 \mathbf{\Sigma}_0 \mathbf{V}_0^* + \mathbf{U}_1 \mathbf{\Sigma}_1 \mathbf{V}_1^* , where U0,V0\mathbf{U}_0,\mathbf{V}_0 (resp. U1,V1\mathbf{U}_1,\mathbf{V}_1) are the singular vectors associated with singular values greater than τ\tau (resp. smaller than or equal to τ\tau). With these notations, we have X^=U0(Σ0τI)V0\hat{\mathbf{X}} = \mathbf{U}_0 \left(\mathbf{\Sigma}_0 - \tau \mathbf{I} \right)\mathbf{V}_0 and, therefore, YX^=τ(U0V0+W),W=τ1U1Σ1V1\mathbf{Y} - \hat{\mathbf{X}} = \tau \left( \mathbf{U}_0 \mathbf{V}_0^* + \mathbf{W}\right), \mathbf{W} = \tau^{-1} \mathbf{U}_1 \mathbf{\Sigma}_1 \mathbf{V}_1^*.

By definition, U0W=0WV0=0\mathbf{U}_0*\mathbf{W}=0 \mathbf{W}\mathbf{V}_0=0 and since the diagonal elements of Σ1\mathbf{\Sigma}_1 have magnitudes bounded by τ\tau , we also have W21\|\mathbf{W}\|_2 \le 1. Hence YX^X^\mathbf{Y} - \hat{\mathbf{X}} \in \partial \|\hat{\mathbf{X}}\|_*, which concludes the proof.

References