Robust Principle Component Analysis

Principle component analysis (PCA) is the most widely used dimension reduction technique. Given a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ in which each column of $\mathbf{X}$ represents a measurement and the rank of $\mathbf{X}$ is $r$, PCA solves the optimization problem as follows:

\[\DeclareMathOperator*{\argmin}{arg\,min} \begin{equation} \begin{split} \argmin_{\mathbf{L} \in \mathbb{R}^{m \times n}} &\ \|\mathbf{X} - \mathbf{L}\|_F^2\\ \text{s.t.} &\ \ rank(\mathbf{L}) = l, l \ll r \end{split} \end{equation}\]

which is also known as low-rank approximation. Mathematically, it can be expressed as $\mathbf{X}=\mathbf{L}+\mathbf{N}$, where each element of $\mathbf{N}$ follows an iid normal distribution.

The Eckart-Young-Mirsky theorem states that the best rank-$l$ approximation in terms of the Frobenius norm is $\mathbf{L} = \sum_{i=1}^{l} \sigma_i u_i v_i^T$, where $\mathbf{X} = \sum_{i=1}^{r} \sigma_i u_i v_i^T$ represents the compact SVD of $\mathbf{X}$. The proof of this theorem can be found in this note.

I also have a note talking about PCA in decomposition formula.

Introduction to RPCA

One drawback of PCA is that it is highly sensitive to corrupted data. The “corrupted” here means significant changes occur on measurements, like specularities on human face images or missing records in database. That’s why robust principle component analysis (RPCA) comes. RPCA aims to recover a low-rank matrix $\mathbf{Y}$ from highly corrupted measurements $\mathbf{X}=\mathbf{L}+\mathbf{S}$, where the elements in $\mathbf{S}$ can have arbitrarily large magnitude and they are supposed to be sparse. The optimization problem is:

\[\begin{equation} \begin{split} \argmin_{\mathbf{L}, \mathbf{S} \in \mathbb{R}^{m \times n}} &\ \|\mathbf{L}\|_* + \lambda \|vec(\mathbf{S})\|_1\\ \text{s.t.} &\ \mathbf{L} + \mathbf{S} = \mathbf{X} \end{split} \end{equation}\]

where $vec(\cdot)$ represents vectorize operator and $|\cdot|_*$ and $|\cdot|_1$ represent matrix nuclear norm and vector $l_1$-norm, respectively. At first sight, the problem seems impossible to solve due to insufficient information. In fact, we need to assume that the low-rank component $\mathbf{L}$ is not sparse and the sparsity pattern of $\mathbf{S}$ is selected uniformly at random. More formally speaking, the incoherence condition should be made, see candes’ paper for more information.

Two instances that RPCA is not capable of disentangling two matrices. Suppose that the matrix $\mathbf{X}$ has only a one at the top left corner and zeros everywhere else. How can we decide whether $\mathbf{X}$ is the low-rank or sparse? Suppose that the first column of $\mathbf{S}$ is the opposite of that of $\mathbf{L}$ and zeros everywhere else. We would not able to recover $\mathbf{L}$ and $\mathbf{S}$ since the column space of $\mathbf{X}$ falls into that of $\mathbf{L}$.

The main result of candes’ paper says that the RPCA problem with above assumptions, has high probability to obtain the solution, given $\lambda = 1/\sqrt{max(m, n)}$. The choice of $\lambda$ is a pure mathematical analysis and it works correctly. However, we may improve the performance of RPCA by choosing $\lambda$ in accordance with prior knowledge about the solution.

Solutions

Proximal Gradient Descent

This kind of technique is also called the Principal Component Pursuit (PCP). Before introduce proximal gradient descent (PGD), lets’ recall that how gradient descent works. Considering $f(\mathbf{x}) \in \mathbb{R}, \mathbf{x} \in \mathbb{R}^{m}$, $f(\mathbf{x})$ is convex and differentiable, we want to find the solution of the problem:

\[\begin{equation} \begin{split} \argmin_{\mathbf{x} \in \mathbb{R}^m} &\ f(\mathbf{x}) \end{split} \end{equation}\]

Firstly,we expand the 2-order Taylor series of the function $f(\mathbf{x})$ and substitute the Hessian matrix $f’’(\mathbf{x})$ with an identity matrix $\frac{1}{t} \mathbf{I}$:

\[\begin{equation} \begin{split} f(\mathbf{z}) &\approx f(\mathbf{x}) + f'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T f''(\mathbf{x}) (\mathbf{z}-\mathbf{x})\\ &\approx f(\mathbf{x}) + f'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T \frac{1}{t}\mathbf{I} (\mathbf{z}-\mathbf{x}) \end{split} \end{equation}\]

and this is a quadratic approximation of $f(\mathbf{x})$ at point $\mathbf{x}$ (here we use the numerator layout). The original problem can be reduced to the following problem:

\[\begin{equation} \begin{split} \argmin_{\mathbf{z} \in \mathbb{R}^m} &\ f(\mathbf{x}) + f'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T \frac{1}{t}\mathbf{I} (\mathbf{z}-\mathbf{x}) \end{split} \end{equation}\]

which further reduces to the form:

\[\begin{equation} \begin{split} \argmin_{\mathbf{z} \in \mathbb{R}^m} \ \frac{1}{2t}\|\mathbf{z}-(\mathbf{x}-tf'(\mathbf{x})^T)\|_2^2 \end{split} \end{equation}\]

So the gradient descent update is:

\[\begin{equation} \begin{split} \mathbf{x}^+ = \mathbf{x}-tf'(\mathbf{x})^T \end{split} \end{equation}\]

But what if $f(\mathbf{x})$ is not differentiable? Lets’ break $f(\mathbf{x})$ into two parts $g(\mathbf{x})$ and $h(\mathbf{x})$, $f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x})$, such that $g(\mathbf{x})$ is convex and differentiable and $h(\mathbf(x))$ is convex but not differentiable. We can still approximate $f(\mathbf{x})$ at point $\mathbf{x}$ with Tayler series of $g(\mathbf{x})$:

\[\begin{equation} \begin{split} f(\mathbf{z}) &= g(\mathbf{z}) + h(\mathbf{z})\\ &\approx g(\mathbf{x}) + g'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T g''(\mathbf{x}) (\mathbf{z}-\mathbf{x}) + h(\mathbf{z})\\ &\approx g(\mathbf{x}) + g'(\mathbf{x})(\mathbf{z}-\mathbf{x}) + \frac{1}{2}(\mathbf{z}-\mathbf{x})^T \frac{1}{t}\mathbf{I} (\mathbf{z}-\mathbf{x}) + h(\mathbf{z}) \end{split} \end{equation}\]

The optimization problem can be reduced to:

\[\begin{equation} \begin{split} \argmin_{\mathbf{z} \in \mathbb{R}^m} \ \frac{1}{2t}\|\mathbf{z}-(\mathbf{x}-tg'(\mathbf{x})^T)\|_2^2 + h(\mathbf{z}) \end{split} \end{equation}\]

Here we define proximal mapping $prox_t(\mathbf{x})$ of $h(\mathbf{x})$ as:

\[\begin{equation} \begin{split} prox_t(\mathbf{x}) = \argmin_{\mathbf{z} \in \mathbb{R}^m} \ \frac{1}{2t}\|\mathbf{z}-\mathbf{x}\|_2^2 + h(\mathbf{z}) \end{split} \end{equation}\]

If we know the proximal mapping of $h(\mathbf{x})$, we can easily get the solution of the above optimization problem also known as the PGD update:

\[\begin{equation} \begin{split} \mathbf{x}^+ &= prox_t(\mathbf{x}-tg'(\mathbf{x})^T)\\ &= \mathbf{x} - t\frac{\mathbf{x}-prox_t(\mathbf{x}-tg'(\mathbf{x})^T)}{t}\\ &= \mathbf{x} - t G_t(\mathbf{x}) \end{split} \end{equation}\]

where $G_t(\mathbf{x})$ is the generalized gradient of $f(\mathbf{x})$.

Here we list a few common proximal mappings.

$h(\mathbf{x})$	$prox_t(\mathbf{x})$	note
$\|\mathbf{x}\|_1$	$S_t(\mathbf{x}) = sign(\mathbf{x}) max(\lvert\mathbf{x}\rvert-t, 0)$	$l_1$-norm of vectors, soft-thresholding operator
$\|\mathbf{X}\|_*$	$SVT_t(\mathbf{X})=\mathbf{U} diag(S_t(diag(\mathbf{\Sigma}))) \mathbf{V}$	nuclear norm of matrices, singular value thresholding operator

Back to RPCA, we transform the original problem into an unconstrained problem using a sightly relaxed version of the original problem:

\[\begin{equation} \begin{split} \argmin_{\mathbf{L}, \mathbf{S} \in \mathbb{R}^{m \times n}} &\ \mu\|\mathbf{L}\|_* + \mu\lambda \|vec(\mathbf{S})\|_1 + \frac{1}{2} \|\mathbf{X}-\mathbf{L}-\mathbf{S}\|_F^2 \end{split} \end{equation}\]

where as $\mu$ approaches 0, any solution to the above problem approaches the solution set of the original problem. Here we let:

\[\begin{equation} \begin{split} h(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= \mu\|\mathbf{L}\|_* + \mu\lambda \|vec(\mathbf{S})\|_1\\ &= \mu\|\begin{bmatrix} \mathbf{I} , \mathbf{0} \end{bmatrix} \begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}\|_* + \mu\lambda \|vec\left(\begin{bmatrix} \mathbf{0} , \mathbf{I} \end{bmatrix} \begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}\right)\|_1\\ g(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= \frac{1}{2} \|\mathbf{X}-\mathbf{L}-\mathbf{S}\|_F^2\\ &= \frac{1}{2} \|\mathbf{X}-\begin{bmatrix} \mathbf{I} , \mathbf{I} \end{bmatrix} \begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}\|_F^2\\ \argmin_{\mathbf{L}, \mathbf{S} \in \mathbb{R}^{m \times n}} &\ h(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) + g(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix})\\ \end{split} \end{equation}\]

The gradient of $g(\begin{bmatrix} \mathbf{L} \ \mathbf{S} \end{bmatrix})$ in numerator layout is:

\[\begin{equation} \begin{split} g'(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= -\begin{bmatrix} \mathbf{X} - \mathbf{L} - \mathbf{S} \\ \mathbf{X} - \mathbf{L} - \mathbf{S} \end{bmatrix}^T \end{split} \end{equation}\]

The proximal mapping of $h(\begin{bmatrix} \mathbf{L} \ \mathbf{S} \end{bmatrix})$ can be decomposed into two separate proximal mappings:

\[\begin{equation} \begin{split} prox_t(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= \argmin_{\mathbf{Z_L}, \mathbf{Z_S} \in \mathbb{R}^{m \times n}} \ \frac{1}{2t}\|\begin{bmatrix} \mathbf{Z_L} \\ \mathbf{Z_S} \end{bmatrix}-\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}\|_F^2 + h(\begin{bmatrix} \mathbf{Z_L} \\ \mathbf{Z_S}\end{bmatrix})\\ &= \argmin_{\mathbf{Z_L} \in \mathbb{R}^{m \times n}} \ \frac{1}{2t}\|\mathbf{Z_L} -\mathbf{L}\|_F^2 + \mu\|\mathbf{L}\|_* + \argmin_{\mathbf{Z_S} \in \mathbb{R}^{m \times n}} \ \frac{1}{2t}\|\mathbf{Z_S} -\mathbf{S}\|_F^2 + \mu\lambda\|vec(\mathbf{S})\|_1 \\ \end{split} \end{equation}\]

so the solution would be:

\[\begin{equation} \begin{split} prox_t(\begin{bmatrix} \mathbf{L} \\ \mathbf{S} \end{bmatrix}) &= \begin{bmatrix} SVT_{\mu t}\left(\mathbf{L}\right) \\ uvec\left(S_{\mu \lambda t}\left(vec(\mathbf{S})\right)\right) \end{bmatrix} \end{split} \end{equation}\]

where $uvec(\cdot)$ represents unvectorize operator.

Finally, we get our proximal gradient descent update rules of the RPCA problem:

\[\begin{equation} \begin{split} \begin{bmatrix} \mathbf{L}^{(k)} \\ \mathbf{S}^{(k)} \end{bmatrix} &= prox_t\left(\begin{bmatrix} \mathbf{L}^{(k-1)}+t(\mathbf{X}-\mathbf{L}^{(k-1)} - \mathbf{S}^{(k-1)}) \\ \mathbf{S}^{(k-1)}+t(\mathbf{X}-\mathbf{L}^{(k-1)} - \mathbf{S}^{(k-1)}) \end{bmatrix} \right)\\ &= \begin{bmatrix} SVT_{\mu t}\left(\mathbf{L}^{(k-1)}+t(\mathbf{X}-\mathbf{L}^{(k-1)} - \mathbf{S}^{(k-1)})\right) \\ uvec\left(S_{\mu \lambda t}\left(vec(\mathbf{S}^{(k-1)}+t(\mathbf{X}-\mathbf{L}^{(k-1)} - \mathbf{S}^{(k-1)}))\right)\right) \end{bmatrix} \end{split} \end{equation}\]

Practicalities of PGD

Convergence

For $f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x})$, we assume:

$g$ is convex , differentiable, and $g’$ is Lipschitz continuous with constant $L > 0$
$h$ is convex and the proximal mapping of $h$ can be evaluated then proximal gradient descent with fixed step size $t \le \frac{1}{L}$ satisfies:

\[\begin{equation} \begin{split} f(\mathbf{x}^{(k)}) - f^{\ast} \le \frac{\|\mathbf{x}^{(0)} - \mathbf{x}^{\ast}\|_2^2}{2tk} \end{split} \end{equation}\]

The above theorem suggests that PGD has convergence rate $O(1/\epsilon)$.

Appendix

Proof of SVT

For each $\tau \ge 0$ and $\mathbf{Y} \in \mathbf{R}^{m \times n}$, the singular value shrinkage operator obeys

\[\begin{equation} \begin{split} \mathcal{D}_\tau(\mathbf{Y}) = \argmin_{\mathbf{X}} \frac{1}{2}\| \mathbf{X} - \mathbf{Y} \|_F^2 + \tau \| \mathbf{X} \|_* \end{split} \end{equation}\]

Proof. Since the function $h(\mathbf{X}) := \frac{1}{2}\| \mathbf{X} - \mathbf{Y} \| _F^2 + \tau \| \mathbf{X} \| _*$ is strictly convex, it is easy to see that there exists a unique minimizer, and we thus need to prove that it is equal to $\mathcal{D}_\tau(\mathbf{Y})$.

To do this, recall the definition of a subgradient of a convex function $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$. We say that $\mathbf{Z}$ is a subgradient of $f$ at $\mathbf{X}_0$, denoted $Z \in \partial f(\mathbf{X}_0)$, if $f(\mathbf{X}) \ge f(\mathbf{X}_0) + \langle \mathbf{Z}, \mathbf{X} - \mathbf{X}_0 \rangle$ for all $\mathbf{X}$. Now $\hat{\mathbf{X}}$ minimizes $h$ if and only if 0 is a subgradient of the functional $h$ at the point $\hat{\mathbf{X}}$, i.e. $0 \in \hat{\mathbf{X}} − \mathbf{Y} + \tau \partial \|\hat{\mathbf{X}}\|_*$, where $\partial \|\hat{\mathbf{X}}\|_*$ is the set of subgradients of the nuclear norm.

Let $\mathbf{X} \in R^{m \times n}$ be an arbitrary matrix and $\mathbf{U} \mathbf{\Sigma} \mathbf{V}^*$ be its SVD. It is known that $\partial \|\mathbf{X}\|_* = \{ \mathbf{U} \mathbf{V}^* + \mathbf{W}: \mathbf{W} \in \mathbb{R}^{m \times n}, \mathbf{U}*\mathbf{W}=0, \mathbf{W}\mathbf{V}=0, \|\mathbf{W}\|_2 \le 1\}$. Set $\hat{\mathbf{X}} = \mathcal{D}_\tau(\mathbf{Y})$ for short. In order to show that $\hat{\mathbf{X}}$ obeys the optimal condition, decompose the SVD of $\mathbf{Y}$ as $\mathbf{Y} = \mathbf{U}_0 \mathbf{\Sigma}_0 \mathbf{V}_0^* + \mathbf{U}_1 \mathbf{\Sigma}_1 \mathbf{V}_1^*$ , where $\mathbf{U}_0,\mathbf{V}_0$ (resp. $\mathbf{U}_1,\mathbf{V}_1$) are the singular vectors associated with singular values greater than $\tau$ (resp. smaller than or equal to $\tau$). With these notations, we have $\hat{\mathbf{X}} = \mathbf{U}_0 \left(\mathbf{\Sigma}_0 - \tau \mathbf{I} \right)\mathbf{V}_0$ and, therefore, $\mathbf{Y} - \hat{\mathbf{X}} = \tau \left( \mathbf{U}_0 \mathbf{V}_0^* + \mathbf{W}\right), \mathbf{W} = \tau^{-1} \mathbf{U}_1 \mathbf{\Sigma}_1 \mathbf{V}_1^*$.

By definition, $\mathbf{U}_0*\mathbf{W}=0 \mathbf{W}\mathbf{V}_0=0$ and since the diagonal elements of $\mathbf{\Sigma}_1$ have magnitudes bounded by $\tau$ , we also have $\|\mathbf{W}\|_2 \le 1$. Hence $\mathbf{Y} - \hat{\mathbf{X}} \in \partial \|\hat{\mathbf{X}}\|_*$, which concludes the proof.