Principle component analysis (PCA) is the most widely used dimension reduction technique. Given a matrix X∈Rm×n in which each column of X represents a measurement and the rank of X is r, PCA solves the optimization problem as follows:
L∈Rm×nargmins.t.∥X−L∥F2rank(L)=l,l≪r
which is also known as low-rank approximation. Mathematically, it can be expressed as X=L+N, where each element of N follows an iid normal distribution.
The Eckart-Young-Mirsky theorem states that the best rank-l approximation in terms of the Frobenius norm is L=∑i=1lσiuiviT, where X=∑i=1rσiuiviT represents the compact SVD of X. The proof of this theorem can be found in this note.
I also have a note talking about PCA in decomposition formula.
Introduction to RPCA
One drawback of PCA is that it is highly sensitive to corrupted data. The “corrupted” here means significant changes occur on measurements, like specularities on human face images or missing records in database. That’s why robust principle component analysis (RPCA) comes. RPCA aims to recover a low-rank matrix Y from highly corrupted measurements X=L+S, where the elements in S can have arbitrarily large magnitude and they are supposed to be sparse. The optimization problem is:
L,S∈Rm×nargmins.t.∥L∥∗+λ∥vec(S)∥1L+S=X
where vec(⋅) represents vectorize operator and ∥⋅∥∗ and ∥⋅∥1 represent matrix nuclear norm and vector l1-norm, respectively. At first sight, the problem seems impossible to solve due to insufficient information. In fact, we need to assume that the low-rank component L is not sparse and the sparsity pattern of S is selected uniformly at random. More formally speaking, the incoherence condition should be made, see candes’ paper for more information.
Two instances that RPCA is not capable of disentangling two matrices. Suppose that the matrix X has only a one at the top left corner and zeros everywhere else. How can we decide whether X is the low-rank or sparse? Suppose that the first column of S is the opposite of that of L and zeros everywhere else. We would not able to recover L and S since the column space of X falls into that of L.
The main result of candes’ paper says that the RPCA problem with above assumptions, has high probability to obtain the solution, given λ=1/max(m,n). The choice of λ is a pure mathematical analysis and it works correctly. However, we may improve the performance of RPCA by choosing λ in accordance with prior knowledge about the solution.
Solutions
Proximal Gradient Descent
This kind of technique is also called the Principal Component Pursuit (PCP). Before introduce proximal gradient descent (PGD), lets’ recall that how gradient descent works. Considering f(x)∈R,x∈Rm, f(x) is convex and differentiable, we want to find the solution of the problem:
x∈Rmargminf(x)
Firstly,we expand the 2-order Taylor series of the function f(x) and substitute the Hessian matrix f′′(x) with an identity matrix t1I:
and this is a quadratic approximation of f(x) at point x (here we use the numerator layout). The original problem can be reduced to the following problem:
z∈Rmargminf(x)+f′(x)(z−x)+21(z−x)Tt1I(z−x)
which further reduces to the form:
z∈Rmargmin2t1∥z−(x−tf′(x)T)∥22
So the gradient descent update is:
x+=x−tf′(x)T
But what if f(x) is not differentiable? Lets’ break f(x) into two parts g(x) and h(x), f(x)=g(x)+h(x), such that g(x) is convex and differentiable and h((x)) is convex but not differentiable. We can still approximate f(x) at point x with Tayler series of g(x):
g is convex , differentiable, and g′ is Lipschitz continuous with constant L>0
h is convex and the proximal mapping of h can be evaluated then proximal gradient descent with fixed step size t≤L1 satisfies:
f(x(k))−f∗≤2tk∥x(0)−x∗∥22
The above theorem suggests that PGD has convergence rate O(1/ϵ).
Appendix
Proof of SVT
For each τ≥0 and Y∈Rm×n, the singular value shrinkage operator obeys
Dτ(Y)=Xargmin21∥X−Y∥F2+τ∥X∥∗
Proof. Since the function h(X):=21∥X−Y∥F2+τ∥X∥∗ is strictly convex, it is easy to see that there exists a unique minimizer, and we thus need to prove that it is equal to Dτ(Y).
To do this, recall the definition of a subgradient of a convex function f:Rm×n→R. We say that Z is a subgradient of f at X0, denoted Z∈∂f(X0), if f(X)≥f(X0)+⟨Z,X−X0⟩ for all X. Now X^ minimizes h if and only if 0 is a subgradient of the functional h at the point X^, i.e. 0∈X^−Y+τ∂∥X^∥∗, where ∂∥X^∥∗ is the set of subgradients of the nuclear norm.
Let X∈Rm×n be an arbitrary matrix and UΣV∗ be its SVD. It is known that ∂∥X∥∗={UV∗+W:W∈Rm×n,U∗W=0,WV=0,∥W∥2≤1}. Set X^=Dτ(Y) for short. In order to show that X^ obeys the optimal condition, decompose the SVD of Y as Y=U0Σ0V0∗+U1Σ1V1∗ , where U0,V0 (resp. U1,V1) are the singular vectors associated with singular values greater than τ (resp. smaller than or equal to τ). With these notations, we have X^=U0(Σ0−τI)V0 and, therefore, Y−X^=τ(U0V0∗+W),W=τ−1U1Σ1V1∗.
By definition, U0∗W=0WV0=0 and since the diagonal elements of Σ1 have magnitudes bounded by τ , we also have ∥W∥2≤1. Hence Y−X^∈∂∥X^∥∗, which concludes the proof.