This weekend I learned Preconditioned Conjugate Gradient method with Jonathan Richard Schewchuk’s lecture note “An Introduction to the Conjugate Gradient Method Without the Agonizing Path”. Here I document what I have learned from the note.
TL;DR
In short, all methods mentioned in this post are proposed to iteratively solve linear equations Ax=b, assuming A is symmetric and positive-definite. Steepest Gradient Method (SG), which moves along the direciton of residual or negative gradient each step, converges slowly and requires more matrix-vector multiply operations. SG can be improved by moving along the conjugate direction instead of residual direction each step, which converges faster and reduces matrix-vector multiply operations efficiently. However, it requires construct conjugate directions in advance, which sometimes costs as much as with SG. Conjugate Gradient Method (CG) is an efficient algorithm to compute conjugate directions on the fly by constructing conjugate directions with residuals, making it a practical method to solve equations. Finally, Preconditioned Conjugate Method (PCG) is a further improvement to convergence rate by reducing the condition number of A.
Steepest Gradient Method
Considering the following minimization problem:
x∈Rnargmin 21xTAx−bTx+c
where A∈Rn×n is symmetric and positive-definite. Since this is a convex problem, the global minimizer exists and can be derived with setting the gradient f′(x)=xTA−bT∈R1×n to zero. Solving the minimization problem above is equivalent to solving the linear equations Ax=b.
Recall that the residual rk, error ek are defined as follows:
rkek=b−Axk=xk−x⋆
where x⋆ is the solution of Ax=b. The residual and error item are related by:
Aek=A(xk−x⋆)=Axk−b+b−Ax⋆=−rk
With these notations, Gradient Descent Method updates each point as follows:
xk+1=xk+αkrk
where αk is the k-th step and rk is equal to the negative gradient. SG uses line search to determine how big a step is taken. By the chain rule dαkdf(xk+1)=f′(xk+1)dαkdxk+1 and setting the gradient to zero, we have the following αk:
αk=rkTArkrkTrk
The SG is:
r0αkxk+1rk+1=b−Ax0=rkTArkrkTrk=xk+αkrk=b−Axk+1
or eliminates one matrix-vector multiply (Axk+1) by iterating residuals as follows:
r0αkxk+1rk+1=b−Ax0=rkTArkrkTrk=xk+αkrk=rk−αkArk
Conjugate Directions
SG often finds itself taking similar directions. It would be better if we took a step and got it right the first time. An analogous example is that a person stands at the top left corner of a 5 by 5 grid, aiming to reach to the bottom right corner by moving either right or down a few grids in each step. We could move one grid at a time following a zig-zag path, or we could move to the bottom first and then move to the end with just two steps. In short, we’ll take exactly one step for each direction (totally n steps). Given n directions {dk}, the update formula is as follows:
xk+1=xk+αkdk
which is exactly the update formula of GD with replacing rk by dk.
These directions {dk} should be orthogonal to each other, and ek+1 should be orthogonal to dk, otherwise you would always find a direction aligned with previous directions in next updates. Notice that by subtracting x⋆ on both sides, Equation (8) becomes:
xk+1−x⋆ek+1=xk−x⋆+αkdk=ek+αkdk
With orthogonal conditions and Equation (9), we have:
dkTek+1dkT(ek+αkdk)αk=0=0=−dkTdkdkTek
We haven’t known ek yet, thus Equation (10) can’t be used to calculate αk. The solution is to use A-orthogonal instead of orthogonal. Two vectors di and dj are A-orthogonal or conjugate, if:
diTAdj=0
Once again, without steping into previous directions, the new requirement is that ek+1 be A-orthogonal to dk. This equation is equivalent to finding the minimum point along the search direction dk with the line search method:
dαkdf(xk+1)f′(xk+1)dαkdxk+1−rk+1Tdkek+1TAdk=0=0=0=0
By Equation (9), αk is computed as:
ek+1TAdk(ek+αkdk)TAdkαk=0=0=−dkTAdkekTAdk=dkTAdkrkTdk
Similar to the SG method, the iterative formulas are as follows:
r0αkxk+1rk+1=b−Ax0=dkTAdkrkTdk=xk+αkdk=rk−αkAdk
We can prove that the error item e0 exactly converges to zero after n steps with these formulas. By Equation (9), we can have that ek=e0+Σi=0k−1αidi. Suppose e0=Σi=0nδidi ({dk} span the whole linear space), all the δi values can be found by multiplying the expression by dkTA:
dkTAe0δk=Σi=0nδidkTAdi=δkdkTAdk=dkTAdkdkTAe0=dkTAdkdkTA(e0+Σi=0k−1αidi)=dkTAdkdkTAek=−dkTAdkdkTrk=−αk
where δk is exactly equal to the −αk value.
This fact demonstrates that we can eliminate one component of e0 each step. Consequently, the remaining error item ek is:
ek=e0+Σi=0k−1αidi=Σi=knδidi
Conjugate Gradient Method
The remaining question is that how to compute a set of A-orthogonal directions {dk}. Here we employ the Gram-Schmidt process. Suppose we have a set of n linearly independent vectors {uk} and let d0=u0, dk is constructed as:
dk=uk+Σi=0k−1βk,idi
where βk,i is computed as:
diTAdk0βk,i=diTA(uk+Σj=0k−1βk,jdj)=diTAuk+βk,idiTAdi=−diTAdidiTAuk
The drawback of the contruction process is that it keeps all older directions in memory to create a new direction. CG employs a more efficient way to constructing these directions by initializing the vectors with residuals uk=rk. With these starting points, we will see that most computations can be eliminated.
By Equation (17) and (18), dk is constructed as:
dkβk,i=rk+Σi=0k−1βk,idi=−diTAdidiTArk
Let’s consider the numerator of βk,i with Equation (14):
diTArk=αi1(ri−ri+1)Trk=αi1(riTrk−ri+1Trk),i=0,1,...,k−1
Luckily, we shall see that rk is orthogonal to previous residuals. Refering to Equation (16), the error item ek=Σi=knδidi, we have the following equation by multiplying −djTA on both sides:
−djTAekdjTrk=−Σi=knδidjTAdi,j<k=0,j<k
Thus the residuals are orthogonal to previous conjugate directions if we choose the residuals as starting vectors to construct these directions. According to Equation (19), we further conclude that the residuals are orthogonal to previous residuals:
djTrk(rj+Σi=0j−1βj,idi)TrkrjTrk=0,j<k=0=0,j<k
By Equation (21) and (22), βk,i is simplified as:
βk,i={αk−11dk−1TAdk−1rkTrk0i=k−1otherwise
where we only need to keep one previous direction instead of all directions.
The method of CG is:
d0=r0αkxk+1rk+1βk+1dk+1=b−Ax0=dkTAdkrkTdk=dkTAdkrkTrk=xk+αkdk=rk−αkAdk=αk1dkTAdkrk+1Trk+1=rkTrkdkTAdkdkTAdkrk+1Trk+1=rkTrkrk+1Trk+1=rk+1+βk+1dk
Preconditioned Conjugate Gradient Method
The analysis of complexity of CG shows that a small condition number κ(A) would improve the convergence rate. Preconditioning is one such technique for reducing the condition number of a matrix. Let’s consider a symmetric, positive-definite matrix M that approximates A but is easier to invert. Instead of solving Ax=b, we solve:
M−1Ax=M−1b
by ensuring that κ(M−1A)≪κ(A). The problem is that M−1A is generally neither symmetric nor positive-definite, even thought M and A are.
The solution is to decompose M with Cholesky decompostion. The Cholesky decompostion is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, denoted as M=LLT. We then solve the following equation:
L−1A(L−1)TLTxL−1A(L−1)Tx^=L−1b=b^
where x^=LTx and b^=L−1b. Notice that L−1A(L−1)T is symmetric and positive-definite, allowing us to apply CG directly.
Now the method of PCG is:
d^0=r^0αkx^k+1r^k+1βk+1d^k+1=b−L−1A(L−1)Tx^0=d^kTL−1A(L−1)Td^kr^kTr^k=x^k+αkd^k=r^k−αkL−1A(L−1)Td^k=r^kTr^kr^k+1Tr^k+1=r^k+1+βk+1d^k
With some substitutions, we can further eliminate L−1:
r0d0αkxk+1rk+1βk+1dk+1=b−Ax0=M−1r0=dkTAdkrkTM−1rk=xk+αkdk=rk−αkAdk=rkTM−1rkrk+1TM−1rk+1=M−1rk+1+βk+1dk
M is called a preconditioner and the effectiveness of PCG depends on finding a preconditioner that approximates A well enough without too much cost of computing M−1. In fact, it’s unnecessary to know the explicit form of M−1, Instead, by directly computing M−1rk, which can be viewed as an operator with input rk achieving the same effect of matrix-vector product. The implementation of PCG benefits from this operator concept, extending it to more general cases.
Implementation Considerations
Starting point
If you have an estimate of x, use it as the starting point x0; otherwise use the zero vector.
Stopping criterion
Theoretically, CG/PCG finds the solution when the residual becomes zero after n steps. Therefore, the maximum number of iterations would never exceed n; otherwise a division-by-zero error would occur. We may choose to stop the algorithm immediately when the residual falls below a specified threshold. A common criterion is ∥rk∥2<ϵ∥r0∥2.
The accumulated roundoff error in the residul iteration formula may yield a false zero residual. To address this issue, once we have a residual falls below the stopping value, we recalculate the residual using the definition (Equation (2)) and double-check if it still falls below the stopping threshold. If not, we restart the iteration.
Matlab’s implementation does a little more than that. It checks whether the current direction is small enough in each iteration with the criterion ∥αkdk∥2<ϵ∥xk∥2. If true (for 3 times), it believes that the current solution is stuck and terminiates the iteration with warning messages.
Choose of Preconditioner
M=A is an ideal but not practical preconditioner. Solving M−1 is equivalent to solving Mx=b. One simple preconditioner is a diagonal matrix whose diagonal elements are identical to those of A.
Another preconditioner is the incomplete Cholesky factorization. Incomplete Cholesky factorization is a sparse approximation of the Cholesky decomposition by setting elements to zero if the corresponding elements in A are also zero. This preserves the sparsity of A, making it suitable for problems with sparsity matrices. Yet I haven’t found such a function in numpy and scipy packages. I’ll look into it when I have time, refering to wiki and Julia’s code.