This weekend I learned Preconditioned Conjugate Gradient method with Jonathan Richard Schewchuk’s lecture note “An Introduction to the Conjugate Gradient Method Without the Agonizing Path”. Here I document what I have learned from the note.

TL;DR

In short, all methods mentioned in this post are proposed to iteratively solve linear equations Ax=b\mathbf{A} \mathbf{x} = \mathbf{b}, assuming A\mathbf{A} is symmetric and positive-definite. Steepest Gradient Method (SG), which moves along the direciton of residual or negative gradient each step, converges slowly and requires more matrix-vector multiply operations. SG can be improved by moving along the conjugate direction instead of residual direction each step, which converges faster and reduces matrix-vector multiply operations efficiently. However, it requires construct conjugate directions in advance, which sometimes costs as much as with SG. Conjugate Gradient Method (CG) is an efficient algorithm to compute conjugate directions on the fly by constructing conjugate directions with residuals, making it a practical method to solve equations. Finally, Preconditioned Conjugate Method (PCG) is a further improvement to convergence rate by reducing the condition number of A\mathbf{A}.

Steepest Gradient Method

Considering the following minimization problem:

arg minxRn 12xTAxbTx+c\begin{equation} \begin{split} \argmin_{\mathbf{x} \in \mathbb{R}^{n}} &\ \frac{1}{2} \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T\mathbf{x} + c \end{split} \end{equation}

where ARn×n\mathbf{A} \in \mathbb{R}^{n \times n} is symmetric and positive-definite. Since this is a convex problem, the global minimizer exists and can be derived with setting the gradient f(x)=xTAbTR1×nf'(\mathbf{x}) = \mathbf{x}^T\mathbf{A} - \mathbf{b}^T \in \mathbb{R}^{1 \times n} to zero. Solving the minimization problem above is equivalent to solving the linear equations Ax=b\mathbf{A}\mathbf{x} = \mathbf{b}.

Recall that the residual rk\mathbf{r}_k, error ek\mathbf{e}_k are defined as follows:

rk=bAxkek=xkx\begin{equation} \begin{split} \mathbf{r}_k &= \mathbf{b} - \mathbf{A} \mathbf{x}_k \\ \mathbf{e}_k &= \mathbf{x}_k - \mathbf{x}^{\star}\\ \end{split} \end{equation}

where x\mathbf{x}^{\star} is the solution of Ax=b\mathbf{A}\mathbf{x} = \mathbf{b}. The residual and error item are related by:

Aek=A(xkx)=Axkb+bAx=rk\begin{equation} \begin{split} \mathbf{A} \mathbf{e}_k &= \mathbf{A} \left( \mathbf{x}_k - \mathbf{x}^{\star} \right)\\ &= \mathbf{A} \mathbf{x}_k - \mathbf{b} + \mathbf{b} - \mathbf{A} \mathbf{x}^{\star}\\ &= -\mathbf{r}_{k} \end{split} \end{equation}

With these notations, Gradient Descent Method updates each point as follows:

xk+1=xk+αkrk\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{r}_k \\ \end{split} \end{equation}

where αk\alpha_k is the k-th step and rk\mathbf{r}_k is equal to the negative gradient. SG uses line search to determine how big a step is taken. By the chain rule ddαkf(xk+1)=f(xk+1)ddαkxk+1\frac{d}{d\alpha_k} f(\mathbf{x}_{k+1})= f'(\mathbf{x}_{k+1}) \frac{d}{d\alpha_k}\mathbf{x}_{k+1} and setting the gradient to zero, we have the following αk\alpha_k:

αk=rkTrkrkTArk\begin{equation} \begin{split} \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{r}_k^T \mathbf{A} \mathbf{r}_k} \\ \end{split} \end{equation}

The SG is:

r0=bAx0αk=rkTrkrkTArkxk+1=xk+αkrkrk+1=bAxk+1\begin{equation} \begin{split} \mathbf{r}_0 &= \mathbf{b} - \mathbf{A}\mathbf{x}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{r}_k^T \mathbf{A} \mathbf{r}_k} \\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{r}_k \\ \mathbf{r}_{k+1} &= \mathbf{b} - \mathbf{A} \mathbf{x}_{k+1} \end{split} \end{equation}

or eliminates one matrix-vector multiply (Axk+1\mathbf{A} \mathbf{x}_{k+1}) by iterating residuals as follows:

r0=bAx0αk=rkTrkrkTArkxk+1=xk+αkrkrk+1=rkαkArk\begin{equation} \begin{split} \mathbf{r}_0 &= \mathbf{b} - \mathbf{A}\mathbf{x}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{r}_k^T \mathbf{A} \mathbf{r}_k} \\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{r}_k \\ \mathbf{r}_{k+1} &= \mathbf{r}_k - \alpha_k \mathbf{A} \mathbf{r}_k \\ \end{split} \end{equation}

Conjugate Directions

SG often finds itself taking similar directions. It would be better if we took a step and got it right the first time. An analogous example is that a person stands at the top left corner of a 5 by 5 grid, aiming to reach to the bottom right corner by moving either right or down a few grids in each step. We could move one grid at a time following a zig-zag path, or we could move to the bottom first and then move to the end with just two steps. In short, we’ll take exactly one step for each direction (totally n steps). Given n directions {dk}\{\mathbf{d}_k\}, the update formula is as follows:

xk+1=xk+αkdk\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{d}_k \\ \end{split} \end{equation}

which is exactly the update formula of GD with replacing rk\mathbf{r}_k by dk\mathbf{d}_k.

These directions {dk}\{\mathbf{d}_k\} should be orthogonal to each other, and ek+1\mathbf{e}_{k+1} should be orthogonal to dk\mathbf{d}_{k}, otherwise you would always find a direction aligned with previous directions in next updates. Notice that by subtracting x\mathbf{x}^{\star} on both sides, Equation (8) becomes:

xk+1x=xkx+αkdkek+1=ek+αkdk\begin{equation} \begin{split} \mathbf{x}_{k+1} - \mathbf{x}^{\star} &= \mathbf{x}_k - \mathbf{x}^{\star} + \alpha_k \mathbf{d}_k \\ \mathbf{e}_{k+1} &= \mathbf{e}_{k} + \alpha_k \mathbf{d}_k \end{split} \end{equation}

With orthogonal conditions and Equation (9), we have:

dkTek+1=0dkT(ek+αkdk)=0αk=dkTekdkTdk\begin{equation} \begin{split} \mathbf{d}_{k}^T \mathbf{e}_{k+1} &= 0\\ \mathbf{d}_{k}^T \left(\mathbf{e}_{k} + \alpha_k \mathbf{d}_k\right) &= 0\\ \alpha_k &= -\frac{\mathbf{d}_{k}^T \mathbf{e}_{k}}{\mathbf{d}_{k}^T\mathbf{d}_{k}} \end{split} \end{equation}

We haven’t known ek\mathbf{e}_k yet, thus Equation (10) can’t be used to calculate αk\alpha_k. The solution is to use A-orthogonal instead of orthogonal. Two vectors di\mathbf{d}_i and dj\mathbf{d}_j are A-orthogonal or conjugate, if:

diTAdj=0\begin{equation} \begin{split} \mathbf{d}_{i}^T \mathbf{A} \mathbf{d}_{j} &= 0\\ \end{split} \end{equation}

Once again, without steping into previous directions, the new requirement is that ek+1\mathbf{e}_{k+1} be A-orthogonal to dk\mathbf{d}_k. This equation is equivalent to finding the minimum point along the search direction dk\mathbf{d}_{k} with the line search method:

ddαkf(xk+1)=0f(xk+1)ddαkxk+1=0rk+1Tdk=0ek+1TAdk=0\begin{equation} \begin{split} \frac{d}{d\alpha_k} f(\mathbf{x}_{k+1}) &= 0\\ f'(\mathbf{x}_{k+1}) \frac{d}{d\alpha_k}\mathbf{x}_{k+1} &= 0\\ -\mathbf{r}_{k+1}^T \mathbf{d}_k &= 0\\ \mathbf{e}_{k+1}^T \mathbf{A} \mathbf{d}_k &= 0 \end{split} \end{equation}

By Equation (9), αk\alpha_k is computed as:

ek+1TAdk=0(ek+αkdk)TAdk=0αk=ekTAdkdkTAdk=rkTdkdkTAdk\begin{equation} \begin{split} \mathbf{e}_{k+1}^T \mathbf{A} \mathbf{d}_k &= 0\\ \left( \mathbf{e}_k + \alpha_k \mathbf{d}_k \right)^T \mathbf{A} \mathbf{d}_k &= 0\\ \alpha_k &= - \frac{\mathbf{e}_k^T \mathbf{A} \mathbf{d}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}\\ &= \frac{\mathbf{r}_k^T \mathbf{d}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k} \end{split} \end{equation}

Similar to the SG method, the iterative formulas are as follows:

r0=bAx0αk=rkTdkdkTAdkxk+1=xk+αkdkrk+1=rkαkAdk\begin{equation} \begin{split} \mathbf{r}_0 &= \mathbf{b} - \mathbf{A}\mathbf{x}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{d}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}\\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{d}_k\\ \mathbf{r}_{k+1} &= \mathbf{r}_k - \alpha_k \mathbf{A} \mathbf{d}_k\\ \end{split} \end{equation}

We can prove that the error item e0\mathbf{e}_0 exactly converges to zero after n steps with these formulas. By Equation (9), we can have that ek=e0+Σi=0k1αidi\mathbf{e}_k = \mathbf{e}_0 + \Sigma_{i=0}^{k-1} \alpha_i \mathbf{d}_i. Suppose e0=Σi=0nδidi\mathbf{e}_0 = \Sigma_{i=0}^{n} \delta_i \mathbf{d}_i ({dk}\{\mathbf{d}_k\} span the whole linear space), all the δi\delta_i values can be found by multiplying the expression by dkTA\mathbf{d}_k^T\mathbf{A}:

dkTAe0=Σi=0nδidkTAdi=δkdkTAdkδk=dkTAe0dkTAdk=dkTA(e0+Σi=0k1αidi)dkTAdk=dkTAekdkTAdk=dkTrkdkTAdk=αk\begin{equation} \begin{split} \mathbf{d}_k^T \mathbf{A} \mathbf{e}_0 &= \Sigma_{i=0}^{n} \delta_i \mathbf{d}_k^T\mathbf{A}\mathbf{d}_i\\ &= \delta_k \mathbf{d}_k^T\mathbf{A}\mathbf{d}_k\\ \delta_k &= \frac{\mathbf{d}_k^T \mathbf{A} \mathbf{e}_0}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}\\ &= \frac{\mathbf{d}_k^T \mathbf{A} \left(\mathbf{e}_0 + \Sigma_{i=0}^{k-1}\alpha_i \mathbf{d}_i \right)}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}\\ &= \frac{\mathbf{d}_k^T \mathbf{A} \mathbf{e}_k}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}\\ &= - \frac{\mathbf{d}_k^T \mathbf{r}_k}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}\\ &= - \alpha_k \end{split} \end{equation}

where δk\delta_k is exactly equal to the αk-\alpha_k value.

This fact demonstrates that we can eliminate one component of e0\mathbf{e}_0 each step. Consequently, the remaining error item ek\mathbf{e}_k is:

ek=e0+Σi=0k1αidi=Σi=knδidi\begin{equation} \begin{split} \mathbf{e}_k &= \mathbf{e}_0 + \Sigma_{i=0}^{k-1} \alpha_i \mathbf{d}_i\\ &= \Sigma_{i=k}^{n} \delta_i \mathbf{d}_i \end{split} \end{equation}

Conjugate Gradient Method

The remaining question is that how to compute a set of A-orthogonal directions {dk}\{\mathbf{d}_k\}. Here we employ the Gram-Schmidt process. Suppose we have a set of n linearly independent vectors {uk}\{ \mathbf{u}_k \} and let d0=u0\mathbf{d}_0 = \mathbf{u}_0, dk\mathbf{d}_k is constructed as:

dk=uk+Σi=0k1βk,idi\begin{equation} \begin{split} \mathbf{d}_k = \mathbf{u}_k + \Sigma_{i=0}^{k-1} \beta_{k,i} \mathbf{d}_{i} \end{split} \end{equation}

where βk,i\beta_{k,i} is computed as:

diTAdk=diTA(uk+Σj=0k1βk,jdj)0=diTAuk+βk,idiTAdiβk,i=diTAukdiTAdi\begin{equation} \begin{split} \mathbf{d}_i^T\mathbf{A}\mathbf{d}_k &= \mathbf{d}_i^T\mathbf{A}\left(\mathbf{u}_k + \Sigma_{j=0}^{k-1} \beta_{k,j} \mathbf{d}_{j}\right)\\ 0 &= \mathbf{d}_i^T\mathbf{A}\mathbf{u}_k + \beta_{k,i} \mathbf{d}_i^T\mathbf{A}\mathbf{d}_i\\ \beta_{k,i} &= - \frac{\mathbf{d}_i^T\mathbf{A}\mathbf{u}_k}{\mathbf{d}_i^T\mathbf{A}\mathbf{d}_i} \end{split} \end{equation}

The drawback of the contruction process is that it keeps all older directions in memory to create a new direction. CG employs a more efficient way to constructing these directions by initializing the vectors with residuals uk=rk\mathbf{u}_k = \mathbf{r}_k. With these starting points, we will see that most computations can be eliminated.

By Equation (17) and (18), dk\mathbf{d}_k is constructed as:

dk=rk+Σi=0k1βk,idiβk,i=diTArkdiTAdi\begin{equation} \begin{split} \mathbf{d}_k &= \mathbf{r}_k + \Sigma_{i=0}^{k-1} \beta_{k,i} \mathbf{d}_{i}\\ \beta_{k,i} &= - \frac{\mathbf{d}_i^T\mathbf{A}\mathbf{r}_k}{\mathbf{d}_i^T\mathbf{A}\mathbf{d}_i} \end{split} \end{equation}

Let’s consider the numerator of βk,i\beta_{k,i} with Equation (14):

diTArk=1αi(riri+1)Trk=1αi(riTrkri+1Trk),i=0,1,...,k1\begin{equation} \begin{split} \mathbf{d}_i^T\mathbf{A}\mathbf{r}_k &= \frac{1}{\alpha_i}(\mathbf{r}_i - \mathbf{r}_{i+1})^T \mathbf{r}_k\\ &= \frac{1}{\alpha_i}(\mathbf{r}_i^T\mathbf{r}_k - \mathbf{r}_{i+1}^T\mathbf{r}_k), i=0,1,...,k-1 \end{split} \end{equation}

Luckily, we shall see that rk\mathbf{r}_k is orthogonal to previous residuals. Refering to Equation (16), the error item ek=Σi=knδidi\mathbf{e}_k = \Sigma_{i=k}^{n} \delta_i \mathbf{d}_i, we have the following equation by multiplying djTA-\mathbf{d}_j^T \mathbf{A} on both sides:

djTAek=Σi=knδidjTAdi,j<kdjTrk=0,j<k\begin{equation} \begin{split} -\mathbf{d}_j^T \mathbf{A} \mathbf{e}_k &= -\Sigma_{i=k}^{n} \delta_i \mathbf{d}_j^T \mathbf{A} \mathbf{d}_i, j<k\\ \mathbf{d}_j^T \mathbf{r}_k &= 0, j<k \end{split} \end{equation}

Thus the residuals are orthogonal to previous conjugate directions if we choose the residuals as starting vectors to construct these directions. According to Equation (19), we further conclude that the residuals are orthogonal to previous residuals:

djTrk=0,j<k(rj+Σi=0j1βj,idi)Trk=0rjTrk=0,j<k\begin{equation} \begin{split} \mathbf{d}_j^T \mathbf{r}_k &= 0, j<k\\ \left(\mathbf{r}_j + \Sigma_{i=0}^{j-1} \beta_{j,i} \mathbf{d}_{i}\right)^T\mathbf{r}_k &= 0\\ \mathbf{r}_j^T \mathbf{r}_k &= 0,j<k \end{split} \end{equation}

By Equation (21) and (22), βk,i\beta_{k,i} is simplified as:

βk,i={1αk1rkTrkdk1TAdk1i=k10otherwise\begin{equation} \begin{split} \beta_{k,i} = \begin{cases} \frac{1}{\alpha_{k-1}} \frac{\mathbf{r}_k^T\mathbf{r}_k}{\mathbf{d}_{k-1}^T\mathbf{A}\mathbf{d}_{k-1}} & i = k-1\\ 0 & otherwise\\ \end{cases} \end{split} \end{equation}

where we only need to keep one previous direction instead of all directions.

The method of CG is:

d0=r0=bAx0αk=rkTdkdkTAdk=rkTrkdkTAdkxk+1=xk+αkdkrk+1=rkαkAdkβk+1=1αkrk+1Trk+1dkTAdk=dkTAdkrkTrkrk+1Trk+1dkTAdk=rk+1Trk+1rkTrkdk+1=rk+1+βk+1dk\begin{equation} \begin{split} \mathbf{d}_0 = \mathbf{r}_0 &= \mathbf{b} - \mathbf{A}\mathbf{x}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{d}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k} = \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}\\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{d}_k\\ \mathbf{r}_{k+1} &= \mathbf{r}_k - \alpha_k \mathbf{A} \mathbf{d}_k\\ \beta_{k+1} &= \frac{1}{\alpha_{k}} \frac{\mathbf{r}_{k+1}^T\mathbf{r}_{k+1}}{\mathbf{d}_{k}^T\mathbf{A}\mathbf{d}_{k}}\\ &= \frac{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}{\mathbf{r}_k^T \mathbf{r}_k}\frac{\mathbf{r}_{k+1}^T\mathbf{r}_{k+1}}{\mathbf{d}_{k}^T\mathbf{A}\mathbf{d}_{k}}\\ &= \frac{\mathbf{r}_{k+1}^T\mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{r}_k}\\ \mathbf{d}_{k+1} &= \mathbf{r}_{k+1} + \beta_{k+1} \mathbf{d}_{k}\\ \end{split} \end{equation}

Preconditioned Conjugate Gradient Method

The analysis of complexity of CG shows that a small condition number κ(A)\kappa(\mathbf{A}) would improve the convergence rate. Preconditioning is one such technique for reducing the condition number of a matrix. Let’s consider a symmetric, positive-definite matrix M\mathbf{M} that approximates A\mathbf{A} but is easier to invert. Instead of solving Ax=b\mathbf{A}\mathbf{x} = \mathbf{b}, we solve:

M1Ax=M1b\begin{equation} \begin{split} \mathbf{M}^{-1}\mathbf{A}\mathbf{x} = \mathbf{M}^{-1}\mathbf{b} \end{split} \end{equation}

by ensuring that κ(M1A)κ(A)\kappa(\mathbf{M}^{-1}\mathbf{A}) \ll \kappa(\mathbf{A}). The problem is that M1A\mathbf{M}^{-1}\mathbf{A} is generally neither symmetric nor positive-definite, even thought M\mathbf{M} and A\mathbf{A} are.

The solution is to decompose M\mathbf{M} with Cholesky decompostion. The Cholesky decompostion is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, denoted as M=LLT\mathbf{M}=\mathbf{L}\mathbf{L}^T. We then solve the following equation:

L1A(L1)TLTx=L1bL1A(L1)Tx^=b^\begin{equation} \begin{split} \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T \mathbf{L}^T \mathbf{x} &= \mathbf{L}^{-1}\mathbf{b}\\ \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T \hat{\mathbf{x}} &= \hat{\mathbf{b}}\\ \end{split} \end{equation}

where x^=LTx\hat{\mathbf{x}} = \mathbf{L}^T \mathbf{x} and b^=L1b\hat{\mathbf{b}} = \mathbf{L}^{-1}\mathbf{b}. Notice that L1A(L1)T\mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T is symmetric and positive-definite, allowing us to apply CG directly.

Now the method of PCG is:

d^0=r^0=bL1A(L1)Tx^0αk=r^kTr^kd^kTL1A(L1)Td^kx^k+1=x^k+αkd^kr^k+1=r^kαkL1A(L1)Td^kβk+1=r^k+1Tr^k+1r^kTr^kd^k+1=r^k+1+βk+1d^k\begin{equation} \begin{split} \hat{\mathbf{d}}_0 = \hat{\mathbf{r}}_0 &= \mathbf{b} - \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T\hat{\mathbf{x}}_0\\ \alpha_k &= \frac{\hat{\mathbf{r}}_k^T \hat{\mathbf{r}}_k}{\hat{\mathbf{d}}_k^T \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T \hat{\mathbf{d}}_k}\\ \hat{\mathbf{x}}_{k+1} &= \hat{\mathbf{x}}_k + \alpha_k \hat{\mathbf{d}}_k\\ \hat{\mathbf{r}}_{k+1} &= \hat{\mathbf{r}}_k - \alpha_k \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T \hat{\mathbf{d}}_k\\ \beta_{k+1} &= \frac{\hat{\mathbf{r}}_{k+1}^T\hat{\mathbf{r}}_{k+1}}{\hat{\mathbf{r}}_k^T \hat{\mathbf{r}}_k}\\ \hat{\mathbf{d}}_{k+1} &= \hat{\mathbf{r}}_{k+1} + \beta_{k+1} \hat{\mathbf{d}}_{k}\\ \end{split} \end{equation}

With some substitutions, we can further eliminate L1\mathbf{L}^{-1}:

r0=bAx0d0=M1r0αk=rkTM1rkdkTAdkxk+1=xk+αkdkrk+1=rkαkAdkβk+1=rk+1TM1rk+1rkTM1rkdk+1=M1rk+1+βk+1dk\begin{equation} \begin{split} \mathbf{r}_0 &= \mathbf{b} - \mathbf{A} \mathbf{x}_0\\ \mathbf{d}_0 &= \mathbf{M}^{-1} \mathbf{r}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{M}^{-1} \mathbf{r}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}\\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{d}_k\\ \mathbf{r}_{k+1} &= \mathbf{r}_k - \alpha_k \mathbf{A} \mathbf{d}_k\\ \beta_{k+1} &= \frac{\mathbf{r}_{k+1}^T \mathbf{M}^{-1} \mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{M}^{-1} \mathbf{r}_k}\\ \mathbf{d}_{k+1} &= \mathbf{M}^{-1} \mathbf{r}_{k+1} + \beta_{k+1} \mathbf{d}_{k}\\ \end{split} \end{equation}

M\mathbf{M} is called a preconditioner and the effectiveness of PCG depends on finding a preconditioner that approximates A\mathbf{A} well enough without too much cost of computing M1\mathbf{M}^{-1}. In fact, it’s unnecessary to know the explicit form of M1\mathbf{M}^{-1}, Instead, by directly computing M1rk\mathbf{M}^{-1}\mathbf{r}_k, which can be viewed as an operator with input rk\mathbf{r}_k achieving the same effect of matrix-vector product. The implementation of PCG benefits from this operator concept, extending it to more general cases.

Implementation Considerations

Starting point

If you have an estimate of x\mathbf{x}, use it as the starting point x0\mathbf{x}_0; otherwise use the zero vector.

Stopping criterion

Theoretically, CG/PCG finds the solution when the residual becomes zero after n steps. Therefore, the maximum number of iterations would never exceed n; otherwise a division-by-zero error would occur. We may choose to stop the algorithm immediately when the residual falls below a specified threshold. A common criterion is rk2<ϵr02\|\mathbf{r}_k\|_2 \lt \epsilon \|\mathbf{r}_0\|_2.

The accumulated roundoff error in the residul iteration formula may yield a false zero residual. To address this issue, once we have a residual falls below the stopping value, we recalculate the residual using the definition (Equation (2)) and double-check if it still falls below the stopping threshold. If not, we restart the iteration.

Matlab’s implementation does a little more than that. It checks whether the current direction is small enough in each iteration with the criterion αkdk2<ϵxk2\|\alpha_k \mathbf{d}_k\|_2 \lt \epsilon \|\mathbf{x}_k\|_2. If true (for 3 times), it believes that the current solution is stuck and terminiates the iteration with warning messages.

Choose of Preconditioner

M=A\mathbf{M} = \mathbf{A} is an ideal but not practical preconditioner. Solving M1\mathbf{M}^{-1} is equivalent to solving Mx=b\mathbf{M}\mathbf{x}=\mathbf{b}. One simple preconditioner is a diagonal matrix whose diagonal elements are identical to those of A\mathbf{A}.

Another preconditioner is the incomplete Cholesky factorization. Incomplete Cholesky factorization is a sparse approximation of the Cholesky decomposition by setting elements to zero if the corresponding elements in A\mathbf{A} are also zero. This preserves the sparsity of A\mathbf{A}, making it suitable for problems with sparsity matrices. Yet I haven’t found such a function in numpy and scipy packages. I’ll look into it when I have time, refering to wiki and Julia’s code.