Preconditioned Conjugate Gradient Method
This weekend I learned Preconditioned Conjugate Gradient method with Jonathan Richard Schewchuk’s lecture note “An Introduction to the Conjugate Gradient Method Without the Agonizing Path”. Here I document what I have learned from the note.
TL;DR
In short, all methods mentioned in this post are proposed to iteratively solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$, assuming $\mathbf{A}$ is symmetric and positive-definite. Steepest Gradient Method (SG), which moves along the direciton of residual or negative gradient each step, converges slowly and requires more matrix-vector multiply operations. SG can be improved by moving along the conjugate direction instead of residual direction each step, which converges faster and reduces matrix-vector multiply operations efficiently. However, it requires construct conjugate directions in advance, which sometimes costs as much as with SG. Conjugate Gradient Method (CG) is an efficient algorithm to compute conjugate directions on the fly by constructing conjugate directions with residuals, making it a practical method to solve equations. Finally, Preconditioned Conjugate Method (PCG) is a further improvement to convergence rate by reducing the condition number of $\mathbf{A}$.
Steepest Gradient Method
Considering the following minimization problem:
\[\DeclareMathOperator*{\argmin}{arg\,min} \begin{equation} \begin{split} \argmin_{\mathbf{x} \in \mathbb{R}^{n}} &\ \frac{1}{2} \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T\mathbf{x} + c \end{split} \end{equation}\]where $\mathbf{A} \in \mathbb{R}^{n \times n}$ is symmetric and positive-definite. Since this is a convex problem, the global minimizer exists and can be derived with setting the gradient $f’(\mathbf{x}) = \mathbf{x}^T\mathbf{A} - \mathbf{b}^T \in \mathbb{R}^{1 \times n}$ to zero. Solving the minimization problem above is equivalent to solving the linear equations $\mathbf{A}\mathbf{x} = \mathbf{b}$.
Recall that the residual $\mathbf{r}_k$, error $\mathbf{e}_k$ are defined as follows:
\[\begin{equation} \begin{split} \mathbf{r}_k &= \mathbf{b} - \mathbf{A} \mathbf{x}_k \\ \mathbf{e}_k &= \mathbf{x}_k - \mathbf{x}^{\star}\\ \end{split} \label{eq:2} \end{equation}\]where $\mathbf{x}^{\star}$ is the solution of $\mathbf{A}\mathbf{x} = \mathbf{b}$. The residual and error item are related by:
\[\begin{equation} \begin{split} \mathbf{A} \mathbf{e}_k &= \mathbf{A} \left( \mathbf{x}_k - \mathbf{x}^{\star} \right)\\ &= \mathbf{A} \mathbf{x}_k - \mathbf{b} + \mathbf{b} - \mathbf{A} \mathbf{x}^{\star}\\ &= -\mathbf{r}_{k} \end{split} \end{equation}\]With these notations, Gradient Descent Method updates each point as follows:
\[\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{r}_k \\ \end{split} \end{equation}\]where $\alpha_k$ is the k-th step and $\mathbf{r}_k$ is equal to the negative gradient. SG uses **line search** to determine how big a step is taken. By the chain rule $\frac{d}{d\alpha_k} f(\mathbf{x}_{k+1})= f'(\mathbf{x}_{k+1}) \frac{d}{d\alpha_k}\mathbf{x}_{k+1}$ and setting the gradient to zero, we have the following $\alpha_k$:
\[\begin{equation} \begin{split} \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{r}_k^T \mathbf{A} \mathbf{r}_k} \\ \end{split} \end{equation}\]The SG is:
\[\begin{equation} \begin{split} \mathbf{r}_0 &= \mathbf{b} - \mathbf{A}\mathbf{x}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{r}_k^T \mathbf{A} \mathbf{r}_k} \\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{r}_k \\ \mathbf{r}_{k+1} &= \mathbf{b} - \mathbf{A} \mathbf{x}_{k+1} \end{split} \end{equation}\]or eliminates one matrix-vector multiply ($\mathbf{A} \mathbf{x}_{k+1}$) by iterating residuals as follows:
\[\begin{equation} \begin{split} \mathbf{r}_0 &= \mathbf{b} - \mathbf{A}\mathbf{x}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{r}_k^T \mathbf{A} \mathbf{r}_k} \\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{r}_k \\ \mathbf{r}_{k+1} &= \mathbf{r}_k - \alpha_k \mathbf{A} \mathbf{r}_k \\ \end{split} \end{equation}\]Conjugate Directions
SG often finds itself taking similar directions. It would be better if we took a step and got it right the first time. An analogous example is that a person stands at the top left corner of a 5 by 5 grid, aiming to reach to the bottom right corner by moving either right or down a few grids in each step. We could move one grid at a time following a zig-zag path, or we could move to the bottom first and then move to the end with just two steps. In short, we’ll take exactly one step for each direction (totally n steps). Given n directions ${\mathbf{d}_k}$, the update formula is as follows:
\[\begin{equation} \begin{split} \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{d}_k \\ \end{split} \label{eq:gd} \end{equation}\]which is exactly the update formula of GD with replacing $\mathbf{r}_k$ by $\mathbf{d}_k$.
These directions $\{\mathbf{d}_k\}$ should be orthogonal to each other, and $\mathbf{e}_{k+1}$ should be orthogonal to $\mathbf{d}_{k}$, otherwise you would always find a direction aligned with previous directions in next updates. Notice that by subtracting $\mathbf{x}^{\star}$ on both sides, Equation (\ref{eq:gd}) becomes:
\[\begin{equation} \begin{split} \mathbf{x}_{k+1} - \mathbf{x}^{\star} &= \mathbf{x}_k - \mathbf{x}^{\star} + \alpha_k \mathbf{d}_k \\ \mathbf{e}_{k+1} &= \mathbf{e}_{k} + \alpha_k \mathbf{d}_k \end{split} \label{eq:9} \end{equation}\]With orthogonal conditions and Equation (\ref{eq:9}), we have:
\[\begin{equation} \begin{split} \mathbf{d}_{k}^T \mathbf{e}_{k+1} &= 0\\ \mathbf{d}_{k}^T \left(\mathbf{e}_{k} + \alpha_k \mathbf{d}_k\right) &= 0\\ \alpha_k &= -\frac{\mathbf{d}_{k}^T \mathbf{e}_{k}}{\mathbf{d}_{k}^T\mathbf{d}_{k}} \end{split} \label{eq:10} \end{equation}\]We haven’t known $\mathbf{e}_k$ yet, thus Equation (\ref{eq:10}) can’t be used to calculate $\alpha_k$. The solution is to use A-orthogonal instead of orthogonal. Two vectors $\mathbf{d}_i$ and $\mathbf{d}_j$ are A-orthogonal or conjugate, if:
\[\begin{equation} \begin{split} \mathbf{d}_{i}^T \mathbf{A} \mathbf{d}_{j} &= 0\\ \end{split} \end{equation}\]Once again, without steping into previous directions, the new requirement is that $\mathbf{e}_{k+1}$ be A-orthogonal to $\mathbf{d}_k$. This equation is equivalent to finding the minimum point along the search direction $\mathbf{d}_{k}$ with the line search method:
\[\begin{equation} \begin{split} \frac{d}{d\alpha_k} f(\mathbf{x}_{k+1}) &= 0\\ f'(\mathbf{x}_{k+1}) \frac{d}{d\alpha_k}\mathbf{x}_{k+1} &= 0\\ -\mathbf{r}_{k+1}^T \mathbf{d}_k &= 0\\ \mathbf{e}_{k+1}^T \mathbf{A} \mathbf{d}_k &= 0 \end{split} \end{equation}\]By Equation (\ref{eq:9}), $\alpha_k$ is computed as:
\[\begin{equation} \begin{split} \mathbf{e}_{k+1}^T \mathbf{A} \mathbf{d}_k &= 0\\ \left( \mathbf{e}_k + \alpha_k \mathbf{d}_k \right)^T \mathbf{A} \mathbf{d}_k &= 0\\ \alpha_k &= - \frac{\mathbf{e}_k^T \mathbf{A} \mathbf{d}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}\\ &= \frac{\mathbf{r}_k^T \mathbf{d}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k} \end{split} \end{equation}\]Similar to the SG method, the iterative formulas are as follows:
\[\begin{equation} \begin{split} \mathbf{r}_0 &= \mathbf{b} - \mathbf{A}\mathbf{x}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{d}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}\\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{d}_k\\ \mathbf{r}_{k+1} &= \mathbf{r}_k - \alpha_k \mathbf{A} \mathbf{d}_k\\ \end{split} \label{eq:14} \end{equation}\]We can prove that the error item $\mathbf{e}_0$ exactly converges to zero after n steps with these formulas. By Equation (\ref{eq:9}), we can have that $\mathbf{e}_k = \mathbf{e}_0 + \Sigma_{i=0}^{k-1} \alpha_i \mathbf{d}_i$. Suppose $\mathbf{e}_0 = \Sigma_{i=0}^{n} \delta_i \mathbf{d}_i$ ($\{\mathbf{d}_k\}$ span the whole linear space), all the $\delta_i$ values can be found by multiplying the expression by $\mathbf{d}_k^T\mathbf{A}$:
\[\begin{equation} \begin{split} \mathbf{d}_k^T \mathbf{A} \mathbf{e}_0 &= \Sigma_{i=0}^{n} \delta_i \mathbf{d}_k^T\mathbf{A}\mathbf{d}_i\\ &= \delta_k \mathbf{d}_k^T\mathbf{A}\mathbf{d}_k\\ \delta_k &= \frac{\mathbf{d}_k^T \mathbf{A} \mathbf{e}_0}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}\\ &= \frac{\mathbf{d}_k^T \mathbf{A} \left(\mathbf{e}_0 + \Sigma_{i=0}^{k-1}\alpha_i \mathbf{d}_i \right)}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}\\ &= \frac{\mathbf{d}_k^T \mathbf{A} \mathbf{e}_k}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}\\ &= - \frac{\mathbf{d}_k^T \mathbf{r}_k}{\mathbf{d}_k^T\mathbf{A}\mathbf{d}_k}\\ &= - \alpha_k \end{split} \end{equation}\]where $\delta_k$ is exactly equal to the $-\alpha_k$ value.
This fact demonstrates that we can eliminate one component of $\mathbf{e}_0$ each step. Consequently, the remaining error item $\mathbf{e}_k$ is:
\[\begin{equation} \begin{split} \mathbf{e}_k &= \mathbf{e}_0 + \Sigma_{i=0}^{k-1} \alpha_i \mathbf{d}_i\\ &= \Sigma_{i=k}^{n} \delta_i \mathbf{d}_i \end{split} \label{eq:16} \end{equation}\]Conjugate Gradient Method
The remaining question is that how to compute a set of A-orthogonal directions ${\mathbf{d}_k}$. Here we employ the Gram-Schmidt process. Suppose we have a set of n linearly independent vectors ${ \mathbf{u}_k }$ and let $\mathbf{d}_0 = \mathbf{u}_0$, $\mathbf{d}_k$ is constructed as:
\[\begin{equation} \begin{split} \mathbf{d}_k = \mathbf{u}_k + \Sigma_{i=0}^{k-1} \beta_{k,i} \mathbf{d}_{i} \end{split} \label{eq:17} \end{equation}\]where $\beta_{k,i}$ is computed as:
\[\begin{equation} \begin{split} \mathbf{d}_i^T\mathbf{A}\mathbf{d}_k &= \mathbf{d}_i^T\mathbf{A}\left(\mathbf{u}_k + \Sigma_{j=0}^{k-1} \beta_{k,j} \mathbf{d}_{j}\right)\\ 0 &= \mathbf{d}_i^T\mathbf{A}\mathbf{u}_k + \beta_{k,i} \mathbf{d}_i^T\mathbf{A}\mathbf{d}_i\\ \beta_{k,i} &= - \frac{\mathbf{d}_i^T\mathbf{A}\mathbf{u}_k}{\mathbf{d}_i^T\mathbf{A}\mathbf{d}_i} \end{split} \label{eq:18} \end{equation}\]The drawback of the contruction process is that it keeps all older directions in memory to create a new direction. CG employs a more efficient way to constructing these directions by initializing the vectors with residuals $\mathbf{u}_k = \mathbf{r}_k$. With these starting points, we will see that most computations can be eliminated.
By Equation (\ref{eq:17}) and (\ref{eq:18}), $\mathbf{d}_k$ is constructed as:
\[\begin{equation} \begin{split} \mathbf{d}_k &= \mathbf{r}_k + \Sigma_{i=0}^{k-1} \beta_{k,i} \mathbf{d}_{i}\\ \beta_{k,i} &= - \frac{\mathbf{d}_i^T\mathbf{A}\mathbf{r}_k}{\mathbf{d}_i^T\mathbf{A}\mathbf{d}_i} \end{split} \label{eq:19} \end{equation}\]Let’s consider the numerator of $\beta_{k,i}$ with Equation (\ref{eq:14}):
\[\begin{equation} \begin{split} \mathbf{d}_i^T\mathbf{A}\mathbf{r}_k &= \frac{1}{\alpha_i}(\mathbf{r}_i - \mathbf{r}_{i+1})^T \mathbf{r}_k\\ &= \frac{1}{\alpha_i}(\mathbf{r}_i^T\mathbf{r}_k - \mathbf{r}_{i+1}^T\mathbf{r}_k), i=0,1,...,k-1 \end{split} \end{equation}\]Luckily, we shall see that $\mathbf{r}_k$ is orthogonal to previous residuals. Refering to Equation (\ref{eq:16}), the error item $\mathbf{e}_k = \Sigma_{i=k}^{n} \delta_i \mathbf{d}_i$, we have the following equation by multiplying $-\mathbf{d}_j^T \mathbf{A}$ on both sides:
\[\begin{equation} \begin{split} -\mathbf{d}_j^T \mathbf{A} \mathbf{e}_k &= -\Sigma_{i=k}^{n} \delta_i \mathbf{d}_j^T \mathbf{A} \mathbf{d}_i, j<k\\ \mathbf{d}_j^T \mathbf{r}_k &= 0, j<k \end{split} \label{eq:21} \end{equation}\]Thus the residuals are orthogonal to previous conjugate directions if we choose the residuals as starting vectors to construct these directions. According to Equation (\ref{eq:19}), we further conclude that the residuals are orthogonal to previous residuals:
\[\begin{equation} \begin{split} \mathbf{d}_j^T \mathbf{r}_k &= 0, j<k\\ \left(\mathbf{r}_j + \Sigma_{i=0}^{j-1} \beta_{j,i} \mathbf{d}_{i}\right)^T\mathbf{r}_k &= 0\\ \mathbf{r}_j^T \mathbf{r}_k &= 0,j<k \end{split} \label{eq:22} \end{equation}\]By Equation (\ref{eq:21}) and (\ref{eq:22}), $\beta_{k,i}$ is simplified as:
\[\begin{equation} \begin{split} \beta_{k,i} = \begin{cases} \frac{1}{\alpha_{k-1}} \frac{\mathbf{r}_k^T\mathbf{r}_k}{\mathbf{d}_{k-1}^T\mathbf{A}\mathbf{d}_{k-1}} & i = k-1\\ 0 & otherwise\\ \end{cases} \end{split} \end{equation}\]where we only need to keep one previous direction instead of all directions.
The method of CG is:
\[\begin{equation} \begin{split} \mathbf{d}_0 = \mathbf{r}_0 &= \mathbf{b} - \mathbf{A}\mathbf{x}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{d}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k} = \frac{\mathbf{r}_k^T \mathbf{r}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}\\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{d}_k\\ \mathbf{r}_{k+1} &= \mathbf{r}_k - \alpha_k \mathbf{A} \mathbf{d}_k\\ \beta_{k+1} &= \frac{1}{\alpha_{k}} \frac{\mathbf{r}_{k+1}^T\mathbf{r}_{k+1}}{\mathbf{d}_{k}^T\mathbf{A}\mathbf{d}_{k}}\\ &= \frac{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}{\mathbf{r}_k^T \mathbf{r}_k}\frac{\mathbf{r}_{k+1}^T\mathbf{r}_{k+1}}{\mathbf{d}_{k}^T\mathbf{A}\mathbf{d}_{k}}\\ &= \frac{\mathbf{r}_{k+1}^T\mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{r}_k}\\ \mathbf{d}_{k+1} &= \mathbf{r}_{k+1} + \beta_{k+1} \mathbf{d}_{k}\\ \end{split} \end{equation}\]Preconditioned Conjugate Gradient Method
The analysis of complexity of CG shows that a small condition number $\kappa(\mathbf{A})$ would improve the convergence rate. Preconditioning is one such technique for reducing the condition number of a matrix. Let’s consider a symmetric, positive-definite matrix $\mathbf{M}$ that approximates $\mathbf{A}$ but is easier to invert. Instead of solving $\mathbf{A}\mathbf{x} = \mathbf{b}$, we solve:
\[\begin{equation} \begin{split} \mathbf{M}^{-1}\mathbf{A}\mathbf{x} = \mathbf{M}^{-1}\mathbf{b} \end{split} \end{equation}\]by ensuring that $\kappa(\mathbf{M}^{-1}\mathbf{A}) \ll \kappa(\mathbf{A})$. The problem is that $\mathbf{M}^{-1}\mathbf{A}$ is generally neither symmetric nor positive-definite, even thought $\mathbf{M}$ and $\mathbf{A}$ are.
The solution is to decompose $\mathbf{M}$ with Cholesky decompostion. The Cholesky decompostion is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, denoted as $\mathbf{M}=\mathbf{L}\mathbf{L}^T$. We then solve the following equation:
\[\begin{equation} \begin{split} \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T \mathbf{L}^T \mathbf{x} &= \mathbf{L}^{-1}\mathbf{b}\\ \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T \hat{\mathbf{x}} &= \hat{\mathbf{b}}\\ \end{split} \end{equation}\]where $\hat{\mathbf{x}} = \mathbf{L}^T \mathbf{x}$ and $\hat{\mathbf{b}} = \mathbf{L}^{-1}\mathbf{b}$. Notice that $\mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T$ is symmetric and positive-definite, allowing us to apply CG directly.
Now the method of PCG is:
\[\begin{equation} \begin{split} \hat{\mathbf{d}}_0 = \hat{\mathbf{r}}_0 &= \mathbf{b} - \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T\hat{\mathbf{x}}_0\\ \alpha_k &= \frac{\hat{\mathbf{r}}_k^T \hat{\mathbf{r}}_k}{\hat{\mathbf{d}}_k^T \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T \hat{\mathbf{d}}_k}\\ \hat{\mathbf{x}}_{k+1} &= \hat{\mathbf{x}}_k + \alpha_k \hat{\mathbf{d}}_k\\ \hat{\mathbf{r}}_{k+1} &= \hat{\mathbf{r}}_k - \alpha_k \mathbf{L}^{-1}\mathbf{A} \left(\mathbf{L}^{-1}\right)^T \hat{\mathbf{d}}_k\\ \beta_{k+1} &= \frac{\hat{\mathbf{r}}_{k+1}^T\hat{\mathbf{r}}_{k+1}}{\hat{\mathbf{r}}_k^T \hat{\mathbf{r}}_k}\\ \hat{\mathbf{d}}_{k+1} &= \hat{\mathbf{r}}_{k+1} + \beta_{k+1} \hat{\mathbf{d}}_{k}\\ \end{split} \end{equation}\]With some substitutions, we can further eliminate $\mathbf{L}^{-1}$:
\[\begin{equation} \begin{split} \mathbf{r}_0 &= \mathbf{b} - \mathbf{A} \mathbf{x}_0\\ \mathbf{d}_0 &= \mathbf{M}^{-1} \mathbf{r}_0\\ \alpha_k &= \frac{\mathbf{r}_k^T \mathbf{M}^{-1} \mathbf{r}_k}{\mathbf{d}_k^T \mathbf{A} \mathbf{d}_k}\\ \mathbf{x}_{k+1} &= \mathbf{x}_k + \alpha_k \mathbf{d}_k\\ \mathbf{r}_{k+1} &= \mathbf{r}_k - \alpha_k \mathbf{A} \mathbf{d}_k\\ \beta_{k+1} &= \frac{\mathbf{r}_{k+1}^T \mathbf{M}^{-1} \mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{M}^{-1} \mathbf{r}_k}\\ \mathbf{d}_{k+1} &= \mathbf{M}^{-1} \mathbf{r}_{k+1} + \beta_{k+1} \mathbf{d}_{k}\\ \end{split} \end{equation}\]$\mathbf{M}$ is called a preconditioner and the effectiveness of PCG depends on finding a preconditioner that approximates $\mathbf{A}$ well enough without too much cost of computing $\mathbf{M}^{-1}$. In fact, it’s unnecessary to know the explicit form of $\mathbf{M}^{-1}$, Instead, by directly computing $\mathbf{M}^{-1}\mathbf{r}_k$, which can be viewed as an operator with input $\mathbf{r}_k$ achieving the same effect of matrix-vector product. The implementation of PCG benefits from this operator concept, extending it to more general cases.
Implementation Considerations
Starting point
If you have an estimate of $\mathbf{x}$, use it as the starting point $\mathbf{x}_0$; otherwise use the zero vector.
Stopping criterion
Theoretically, CG/PCG finds the solution when the residual becomes zero after n steps. Therefore, the maximum number of iterations would never exceed n; otherwise a division-by-zero error would occur. We may choose to stop the algorithm immediately when the residual falls below a specified threshold. A common criterion is $|\mathbf{r}_k|_2 \lt \epsilon |\mathbf{r}_0|_2$.
The accumulated roundoff error in the residul iteration formula may yield a false zero residual. To address this issue, once we have a residual falls below the stopping value, we recalculate the residual using the definition (Equation (\ref{eq:2})) and double-check if it still falls below the stopping threshold. If not, we restart the iteration.
Matlab’s implementation does a little more than that. It checks whether the current direction is small enough in each iteration with the criterion $|\alpha_k \mathbf{d}_k|_2 \lt \epsilon |\mathbf{x}_k|_2$. If true (for 3 times), it believes that the current solution is stuck and terminiates the iteration with warning messages.
Choose of Preconditioner
$\mathbf{M} = \mathbf{A}$ is an ideal but not practical preconditioner. Solving $\mathbf{M}^{-1}$ is equivalent to solving $\mathbf{M}\mathbf{x}=\mathbf{b}$. One simple preconditioner is a diagonal matrix whose diagonal elements are identical to those of $\mathbf{A}$.
Another preconditioner is the incomplete Cholesky factorization. Incomplete Cholesky factorization is a sparse approximation of the Cholesky decomposition by setting elements to zero if the corresponding elements in $\mathbf{A}$ are also zero. This preserves the sparsity of $\mathbf{A}$, making it suitable for problems with sparsity matrices. Yet I haven’t found such a function in numpy and scipy packages. I’ll look into it when I have time, refering to wiki and Julia’s code.