The Moore-Penrose inverse or the pseudoinverse A+Rn×m\mathbf{A}^+ \in \mathbb{R}^{n \times m} of a matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} is a kind of generalization of the inverse matrix to non-square matrices or ill-conditioned matricies. The most confusing part in coding a pinv function is how to choose a appropriate tolerance truncating zero singular values.

Basics on Pseudoinverse

Mathematically, a pseudoinverse of A\mathbf{A} is defined as a matrix statisfing some criteria. The pseudoinverse has many good properites, such as being equal to the inverse of A\mathbf{A} if A\mathbf{A} is a square matrix and invertiable, and it exists for any matrix, etc. The compuation of the pseudoinverse is quite intutive, A+=VS+UT\mathbf{A}^+ = \mathbf{V} \mathbf{S}^+ \mathbf{U}^T, where A=USVT\mathbf{A} = \mathbf{U} \mathbf{S} \mathbf{V}^T is the SVD of the matrix A\mathbf{A}.

More importantly, the pseudoinverse relates to the least square problems with Tiknonov regularization:

arg minxRn Axb22+λx22\begin{equation} \begin{split} \argmin_{\mathbf{x} \in \mathbb{R}^n} &\ \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2 + \lambda \|\mathbf{x}\|_2^2\\ \end{split} \end{equation}

where the solution is:

x^=(ATA+λI)1ATb\begin{equation} \begin{split} \mathbf{\hat{x}} = \left( \mathbf{A}^T\mathbf{A} + \lambda \mathbf{I}\right)^{-1} \mathbf{A}^T \mathbf{b}\\ \end{split} \end{equation}

The pseudoinverse A+\mathbf{A}^+ is exactly the limit when λ0\lambda \to 0:

A+=limλ0(ATA+λI)1AT\begin{equation} \begin{split} \mathbf{A}^+ = \lim_{\lambda \to 0} \left( \mathbf{A}^T\mathbf{A} + \lambda \mathbf{I}\right)^{-1} \mathbf{A}^T\\ \end{split} \end{equation}

If A\mathbf{A} is square and ill-conditioned, then it has a high condition number and many singular values or eigenvalues would gradually decay and be equal to 0 in the diagonal matrix S\mathbf{S}. The inverse of A\mathbf{A} usually causes overflow in real applications due to its division by zero errors. A common workaround is to replace A1\mathbf{A}^{-1} with its pseudoinverse A+\mathbf{A}^+. Recall equation (3) and the SVD, we have:

A+=limλ0(STS+λI)1AT=limλ0V(STS+λI)1SUT=limλ0V[σ1σ12+λ000000σkσk2+λ00000]UT=V[1σ10000001σk00000]UT\begin{equation} \begin{split} \mathbf{A}^+ &= \lim_{\lambda \to 0} \left( \mathbf{S}^T\mathbf{S} + \lambda \mathbf{I}\right)^{-1} \mathbf{A}^T\\ &= \lim_{\lambda \to 0} \mathbf{V} \left( \mathbf{S}^T\mathbf{S} + \lambda \mathbf{I}\right)^{-1}\mathbf{S} \mathbf{U}^T\\ &= \lim_{\lambda \to 0} \mathbf{V} \begin{bmatrix} \frac{\sigma_1}{\sigma_1^2+\lambda} & \cdots& 0 &0 & \\ \vdots & \ddots & 0 &0\\ 0 & 0 & \frac{\sigma_k}{\sigma_k^2+\lambda} &0\\ 0 & 0 & 0 &0\\ & & & &\ddots\\ \end{bmatrix} \mathbf{U}^T\\ &= \mathbf{V} \begin{bmatrix} \frac{1}{\sigma_1} & \cdots& 0 &0 & \\ \vdots & \ddots & 0 &0\\ 0 & 0 & \frac{1}{\sigma_k} &0\\ 0 & 0 & 0 &0\\ & & & &\ddots\\ \end{bmatrix} \mathbf{U}^T\\ \end{split} \end{equation}

so, the pseudoinverse A+\mathbf{A}^+ is a truncated SVD method by discarding all zero corresponding components. Theorectially, it would not make encounter any divide-by-zero issues.

Engineering Considerations

However, determining what zero means in practical numerical computing is a little tricky. Typically, we classify those singular values that fall below some small tolerance as zero numbers, and the choice of default tolerance varies among implementations. Here I list some implementations:

SoftwareImplementationNote
Matlabmax(m,n)*eps(norm(s, inf))eps(x) returns the positive distance from abs(x) to the next larger in magnitude floating point number of the same precision as x
Scipyatol+max(m, n)*np.finfo(dtype).eps*max(s)np.finfo(dtype).eps returns the difference between 1.0 and the next smallest floating point number larger than 1.0 of the precision dtype, and atol is the absolute tolerance defaults to 0
Numpy1e-15*max(s)
Octavemax(m, n)*max(s)*std::numeric_limits<T>.epsilon()std::numeric_limits<T>.epsilon() returns the difference between 1.0 and the next smallest floating point number larger than 1.0 of the precision T, and that is GCC compiler’s behavior
Juliamax(eps(T)*min(m, n)*maximum(s), atol)eps(T) returns the distance between 1.0 and the next larger representable floating-point value of the precision T, and the Julia community also recommend sqrt(eps(T)) for dense ill-conditioned matrices.

I’m not an expert in numerical computing, so I don’t want to talk about the considerations behind these implementations. If you are interested, you can refer to the commnunity discussions on this topic, such as those in Numpy and Julia, or some typical books such as Golub and Van Loan’s Matrix Computations, Vetterling’s Numerical Recipes.

Generally speaking, there is no one-size-fits-all tolerance for all kinds of problems, and Matlab’s implementation is considered to be the most conservative way. Nearly all implementations consider two tolerances: absolute and relative. Here I just provide an excerpt from stata’s manuals:

An absolute tolerance is a fixed number that is used to make direct comparisons. If the tolerance for a particular routine were 1e–14, then 8.99e–15 in some calculation would be considered to be close enough to zero to act as if it were, in fact, zero, and 1.000001e–14 would be considered a valid, nonzero number.

But is 1e–14 small? The number may look small to you, but whether 1e–14 is small depends on what is being measured and the units in which it is measured. If all the numbers in a certain problem were around 1e–12, you might suspect that 1e–14 is a reasonable number. That leads to relative measures of tolerance. Rather than treating, say, a predetermined quantity as being so small as to be zero, one specifies a value (for example, 1e–14) multiplied by something and uses that as the definition of small.

For the above matrix, the diagonal of U turns out to be (5.5e+14, 2.4e+13, 0.000087).

An absolutist would tell you that the matrix is of full rank; the smallest number along the diagonal of U is 0.000087 (8.7e–5), and that is still a respectable number, at least when compared with computer precision, which is about 2.22e–16.

Most Mata routines would tell you that the matrix has rank 2. Numbers such as 0.000087 may seem respectable when compared with machine precision, but 0.000087 is, relatively speaking, a very small number, being about 4.6e–19 relative to the average value of the diagonal elements.

and from Vetterling’s comments (p795):

Moreover, if a singular value wi is nonzero but very small, you should also define its reciprocal to be zero, since its apparent value is probably an artifact of
roundoff error, not a meaningful number. A plausible answer to the question “how small is small?” is to edit in this fashion all singular values whose ratio to the largest singular value is less than N times the machine precision ϵ\epsilon. (This is a more conservative recommendation than the default in section 2.6, which scales as N1/2N^{1/2}.)