This post contains some userful tensor notation and tricks which I have seen and collected. The majority of the content is from Tamara Kolda and Brett Bader’s review, Tensor Decompositions and Applications.

What is A Tensor

The mathematical definition of tensor is hard to understand for me. I would prefer viewing a tensor as a multi-dimensional array. An order-NN tensor XRI1×I2×IN\mathcal{X} \in \mathbb{R}^{I_1 \times I_2 \times \cdots I_N} is a NN dimensional array and IkI_k is the degree of the kk-th dimension. So matrix are order-2 tensors and vectors are order-1 tensors. Elements in tensor X\mathcal{X} are denoted as xi1,i2,,iNx_{i_1,i_2,\cdots,i_N}, where iki_k is the index running from 1 to IkI_k.

Norm, Inner Product, and Rank-One

The inner product of two same-sized tensors X,YRI1×I2×IN\mathcal{X}, \mathcal{Y} \in \mathbb{R}^{I_1 \times I_2 \times \cdots I_N} is the sum of the product of their elements:

X,Y=i1i2iNxi1,i2,,iNyi1,i2,,iN\begin{equation} \begin{split} \langle \mathcal{X}, \mathcal{Y} \rangle = \sum_{i_1} \sum_{i_2} \cdots \sum_{i_N} x_{i_1,i_2,\cdots,i_N} y_{i_1,i_2,\cdots,i_N} \end{split} \end{equation}

By this definition, the norm of a tensor X\mathcal{X} is the square root of the inner product of X\mathcal{X} and itself:

X=X,X\begin{equation} \begin{split} \|\mathcal{X}\| = \sqrt{\langle \mathcal{X}, \mathcal{X} \rangle} \end{split} \end{equation}

An order-NN tensor X\mathcal{X} is rank-one if it can be written as the outer product of NN vectors:

X=a1aN\begin{equation} \begin{split} \|\mathcal{X}\| = \mathbf{a}^1 \circ \cdots \circ \mathbf{a}^N \end{split} \end{equation}

where xi1,i2,,iN=ai11aiNNx_{i_1,i_2,\cdots,i_N} = a_{i_1}^1 \cdots a_{i_N}^N.

K-mode Product

As matrix product, the kk-mode product between a tensor X\mathcal{X} and a matrix ARJ×Ik\mathbf{A} \in \mathbb{R}^{J \times I_k} is defined as:

Y=X×kA\begin{equation} \begin{split} \mathcal{Y} &= \mathcal{X} \times_k \mathbf{A} \end{split} \end{equation}

where the elements of tensor YRI1×I2×J×IN\mathcal{Y} \in \mathbb{R}^{I_1 \times I_2 \times J \times \cdots I_N} can be computed as:

yi1,i2,,j,,iN=ik=1Ikxi1,i2,,ik,,iNaj,ik\begin{equation} \begin{split} y_{i_1,i_2,\cdots,j,\cdots,i_N} &= \sum_{i_k=1}^{I_k} x_{i_1,i_2,\cdots,i_k,\cdots,i_N} a_{j, i_k} \end{split} \end{equation}

Continuous kk-mode products with matrices A1RJ1×I1,,ANRJN×IN\mathbf{A}^1 \in \mathbb{R}^{J_1 \times I_1},\cdots,\mathbf{A}^N \in \mathbb{R}^{J_N \times I_N} are denoted as:

Y=X×1A1×2A2×NAN\begin{equation} \begin{split} \mathcal{Y} &= \mathcal{X} \times_1 \mathbf{A}^1 \times_2 \mathbf{A}^2 \cdots \times_N \mathbf{A}^N \end{split} \end{equation}

Intuitively, if no two matrices act on the same mode, then the order of product between them is interchangeable.

If two matrices share the same mode, then

X×nA×nB=X×n(BA)\begin{equation} \begin{split} \mathcal{X} \times_n \mathbf{A} \times_n \mathbf{B} = \mathcal{X} \times_n \left(\mathbf{B} \mathbf{A}\right) \end{split} \end{equation}

Mode-K Unfolding

The mode-kk unfolding of a tensor is a matricization process, that is:

X(k)=[v]\begin{equation} \begin{split} \mathcal{X}_{(k)} = \begin{bmatrix}\cdots & \mathbf{v} & \cdots \end{bmatrix} \end{split} \end{equation}

where X(k)RIk×I1Ik1Ik+1IN\mathcal{X}_{(k)} \in \mathbb{R}^{I_k \times I_1 \cdots I_{k-1}I_{k+1} \cdots I_N}. Each column v\mathbf{v} is called a fiber, which is just elements along the kk-th dimension given other indices. For simplicity, let’s assume that this reshape operation follows the row-major layout or C-like order, with the index of the last axis changing the fastest. And the kk-rank of X\mathcal{X} is defined as the rank of the mode-k unfolding of X\mathcal{X}.

With the mode-kk unfolding, the kk-mode product can be expressed as a normal matrix product:

Y(k)=AX(k)\begin{equation} \begin{split} \mathcal{Y}_{(k)} &= \mathbf{A} \mathcal{X}_{(k)} \end{split} \end{equation}

or for continuous kk-mode products:

Y(k)=AkX(k)(A1Ak1Ak+1AN)T\begin{equation} \begin{split} \mathcal{Y}_{(k)} &= \mathbf{A}^k \mathcal{X}_{(k)} \left(\mathbf{A}^1 \otimes \cdots \otimes \mathbf{A}^{k-1} \otimes \mathbf{A}^{k+1} \otimes \cdots \otimes \mathbf{A}^{N} \right)^T \end{split} \end{equation}

where \otimes is the Kronecker product and A1Ak1Ak+1ANRJ1J2JN×I1I2IN\mathbf{A}^1 \otimes \cdots \otimes \mathbf{A}^{k-1} \otimes \mathbf{A}^{k+1} \otimes \cdots \otimes \mathbf{A}^{N} \in \mathbb{R}^{J_1J_2\cdots J_N \times I_1I_2\cdots I_N } is a super big matrix (you would find that Kolda & Bader matricized the tensor with the column-major layout and the order of Kronecker products was reversed).

The Kronecker product of matrices ARI×J\mathbf{A} \in \mathbb{R}^{I \times J} and BRK×L\mathbf{B} \in \mathbb{R}^{K \times L} is defined by:

AB=[a1,1Ba1,JBaI,1BaI,JB]=[a1b1a1bLa2b1aJbL]\begin{equation} \begin{split} \mathbf{A} \otimes \mathbf{B} &= \begin{bmatrix} a_{1,1} \mathbf{B} & \cdots & a_{1, J} \mathbf{B}\\ \vdots & \ddots & \vdots\\ a_{I,1} \mathbf{B}& \cdots & a_{I,J} \mathbf{B} \end{bmatrix}\\ &= \begin{bmatrix} \mathbf{a}_1 \otimes \mathbf{b}_1 & \cdots & \mathbf{a}_1\otimes \mathbf{b}_L & \mathbf{a}_2\otimes \mathbf{b}_1 & \cdots & \mathbf{a}_J \otimes \mathbf{b}_L \end{bmatrix} \end{split} \end{equation}

Tensor Decompositions

Two most common tensor decompositions are CP and Tucker decompositions, considered to be higher-order generalizations of the matrix SVD and PCA, separately.

CP Decomposition

CP decomposition (cnanonical decomposition/parallel factors, CANDECOMP/PARAFAC) is to express a tensor as the sum of a finite number of rank-one tensors. For example, given a 3-order tensor XRI×J×K\mathcal{X} \in \mathbb{R}^{I \times J \times K}, CP decomposes it as:

Xi=1Raibici\begin{equation} \begin{split} \mathcal{X} &\approx \sum_{i=1}^{R} \mathbf{a}_i \circ \mathbf{b}_i \circ \mathbf{c}_i \end{split} \end{equation}

where the combination of the column vectors are called factor matrices, i.e., A=[a1aR]\mathbf{A} = \begin{bmatrix}\mathbf{a}_1 & \cdots & \mathbf{a}_R \end{bmatrix}. Let’s assume that each vector is normalized to value one with a weight vector λRR\mathbf{\lambda} \in \mathbb{R}^R so that the decomposition is:

Xi=1Rλiaibici\begin{equation} \begin{split} \mathcal{X} &\approx \sum_{i=1}^{R} \lambda_i \mathbf{a}_i \circ \mathbf{b}_i \circ \mathbf{c}_i \end{split} \end{equation}

For a general NN-order tensor XRI1×I2××IN\mathcal{X} \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_N}, with the weight vector λ\mathbf{\lambda} and factor matrices AkRIk×R\mathbf{A}^k \in \mathbb{R}^{I_k \times R}, the CP decomposition is:

Xi=1Rλiai1ai2aiN\begin{equation} \begin{split} \mathcal{X} &\approx \sum_{i=1}^{R} \lambda_i \mathbf{a}_i^1 \circ \mathbf{a}_i^2 \circ \cdots \circ \mathbf{a}_i^N \end{split} \end{equation}

and the mode-kk matricized version is:

X(k)AkΛ(A1Ak1Ak+1AN)T\begin{equation} \begin{split} \mathcal{X}_{(k)} &\approx \mathbf{A}^k \mathbf{\Lambda} \left( \mathbf{A}^1 \odot \cdots \mathbf{A}^{k-1} \odot \mathbf{A}^{k+1} \odot \cdots \mathbf{A}^{N} \right)^T \end{split} \end{equation}

where ΛRR×R\mathbf{\Lambda} \in \mathbb{R}^{R \times R} is a diagonal matrix of the weight vector λ\mathbf{\lambda} and \odot is the Khatri-Rao product.

The Khatri-Rao product of two matrices ARI×K\mathbf{A} \in \mathbb{R}^{I \times K} and BRJ×K\mathbf{B} \in \mathbb{R}^{J \times K} is defined by:

AB=[a1b1a2b2aKbK]\begin{equation} \begin{split} \mathbf{A} \odot \mathbf{B} &= \begin{bmatrix} \mathbf{a}_1 \otimes \mathbf{b}_1 & \mathbf{a}_2 \otimes \mathbf{b}_2 & \cdots & \mathbf{a}_K \otimes \mathbf{b}_K \end{bmatrix} \end{split} \end{equation}

where the resulting matrix is in RIJ×K\mathbb{R}^{IJ \times K}.

The rank of a tensor is defined as the smallest number RR to achieve an exact CP decomposition in Equation (14). The tensor rank is an analogue to the matrix rank, but it has many different properites:

  1. the tensor rank may be different over R\mathbb{R} and C\mathbf{C};
  2. there is no straightforward algorithm to determine the rank of a given tensor yet;
  3. the rank decomposition is generally unique with the exception of permuation and scaling operations.

It’s well-known that the best rank-kk approximation of a rank-RR matrix A\mathbf{A} is the leading kk components of its SVD:

Ai=1kλiuivi, with λ1λ2λk\begin{equation} \begin{split} \mathbf{A} &\approx \sum_{i=1}^{k} \lambda_i \mathbf{u}_i \circ \mathbf{v}_i,\ with\ \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_k \end{split} \end{equation}

but this does not hold true for high-order tensors. In fact, the best rank-kk approximation of a tensor may not exist.

Since there is no easy way to determine the rank of a tensor, most CP decomposition procedures would fit multiple models until finding one that is good enough to approximate the given tensor. The Alternating least squares (ALS) method is one of them to compute a CP decomposition by solving one factor matrix each time with others fixed:

Ak=X(k)(A1Ak1Ak+1AN)((A1)TA1(AN)TAN)\begin{equation} \begin{split} \mathbf{A}^k = &\mathcal{X}_{(k)} \left(\mathbf{A}^1 \odot \cdots \mathbf{A}^{k-1} \odot \mathbf{A}^{k+1} \cdots \odot \mathbf{A}^N \right)\\ &\left( (\mathbf{A}^1)^T\mathbf{A}^1 \ast \cdots \ast (\mathbf{A}^N)^T\mathbf{A}^N\right)^\dag \end{split} \end{equation}

where \ast is the elementwise product (Hadamard product) and \dag is the pseudo-inverse. λ\mathbf{\lambda} can be computed by normalizing the columns of Ak\mathbf{A}^k. The ALS method iterates until the stopping criteria satisfied. However, the ALS method is not guaranteed to converge to a global minimum or even a stationary point.

Tucker Decomposition

The Tucker decomposition decomposes a tensor into a core tensor multiplied by a factor matrix along each mode:

X=G×1A1×2A2×NAN\begin{equation} \begin{split} \mathcal{X} &= \mathcal{G} \times_1 \mathbf{A}^1 \times_2 \mathbf{A}^2 \cdots \times_N \mathbf{A}^N \end{split} \end{equation}

where GRR1××RN\mathcal{G} \in \mathbb{R}^{R_1 \times \cdots \times R_N}, AkRIk×Rk\mathbf{A}^k \in \mathbb{R}^{I_k \times R_k} and Ak\mathbf{A}^k is orthogonal. It can be seen that the CP decomposition is a special case of the Tucker decomposition where the core tensor is superdiagonal and R1=R2==RNR_1 = R_2 = \cdots = R_N.

An important concept of Tucker Decomposition is the nn-rank of X\mathcal{X}, denoted rankn(X)rank_n(\mathcal{X}), which is the rank of X(n)\mathcal{X}_{(n)}. The Tucker decomposition requires a rank group (R1,R2,,RN)(R_1,R_2,\cdots, R_N), which is also the size of the core tensor G\mathcal{G}.

Higher-order orthogonal iteration (HOOI) solves Ak\mathbf{A}^k with others fixed, as the ALS method:

Y=X×1(A1)T×k1(Ak1)T×k+1(Ak+1)T×N(AN)TY(k)=UΣVTAk=[u1u2uRn]\begin{equation} \begin{split} \mathcal{Y} &= \mathcal{X} \times_1 (\mathbf{A}^1)^T \cdots \times_{k-1}\\ &(\mathbf{A}^{k-1})^T \times_{k+1} (\mathbf{A}^{k+1})^T \cdots \times_{N} (\mathbf{A}^{N})^T\\ \mathcal{Y}_{(k)} &= \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T\\ \mathbf{A}^k &= \begin{bmatrix} \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_{R_n}\end{bmatrix} \end{split} \end{equation}

and the core tensor G=X×1(A1)T××(AN)T\mathcal{G} = \mathcal{X} \times_1 (\mathbf{A}^{1})^T \times \cdots \times (\mathbf{A}^{N})^T. Note that the Tucker decomposition is not unique.