Tensor Decompositions
This post contains some userful tensor notation and tricks which I have seen and collected. The majority of the content is from Tamara Kolda and Brett Bader’s review, Tensor Decompositions and Applications.
What is A Tensor
The mathematical definition of tensor is hard to understand for me. I would prefer viewing a tensor as a multi-dimensional array. An order-$N$ tensor $\mathcal{X} \in \mathbb{R}^{I_1 \times I_2 \times \cdots I_N}$ is a $N$ dimensional array and $I_k$ is the degree of the $k$-th dimension. So matrix are order-2 tensors and vectors are order-1 tensors. Elements in tensor $\mathcal{X}$ are denoted as $x_{i_1,i_2,\cdots,i_N}$, where $i_k$ is the index running from 1 to $I_k$.
Norm, Inner Product, and Rank-One
The inner product of two same-sized tensors $\mathcal{X}, \mathcal{Y} \in \mathbb{R}^{I_1 \times I_2 \times \cdots I_N}$ is the sum of the product of their elements:
\[\begin{equation} \begin{split} \langle \mathcal{X}, \mathcal{Y} \rangle = \sum_{i_1} \sum_{i_2} \cdots \sum_{i_N} x_{i_1,i_2,\cdots,i_N} y_{i_1,i_2,\cdots,i_N} \end{split} \end{equation}\]By this definition, the norm of a tensor $\mathcal{X}$ is the square root of the inner product of $\mathcal{X}$ and itself:
\[\begin{equation} \begin{split} \|\mathcal{X}\| = \sqrt{\langle \mathcal{X}, \mathcal{X} \rangle} \end{split} \end{equation}\]An order-$N$ tensor $\mathcal{X}$ is rank-one if it can be written as the outer product of $N$ vectors:
\[\begin{equation} \begin{split} \|\mathcal{X}\| = \mathbf{a}^1 \circ \cdots \circ \mathbf{a}^N \end{split} \end{equation}\]where $x_{i_1,i_2,\cdots,i_N} = a_{i_1}^1 \cdots a_{i_N}^N$.
K-mode Product
As matrix product, the $k$-mode product between a tensor $\mathcal{X}$ and a matrix $\mathbf{A} \in \mathbb{R}^{J \times I_k}$ is defined as:
\[\begin{equation} \begin{split} \mathcal{Y} &= \mathcal{X} \times_k \mathbf{A} \end{split} \end{equation}\]where the elements of tensor $\mathcal{Y} \in \mathbb{R}^{I_1 \times I_2 \times J \times \cdots I_N}$ can be computed as:
\[\begin{equation} \begin{split} y_{i_1,i_2,\cdots,j,\cdots,i_N} &= \sum_{i_k=1}^{I_k} x_{i_1,i_2,\cdots,i_k,\cdots,i_N} a_{j, i_k} \end{split} \end{equation}\]Continuous $k$-mode products with matrices $\mathbf{A}^1 \in \mathbb{R}^{J_1 \times I_1},\cdots,\mathbf{A}^N \in \mathbb{R}^{J_N \times I_N}$ are denoted as:
\[\begin{equation} \begin{split} \mathcal{Y} &= \mathcal{X} \times_1 \mathbf{A}^1 \times_2 \mathbf{A}^2 \cdots \times_N \mathbf{A}^N \end{split} \end{equation}\]Intuitively, if no two matrices act on the same mode, then the order of product between them is interchangeable.
If two matrices share the same mode, then
\[\begin{equation} \begin{split} \mathcal{X} \times_n \mathbf{A} \times_n \mathbf{B} = \mathcal{X} \times_n \left(\mathbf{B} \mathbf{A}\right) \end{split} \end{equation}\]Mode-K Unfolding
The mode-$k$ unfolding of a tensor is a matricization process, that is:
\[\begin{equation} \begin{split} \mathcal{X}_{(k)} = \begin{bmatrix}\cdots & \mathbf{v} & \cdots \end{bmatrix} \end{split} \end{equation}\]where $\mathcal{X}_{(k)} \in \mathbb{R}^{I_k \times I_1 \cdots I_{k-1}I_{k+1} \cdots I_N}$. **Each column $\mathbf{v}$ is called a fiber**, which is just elements along the $k$-th dimension given other indices. **For simplicity, let's assume that this reshape operation follows the row-major layout or C-like order, with the index of the last axis changing the fastest**. And the $k$-rank of $\mathcal{X}$ is defined as the rank of the mode-k unfolding of $\mathcal{X}$.
With the mode-$k$ unfolding, the $k$-mode product can be expressed as a normal matrix product:
\[\begin{equation} \begin{split} \mathcal{Y}_{(k)} &= \mathbf{A} \mathcal{X}_{(k)} \end{split} \end{equation}\]or for continuous $k$-mode products:
\[\begin{equation} \begin{split} \mathcal{Y}_{(k)} &= \mathbf{A}^k \mathcal{X}_{(k)} \left(\mathbf{A}^1 \otimes \cdots \otimes \mathbf{A}^{k-1} \otimes \mathbf{A}^{k+1} \otimes \cdots \otimes \mathbf{A}^{N} \right)^T \end{split} \end{equation}\]where $\otimes$ is the Kronecker product and $\mathbf{A}^1 \otimes \cdots \otimes \mathbf{A}^{k-1} \otimes \mathbf{A}^{k+1} \otimes \cdots \otimes \mathbf{A}^{N} \in \mathbb{R}^{J_1J_2\cdots J_N \times I_1I_2\cdots I_N }$ is a super big matrix (you would find that Kolda & Bader matricized the tensor with the column-major layout and the order of Kronecker products was reversed).
The Kronecker product of matrices $\mathbf{A} \in \mathbb{R}^{I \times J}$ and $\mathbf{B} \in \mathbb{R}^{K \times L}$ is defined by: \(\begin{equation} \begin{split} \mathbf{A} \otimes \mathbf{B} &= \begin{bmatrix} a_{1,1} \mathbf{B} & \cdots & a_{1, J} \mathbf{B}\\ \vdots & \ddots & \vdots\\ a_{I,1} \mathbf{B}& \cdots & a_{I,J} \mathbf{B} \end{bmatrix}\\ &= \begin{bmatrix} \mathbf{a}_1 \otimes \mathbf{b}_1 & \cdots & \mathbf{a}_1\otimes \mathbf{b}_L & \mathbf{a}_2\otimes \mathbf{b}_1 & \cdots & \mathbf{a}_J \otimes \mathbf{b}_L \end{bmatrix} \end{split} \end{equation}\)
Tensor Decompositions
Two most common tensor decompositions are CP and Tucker decompositions, considered to be higher-order generalizations of the matrix SVD and PCA, separately.
CP Decomposition
CP decomposition (cnanonical decomposition/parallel factors, CANDECOMP/PARAFAC) is to express a tensor as the sum of a finite number of rank-one tensors. For example, given a 3-order tensor $\mathcal{X} \in \mathbb{R}^{I \times J \times K}$, CP decomposes it as:
\[\begin{equation} \begin{split} \mathcal{X} &\approx \sum_{i=1}^{R} \mathbf{a}_i \circ \mathbf{b}_i \circ \mathbf{c}_i \end{split} \end{equation}\]where the combination of the column vectors are called factor matrices, i.e., $\mathbf{A} = \begin{bmatrix}\mathbf{a}_1 & \cdots & \mathbf{a}_R \end{bmatrix}$. Let’s assume that each vector is normalized to value one with a weight vector $\mathbf{\lambda} \in \mathbb{R}^R$ so that the decomposition is:
\[\begin{equation} \begin{split} \mathcal{X} &\approx \sum_{i=1}^{R} \lambda_i \mathbf{a}_i \circ \mathbf{b}_i \circ \mathbf{c}_i \end{split} \end{equation}\]For a general $N$-order tensor $\mathcal{X} \in \mathbb{R}^{I_1 \times I_2 \times \cdots \times I_N}$, with the weight vector $\mathbf{\lambda}$ and factor matrices $\mathbf{A}^k \in \mathbb{R}^{I_k \times R}$, the CP decomposition is:
\[\begin{equation} \begin{split} \mathcal{X} &\approx \sum_{i=1}^{R} \lambda_i \mathbf{a}_i^1 \circ \mathbf{a}_i^2 \circ \cdots \circ \mathbf{a}_i^N \end{split} \label{eq:14} \end{equation}\]and the mode-$k$ matricized version is:
\[\begin{equation} \begin{split} \mathcal{X}_{(k)} &\approx \mathbf{A}^k \mathbf{\Lambda} \left( \mathbf{A}^1 \odot \cdots \mathbf{A}^{k-1} \odot \mathbf{A}^{k+1} \odot \cdots \mathbf{A}^{N} \right)^T \end{split} \end{equation}\]where $\mathbf{\Lambda} \in \mathbb{R}^{R \times R}$ is a diagonal matrix of the weight vector $\mathbf{\lambda}$ and $\odot$ is the Khatri-Rao product.
The Khatri-Rao product of two matrices $\mathbf{A} \in \mathbb{R}^{I \times K}$ and $\mathbf{B} \in \mathbb{R}^{J \times K}$ is defined by: \(\begin{equation} \begin{split} \mathbf{A} \odot \mathbf{B} &= \begin{bmatrix} \mathbf{a}_1 \otimes \mathbf{b}_1 & \mathbf{a}_2 \otimes \mathbf{b}_2 & \cdots & \mathbf{a}_K \otimes \mathbf{b}_K \end{bmatrix} \end{split} \end{equation}\) where the resulting matrix is in $\mathbb{R}^{IJ \times K}$.
The rank of a tensor is defined as the smallest number $R$ to achieve an exact CP decomposition in Equation (\ref{eq:14}). The tensor rank is an analogue to the matrix rank, but it has many different properites:
- the tensor rank may be different over $\mathbb{R}$ and $\mathbf{C}$;
- there is no straightforward algorithm to determine the rank of a given tensor yet;
- the rank decomposition is generally unique with the exception of permuation and scaling operations.
It’s well-known that the best rank-$k$ approximation of a rank-$R$ matrix $\mathbf{A}$ is the leading $k$ components of its SVD:
\[\begin{equation} \begin{split} \mathbf{A} &\approx \sum_{i=1}^{k} \lambda_i \mathbf{u}_i \circ \mathbf{v}_i,\ with\ \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_k \end{split} \end{equation}\]but this does not hold true for high-order tensors. In fact, the best rank-$k$ approximation of a tensor may not exist.
Since there is no easy way to determine the rank of a tensor, most CP decomposition procedures would fit multiple models until finding one that is good enough to approximate the given tensor. The Alternating least squares (ALS) method is one of them to compute a CP decomposition by solving one factor matrix each time with others fixed:
\[\begin{equation} \begin{split} \mathbf{A}^k = &\mathcal{X}_{(k)} \left(\mathbf{A}^1 \odot \cdots \mathbf{A}^{k-1} \odot \mathbf{A}^{k+1} \cdots \odot \mathbf{A}^N \right)\\ &\left( (\mathbf{A}^1)^T\mathbf{A}^1 \ast \cdots \ast (\mathbf{A}^N)^T\mathbf{A}^N\right)^{\dagger} \end{split} \end{equation}\]where $\ast$ is the elementwise product (Hadamard product) and $\dagger$ is the pseudo-inverse. $\mathbf{\lambda}$ can be computed by normalizing the columns of $\mathbf{A}^k$. The ALS method iterates until the stopping criteria satisfied. However, the ALS method is not guaranteed to converge to a global minimum or even a stationary point.
Tucker Decomposition
The Tucker decomposition decomposes a tensor into a core tensor multiplied by a factor matrix along each mode:
\[\begin{equation} \begin{split} \mathcal{X} &= \mathcal{G} \times_1 \mathbf{A}^1 \times_2 \mathbf{A}^2 \cdots \times_N \mathbf{A}^N \end{split} \end{equation}\]where $\mathcal{G} \in \mathbb{R}^{R_1 \times \cdots \times R_N}$, $\mathbf{A}^k \in \mathbb{R}^{I_k \times R_k}$ and $\mathbf{A}^k$ is orthogonal. It can be seen that the CP decomposition is a special case of the Tucker decomposition where the core tensor is superdiagonal and $R_1 = R_2 = \cdots = R_N$.
An important concept of Tucker Decomposition is the $n$-rank of $\mathcal{X}$, denoted $rank_n(\mathcal{X})$, which is the rank of $\mathcal{X}_{(n)}$. The Tucker decomposition requires a rank group $(R_1,R_2,\cdots, R_N)$, which is also the size of the core tensor $\mathcal{G}$.
Higher-order orthogonal iteration (HOOI) solves $\mathbf{A}^k$ with others fixed, as the ALS method:
\[\begin{equation} \begin{split} \mathcal{Y} &= \mathcal{X} \times_1 (\mathbf{A}^1)^T \cdots \times_{k-1}\\ &(\mathbf{A}^{k-1})^T \times_{k+1} (\mathbf{A}^{k+1})^T \cdots \times_{N} (\mathbf{A}^{N})^T\\ \mathcal{Y}_{(k)} &= \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T\\ \mathbf{A}^k &= \begin{bmatrix} \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_{R_n}\end{bmatrix} \end{split} \end{equation}\]and the core tensor $\mathcal{G} = \mathcal{X} \times_1 (\mathbf{A}^{1})^T \times \cdots \times (\mathbf{A}^{N})^T$. Note that the Tucker decomposition is not unique.