Last week I was attempting to implement an in-place fftshift
function in c++. I hoped this function could perform shifting along any given dimension of N-dimensional data. similar to those found in Matlab and Numpy. Interestingly, both Matlab and Numpy call another function within it, known as the circshift
function in Matlab and the roll
function in Numpy. These functions circularly shift the elements in an array by K
positions. Below are some illustrations about what circshift
does.
In fact, circshift
is an array rotation problem that has been well-studied a few decades ago. Many problems in computer science eventually boil down to array rotation problem (That’s a rotate!). The array rotation problem is that, given an array with two parts [a, b]
, how to swap them to get [b, a]
.
Allocate and Copy
As you already know, the optimal time complexity for rotating an array is O(n)
, where n
represents the number of elements in the array. Implementing an array rotation algorithm with a buffer is intuitive:
// left: the size of left part |
Bridge Rotation
rotation_with_buffer
is quite simple, provided that you can pay for the cost of memory allocation. In fact, for small arrays, it may outperform other algorithms. Memory requirements can be further reduced by considering the overlap of two parts:
Here we divide the array into three parts (assuming a
is smaller than b
): bl
, which represents the overlap of a
and b
,and the size of a
equals to the size of br
. If bl
is smaller than a
, then we move bl
to the buffer and sequentially shift br
and a
to their correct positions. Finally, we move bl
back to the beginning of array. If bl
is bigger than a
, then we revert to the previous algorithm, buffering a
instead. The basic idea is that we always move the smallest part to the buffer. Hoven called it a bridge rotation
algorithm.
Triple Reversal Rotation
Obviously, the above two are not turly in-place rotation. Triple reversal algorithm, as found in Programming Pearls, 2nd Edition, is a prime example of an in-place algorithm. It’s remarkably elegant and easy to understand. Let me explain this algorithm with some notations: denotes the array to be rotated, is the result of rotation, is the reversal of , is the reversal of . It’s straightforward to infer two theorems: , and (it looks like the tranpose in matrix multiply!).
To get with reverse operations, we have the following chains:
We can rotate the array with only three reverse operations in many ways. Here is one of implementations:
// reverse array pSrcDst with N elements |
Triple reversal rotation involves precisely 2N
memory operations. Hoven proposed an enhanced version, named trinity rotation (Hi, Lara!), which improves locality and reduces the number of moves. Honestly, I don’t really grasp the idea behind this algorithm and it seems quite complicated. If you are curious, please explore Hoven’s page. Additionally, the performance of triple reversal also depends on how to implement a reverse function efficiently. This blog provides a detailed explanation on how to implement it, utilizing modern computer characteristics, resulting in up to x22 speedups compared to std::reverse
in the standard C++ library!
Performance Test
I compared the perfomance of bridge rotation
, triple reversal rotation
, and conjoint reversal rotation
(aka trinity rotation) on my PC. It’s not a rigorous test. The array size is 1000000, and I chose the left
to be 333334. I ran each rotation algorithm 100 times to compute the average running time. I was expecting bridge rotation to be the worst since it had to allocate memory for one third of the array size. Surprisingly, they got nearly the same performance (bridge rotation: 253.5us, triple reversal rotation: 255.6us, conjoint reversal rotation: 252.6us).
How operation system optimizs memory management may contribute to this phenomenon. If I just freed allocated memory at the end of each loop and then executed the next loop immediately, I found that the first iteration cost much longer time (1000us) and the subsequent iterations cost shorter time (200-300us). If I deliberately didn’t free the buffer in each rotation, as recommended in this blog, bridge rotation was worse than the others. I can imagine that the system reuses the same memory block in each iteration, thereby making the time for memory allocation within subsequent iterations negligible. The blog I mentioned before also compared a juggling rotation with triple reversal rotation, and it found that most rotation algorithms are almost at the same level. The best way to optimize such an algorithm is to make it as predictable as possible for modern computers so that they can utilize their hardware-level advantages to accelerate the program. It does make sense.
Circshift
Here I use triple reversal rotation to implement an in-place circshift
function capable of shifting along any given dimension of N-dimensional data. The code is here and the interface looks like this:
void circshift(const std::vector<int> &vDims, int iAxis, int shift, T *pSrcDst); |
It’s not a fully optimized algorithm since I use two loops to handle data before and after the target axis seperately. I compared its performance with Matlab’s built-in circshift
function, and Matlab is still faster in small and medium array sizes. However, as the size grows larger, it takes much longer for Matlab to allocate a buffer for shifted elements. In such cases, this naive in-place circshift
can outperform it.
Fftshift
Once you get circshift
, fftshift
and ifftshift
are quite easy to implement:
template <typename T> |