Here is a comparison I made. np.argsort was timed on a float32 ndarray consists of 1,000,000 elements.
In [1]: import numpy as np
In [2]: a = np.random.randn(1000000)
In [3]: a = a.astype(np.float32)
In [4]: %timeit np.argsort(a)
86.1 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And here is a C++ program do the same procedure but on vectors referring to this answer.
#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
  std::vector<float> numbers;
  for (int i = 0; i != 1000000; ++i) {
    numbers.push_back((float)rand() / (RAND_MAX));
  }
  double e1 = (double)cv::getTickCount();
  std::vector<size_t> idx(numbers.size());
  std::iota(idx.begin(), idx.end(), 0);
  std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
                                               { return numbers[a] < numbers[b];});
  double e2 = (double)cv::getTickCount();
  std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
  return 0;
}
It prints Finished in 525.908 milliseconds. and it is far slower than the numpy version. So could anyone explain what makes np.argsort so fast? Thanks.
Edit1: np.__version__ returns 1.15.0 which runs on Python 3.6.6 |Anaconda custom (64-bit) and g++ --version prints 8.2.0. Operating system is Manjaro Linux.
Edit2: I treid to compile with  This was deleted because I mistakenly run the test with my laptop's DC adapter unplugged, which would cause it slow down. In a fair competition, C-array and vector version perform equally (take about 100ms).-O2 and -O3 flags in g++ and I got result within 216.515 miliseconds and 205.017 miliseconds. That is an improve but still slower than numpy version. (Referring to this question)
Edit3: Another approach would be to replace vector with C like array: float numbers[1000000];. After that the running time is about 100ms(+/-5ms). Full code here:
#include <iostream>
#include <vector>
#include <cstddef>
#include <algorithm>
#include <opencv2/opencv.hpp>
#include <numeric>
#include <utility>
int main()
{
  //std::vector<float> numbers;
  float numbers[1000000];
  for (int i = 0; i != 1000000; ++i) {
    numbers[i] = ((float)rand() / (RAND_MAX));
  }
  double e1 = (double)cv::getTickCount();
  std::vector<size_t> idx(1000000);
  std::iota(idx.begin(), idx.end(), 0);
  std::sort(idx.begin(), idx.end(), [&numbers](const size_t &a, const size_t &b)
                                               { return numbers[a] < numbers[b];});
  double e2 = (double)cv::getTickCount();
  std::cout << "Finished in " << 1000 * (e2 - e1) / cv::getTickFrequency() << " milliseconds." << std::endl;
  return 0;
}
 
     
    