With a more modest sized matrix, my extractor code is slower, but only by 2x.
In [4]: M = sparse.random(1000,1000,.1,'csr')
In [5]: M
Out[5]: 
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 100000 stored elements in Compressed Sparse Row format>
In [6]: idx = np.arange(100)
In [7]: M[idx,:]
Out[7]: 
<100x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 10116 stored elements in Compressed Sparse Row format>
In [8]: timeit M[idx,:]
173 µs ± 514 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The extractor:
In [10]: def extractor(indices,N):
    ...:     indptr=np.arange(len(indices)+1)
    ...:     data=np.ones(len(indices))
    ...:     shape=(len(indices),N)
    ...:     return sparse.csr_matrix((data,indices,indptr), shape=shape)
The nnz match:
In [13]: extractor(idx, M.shape[0])*M
Out[13]: 
<100x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 10116 stored elements in Compressed Sparse Row format>
Somewhat slower, but not as much as in your example.
In [14]: timeit extractor(idx, M.shape[0])*M
302 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is larger than I tested 5yrs ago.
Since I have other things to do, I'm not going to try to test other cases - larger M and longer idx.
You are welcome to explore the sparse code to see if they have changed the calculation, may be streamlining it in various ways that my reconstruction doesn't capture.