I have read that Quick sort is one of the most efficient sorting algorithms (if not THE most), but I don't understand why it is so slow.
Before you try to close as duplicate, of course I have read this question, but it was posted almost a decade ago, and none of the answers were optimized.
I tested code from many answers and found them all inefficient, and I implemented a version of mine own with rudimentary optimization:
import random
from collections import deque
def qsort(arr):
    if len(arr) <= 1:
        return arr
    else:
        return qsort([x for x in arr[1:] if x < arr[0]]) + [arr[0]] + qsort([x for x in arr[1:] if x >= arr[0]])
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    lt = deque()
    ge = deque()
    e0 = arr.popleft()
    for e in arr:
        if e < e0:
            lt.append(e)
        else:
            ge.append(e)
    return quick_sort(lt) + deque([e0]) + quick_sort(ge)
def quick_sort_helper(arr):
    return quick_sort(deque(arr))
order = list(range(256))
chaos = random.choices(range(666), k=256)
In [2]: %timeit qsort(chaos)
480 µs ± 2.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [3]: %timeit qsort(order)
4.05 ms ± 45.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit quick_sort_helper(chaos)
394 µs ± 9.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [5]: %timeit quick_sort_helper(order)
3.06 ms ± 57.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
They are all extremely slow, but my optimization did help a little. (I have done many other tests, they are not included for brevity)
For comparison, here is the performance of a version of insertion sort I wrote in under 3 minutes:
def bisect_right(a, x):
    lo, hi = 0, len(a)
    while lo < hi:
        mid = (lo + hi) // 2
        if x < a[mid]:
            hi = mid
        else:
            lo = mid + 1
    return lo
def insertion_sort(arr):
    out = deque()
    for e in arr:
        out.insert(bisect_right(out, e), e)
    return out
In [12]: %timeit insertion_sort(chaos)
311 µs ± 6.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [13]: %timeit insertion_sort(order)
267 µs ± 7.93 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
How to optimize the algorithm so as to reduce recursion and identify sorted sub sequences to eliminate unnecessary computation?
The code from the answer linked in a comment below is surprisingly NOT faster than mine:
In [22]: %timeit qsort(chaos, 0, 255)
391 µs ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
It seems to perform better in the worst case, but that's about it.
In [28]: %timeit qsort(order, 0, 255)
360 µs ± 6.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
 
    