Numpy, how to break a list into multiple chunks

Question

I am trying to break a numpy array into chunks with a fixed size and pad the last one with 0. For example: [1,2,3,4,5,6,7] into chunks of 3 returns [[1,2,3],[4,5,6],[7,0,0]].

The function I wrote is:

def makechunk(lst, chunk):
    result = []
    for i in np.arange(0, len(lst), chunk):
        temp = lst[i:i + chunk]
        if len(temp) < chunk:
            temp = np.pad(temp, (0, chunk - len(temp)), 'constant')
        result.append(temp)
    return result

It works but when dealing with large size array it is very slow. What is a more numpy-ish and vectorized way of doing it?

https://stackoverflow.com/questions/2235526/how-to-split-a-list-into-a-given-number-of-sub-lists-in-python — MaximGi, Apr 01 '19 at 09:27

Cedric Poulet · Accepted Answer · 2019-04-01T12:21:58.320

3

Using the function resize() should do what you need :

l = np.array([1,2,3,4,5,6,7])
l.resize((3,3), refcheck=False)

(Edit: mea culpa, monday problem with reasignation)

@J: Resize boost the speed by about 5 times for np.arange(0,44100) into chunks of 512.

import math
def makechunk4(lst, chunk):
    l = lst.copy()
    l.resize((math.ceil(l.shape[0]/chunk),chunk), refcheck=False)
    return l

edited Apr 01 '19 at 12:21

answered Apr 01 '19 at 09:39

Cedric Poulet

48
7

3

Don't assign it back. Unlike `numpy.resize`, `numpy.ndarray.size` returns `None`, as it is an in-place operation. – Chris Apr 01 '19 at 09:43
This returns None though – J_yang Apr 01 '19 at 09:43
@J_yang would be curious to know if the resize function if efficient with large size array? – Cedric Poulet Apr 01 '19 at 09:49
1

@CedricPoulet well, a little late to the party (and not the fastest code typer from what I see) but you can find time measurements of your approach below. – Szymon Maszke Apr 01 '19 at 09:58
1

@CedricPoulet I modified and tested your code: import math def makechunk4(lst, chunk): l = lst.copy() l.resize((math.ceil(l.shape[0]/chunk),chunk), refcheck=False) l.reshape(l.shape[0] * l.shape[1]) return l It is now about 5 times faster with an array of 44100 into chunk of 512 blocks. Many thanks. You can modify your answer to the code above and I will select as the best answer. :) – J_yang Apr 01 '19 at 10:16
@CedricPoulet and @J_yang does it return the correct answer? It returns one dimensional array while you wanted to split it. Shouldn't `l = l.reshape(l.shape[0] * l.shape[1])` be `l = l.reshape(chunk, -1)`? It would return `chunk x padded_elements` matrix this way. – Szymon Maszke Apr 01 '19 at 10:36
1

To be honest, it's not really clear what does J_yang want as an output. The reshape that he added flattern the matrix, I guess that's the output he was looking for (with the reshape, it could have just been an addition of zero at the end, so I think it's not really what he want..). But the problem was more about the way to split the data that the output. I let him judge what he need exactly. – Cedric Poulet Apr 01 '19 at 12:21

Szymon Maszke · Answer 2 · 2019-04-01T10:41:05.670

Time comparison of @Cedric Poulet's (all kudos to him, see his answer) solution (with added array splitting so it returns the result as desired) with another numpy approach I thought about at first (create array of zeros and insert data in-place):

import time

import numpy as np

def time_measure(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        stop = time.time()
        print(f"Elapsed time: {stop-start}")
        return result

    return wrapper


@time_measure
def pad_and_chunk(array, chunk_size: int):
    padded_array = np.zeros(len(array) + (chunk_size - len(array) % chunk_size))
    padded_array[: len(array)] = array
    return np.split(padded_array, len(padded_array) / chunk_size)


@time_measure
def resize(array, chunk_size: int):
    array.resize(len(array) + (chunk_size - len(array) % chunk_size), refcheck=False)
    return np.split(array, len(array) / chunk_size)

@time_measure
def makechunk4(l, chunk):
    l.resize((math.ceil(l.shape[0] / chunk), chunk), refcheck=False)
    return l.reshape(chunk, -1)


if __name__ == "__main__":
    array = np.random.rand(1_000_000)

    ret = pad_and_chunk(array, 3)
    ret = resize(array, 3)
    ret = makechunk4(array, 3)

EDIT-EDIT

Gathering all possible answers it is indeed the case that np.split is horribly slow when compared to reshape.

Elapsed time: 0.3276541233062744
Elapsed time: 0.3169224262237549
Elapsed time: 1.8835067749023438e-05

Way of padding data is not essential, it's the split taking up most of the time.

I am afraid the results are not the same. resize return an array of one 3 chunks, — J_yang, Apr 01 '19 at 10:05
@J_yang are you downvoting because of simple programistic error? Wow, okay, here you go, fixed... — Szymon Maszke, Apr 01 '19 at 10:11
I think np.split is not as fast as np.reshape: here is my edit based of Cedric: def makechunk4(lst, chunk): l = lst.copy() l.resize((math.ceil(l.shape[0]/chunk),chunk), refcheck=False) l.reshape(l.shape[0] * l.shape[1]) return l Both return the same result. But %timeit shows 39us vs 207us for np.arange(0, 44100) into chunks of 512. Thanks — J_yang, Apr 01 '19 at 10:26
@J_yang added this answer as well (fixed `reshape` and removed data copying) and you are right, it makes tremendous difference. Please verify whether this `reshape` does what you want. — Szymon Maszke, Apr 01 '19 at 10:41

score 0 · Answer 3 · answered Apr 01 '19 at 09:26

0

in the itertools recipes there is a recipe for grouper:

from itertools import zip_longest
import numpy as np

array = np.array([1,2,3,4,5,6,7])

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

res = list(grouper(array, 3, fillvalue=0))
# [(1, 2, 3), (4, 5, 6), (7, 0, 0)]

if you need the sublists to be lists and not tuples:

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return (list(item) for item in zip_longest(*args, fillvalue=fillvalue))

answered Apr 01 '19 at 09:26

hiro protagonist

44,693
14
86
111

Hi, the grouper return . at 0x1c1f42f390>, which is much faster. 8us (for np.arange(0,44100) into n = 512). compared to 140us for my function. But to get the actual list of the array I need to list(grouper_return), then it becomes much slower at 3ms . Any suggestion? – J_yang Apr 01 '19 at 09:41
calling the grouper does not do anything yet: `zip_longest` is lazy and starts evaluation only once you iterate over it. and then it will cost some time... don't think there is anything you can do... – hiro protagonist Apr 01 '19 at 09:43
unfortunately this is much slower. :( – J_yang Apr 01 '19 at 09:45

Thijs van Ede · Answer 4 · 2019-04-01T11:25:45.627

-2

A solution using numpy

I assume a chunk size of 3 and created a random array input of length 10 in x.

# Chunk size
chunk = 3
# Create array
x = np.arange(10)

First make sure to pad the array with zeros. Next you can use reshape to create an array of arrays.

# Pad array
x = np.pad(x, (0, chunk - (x.shape[0]%chunk)), 'constant')
# Divide into chunks
x = x.reshape(-1, chunk)

Optionally you can retrieve the numpy array as a list

x = x.tolist()

edited Apr 01 '19 at 11:25

answered Apr 01 '19 at 09:32

Thijs van Ede

861
6
15

This returns ValueError: cannot reshape array of size 44168 into shape (512) – J_yang Apr 01 '19 at 09:44
But buy padding the array before hand instead of checking it at each loop, I managed to improve time by about 5%. :) – J_yang Apr 01 '19 at 09:48

Numpy, how to break a list into multiple chunks

4 Answers4

EDIT-EDIT