I want to divide my images into smaller windows which will be send to a neural net for training (e.g. for face detectors training). I found tf.extract_image_patches method in Tensorflow which seemed like exactly what I need. This question explains what it does.
The example there shows input of (1x10x10x1) (numbers 1 through 100 in order) given the ksize is (1, 3, 3, 1) (and strides (1, 5, 5, 1)). The output is this:
[[[[ 1 2 3 11 12 13 21 22 23]
[ 6 7 8 16 17 18 26 27 28]]
[[51 52 53 61 62 63 71 72 73]
[56 57 58 66 67 68 76 77 78]]]]
But I'd expect windows like this (of a shape (Nx3x3x1), so that it's N patches/windows of the size 3x3):
[[[1, 2, 3]
[11, 12, 13]
[21, 22, 23]]
...
So why are all patch values stored in 1D? Does it mean that this method is not meant for the purposes I described above and i can't use it to prepare batches for training? I also found another method for patches extracting, sklearn.feature_extraction.image.extract_patches_2d and this one really does what I was expecting. So should I understand it like that these two methods don't do the same thing?