Convert a 2d matrix to a 3d one hot matrix numpy

Question

I have np matrix and I want to convert it to a 3d array with one hot encoding of the elements as third dimension. Is there a way to do with without looping over each row eg

a=[[1,3],
   [2,4]]

should be made into

b=[[1,0,0,0], [0,0,1,0],
   [0,1,0,0], [0,0,0,1]]

Divakar · Accepted Answer · 2017-10-19T06:45:06.327

Approach #1

Here's a cheeky one-liner that abuses broadcasted comparison -

(np.arange(a.max()) == a[...,None]-1).astype(int)

Sample run -

In [120]: a
Out[120]: 
array([[1, 7, 5, 3],
       [2, 4, 1, 4]])

In [121]: (np.arange(a.max()) == a[...,None]-1).astype(int)
Out[121]: 
array([[[1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 0, 0]],

       [[0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0]]])

For 0-based indexing, it would be -

In [122]: (np.arange(a.max()+1) == a[...,None]).astype(int)
Out[122]: 
array([[[0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0]],

       [[0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0]]])

If the one-hot enconding is to cover for the range of values ranging from the minimum to the maximum values, then offset by the minimum value and then feed it to the proposed method for 0-based indexing. This would be applicable for rest of the approaches discussed later on in this post as well.

Here's a sample run on the same -

In [223]: a
Out[223]: 
array([[ 6, 12, 10,  8],
       [ 7,  9,  6,  9]])

In [224]: a_off = a - a.min() # feed a_off to proposed approaches

In [225]: (np.arange(a_off.max()+1) == a_off[...,None]).astype(int)
Out[225]: 
array([[[1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 0, 0]],

       [[0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0]]])

If you are okay with a boolean array with True for 1's and False for 0's, you can skip the .astype(int) conversion.

Approach #2

We can also initialize a zeros arrays and index into the output with advanced-indexing. Thus, for 0-based indexing, we would have -

def onehot_initialization(a):
    ncols = a.max()+1
    out = np.zeros(a.shape + (ncols,), dtype=int)
    out[all_idx(a, axis=2)] = 1
    return out

Helper func -

# https://stackoverflow.com/a/46103129/ @Divakar
def all_idx(idx, axis):
    grid = np.ogrid[tuple(map(slice, idx.shape))]
    grid.insert(axis, idx)
    return tuple(grid)

This should be especially more performant when dealing with larger range of values.

For 1-based indexing, simply feed in a-1 as the input.

Approach #3 : Sparse matrix solution

Now, if you are looking for sparse array as output and AFAIK since scipy's inbuilt sparse matrices support only 2D formats, you can get a sparse output that is a reshaped version of the output shown earlier with the first two axes merging and the third axis being kept intact. The implementation for 0-based indexing would look something like this -

from scipy.sparse import coo_matrix
def onehot_sparse(a):
    N = a.size
    L = a.max()+1
    data = np.ones(N,dtype=int)
    return coo_matrix((data,(np.arange(N),a.ravel())), shape=(N,L))

Again, for 1-based indexing, simply feed in a-1 as the input.

Sample run -

In [157]: a
Out[157]: 
array([[1, 7, 5, 3],
       [2, 4, 1, 4]])

In [158]: onehot_sparse(a).toarray()
Out[158]: 
array([[0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0]])

In [159]: onehot_sparse(a-1).toarray()
Out[159]: 
array([[1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0]])

This would be much better than previous two approaches if you are okay with having sparse output.

Runtime comparison for 0-based indexing

Case #1 :

In [160]: a = np.random.randint(0,100,(100,100))

In [161]: %timeit (np.arange(a.max()+1) == a[...,None]).astype(int)
1000 loops, best of 3: 1.51 ms per loop

In [162]: %timeit onehot_initialization(a)
1000 loops, best of 3: 478 µs per loop

In [163]: %timeit onehot_sparse(a)
10000 loops, best of 3: 87.5 µs per loop

In [164]: %timeit onehot_sparse(a).toarray()
1000 loops, best of 3: 530 µs per loop

Case #2 :

In [166]: a = np.random.randint(0,500,(100,100))

In [167]: %timeit (np.arange(a.max()+1) == a[...,None]).astype(int)
100 loops, best of 3: 8.51 ms per loop

In [168]: %timeit onehot_initialization(a)
100 loops, best of 3: 2.52 ms per loop

In [169]: %timeit onehot_sparse(a)
10000 loops, best of 3: 87.1 µs per loop

In [170]: %timeit onehot_sparse(a).toarray()
100 loops, best of 3: 2.67 ms per loop

Squeezing out best performance

To squeeze out the best performance, we could modify approach #2 to use indexing on a 2D shaped output array and also use uint8 dtype for memory efficiency and that leading to much faster assignments, like so -

def onehot_initialization_v2(a):
    ncols = a.max()+1
    out = np.zeros( (a.size,ncols), dtype=np.uint8)
    out[np.arange(a.size),a.ravel()] = 1
    out.shape = a.shape + (ncols,)
    return out

Timings -

In [178]: a = np.random.randint(0,100,(100,100))

In [179]: %timeit onehot_initialization(a)
     ...: %timeit onehot_initialization_v2(a)
     ...: 
1000 loops, best of 3: 474 µs per loop
10000 loops, best of 3: 128 µs per loop

In [180]: a = np.random.randint(0,500,(100,100))

In [181]: %timeit onehot_initialization(a)
     ...: %timeit onehot_initialization_v2(a)
     ...: 
100 loops, best of 3: 2.38 ms per loop
1000 loops, best of 3: 213 µs per loop

you probably don't actually want the `.astype(int)` here, since that incurs an unnecessary copy — Eric, May 01 '16 at 05:19
@Eric Well, without it the output would be a boolean array and OP wants an int array it seems. Or am I missing something? — Divakar, May 01 '16 at 05:21
This is more a remark aimed at the OP, and how they probably want the boolean array in the first place. Viewing as an `np.int8` might elide the copy — Eric, May 01 '16 at 05:59
Good point, yes boolean arrays would do the job too, I will keep that in mind, @Divakar is there any way to make this sparse in ints or booleans? — Rahul, May 01 '16 at 13:24
I think this is truly an awesome trick, and I just wanted to share that you can even make it work with arrays or an arbitrary number of dimensions with this slight modification to the magic line: `(np.arange(a.max()) == a[..., None]-1).astype(int)` — jdehesa, Feb 14 '17 at 16:42
Just in case anyone needed onehot shape of [N, H, W] like I do, simply invert the shape of `out` in function `onehot_initialization_v2` like this: `out = np.zeros( (ncols, a.size), dtype=np.uint8)` and `out[a.ravel(), np.arange(a.size)] = 1`, `out.shape = (ncols,) + a.shape`. Notice that `.reshape` on original function won't do. — Toon Tran, Aug 25 '21 at 20:49

off99555 · Answer 2 · 2019-12-14T21:55:04.017

If you are trying to create one-hot tensor for your machine learning models (you have tensorflow or keras installed) then you can use one_hot function from https://www.tensorflow.org/api_docs/python/tf/keras/backend/one_hot or https://www.tensorflow.org/api_docs/python/tf/one_hot

It's what I'm using and is working well for high dimensional data.

Here's example usage:

>>> import tensorflow as tf

>>> tf.one_hot([[0,2],[1,3]], 4).numpy()
array([[[1., 0., 0., 0.],
        [0., 0., 1., 0.]],

       [[0., 1., 0., 0.],
        [0., 0., 0., 1.]]], dtype=float32)

simon · Answer 3 · 2018-10-25T12:28:12.420

Edit: I just realized that my answer is covered already in the accepted answer. Unfortunately, as an unregistered user, I cannot delete it any more.

As an addendum to the accepted answer: If you have a very small number of classes to encode and if you can accept np.bool arrays as output, I found the following to be even slightly faster:

def onehot_initialization_v3(a):
    ncols = a.max() + 1
    labels_one_hot = (a.ravel()[np.newaxis] == np.arange(ncols)[:, np.newaxis]).T
    labels_one_hot.shape = a.shape + (ncols,)
    return labels_one_hot

Timings (for 10 classes):

a = np.random.randint(0,10,(100,100))
assert np.all(onehot_initialization_v2(a) == onehot_initialization_v3(a))
%timeit onehot_initialization_v2(a)
%timeit onehot_initialization_v3(a)

# 102 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 79.3 µs ± 815 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This changes, however, if the number of classes increases (now 100 classes):

a = np.random.randint(0,100,(100,100))
assert np.all(onehot_initialization_v2(a) == one_hot_initialization_v3(a))
%timeit onehot_initialization_v2(a)
%timeit onehot_initialization_v3(a)

# 132 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 639 µs ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So, depending on your problem, either might be the faster version.

score 1 · Answer 4 · answered Jun 26 '21 at 21:27

1

Here is the simplest and most elegant solution using np.eye (identity matrix) and numpy powerful indexing :

labels_3d = np.eye(N_CLASSES)[labels_2d]

answered Jun 26 '21 at 21:27

Ismael EL ATIFI

1,939
20
16

Convert a 2d matrix to a 3d one hot matrix numpy

4 Answers4

Approach #1

Approach #2

Approach #3 : Sparse matrix solution

Squeezing out best performance

Linked