Say I have the following:
my_list = np.array(["abc", "def", "ghi"])
and I'd like to get:
np.array(["ef", "hi"])
I tried:
my_list[1:,1:]
But then I get:
IndexError: too many indices for array
Does Numpy support slicing strings?
No, you cannot do that. For numpy np.array(["abc", "def", "ghi"]) is a 1D array of strings, therefore you cannot use 2D slicing.
You could either define your array as a 2D array or characters, or simply use list comprehension for slicing,
In [4]: np.asarray([el[1:] for el in my_list[1:]])
Out[4]:
array(['ef', 'hi'], dtype='|S2')
Your array of strings stores the data as a contiguous block of characters, using the 'S3' dtype to divide it into strings of length 3.
In [116]: my_list
Out[116]:
array(['abc', 'def', 'ghi'],
dtype='|S3')
A S1,S2 dtype views each element as 2 strings, with 1 and 2 char each:
In [115]: my_list.view('S1,S2')
Out[115]:
array([('a', 'bc'), ('d', 'ef'), ('g', 'hi')],
dtype=[('f0', 'S1'), ('f1', 'S2')])
select the 2nd field to get an array with the desired characters:
In [114]: my_list.view('S1,S2')[1:]['f1']
Out[114]:
array(['ef', 'hi'],
dtype='|S2')
My first attempt with view was to split the array into single byte strings, and play with the resulting 2d array:
In [48]: my_2dstrings = my_list.view(dtype='|S1').reshape(3,-1)
In [49]: my_2dstrings
Out[49]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']],
dtype='|S1')
This array can then be sliced in both dimensions. I used flatten to remove a dimension, and to force a copy (to get a new contiguous buffer).
In [50]: my_2dstrings[1:,1:].flatten().view(dtype='|S2')
Out[50]:
array(['ef', 'hi'],
dtype='|S2')
If the strings are already in an array (as opposed to a list) then this approach is much faster than the list comprehension approaches.
Some timings with the 1000 x 64 list that wflynny tests
In [98]: timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 173 us per loop # mine's slower computer
In [99]: timeit np.array(my_list_64).view('S1').reshape(64,-1)[1:,1:].flatten().view('S63')
1000 loops, best of 3: 213 us per loop
In [100]: %%timeit arr =np.array(my_list_64)
.....: arr.view('S1').reshape(64,-1)[1:,1:].flatten().view('S63') .....:
10000 loops, best of 3: 23.2 us per loop
Creating the array from the list is slow, but once created the view approach is much faster.
See my edit history for my earlier notes on np.char.
As per Joe Kington here, python is very good at string manipulations and generator/list comprehensions are fast and flexible for string operations. Unless you need to use numpy later in your pipeline, I would urge against it.
[s[1:] for s in my_list[1:]]
is fast:
In [1]: from string import ascii_lowercase
In [2]: from random import randint, choice
In [3]: my_list_rand = [''.join([choice(ascii_lowercase)
for _ in range(randint(2, 64))])
for i in range(1000)]
In [4]: my_list_64 = [''.join([choice(ascii_lowercase) for _ in range(64)])
for i in range(1000)]
In [5]: %timeit [s[1:] for s in my_list_rand[1:]]
10000 loops, best of 3: 47.6 µs per loop
In [6]: %timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 45.3 µs per loop
Using numpy just adds overhead.
Starting with numpy 1.23.0, I added a mechanism to change the dtype of views of non-contiguous arrays. That means you can view your array as individual characters, slice it how you like, and then build it back together. Before this would require a copy, as @hpaulj's answer clearly shows.
>>> my_list = np.array(["abc", "def", "ghi"])
>>> my_list[:, None].view('U1')[1:, 1:].view('U2').squeeze()
array(['ef', 'hi'])
I'm working on another layer of abstraction, specifically for string arrays called np.slice_ (currently work-in-progress in PR #20694, but the code is functional). If that should get accepted, you will be able to do
>>> np.char.slice_(my_list[1:], 1)
array(['ef', 'hi'])
Your slicing is incorrectly syntaxed. You only need to do my_list[1:] to get what you need. If you want to copy the elements twice onto a list, You can do something = mylist[1:].extend(mylist[1:])