I'm almost positive that you're encountering memory issues because str.get_dummies returns an array full of 1s and 0s, of datatype np.int64. This is quite different from the behavior of pd.get_dummies, which returns an array of values of datatype uint8.
This appears to be a known issue. However, there's been no update, nor fix, for the past year. Checking out the source code for str.get_dummies will indeed confirm that it is returning np.int64.
An 8 bit integer will take up 1 byte of memory, while a 64 bit integer will take up 8 bytes. I'm hopeful that memory problems can be avoided by finding an alternative way to one-hot encode Col2 which ensures the output are all 8 bit integers.
Here was my approach, beginning with your example:
df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
'Col2': ['a,b,c', 'a,b', 'b,d']})
df
Col1 Col2
0 X a,b,c
1 Y a,b
2 X b,d
- Since
Col1 contains simple, non-delimited strings, we can easily one-hot encode it using pd.get_dummies:
df = pd.get_dummies(df, columns=['Col1'])
df
Col2 Col1_X Col1_Y
0 a,b,c 1 0
1 a,b 0 1
2 b,d 1 0
So far so good.
df['Col1_X'].values.dtype
dtype('uint8')
- Let's get a list of all unique substrings contained inside the comma-delimited strings in
Col2:
vals = list(df['Col2'].str.split(',').values)
vals = [i for l in vals for i in l]
vals = list(set(vals))
vals.sort()
vals
['a', 'b', 'c', 'd']
- Now we can loop through the above list of values and use
str.contains to create a new column for each value, such as 'a'. Each row in a new column will contain 1 if that row actually has the new column's value, such as 'a', inside its string in Col2. As we create each new column, we make sure to convert its datatype to uint8:
col='Col2'
for v in vals:
n = col + '_' + v
df[n] = df[col].str.contains(v)
df[n] = df[n].astype('uint8')
df.drop(col, axis=1, inplace=True)
df
Col1_X Col1_Y Col2_a Col2_b Col2_c Col2_d
0 1 0 1 1 1 0
1 0 1 1 1 0 0
2 1 0 0 1 0 1
This results in a dataframe that meets your desired format. And thankfully, the integers in the four new columns that were one-hot encoded from Col2 only take up 1 byte each, as opposed to 8 bytes each.
df['Col2_a'].dtype
dtype('uint8')
If, on the outside chance, the above approach doesn't work. My advice would be to use str.get_dummies to one-hot encode Col2 in chunks of rows. Each time you do a chunk, you would convert its datatype from np.int64 to uint8, and then transform the chunk to a sparse matrix. You could eventually concatenate all chunks back together.