I know it can be easily realized using the package pandas, but because it is too sparse and large (170,000 x 5000), and at the end I need to use sklearn to deal with the data again, I'm wondering if there is a way to do with sklearn. I tried the one hot encoder, but got stuck to associate dummies with the 'id'.
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3], 'item': ['a', 'a', 'c', 'b', 'a', 'b']})
   id item
0   1    a
1   1    a
2   2    c
3   2    b
4   3    a
5   3    b
dummy = pd.get_dummies(df, prefix='item', columns=['item'])
dummy.groupby('id').sum().reset_index()
   id  item_a  item_b  item_c
0   1       2       0       0
1   2       0       1       1
2   3       1       1       0
Update:
Now I'm here, and the 'id' is lost, how to do aggregation then?
lab = sklearn.preprocessing.LabelEncoder()
labels = lab.fit_transform(np.array(df.item))
enc = sklearn.preprocessing.OneHotEncoder()
dummy = enc.fit_transform(labels.reshape(-1,1))
dummy.todense()
matrix([[ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 0.,  0.,  1.],
        [ 0.,  1.,  0.],
        [ 1.,  0.,  0.],
        [ 0.,  1.,  0.]])