In my code, I have many dataframe subset operations on more or less large data frames. Unfortunately, the df columns contain lists; I do not insist of storing the data with pandas, however I haven't found a better option yet. The dataframes follow this principle, but they can get very large:
list_column_one       list_column_two          other_column_1  other_column2
["apple", "orange"]   ["cucumber", "tomato"]   1               "bread"
I tried subsetting them like this, when the subset should contain a certain value in a non-list column:
df[[d == some_value for d in df["other_column_1"]]]
Like this when the subset should contain a certain value in a list column:
df.loc[df["list_column_1"].map(lambda d: some_value in d)]
Or like this, when the list in a column should be a subset of another list:
from collections import Counter
#source: https://stackoverflow.com/a/15147825/7253302
def counterSubset(list1, list2):
   c1, c2 = Counter(list1), Counter(list2)
   for k, n in c1.items():
      if n > c2[k]:
        return False
   return True
important_list = ["apple", "orange", "bear"]
df[[counterSubset(d, important_list) for d in df["list_column_one"]]]
But all of these still slow down the code massively because they are executed so often. Is there any way to use cython/numpy/another package for data storage in order to speed up lookups?
 
    