demo csv file:
      label1                              label2            m1
0   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0000_1  0.000000
1   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0001_1  1.000000
2   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0002_1  1.000000
3   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0003_1  1.414214
4   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0004_1  2.000000
5   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0005_1  2.000000
6   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0006_1  3.000000
7   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0007_1  3.162278
8   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0008_1  4.000000
9   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0009_1  5.000000
10  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0010_1  5.000000
11  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0011_1  6.000000
12  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0012_1  6.000000
13  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0013_1  6.000000
14  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0014_1  6.000000
From this CSV file, I will do some comparison operation. I will have a function which will make comparison and return minimum from the combination.
There are 160000 rows. Using pandas and for loop are taking a lot of time. Can I make it faster using dask? I tried dask dataframe from pandas but when I am using to_list which I can used for pandas column, it's giving me error. I have core i7 machine and ram of 128 gb Below is my code:
"""
#the purpose of this function is to calculate different rows...
#values for the m1 column of data frame. there could be two 
#combinations and inside combination it needs to get m1 value for the row
#suppose first comb1 will calucalte sum of m1 value of #row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_1) and 
#row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_2)
a more details of this function could be found here: 
(https://stackoverflow.com/questions/72663618/writing-a-python-function-to-get-desired-value-from-csv/72677299#72677299)
def compute(img1,img2):
   comb1=(img1_1,img2_1)+(img1_1,img2_2)
   comb2=(img1_2,img2_1)+(img1_2,img2_2)
   return minimum(comb1,comb2)
"""
def min_4line(key1,key2,list1,list2,df):
  k=['1','2','3','4']
  indice_list=[]
  key1_line1=key1+'_'+k[0]
  key1_line2=key1+'_'+k[1]
  key1_line3=key1+'_'+k[2]
  key1_line4=key1+'_'+k[3]
  key2_line1=key2+'_'+k[0]
  key2_line2=key2+'_'+k[1]
  key2_line3=key2+'_'+k[2]
  key2_line4=key2+'_'+k[3]
  
  ind1=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line1)].tolist()
  ind2=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line2)].tolist()
  ind3=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line3)].tolist()
  ind4=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line4)].tolist()
  
  comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
  
  ind1=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line1)].tolist()
  ind2=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line2)].tolist()
  ind3=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line3)].tolist()
  ind4=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line4)].tolist()
  comb2=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
  return min(comb1,comb2)
Now, have to create unique list of labels to do the comparison:
list_line=list(df3['label1'].unique())
string_test=[a[:-2] for a in list_line]
#above list comprehension is done as we will get unique label like animebook0000_1,animebook0001_1
list_key=sorted(list(set(string_test)))
print(len(list_key))
#making list of those two column
lable1_list=df3['label1'].to_list()
lable2_list=df3['label2'].to_list()
Next, I will write the output of the comparison function in an excel
%%time
file = open("content\\dummy_metric.csv", "a")
file.write("label1,label2,m1\n")
c=0
for i in range(len(list_key)):
  for j in range(i+1,len(list_key)):
    a=min_4line(list_key[i],list_key[j]  ,lable1_list,lable2_list,df3)
    #print(a)
    file.write(str(list_key[i]) + "," + str(list_key[j])  + "," + str(a)+  "\n")
    c+=1
    if c>20000:
        print('20k done')
my expected output:
     label1                          label2                m1
0   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0001    2
1   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0002    2
2   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0003    2
3   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0004    4
4   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0005    5
5   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0006    7
6   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0007    9
7   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0008    13
For dask I was proceeding like this:
import pandas as pd
import dask.dataframe as dd
csv_gb=pd.read_csv("content\\four_metric.csv")
dda = dd.from_pandas(csv_gb, npartitions=10)
Upto that line is fine, but when I want to do the list of label like this
 lable1_list=df3['label1'].to_list()
 it's showing me this error:
    2022-07-05 16:31:17,530 - distributed.worker - WARNING - Compute Failed
Key:       ('unique-combine-5ce843b510d3da88b71287e6839d3aa3', 0, 1)
Function:  execute_task
args:      ((<function pipe at 0x0000022E39F18160>, [0     KeyT1_L1_1_animebook0000_1
.....
25    KeyT1_L1_1_animebook_002
kwargs:    {}
Exception: 'TypeError("\'Serialize\' object is not callable")'
Is there any better way to perform the above mentioned code with dask? I am also curious about using dask distributed function like this for my task:
from dask.distributed import Client
client = Client()
client = Client(n_workers=3, threads_per_worker=1, processes=False, memory_limit='40GB')
