The main goal is to generate the customer similarity based on Euclidean distance, and find the 5 most similar customers for each customer.
I have 400,000 customers data, each of them has 40 attributes. The DataFame looks like:
          A1 A2 ... A40
0         xx xx ... xx
1         xx xx ... xx
2         xx xx ... xx
...       ...
399,999   xx xx ... xx
I first standardize these data by sklearn's StandardScaler. Now we get the processed data X_data.
So now we have 400,000 customers(points/vectors), each has 40 dimensions. So far so good.
I then use dis = numpy.linalg.norm(a-b) to calculate the distance of each pair of two points. The shorter the distance is, the more similar the customers are.
What I planed was to calculate the 5 most similar customers for each customer, and then combined the results together. I firstly start from customer0 to have a try. But it is already too slow for just one customer. Even I decrease the 40 dimensions to 2 dimensions by PCA from sklearn.decomposition, it is still too slow.
result=pd.DataFrame(columns=['index1','index2','distance'])
for i in range(len(X_data)):
    dis = numpy.linalg.norm(X_data[0]-X_data[i])
    result.loc[len(result)]=[0,i,dis]
result=result.sort_values(by=['distance])
result=result[1:6] #pick the first 5 customers starting from the second customer, because the first one is himself with 0 distance value
The result look like this, it shows the 5 most similar customers of customer0:
  index1 index2 distance
0   0    206391  0.004
1   0    314234  0.006
2   0    89284   0.007
3   0    124826  0.012
4   0    234513  0.013
So to get the result for all the 400,000 customers, i can just put another for loop outside this for loop. But the problem is, in this case, it is already so slow even I just calculate the most 5 similar customers for only customer0, not to mention all the customers. What should I do to get it faster? Any ideas?
 
     
    