I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code. The output could be also stored in memory, and written to a file afterwards.
I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation. I am working on a Windows platform, using Python 3.4.
for i in range(0, len(unique_words)):
max_similarity = 0
max_similarity_word = ""
for j in range(0, len(unique_words)):
if not i == j:
similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
if similarity > max_similarity:
max_similarity = similarity
max_similarity_word = unique_words[j]
file_co_occurring.write(
unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")
If you need an explanation for the code:
unique_wordsis a list of words (strings)global_mapis a dictionary whose keys are words(global_map.keys()contains the same elements asunique_words) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values inunique_words- for each word, I look for the most similar word based on its value in
global_map. I wouldn't prefer to store each similarity in memory since the maps already take too much. calculate_similarityreturns a value from 0 to 1- the result should contain the most similar word for each of the words in
unique_words(the most similar word should be different than the word itself, that's why I added the conditionif not i == j, but this can be also done if I check ifmax_similarityis different than 1) - if the
max_similarityfor a word is 0, it's OK if the most similar word is the empty string