I have a csv file containing user ratings for about 56,124 items (columns) for about 3000 users (rows). The rating is an integer less than 128. I have this function:
def sparse_to_npz(file, npz):
  print("Reading " + file + " ...")
  data_items = pd.read_csv(file)
  # Create a new dataframe without the user ids.
  data_items = data_items.drop('u', 1)
  # As a first step we normalize the user vectors to unit vectors.
  # magnitude = sqrt(x2 + y2 + z2 + ...)
  magnitude = np.sqrt(np.square(data_items).sum(axis=1))
  # unitvector = (x / magnitude, y / magnitude, z / magnitude, ...)
  data_items = data_items.divide(magnitude, axis='index')
  del magnitude
  print("Saving to " + npz)
  data_sparse = sparse.csr_matrix(data_items)
  del data_items
  sparse.save_npz(npz, data_sparse)
  #np.save("columns", data_items.columns.values)
which is passed two files: input csv file (sparse, every user with all items ratings), and should output npz file to save memory. After the file is read using pandas and stored in data_items, we need to calculate the magnitude and divide the data_items by it, then finally saving the npz file. The problem is that I am getting my error at the step of calculating the mag. using np.sqrt(np.square(... on machine with 12 GB memory. How can I make it work?
