I am writing a module to train a ML model on a large dataset - It includes 0.6M datapoints, each of 0.15M dimensions. I am facing problem with loading the data set itself. (its all numpy arrays)
Below is a code snippet (This replicates major behaviour of the actual code):
import numpy
import psutil
FV_length = 150000
X_List = []
Y_List = []
for i in range(0,600000):
    feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
    # using db data, mark the features to activated 
    class_label = 0
    X_List.append(feature_vector)
    Y_List.append(class_label)
    if (i%100 == 0):
        print(i)
        print("Virtual mem %s" %(psutil.virtual_memory().percent))
        print("CPU usage %s" %psutil.cpu_percent())
X_Data = np.asarray(X_List)
Y_Data = np.asarray(Y_List)
The code results in ever-increasing memory allocation, until it gets killed. Is there a way to reduce the ever-increasing memory allocation ?
I have tried using gc.collect() but it always returns 0. I have made variables = None explicitly, no use again.
 
     
     
    