How to speed up caffe classifer in python

Question

I am using python to use caffe classifier. I got image from my camera and peform predict image from training set. It work well but the problem is speed very slow. I thinks just 4 frames/second. Could you suggest to me some way to improve computational time in my code? The problem can be explained as following. I have to reload an network model age_net.caffemodel that its size about 80MB by following code

age_net_pretrained='./age_net.caffemodel'
age_net_model_file='./deploy_age.prototxt'
age_net = caffe.Classifier(age_net_model_file, age_net_pretrained,
           mean=mean,
           channel_swap=(2,1,0),
           raw_scale=255,
           image_dims=(256, 256))

And for each input image (caffe_input), I call the predict function

prediction = age_net.predict([caffe_input])

I think that due to size of network is very large. Then predict function takes long time to predict image. I think the slow time is from it.
This is my full reference code. It changed by me.

from conv_net import *

import matplotlib.pyplot as plt
import numpy as np
import cv2
import glob
import os
caffe_root = './caffe' 
import sys
sys.path.insert(0, caffe_root + 'python')
import caffe
DATA_PATH = './face/'
cnn_params = './params/gender_5x5_5_5x5_10.param'
face_params = './params/haarcascade_frontalface_alt.xml'
def format_frame(frame):
    img = frame.astype(np.float32)/255.
    img = img[...,::-1]
    return img   

if __name__ == '__main__':    
    files = glob.glob(os.path.join(DATA_PATH, '*.*'))

    # This is the configuration of the full convolutional part of the CNN
    # `d` is a list of dicts, where each dict represents a convolution-maxpooling
    # layer. 
    # Eg c1 - first layer, convolution window size
    # p1 - first layer pooling window size
    # f_in1 - first layer no. of input feature arrays
    # f_out1 - first layer no. of output feature arrays
    d = [{'c1':(5,5),
          'p1':(2,2),
          'f_in1':1, 'f_out1':5},
         {'c2':(5,5),
          'p2':(2,2),
          'f_in2':5, 'f_out2':10}]

    # This is the configuration of the mlp part of the CNN
    # first tuple has the fan_in and fan_out of the input layer
    # of the mlp and so on.
    nnet =  [(800,256),(256,2)]    
    c = ConvNet(d,nnet, (45,45))
    c.load_params(cnn_params)        
    face_cascade = cv2.CascadeClassifier(face_params)
    cap = cv2.VideoCapture(0)
    cv2.namedWindow("Image", cv2.WINDOW_NORMAL)

    plt.rcParams['figure.figsize'] = (10, 10)
    plt.rcParams['image.interpolation'] = 'nearest'
    plt.rcParams['image.cmap'] = 'gray'
    mean_filename='./mean.binaryproto'
    proto_data = open(mean_filename, "rb").read()
    a = caffe.io.caffe_pb2.BlobProto.FromString(proto_data)
    mean  = caffe.io.blobproto_to_array(a)[0]
    age_net_pretrained='./age_net.caffemodel'
    age_net_model_file='./deploy_age.prototxt'
    age_net = caffe.Classifier(age_net_model_file, age_net_pretrained,
               mean=mean,
               channel_swap=(2,1,0),
               raw_scale=255,
               image_dims=(256, 256))
    age_list=['(0, 2)','(4, 6)','(8, 12)','(15, 20)','(25, 32)','(38, 43)','(48, 53)','(60, 100)']
    while(True):

        val, image = cap.read()        
        if image is None:
            break
        image = cv2.resize(image, (320,240))
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        faces = face_cascade.detectMultiScale(gray, 1.3, 5, minSize=(30,30))

        for f in faces:
            x,y,w,h = f
            cv2.rectangle(image, (x,y), (x+w,y+h), (0,255,255))            
            face_image_rgb = image[y:y+h, x:x+w]
            caffe_input = cv2.resize(face_image_rgb, (256, 256)).astype(np.float32)
            prediction = age_net.predict([caffe_input]) 
            print 'predicted age:', age_list[prediction[0].argmax()]       
        cv2.imshow('Image', image)
        ch = 0xFF & cv2.waitKey(1)
        if ch == 27:
            break
        #break

Why not try to profile your code, so you can pinpoint where your bottlenecks are? — boardrider, Jun 13 '15 at 18:21
@user8430 it's always advisable to find out which part of a long code takes up most resources/time, unless you're making a glaring error. — a-Jays, Jun 14 '15 at 15:35
@a-Jays: Based on request of boardrider, I posed my full code. I will point out about my problem in short update. Let see later — Jame, Jun 14 '15 at 15:37

score 3 · Accepted Answer · answered Sep 07 '15 at 14:19

3

Try calling age_net.predict([caffe_input]) with oversmaple=False:

prediction = age_net.predict([caffe_input], oversample=False)

The default behavior of predict is to create 10, slightly different, crops of the input image and feed them to the network to classify, by disabling this option you should get a x10 speedup.

answered Sep 07 '15 at 14:19

Shai

111,146
38
238
371

@Shai I'm doing something similar to OP but having an issue. I'm running prediction based on images from the webcam, but doing the prediction in a separate python process. The problem that arises is the FPS of the program slows down even though the prediction is occurring on the GPU in a separate process. I've tried your `oversample` solution and while my predictions occur faster, the FPS increase is minimal. I've placed multiple bounties on my question here: http://stackoverflow.com/questions/39522693/python-real-time-image-classification-problems-with-neural-networks and while it has (cont) – user3543300 Oct 15 '16 at 00:52
received a lot of attention there hasn't been a concrete answer. Have tried reaching out to caffe devs to explain this weirdness but no luck there etither. Would love to hear your input on my similar situation. Thanks! – user3543300 Oct 15 '16 at 00:53
@user3543300 I'm afraid I have little to contribute when it comes to CPU GPU multiprocess performance. sorry. – Shai Oct 15 '16 at 16:32
@Shai I think this is more a caffe and opencv issue than it is a multiprocessing one. Any ideas on how to reduce the load on the CPU when running caffe in GPU mode? – user3543300 Oct 15 '16 at 17:35

score 0 · Answer 2 · answered Dec 14 '22 at 14:48

For all of you who still use Caffe, I'd recommend trying OpenVINO to decrease inference time. OpenVINO optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.

Some snippets are below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[caffe]

Use Model Optimizer to convert Caffe model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Caffe model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:

mo --input_model "age_net.caffemodel" --data_type FP32 --source_layout "[n,c,h,w]" --target_layout "[n,h,w,c]" --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency or throughput, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/age_net.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"}) # alternatively THROUGHPUT or CUMULATIVE_THROUGHPUT

# Get input and output layers
input_layer_ir = compiled_model_ir.input(0)
output_layer_ir = compiled_model_ir.output(0)

# Resize and reshape input image
height, width = list(input_layer_ir.shape)[1:3]
input_image = cv2.resize(input_image, (width, height))[np.newaxis, ...]

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

Disclaimer: I work on OpenVINO.

How to speed up caffe classifer in python

2 Answers2

Linked