Iterator 40,000 times faster on CPU than GPU worker

Question

I am comparing two training runs of a tf.Estimator.Estimator model fed by a tf.data.Dataset iterator. The training is handled by tf.train_and_evaluate()

When I look at the traces of a single training step I noticed that the GPU training is dominated by the IteratorGetNext call which takes 4.5 seconds. The same call when trained using cpus only takes only 100us. See the following photos of the traces:

cpu training:

gpu training:

What could be causing this, and how can I improve the speed of the GPUs IteratorGetNext?

I guess you have issues feeding the gpu. https://stackoverflow.com/questions/48715062/tensorflow-performance-bottleneck-on-iteratorgetnext — user1462442, Mar 29 '19 at 02:07
This would be my guess as well -- but is there any way of prefetching the data on the GPU memory? — zephyrus, Mar 29 '19 at 02:31
I did send you a link. The guy answered with this official doc https://www.tensorflow.org/guide/performance/datasets — user1462442, Mar 29 '19 at 02:33
I've already implemented all of those suggestions -- but AFAIK these only improve the speed if the bottleneck is cpu compute, not communication time between gpu and cpu. — zephyrus, Mar 29 '19 at 02:36
I would think the slowdown was due to cpu computer time were it not for the fact the CPU-only trained model was so fast, indicating that the cpu is more than capable of doing the pre-processing I put in the input pipeline. — zephyrus, Mar 29 '19 at 02:36
Umm, I do not think anyone can extrapolate anything from your opening post. There isnt any idea on how long your models takes to compute nor how big it is. I am not sure if those graphs are complete or a segment. — user1462442, Mar 29 '19 at 02:45
These indicate a single training step. The point is only to illustrate the relative speed of the same step `IteratorGetNext` on cpu vs gpu system. — zephyrus, Mar 29 '19 at 02:50

Iterator 40,000 times faster on CPU than GPU worker

cpu training:

gpu training:

0 Answers0