I am comparing two training runs of a tf.Estimator.Estimator model fed by a tf.data.Dataset iterator. The training is handled by tf.train_and_evaluate()
When I look at the traces of a single training step I noticed that the GPU training is dominated by the IteratorGetNext call which takes 4.5 seconds. The same call when trained using cpus only takes only 100us. See the following photos of the traces:
cpu training:
gpu training:
What could be causing this, and how can I improve the speed of the GPUs IteratorGetNext?

