There are pre-made tools for that, look for Tensorflow models repository.
Their approach in essence is:
- Parse the xml annotation files and flatten the data structure within them.
- Produce
tfrecord that combines annotation and images,
this is arguably the best way.
For sake of training you can implement your own converter that takes a pair (xml,image) and saves into tfrecord example.
Tfrecord is tensorflow format for storing data, every tfrecord file is bascially a list containing examples, every example is an object that holds data in key : value pairs, where value is an array of primitive types (int, string, float) and key is a string.
So, first you flatten your xml annotation to match constraints of tfrecord file then you use tensorflow TFRecordWriter to save data into file.
Check Tensorflow API - it will pay off.