Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle

Question

As per TensorFlow documentation , the prefetch and map methods of tf.contrib.data.Dataset class, both have a parameter called buffer_size.

For prefetch method, the parameter is known as buffer_size and according to documentation :

buffer_size: A tf.int64 scalar tf.Tensor, representing the maximum number elements that will be buffered when prefetching.

For the map method, the parameter is known as output_buffer_size and according to documentation :

output_buffer_size: (Optional.) A tf.int64 scalar tf.Tensor, representing the maximum number of processed elements that will be buffered.

Similarly for the shuffle method, the same quantity appears and according to documentation :

buffer_size: A tf.int64 scalar tf.Tensor, representing the number of elements from this dataset from which the new dataset will sample.

What is the relation between these parameters ?

Suppose I create aDataset object as follows :

 tr_data = TFRecordDataset(trainfilenames)
    tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls\
=5)
    tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
    tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
    tr_data = tr_data.batch(trainbatchsize)

What role is being played by the buffer parameters in the above snippet ?

404 link to "documentation" not found. – Pradeep Singh Mar 23 '20 at 06:53 — Pradeep Singh, Mar 23 '20 at 06:53

score 186 · Accepted Answer · edited Nov 13 '17 at 08:59

TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.

The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background. (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)

Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.

By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.

For this explanation, I still have some confusions w.r.t `tf.data.Dataset.shuffle()`. I would like to know the exact shuffling process. Say, the first `batch_size` samples are randomly chosen from the first `buffer_size` elements, and so on. — Bs He, Jul 10 '18 at 21:20
@mrry IIUC shuffling filenames is important because otherwise each epoch will see the same element in batches 0...999; and in batches 1000.1999; etc., where I assume 1 file = 1000 batches. Even with filename shuffling, there's still some non-randomness: that's because the examples from file #k are all close to each other in every epoch. That might be not too bad since file #k itself is random; still in some cases, even that could mess up the training. The only way to obtain perfect shuffle would be to set `buffer_size` to equal the file size (and shuffle the files of course). — max, Jan 04 '19 at 06:45
Tensorflow rc 15.0. With `dataset.shuffle(buffer_size=1)` shuffling still occurs. Any thoughts? — Sergey Bushmanov, Sep 29 '19 at 18:44
@SergeyBushmanov it may depend on the transformation before your shuffle, e.g. list_files(), which shuffles the filenames in the begining of every epoch by default. — Xiaolong, Oct 11 '19 at 06:49

Olivier Moindrot · Answer 2 · 2018-04-06T14:27:56.240

Importance of `buffer_size` in `shuffle()`

I wanted to follow up on the previous answer from @mrry to stress the importance of buffer_size in tf.data.Dataset.shuffle().

Having a low buffer_size will not just give you inferior shuffling in some cases: it can mess up your whole training.

A practical example: cat classifier

Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with 10000 images in each category):

train/
    cat/
        filename_00001.jpg
        filename_00002.jpg
        ...
    not_cat/
        filename_10001.jpg
        filename_10002.jpg
        ...

A standard way to input data with tf.data can be to have a list of filenames and a list of corresponding labels, and use tf.data.Dataset.from_tensor_slices() to create the dataset:

filenames = ["filename_00001.jpg", "filename_00002.jpg", ..., 
             "filename_10001.jpg", "filename_10002.jpg", ...]
labels = [1, 1, ..., 0, 0...]  # 1 for cat, 0 for not_cat

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(buffer_size=1000)  # 1000 should be enough right?
dataset = dataset.map(...)  # transform to images, preprocess, repeat, batch...

The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.
At the beginning of training, the dataset will take the first 1000 filenames and put them in its buffer, then pick one at random among them. Since all the first 1000 images are images of cat, we will only pick cat images at the beginning.

The fix here is to make sure that buffer_size is larger than 20000, or to shuffle in advance filenames and labels (with the same indices obviously).

Since storing all the filenames and labels in memory is not an issue, we can actually use buffer_size = len(filenames) to make sure that everything will be shuffled together. Make sure to call tf.data.Dataset.shuffle() before applying the heavy transformations (like reading the images, processing them, batching...).

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(buffer_size=len(filenames)) 
dataset = dataset.map(...)  # transform to images, preprocess, repeat, batch...

The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).

Then say, how the second sample is chosen? Randomly chosen from the array `[filename_01001, ...filename_02000]`? Or being chosen in another way? Meanwhile, I don't understand why using cat image at the very beginning is problematic, and why does the very first sampling so important? — Bs He, Jul 10 '18 at 21:32
The next sample is always chosen from the buffer (of size 1000 here). So the first sample is taken from the first 1000 filenames. The buffer decreases to size 999, so it takes the next input (`filename_01001`) and adds it. The second sample is taken randomly from these 1000 filenames (1001 first filenames minus the first sample). — Olivier Moindrot, Jul 11 '18 at 09:07
The issue with this low buffer size is that you will only have cats in your first batches. So the model will trivially learn to predict only "cat". The best way to train the network is to have batches with the same amount of "cat" and "non cat". — Olivier Moindrot, Jul 11 '18 at 09:08
Does tensorflow has a direct way of plotting out the distribution of batches? — Elona Mishmika, Sep 11 '18 at 07:05
You could use `tf.summary.histogram` to plot the distribution of labels over time. — Olivier Moindrot, Sep 11 '18 at 08:07
@OlivierMoindrot, you mean greater than 2000 not 20000. I think it was a typo — LearnToGrow, Feb 21 '19 at 15:18
Not a typo :) The dataset has 10k images of each class so the total buffer size should be above 20k. But in the example above, I took a buffer size of 1k which is too low. — Olivier Moindrot, Feb 22 '19 at 09:04
I get the point that the buffer_size should be large enough that the first batch consists not only of the first class but then shouldn't we just set our buffer_size to something greater than the dataset size? — FlyingZipper, Jun 20 '19 at 22:30
Yes setting the buffer size to the dataset size is generally fine. Anything above the dataset size would be useless anyway (and unless you repeat your dataset before shuffling, the buffer could not be bigger than the dataset). — Olivier Moindrot, Jun 21 '19 at 08:12

score 7 · Answer 3 · answered Feb 08 '19 at 15:06

Code

import tensorflow as tf
def shuffle():
    ds = list(range(0,1000))
    dataset = tf.data.Dataset.from_tensor_slices(ds)
    dataset=dataset.shuffle(buffer_size=500)
    dataset = dataset.batch(batch_size=1)
    iterator = dataset.make_initializable_iterator()
    next_element=iterator.get_next()
    init_op = iterator.initializer
    with tf.Session() as sess:
        sess.run(init_op)
        for i in range(100):
            print(sess.run(next_element), end='')

shuffle()

Output

[298][326][2][351][92][398][72][134][404][378][238][131][369][324][35][182][441][370][372][144][77][11][199][65][346][418][493][343][444][470][222][83][61][81][366][49][295][399][177][507][288][524][401][386][89][371][181][489][172][159][195][232][160][352][495][241][435][127][268][429][382][479][519][116][395][165][233][37][486][553][111][525][170][571][215][530][47][291][558][21][245][514][103][45][545][219][468][338][392][54][139][339][448][471][589][321][223][311][234][314]

This indicates that for every element yielded by the iterator, the buffer is being filled up with the respective next element of the dataset that wasn't in the buffer before. — Alex, Feb 18 '19 at 21:32

score 3 · Answer 4 · answered Jan 17 '19 at 22:27

I found that @olivier-moindrot is indeed correct, I tried the code provided by @Houtarou Oreki, using the modifications pointed by @max. The code I used was the following:

fake_data = np.concatenate((np.arange(1,500,1),np.zeros(500)))

dataset = tf.data.Dataset.from_tensor_slices(fake_data)
dataset=dataset.shuffle(buffer_size=100)
dataset = dataset.batch(batch_size=10)
iterator = dataset.make_initializable_iterator()
next_element=iterator.get_next()

init_op = iterator.initializer

with tf.Session() as sess:
    sess.run(init_op)
    for i in range(50):
        print(i)
        salida = np.array(sess.run(next_element))
        print(salida)
        print(salida.max())

The code output was indeed a number ranging from 1 to (buffer_size+(i*batch_size)), where i is the number of times you ran next_element. I think the way it is working is the following. First, buffer_size samples are picked in order from the fake_data. Then one by one the batch_size samples are picked from the buffer. Each time a batch sample is picked from the buffer it is replaced by a new one, taken in order from fake_data. I tested this last thing using the following code:

aux = 0
for j in range (10000):
    with tf.Session() as sess:
        sess.run(init_op)
        salida = np.array(sess.run(next_element))
        if salida.max() > aux:
            aux = salida.max()

print(aux)

The maximum value produced by the code was 109. So you need to assure a balanced sample within your batch_size to ensure a uniform sampling during training.

I also tested what @mrry said about performance, I found that the batch_size will prefetch that amount of samples into memory. I tested this using the following code:

dataset = dataset.shuffle(buffer_size=20)
dataset = dataset.prefetch(10)
dataset = dataset.batch(batch_size=5)

Changing the dataset.prefetch(10) amount resulted in no change in memory (RAM) used. This is important when your data does no fit into RAM. I think the best way is to shuffle your data/file_names before feeding them to tf.dataset, and then control the buffer size using buffer_size.

score 1 · Answer 5 · answered Nov 07 '18 at 16:49

Actually the answer by @olivier-moindrot is not correct.

You can verify it by creating filenames and labels as he/she mention and print the shuffle values.

You will see each shuffle procedure will generate sample randomly with the size equals to buffer size from the dataset.

dataset = dataset.shuffle(buffer_size=1000)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
    for i in range(1000):
        print(sess.run(next_element))

Sergey Bushmanov · Answer 6 · 2021-08-06T08:58:05.480

The following code snippet demonstrates the effect of buffer_size in ds.shuffle:

t = tf.range(10)
ds = tf.data.Dataset.from_tensor_slices(t)
for batch in ds.shuffle(buffer_size=2, seed=42).batch(5):
  print(batch)

tf.Tensor([1 2 0 3 5], shape=(5,), dtype=int32)
tf.Tensor([4 6 7 8 9], shape=(5,), dtype=int32)

Shuffle is an "action" (for who is familiar with Spark), which reads data of buffer_size into memory and shuffles it in-memory. After that the shuffled data is cut into batches according to the batch size. Note, how 5 has made it into the first batch (and nothing else from the second half of the data).

This brings up all the questions touched in other answers, like do you have enough memory to shuffle the whole dataset in-memory, or you better shuffle file-names, or shuffle data on disk, or both in-memory and on disk.

Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle

6 Answers6

Importance of `buffer_size` in `shuffle()`

A practical example: cat classifier

Linked

Related

Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle

6 Answers6

Importance of buffer_size in shuffle()

A practical example: cat classifier

Linked

Related

Importance of `buffer_size` in `shuffle()`