What are the exhaustion characteristics of RDRAND on Ivy Bridge?

Question

After reviewing the Intel Digital Random Number Generator (DRNG) Software Implementation Guide, I have a few questions about what happens to the internal state of the generator when RDRAND is invoked. Unfortunately the answers don't seem to be in the guide.

According to the guide, inside the DRNG there are four 128-bit buffers that serve random bits for RDRAND to drain. RDRAND itself will provide either 16, 32, or 64 bits of random data depending on the width of the destination register:
```
rdrand ax   ; put 16 random bits in ax
rdrand eax  ; put 32 random bits in eax
rdrand rax  ; put 64 random bits in rax
```
Will the use of larger destination registers empty those 128-bit buffers more quickly? For example, if I need only 2 bits of randomness, should I go through the trouble of using a 16 bit register over a 64 bit register? Will that make any difference on the throughput of the DRNG? I'd like to avoid consuming more randomness than is necessary.
The guide says the carry flag will be set after RDRAND executes:
```
CF = 1   Destination register valid. Non-zero random value
         available at time of execution. Result placed in register.
CF = 0   Destination register all zeros. Random value not available
         at time of execution. May be retried.
```
What does "not available" mean? Can random data be unavailable because RDRAND invocations exhausted those 128-bit buffers too quickly? Or does unavailable mean the DRNG is failing its health checks and cannot generate any new data? Basically, I'm trying to understand if CF=0 can occur just because the buffers happen to be (transiently) empty when RDRAND is invoked.

Note: I have reviewed the answers to this question on throughput and latency of RDRAND, but I'm seeking different information.

Thanks!

Note that [`rdrand` throughput is one per ~110 cycles on IvB, one per ~460 cycles on Skylake](http://agner.org/optimize/). It's a good idea to get 64bits and chop it up if you have a use for multiple smaller random numbers at the same time, or to use `rdseed` to seed a faster PRNG if you need a lot of random numbers. It's only ~16 uops, but high latency, and David's answer on the linked question indicates that it tends to stall the pipeline when you use the result right away. People only seem to be measuring RNG throughput, not how much impact it has on computations that use the numbers. — Peter Cordes, May 20 '16 at 19:20

score 19 · Accepted Answer · edited Jul 03 '13 at 18:47

Part 1. Does it make a difference pulling 16, 32 or 64 bits?

No.

On Ivy Bridge, the CPU cores pull 64 bits over the internal communication links to the DRNG, regardless of the size of the destination register. So if you read 32 bits, it pulls 64 bits and throws away the top half. If you read 16 bits, it pulls 64 and throws away the top 3/4.

This is not described in the instruction documentation because it may not continue to be true in future products. A chip might be designed which stashes and uses the unused parts of the 64 bit word. However there isn't a significant performance imperative to do this today.

For the highest throughput, the most effective strategy is to pull from parallel threads. This is because there is parallelism in the bus hierarchy on chip. Most of the time for the instruction is transit time across the buses. Performing that transit in parallel is going to yield a linear increase in throughput with the number of threads, up to the maximum of 800MBytes/s. The second thing is to use 64-bit RdRands, because they get more data per instruction.

Part 2. What does CF=0 mean really?

It means 'random data not available'. This is because the details of why it can't get a number are not available to the CPU core without it going off and reading more registers, which it isn't going to do because there is nothing it can do with the information.

If you sucked the output buffer of the DRNG dry, you would get an underflow (CF=0) but you could expect the next RdRand to succeed, because the DRNG is fast.

If the DRNG failed (e.g. a transistor popped in the entropy source and it no longer was random) then the online health tests would detect this and shut down the DRNG. Then all your RdRand invocations would yield CF=0.

However on Ivy Bridge, you will not be able to underflow the buffer. The DRNG is a little faster than the bus to which it is attached. The effect of pulling more data per unit time (with parallel threads) will be to increase the execution time of each individual RdRand as contention on the bus causes the instructions to have to wait in line at the DRNG's local bus. You can never pull so fast the the DRNG will underflow. You will asymptotically reach 800 MBytes/s.

This also is not described in the documentation because it may not continue to be true in future products. We can envisage products where the buses are faster and the cores faster and the DRNG would be able to be underflowed. These things are not known yet, so we can't make claims about them.

What will remain true is that the basic loop (try up to 10 times, then report a failure up the stack) given in the software implementors guide will continue to work in future products, because we've made the claim that it will and so we will engineer all future products to meet this.

So no, CF=0 cannot occur because "the buffers happen to be (transiently) empty when RDRAND is invoked" on Ivy Bridge, but it might occur on future silicon, so design your software to cope.

score 4 · Answer 2 · edited Aug 08 '13 at 18:01

Don't read anything into the 4*128 bit FIFO in the DRNG output. It is certainly there (I put it there) but it isn't something that has a software visible effect. The logic behind the DRNG doesn't produce data smoothly. It sometime schedules other things, like reseeding or conditioning, as per the SP800-90 spec. So the flow of data under load is irregular.

The buffer length of 4 was chosen because at 800MBytes/s (the speed of the locally attached bus) 4 is deep enough to prevent underflow when pulling at the maximum rate, given the worst case scheduling excursion, so there is a constant, smooth 800MByte/s supply with no interruption in the output.

If the attached bus was slower, the buffer would be shorter because a shorter buffer would be sufficient to prevent underflow.

score 2 · Answer 3 · answered Jan 20 '13 at 07:47

2

Regarding 2: http://download.intel.com/products/processor/manual/253665.pdf, 7.3.17

The CF indicates that the demand for random data exceeds the throughput of the DRNG.

Regarding 1:

If it is performance you are concerned about, why not read 64bit random value from the DRNG, then you can read 2bits from that 32 times, before you need to call the instruction again. You don't have to invoke new rdrand every time you need to bits.

answered Jan 20 '13 at 07:47

Vlad Krasnov

1,027
7
12

Thanks for the link! As for fetching a large result and chopping it up as needed, that would require maintaining my own state somewhere, which is complex and requires synchronization of some kind. I would like to instead rely solely on the DRNG's hardware-managed state while not over-consuming random bits. – cambecc Jan 21 '13 at 08:26

What are the exhaustion characteristics of RDRAND on Ivy Bridge?

3 Answers3

Linked