Which C++ random number engines have a O(1) discard function?

Question

Since C++11 there are a number of std random number engines. One of the member functions they implement is void discard(int long long z) which skips over z randomly generated numbers. The complexity of this function is given as O(z) on www.cplusplus.com (http://www.cplusplus.com/reference/random/mersenne_twister_engine/discard/)

However, on www.cppreference.com (http://en.cppreference.com/w/cpp/numeric/random/mersenne_twister_engine/discard) there is a note to say that

For some engines, "fast jump" algorithms are known, which advancing the state by many steps (order of millions) without calculating intermediate state transitions.

How do I know for which engines the actual cost of discard is O(1)?

Which manual are you referring to? I tried MSDN (https://msdn.microsoft.com/en-us/library/ee462318.aspx), because I am working in Visual Studio, but didn't find the answer there. — alextack, Nov 13 '17 at 12:22
That didn't help because the function is not properly documented, so I am stuck doing my own benchmarks? Seems a bit heavy-handed for using STL. — alextack, Nov 13 '17 at 12:26
You could look at the source code, which is freely available in the Visual Studio installation directory — meowgoesthedog, Nov 13 '17 at 12:34
Thanks, yes did that. And from what I can see all discard(z) functions in are implementated using a for-loop drawing z random numbers, so O(z). Not sure if I am overlooking something though. — alextack, Nov 13 '17 at 12:43
I can add that the RNG algorithm will strongly determine how possible this is. Imagine an algorithm which steps through a pre-computed table of "random" numbers via a known step value. In this case, `table[(Stepvalue * numSteps) % tableSize]` is O(1). This is an extremely simple and limited example, but I've used a RNG method like this on very limited systems, so it has a place in the world. — Michael Dorgan, Nov 13 '17 at 18:14

Severin Pappadeux · Answer 1 · 2017-11-13T17:55:48.023

6

Well, if you use precomputed jump points, O(1) will work for each and every RNG in existence. Please, remember, that there are algorithm which might have better than O(z), but not O(1) - say, O(log₂ z).

If we're talking about jump to arbitrary point, things get interesting. For example, for linear congruential generator there is known O(log₂ z) jump ahead algorithm, based upon paper by F. Brown, "Random Number Generation with Arbitrary Stride," Trans. Am. Nucl. Soc. (Nov. 1994). Code example is here.

There is LCG RNG in the C++11 standard, not sure how fast jump ahead is done in particular implementation (http://en.cppreference.com/w/cpp/numeric/random/linear_congruential_engine)

PCG family of RNGs share the same property, I believe

edited Nov 13 '17 at 17:55

answered Nov 13 '17 at 13:54

Severin Pappadeux

18,636
3
38
64

Thanks for the answer. Yes, I am looking for arbitrary jumps, not pre-computed jumps. The PCG page that you link to has an overview of engines that can in principle support jump ahead, better than O(z). I would really like to know about O(1) though. – alextack Nov 13 '17 at 16:20
3

@alextack You got me thinking wrt O(1) for jump ahead, and I'm not sure if it is possible at all. Basically, there ought to be a function where there is a jump value (say, 64bit), state value (XXX bits) and function always does the same amount of computations independent of jump value to advance state forward. Might not be possible. You'd better ask question wrt existence of such RNG/function in friendly mathexchange and/or cross-validation sites. O(log2 z) is good enough for me - 30 computations to advance RNG by 1 billion. – Severin Pappadeux Nov 13 '17 at 16:59
Blum Blum Shub also has this property, but it is otherwise dog slow. – Lee Daniel Crocker Nov 13 '17 at 22:16

Serge Dundich · Answer 2 · 2023-08-22T10:18:58.863

Fact is that std::linear_congruential_engine<UIntType,a,c,m>::discard(unsigned long long z) is definitely possible to implement very efficiently (and quite trivial actually). It is mostly equivalent to exponentiation of a to the power of z modulo m (for both zero and non-zero c) - it means quite simple software implementation executes in O(log(z % φ(m))) UIntType multiplication + modulo reduction operations (φ(m)=m-1 - for prime number and less than m in general case).

^^^NOTE that in a way O(log(z % φ(m))) is O(1) because log2(z % φ(m)) < log2(m) < sizeof(UIntType)*CHAR_BIT - though in practice it is more like O(log(z)).

^^^ALSO NOTE that after generating some template-parameter-dependent pre-computed tables (either lazy or compile-time computation may be employed if suitable) exponentiation complexity may be reduced to just a few multiplication + modulo reduction operations (like 4 or 8) - i.e. O(1) in every practical sense.

Also probably there are efficient enough algorithms for most other engines' discard functions that would satisfy O(P(sizeof(state))*log(z)) constraint (P(x) - some low degree polynomial function, most likely 1+o(1) degree or at most 3+o(1) degree, given that log(z) < sizeof(unsigned long long)*CHAR_BIT may be considered as constant).

DESPITE ALL THIS:

Somehow for some unknown reason C++ standard (as of ISO/IEC 14882:2017) does not require to implement discard in more efficient way than just z operator()() calls for any PRNG engine including those that definitely allow this.

To me personally it is baffling and just MAKES NO SENSE - it violates one of the fundamental C++ language design principles (in a very brutal way) - that is to add to the C++ standard only reasonable functionality in terms of performance and practical usefulness.

Notable and very documented example: there is no such thing as access of std::list element by index (std::list<T>::operator[](size_type n)) even though it is as "easy" as to just call operator++() n times with iterator begin(). And naturally so - because O(n) execution time would make this function unreasonable choice in any practical application (AKA plain stupid idea). For this obvious reason a[n] and a.at(n) is not part of mandatory Sequence container requirements (ISO/IEC 14882:2017 26.2.3 Table 87) but instead is a part of Optional sequence container operations (ISO/IEC 14882:2017 26.2.3 Table 88).

SO why in the world then e.discard(z) is part of mandatory Random number engine requirements (ISO/IEC 14882:2017 29.6.1.4 Table 104) with this ridiculous complexity requirement - no worse than the complexity of z consecutive calls e() - instead of some optional operations section entry with adequate complexity requirement like O(P(sizeof(state))*log(z))?

Like WHAT the...? z consecutive calls e() is exponential complexity - Since when it is OK?

Even more baffling was to actually find in my GCC this real-world implementation:

void discard(unsigned long long __z)
{
    for (; __z != 0ULL; --__z)
        (*this)(); //<--  Wait what? Are you kidding me?
}

So once again as it was before we have no other choice than to implement the necessary functionality ourselves... Not much help from C++ standard library.

AMENDMENT:

When diving deeper into design details of Mersenne Twister we discover that discard(z) of std::mersenne_twister_engine also can be implemented quite efficiently.

For

template<
    class UIntType,
    std::size_t w, std::size_t n, std::size_t m, std::size_t r, UIntType a,
    std::size_t u, UIntType d, std::size_t s, UIntType b, std::size_t t, UIntType c,
    std::size_t l, UIntType f
> class mersenne_twister_engine;

even generic implementation of discard(z) (applicable to the whole class of Linear PRNGs modulo 2 - not just Mersenne Twister but also WELL and many others) would have complexity like O(n^ω*log(z)) - where n is template parameter above - size of state in w-bit words, and power ω is constant between 2 and 3 (depending on a chosen bit matrix multiplication algorithm). That complexity may be trivially reduced with reasonable amount of template-parameter-dependent pre-computation to O(n^ω) practical execution time. SIMD CPU instructions (vector XOR or vector AND) would improve practical performance by a constant factor. Parallel algorithms (e.g. specialized hardware solutions) would compute this in O(log(n)) time using O(n^3) simultaneous single-bit computations (XORs and ANDs).

Sure you may notice that n parameter above is typically not that small (like 624 for std::mt19937 or 312 for std::mt19937_64) and n-cubed is even bigger - so O(n^3) is not necessarily fast actual execution. But modern CPU (and especially GPU) would still execute optimized implementation quite fast - no comparison to ridiculous exponential complexity of z consecutive calls of operator()().

SOME GENERAL OBSERVATIONS:

Each existing PRNG (and each one I can imagine) can be defined with the following iterative equations:

x[n+1] = T(x[n])
Output[n] = F(x[n])

where x[n] is a state (some W-bit sequence) after n iterations (so x[0] is initial state AKA seed), T(x) is a state iterative transformation function (that converts current state to the next state) and F(x) is output transformation that converts each state W-bit sequence to output v-bit sequence (Output[n]) - typically v < W.

Sure both T(x) and F(x) computations are fast - i.e. maximum time is polynomial - O(P(W)) at worst. (Typically those functions are designed to be even faster - like O(P(v)) which is essentially O(1) in most cases because v is typically chosen to be CPU register size with fast hardware-optimized operations usually available for that size).

I mean literally - all existing (and future) PRNGs that make sense can be expressed in this way.

(The only further generalization I can think of is making W and v sizes non-constant - i.e. dependent on n - i.e. changing from one iteration to the next. Probably there's not much practical sense in this - I guess no one wants their PRNG internal data to grow infinitely and eventually consume all RAM or something like that. Even though very slowly growing W could allow non-periodic PRNG designs.)

So the question is: What property of PRNG would make discard(z) operation fast - i.e. with polynomial - O(P(W)) - worst-case running time?

IMO quite obvious answer is that IF we can perform fast computation of the function - T^z(x) = T(T(...T(x)...)) - z times for any z then we can implement fast discard(z).

Also it is not very hard to notice that IF T(x) = T_p(x) is some parametrized transformation with some internal fixed parameter p which is one of a class of transformations with varying parameter values and for any allowed parameter value q transformation T_q(x) can be computed fast - in O(P(W)) time. And also if for any allowed parameter values p and q transformation T_p(T_q(x)) is also in this class of transformations with some allowed parameter r - i.e. T_p(T_q(x)) = T_r(x) and r can be computed fast from parameters p and q... Say we define a notation r=p*q - where * is some binary operation computable fast (in polynomial time at most) - so T_{p*q}(x) = T_p(T_q(x)) by definition. (You can notice that binary operation * is not necessarily commutative - i.e. p*q must not be the same value as q*p. But this operation is associative by design - because T_p(T_q(T_r(x))) = T_p(T_{q*r}(x)) = T_{p*q}(T_r(x)) - hence p*(q*r)=(p*q)*r.)

^^^This T(x) transformation structure would obviously allow fast computation of T^z(x) transformation: if T(x) = T_p(x) and parameter p is known - we just compute q=p^z=p*p*p*...p - z times (which is just O(log(z)) computations of * and can be optimized by pre-computations and/or parallel execution) and then we compute T_q(x).

^^^ While this many preconditions seem like very special case - in fact all this is quite common. E.g. for a class of linear transformations modulo 2 (like Mersenne Twister or WELL, etc) iterative transformation can be represented as multiplication of constant bit matrix by a state bit vector in modulo 2 arithmetic - so that constant matrix exponentiation (in modulo 2 bitwise arithmetic) does the trick. With std::linear_congruential_engine it is even simpler - do the math yourself as an easy exercise. Elliptic-curve based PRNG also have those conditions met. (Actually I'm wondering why would anyone design PRNG without this very useful property. - But that's just me.)

alfC · Answer 3 · 2022-05-25T15:08:37.807

I don't think such things exists at all. My heuristic conclusion is that O(1)-jump RNG is basically a hash, with all that this implies (e.g. it might not be "good" RNG at all). (See new answer by @AkiSuihkonen and link therein.)

But even if you are asking about O(log z) I don't see that implemented in the STL. In GCC's all the discard functions I was able to grep are all simple loops.

      discard(unsigned long long __z)
      {
        for (; __z != 0ULL; --__z)
          (*this)();
      }

Which is not only sad but also misleading since discard should exists only if there is an efficient way to do it.

The only non trivial one is mersenne (below) but it is still O(z).

    discard(unsigned long long __z)
    {
      while (__z > state_size - _M_p)
    {
      __z -= state_size - _M_p;
      _M_gen_rand();
    }
      _M_p += __z;
    }

Boost's Mersenne, has a skip function but it is only called for skips larger than a (default of) 10000000 (!?). Which already tells me that that the skip is very heavy computationally (even if it is O(log z)). https://www.boost.org/doc/libs/1_72_0/boost/random/mersenne_twister.hpp

Finally, Thrust has an efficient discard for linear congruential apparently, but only in the case c == 0. (Which I am not sure if it makes it less useful as a RNG.) https://thrust.github.io/doc/classthrust_1_1random_1_1linear__congruential__engine_aec05b19d2a85d02f1ff437791ea4dd68.html#aec05b19d2a85d02f1ff437791ea4dd68

Aki Suihkonen · Answer 4 · 2022-05-25T16:17:16.450

3

All Counter Based Random Number Generators

These work solely by computing a function uint64_t rnd(uint64_t counter) or maybe uint16_t rnd(uint128_t counter). Then the skip function is as easy as

// the method itself is "randomly" generated -- known at least
// by von Neumann to typically generate poor results
struct MyRandomCBRNG {
    uint64_t counter{0};
    void skip(uint64_t a) { counter+=a;}

    uint64_t operator()() {
        uint64_t x = counter++;
        // repeat multiple times if needed
        x = x * 0xdeadbeefcafebabeull ^ (x >> 53) ^ (x << 11) + 0x13459876abcdfdecull;
        return x;
    }
};

One can make even cryptographically strong CBRNGs by hashing the counter concatenated by secret key with something like SHA-512, not to mention Blum-Blum-Shub.

edited May 25 '22 at 16:17

answered May 25 '22 at 11:23

Aki Suihkonen

19,144
1
36
57

Nice in principle. This is in line of my answer, my hypothesis is that `O(1)`-skip == hash algorithm. Did you evaluate how good of a RNG this is? I have the impression that hashes cannot be that good. – alfC May 25 '22 at 14:22
1

No, I didn't. There's xor128 which has something like three stages of middle square multiplications passing Marsaglias statistical tests (with two rounds it passes all but one test suite). – Aki Suihkonen May 25 '22 at 15:13
ok, thanks for the wikipedia link. I didn't know that such thing existed and implemented it my self a couple of times. I did it with `fnv1a` hash at the time. – alfC May 25 '22 at 15:16

Which C++ random number engines have a O(1) discard function?

4 Answers4

Linked