Fact is that std::linear_congruential_engine<UIntType,a,c,m>::discard(unsigned long long z) is definitely possible to implement very efficiently (and quite trivial actually). It is mostly equivalent to exponentiation of a to the power of z modulo m (for both zero and non-zero c) - it means quite simple software implementation executes in O(log(z % φ(m))) UIntType multiplication + modulo reduction operations (φ(m)=m-1 - for prime number and less than m in general case).
^^^NOTE that in a way O(log(z % φ(m))) is O(1) because log2(z % φ(m)) < log2(m) < sizeof(UIntType)*CHAR_BIT - though in practice it is more like O(log(z)).
^^^ALSO NOTE that after generating some template-parameter-dependent pre-computed tables (either lazy or compile-time computation may be employed if suitable) exponentiation complexity may be reduced to just a few multiplication + modulo reduction operations (like 4 or 8) - i.e. O(1) in every practical sense.
Also probably there are efficient enough algorithms for most other engines' discard functions that would satisfy O(P(sizeof(state))*log(z)) constraint (P(x) - some low degree polynomial function, most likely 1+o(1) degree or at most 3+o(1) degree, given that log(z) < sizeof(unsigned long long)*CHAR_BIT may be considered as constant).
DESPITE ALL THIS:
Somehow for some unknown reason C++ standard (as of ISO/IEC 14882:2017) does not require to implement discard in more efficient way than just z operator()() calls for any PRNG engine including those that definitely allow this.
To me personally it is baffling and just MAKES NO SENSE - it violates one of the fundamental C++ language design principles (in a very brutal way) - that is to add to the C++ standard only reasonable functionality in terms of performance and practical usefulness.
Notable and very documented example: there is no such thing as access of std::list element by index (std::list<T>::operator[](size_type n)) even though it is as "easy" as to just call operator++() n times with iterator begin(). And naturally so - because O(n) execution time would make this function unreasonable choice in any practical application (AKA plain stupid idea). For this obvious reason a[n] and a.at(n) is not part of mandatory Sequence container requirements (ISO/IEC 14882:2017 26.2.3 Table 87) but instead is a part of Optional sequence container operations (ISO/IEC 14882:2017 26.2.3 Table 88).
SO why in the world then e.discard(z) is part of mandatory Random number engine requirements (ISO/IEC 14882:2017 29.6.1.4 Table 104) with this ridiculous complexity requirement - no worse than the complexity of z consecutive calls e() - instead of some optional operations section entry with adequate complexity requirement like O(P(sizeof(state))*log(z))?
Like WHAT the...? z consecutive calls e() is exponential complexity - Since when it is OK?
Even more baffling was to actually find in my GCC this real-world implementation:
void discard(unsigned long long __z)
{
for (; __z != 0ULL; --__z)
(*this)(); //<-- Wait what? Are you kidding me?
}
So once again as it was before we have no other choice than to implement the necessary functionality ourselves... Not much help from C++ standard library.
AMENDMENT:
When diving deeper into design details of Mersenne Twister we discover that discard(z) of std::mersenne_twister_engine also can be implemented quite efficiently.
For
template<
class UIntType,
std::size_t w, std::size_t n, std::size_t m, std::size_t r, UIntType a,
std::size_t u, UIntType d, std::size_t s, UIntType b, std::size_t t, UIntType c,
std::size_t l, UIntType f
> class mersenne_twister_engine;
even generic implementation of discard(z) (applicable to the whole class of Linear PRNGs modulo 2 - not just Mersenne Twister but also WELL and many others) would have complexity like O(n^ω*log(z)) - where n is template parameter above - size of state in w-bit words, and power ω is constant between 2 and 3 (depending on a chosen bit matrix multiplication algorithm). That complexity may be trivially reduced with reasonable amount of template-parameter-dependent pre-computation to O(n^ω) practical execution time. SIMD CPU instructions (vector XOR or vector AND) would improve practical performance by a constant factor. Parallel algorithms (e.g. specialized hardware solutions) would compute this in O(log(n)) time using O(n^3) simultaneous single-bit computations (XORs and ANDs).
Sure you may notice that n parameter above is typically not that small (like 624 for std::mt19937 or 312 for std::mt19937_64) and n-cubed is even bigger - so O(n^3) is not necessarily fast actual execution. But modern CPU (and especially GPU) would still execute optimized implementation quite fast - no comparison to ridiculous exponential complexity of z consecutive calls of operator()().
SOME GENERAL OBSERVATIONS:
Each existing PRNG (and each one I can imagine) can be defined with the following iterative equations:
x[n+1] = T(x[n])
Output[n] = F(x[n])
where x[n] is a state (some W-bit sequence) after n iterations (so x[0] is initial state AKA seed), T(x) is a state iterative transformation function (that converts current state to the next state) and F(x) is output transformation that converts each state W-bit sequence to output v-bit sequence (Output[n]) - typically v < W.
Sure both T(x) and F(x) computations are fast - i.e. maximum time is polynomial - O(P(W)) at worst. (Typically those functions are designed to be even faster - like O(P(v)) which is essentially O(1) in most cases because v is typically chosen to be CPU register size with fast hardware-optimized operations usually available for that size).
I mean literally - all existing (and future) PRNGs that make sense can be expressed in this way.
(The only further generalization I can think of is making W and v sizes non-constant - i.e. dependent on n - i.e. changing from one iteration to the next. Probably there's not much practical sense in this - I guess no one wants their PRNG internal data to grow infinitely and eventually consume all RAM or something like that. Even though very slowly growing W could allow non-periodic PRNG designs.)
So the question is: What property of PRNG would make discard(z) operation fast - i.e. with polynomial - O(P(W)) - worst-case running time?
IMO quite obvious answer is that IF we can perform fast computation of the function - T^z(x) = T(T(...T(x)...)) - z times for any z then we can implement fast discard(z).
Also it is not very hard to notice that IF T(x) = T_p(x) is some parametrized transformation with some internal fixed parameter p which is one of a class of transformations with varying parameter values and for any allowed parameter value q transformation T_q(x) can be computed fast - in O(P(W)) time. And also if for any allowed parameter values p and q transformation T_p(T_q(x)) is also in this class of transformations with some allowed parameter r - i.e. T_p(T_q(x)) = T_r(x) and r can be computed fast from parameters p and q... Say we define a notation r=p*q - where * is some binary operation computable fast (in polynomial time at most) - so T_{p*q}(x) = T_p(T_q(x)) by definition. (You can notice that binary operation * is not necessarily commutative - i.e. p*q must not be the same value as q*p. But this operation is associative by design - because T_p(T_q(T_r(x))) = T_p(T_{q*r}(x)) = T_{p*q}(T_r(x)) - hence p*(q*r)=(p*q)*r.)
^^^This T(x) transformation structure would obviously allow fast computation of T^z(x) transformation: if T(x) = T_p(x) and parameter p is known - we just compute q=p^z=p*p*p*...p - z times (which is just O(log(z)) computations of * and can be optimized by pre-computations and/or parallel execution) and then we compute T_q(x).
^^^ While this many preconditions seem like very special case - in fact all this is quite common. E.g. for a class of linear transformations modulo 2 (like Mersenne Twister or WELL, etc) iterative transformation can be represented as multiplication of constant bit matrix by a state bit vector in modulo 2 arithmetic - so that constant matrix exponentiation (in modulo 2 bitwise arithmetic) does the trick. With std::linear_congruential_engine it is even simpler - do the math yourself as an easy exercise. Elliptic-curve based PRNG also have those conditions met. (Actually I'm wondering why would anyone design PRNG without this very useful property. - But that's just me.)