If our set of keys are integers then this means that the prime number
needs to be of the next bigger data type e.g. long.
That is not a problem. Sometimes it is necessary otherwise the hash family cannot be universal. See below for more information.
But eventually whatever we get as the hash would need to be
down-casted to an int to index the hash table.
Doesn't this down-casting affect the quality of the Universal Hashing
(I am referring to the distribution of the keys over the buckets)
somehow?
The answer is no. I will try to explain.
Whether p has another data type or not is not important for the hash family to be universal. Important is that p is equal or larger than u (the maximum integer of the universe of integers). It is important that p is big enough (i.e. >= u).
A hash family is universal when the collision probability is equal or
smaller than 1/m.
So the idea is to hold that constraint.
The value of p, in theory, can be as big as a long or more. It just needs to be an integer and prime.
u is the size of the domain/universe (or the number of keys). Given the universe U = {0, ..., u-1}, u denotes the size |U|.
m is the number of bins or buckets
p is a prime which must be equal or greater than n
- the hash family is defined as
H = {h(a,b)(x)} with h(a,b)(x) = ((a * x + b) mod p) mod m. Note that a and b are randomly chosen integers (from all possible integers, so theoretically can be larger than p) modulo a prime p (which can make them either smaller or larger than m, the number of bins/buckets); but here too the data type (domain of values does not matter). See Hashing integers on Wikipedia for notation.
- Follow the proof on Wikipedia and you conclude that the collision probability is
_p/m_ * 1/(p-1) (the underscores mean to truncate the decimals). For p >> m (p considerably bigger than m) the probability tends to 1/m (but this does not mean that the probability would be better the larger p is).
In other terms answering your question: p being a bigger data type is not a problem here and can be even required. p has to be equal or greater than u and a and b have to be randomly chosen integers modulo p, no matter the number of buckets m. With these constraints you can construct a universal hash family.
Maybe a mathematical example could help
- Let U be the universe of integers that correspond to
unsigned char (in C for example). Then U = {0, ..., 255}
Let p be (next possible) prime equal or greater than 256. Note that p can be any of these types (short, int, long be it signed or unsigned). The point is that the data type does not play a role (In programming the type mainly denotes a domain of values.). Whether 257 is short, int or long doesn't really matter here for the sake of correctness of the mathematical proof. Also we could have chosen a larger p (i.e. a bigger data type); this does not change the proof's correctness.
- The next possible prime number would be
257.
- We say we have
25 buckets, i.e. m = 25. This means a hash family would be universal if the collision probability is equal or less than 1/25, i.e. approximately 0.04.
- Put in the values for
_p/m_ * 1/(p-1): _257/25_ * 1/256 = 10/256 = 0.0390625 which is smaller than 0.04. It is a universal hash family with the chosen parameters.
We could have chosen m = u = 256 buckets. Then we would have a collision probability of 0.003891050584, which is smaller than 1/256 = 0,00390625. Hash family is still universal.
Let's try with m being bigger than p, e.g. m = 300. Collision probability is 0, which is smaller than 1/300 ~= 0.003333333333. Trivial, we had more buckets than keys. Still universal, no collisions.
Implementation detail example
We have the following: