Given an array of n word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi is a word, fi is an integer frequencey, and the sum of the frequencies ∑fi = m,
I want to use a pseudo-random number generator (pRNG) to select p words wj0, wj1, ..., wjp-1 such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I've come up with three algorithms so far:
- Create an array of size - m, and populate it so the first- f0entries are- w0, the next- f1entries are- w1, and so on, so the last- fp-1entries are- wp-1.- [ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ] Then use the pRNG to select- pindices in the range- 0...m-1, and report the words stored at those indices.
 This takes- O(n + m + p)work, which isn't great, since- mcan be much much larger than n.
- Step through the input array once, computing - mi = ∑h≤ifh = mi-1 + fi and after computing- mi, use the pRNG to generate a number- xkin the range- 0...mi-1for each- kin- 0...p-1and select- wifor- wjk(possibly replacing the current value of- wjk) if- xk < fi.
 This requires- O(n + np)work.
- Compute mias in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ] and then, for each k in0...p-1, use the pRNG to generate a numberxkin the range0...m-1then do binary search on the array of triples to find theis.t.mi-fi ≤ xk < mi, and selectwiforwjk.
 This requiresO(n + p log n)work.
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
 
     
     
     
     
    
...blocks (for inline) or blocks (for fullline). – rampion May 16 '09 at 15:34