Efficient algorithm to randomly select items with frequency

Question

Given an array of n word-frequency pairs:

[ (w₀, f₀), (w₁, f₁), ..., (w_n-1, f_n-1) ]

where w_i is a word, f_i is an integer frequencey, and the sum of the frequencies ∑f_i = m,

I want to use a pseudo-random number generator (pRNG) to select p words w_j₀, w_j₁, ..., w_{j_p-1} such that the probability of selecting any word is proportional to its frequency:

P(w_i = w_{j_k}) = P(i = j_k) = f_i / m

(Note, this is selection with replacement, so the same word could be chosen every time).

I've come up with three algorithms so far:

Create an array of size m, and populate it so the first f₀ entries are w₀, the next f₁ entries are w₁, and so on, so the last f_p-1 entries are w_p-1.
```
[ w₀, ..., w₀, w₁,..., w₁, ..., w_p-1, ..., w_p-1 ]
```
Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.
Step through the input array once, computing
```
m_i = ∑_h≤if_h = m_i-1 + f_i
```
and after computing m_i, use the pRNG to generate a number x_k in the range 0...m_i-1 for each k in 0...p-1 and select w_i for w_{j_k} (possibly replacing the current value of w_{j_k}) if x_k < f_i.
This requires O(n + np) work.
Compute m_i as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:
```
[ (w₀, f₀, m₀), (w₁, f₁, m₁), ..., (w_n-1, f_n-1, m_n-1) ]
```
and then, for each k in 0...p-1, use the pRNG to generate a number x_k in the range 0...m-1 then do binary search on the array of triples to find the i s.t. m_i-f_i ≤ x_k < m_i, and select w_i for w_{j_k}.
This requires O(n + p log n) work.

My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?

this is OT, and please don't kill me for this, but how did you get sub/super scripts, and the sum equation signs? — dassouki, May 16 '09 at 15:25
Just use _... inside ... blocks (for inline) or
...
blocks (for fullline). — rampion, May 16 '09 at 15:34
And for the sum sign, just use ∑ (see http://www.w3.org/TR/WD-entities-961125 for more html entities for math sigils) — rampion, May 16 '09 at 15:36
BTW when performance is irrelevant here's copy and paste code to save you typing http://stackoverflow.com/a/33991225/294884 — Fattie, Nov 30 '15 at 04:01
note that algo 1 is of course spectacularly more efficient, assuming you do not count the time to assemble the array to begin with (ie, if you do that only once at development time). — Fattie, Nov 30 '15 at 04:03

score 6 · Answer 1 · edited May 23 '17 at 12:07

6

This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.

Look at Roulette Selection in Genetic Algorithms

edited May 23 '17 at 12:07

Community

1
1

answered May 16 '09 at 15:06

seb

1,628
1
10
15

Yeah, this is exactly what algorithm is required. You're not going to get quicker than O(n) complexity for sure. – Noldorin May 16 '09 at 15:21
Ok. They're just using iterative search, which requires O(n log m) to select each, and a total work of O(n log m + pn log m), just like my algorithm 2. Thanks! – rampion May 16 '09 at 15:44
with binary search it's O(n + p * log n). Why do you have *m* there? It doesn't effect the algorithm complexity. – Karoly Horvath Jan 16 '15 at 13:04

score 2 · Answer 2 · answered May 16 '09 at 15:54

2

You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.

For the first word the probability would be f₀/m₀ (where m_n=f₀+..+f_n), i.e. 100%, so all positions in the target array would be filled with w₀.

For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.

Example code in C#:

public class WordFrequency {

    public string Word { get; private set; }
    public int Frequency { get; private set; }

    public WordFrequency(string word, int frequency) {
        Word = word;
        Frequency = frequency;
    }

}

WordFrequency[] words = new WordFrequency[] {
    new WordFrequency("Hero", 80),
    new WordFrequency("Monkey", 4),
    new WordFrequency("Shoe", 13),
    new WordFrequency("Highway", 3),
};

int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
    sum += wf.Frequency;
    for (int i = 0; i < p; i++) {
        if (rnd.Next(sum) < wf.Frequency) {
            result[i] = wf.Word;
        }
    }
}

answered May 16 '09 at 15:54

Guffa

687,336
108
737
1,005

Right. This is exactly algorithm 2. – rampion May 16 '09 at 16:34
Is that what you meant? I was thrown off by the O() calculation. The frequency values are irrelevant for how much work there is, so the m has no business in the O() value. It should simply be O(np). – Guffa May 16 '09 at 18:03
No, the frequency values matter - it takes O(log m) bits to store a frequency, and O(log m) work to add two frequencies or compare two. Usually this is just swallowed by a constant term when log m < 64 (you store it in a 64 bit int), but for larger numbers, it can matter. – rampion May 16 '09 at 18:54
If you want that kind of complexity, then you have to consider the data size for every operation... Looping through the pairs is not an O(n) operation, but an O(n log n) operation... Creating an array is not an O(p) operation, but an O(p log p) operation... – Guffa May 16 '09 at 19:21
good point. I'll adjust my complexity descriptions accordingly. – rampion May 16 '09 at 21:54

score 0 · Accepted Answer · edited May 23 '17 at 11:48

Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:

There are n partitions, all of the same width r s.t. nr = m.
each partition contains two words in some ratio (which is stored with the partition).
for each word w_i, f_i = ∑_{partitions t s.t w_i ∈ t} r × ratio(t,w_i)

Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.

The reason that such a partitioning exists is that there exists a word w_i s.t. f_i < r, if and only if there exists a word w_i' s.t. f_i' > r, since r is the average of the frequencies.

Given such a pair w_i and w_i' we can replace them with a pseudo-word w'_i of frequency f'_i = r (that represents w_i with probability f_i/r and w_i' with probability 1 - f_i/r) and a new word w'_i' of adjusted frequency f'_i' = f_i' - (r - f_i) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.

To construct this partition in O(n) time,

go through the list of the words once, constructing two lists:
- one of words with frequency ≤ r
- one of words with frequency > r
then pull a word from the first list
- if its frequency = r, then make it into a one element partition
- otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.

This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'_i = nf_i, which updates m' = mn and sets r' = m when q = n.

In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.

In ruby:

def weighted_sample_with_replacement(input, p)
  n = input.size
  m = input.inject(0) { |sum,(word,freq)| sum + freq }

  # find the words with frequency lesser and greater than average
  lessers, greaters = input.map do |word,freq| 
                        # pad the frequency so we can keep it integral
                        # when subdivided
                        [ word, freq*n ] 
                      end.partition do |word,adj_freq| 
                        adj_freq <= m 
                      end

  partitions = Array.new(n) do
    word, adj_freq = lessers.shift

    other_word = if adj_freq < m
                   # use part of another word's frequency to pad
                   # out the partition
                   other_word, other_adj_freq = greaters.shift
                   other_adj_freq -= (m - adj_freq)
                   (other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
                   other_word
                 end

    [ word, other_word , adj_freq ]
  end

  (0...p).map do 
    # pick a partition at random
    word, other_word, adj_freq = partitions[ rand(n) ]
    # select the first word in the partition with appropriate
    # probability
    if rand(m) < adj_freq
      word
    else
      other_word
    end
  end
end

Efficient algorithm to randomly select items with frequency

3 Answers3

Linked