gensim Doc2Vec vs tensorflow Doc2Vec

Question

I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better.

I ran the following code to train the gensim model and the one below that for tensorflow model. My questions are as follows:

Is my tf implementation of Doc2Vec correct. Basically is it supposed to be concatenating the word vectors and the document vector to predict the middle word in a certain context?
Does the window=5 parameter in gensim mean that I am using two words on either side to predict the middle one? Or is it 5 on either side. Thing is there are quite a few documents that are smaller than length 10.
Any insights as to why Gensim is performing better? Is my model any different to how they implement it?
Considering that this is effectively a matrix factorisation problem, why is the TF model even getting an answer? There are infinite solutions to this since its a rank deficient problem. <- This last question is simply a bonus.

Gensim

model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores)
model.build_vocab(corpus)
epochs = 100
for i in range(epochs):
    model.train(corpus)

TF

batch_size = 512
embedding_size = 100 # Dimension of the embedding vector.
num_sampled = 10 # Number of negative examples to sample.


graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):
    # Input data.
    train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])

    # The variables   
    word_embeddings =  tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
    doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))
    softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],
                             stddev=1.0 / np.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))

    ###########################
    # Model.
    ###########################
    # Look up embeddings for inputs and stack words side by side
    embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),
                            shape=[int(batch_size/context_window),-1])
    embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)
    embed = tf.concat(1,[embed_words, embed_docs])
    # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,
                                   train_labels, num_sampled, vocabulary_size))

    # Optimizer.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

Update:

Check out the jupyter notebook here (I have both models working and tested in here). It still feels like the gensim model is performing better in this initial analysis.

A proper discussion on this can be found here: https://groups.google.com/forum/#!topic/gensim/0GVxA055yOU — sachinruk, Oct 11 '16 at 04:54
according to the documentation - "window is the maximum distance between the predicted word and context words used for prediction within a document". So its 5 words on either side. Also, can you tell me whats the meaning of `negative` or `num_sampled`? Couldn't quite get it — Clock Slave, Mar 14 '17 at 10:53
the negative sampling approach is described in one of the Mikolov [papers](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) . AfaIr it reduces the number of parameters that are updated in each learning step. — patrick, Mar 14 '17 at 17:53
Note that the `dm_concat` mode results in much-larger, slower-to-train models that probably require a lot more data (or training-passes) than the more-commonly-used PV-DBOW or PV-DM-with-context-window-averaging. I initially added `dm_concat` mode to gensim, to try to closely reproduce the 'Paragraph Vector' paper results said to use that mode. (I couldn't; nor has anyone else who's tried.) I haven't personally found any datasets/evaluations where `dm_concat` was worth the extra effort – but maybe they exist with really-big doc corpuses. — gojomo, Feb 02 '18 at 20:05

THN · Answer 1 · 2017-08-20T06:41:30.303

Old question, but an answer would be useful for future visitors. So here are some of my thoughts.

There are some problems in the tensorflow implementation:

window is 1-side size, so window=5 would be 5*2+1 = 11 words.
Note that with PV-DM version of doc2vec, the batch_size would be the number of documents. So train_word_dataset shape would be batch_size * context_window, while train_doc_dataset and train_labels shapes would be batch_size.
More importantly, sampled_softmax_loss is not negative_sampling_loss. They are two different approximations of softmax_loss.

So for the OP's listed questions:

This implementation of doc2vec in tensorflow is working and correct in its own way, but it is different from both the gensim implementation and the paper.
window is 1-side size as said above. If document size is less than context size, then the smaller one would be use.
There are many reasons why gensim implementation is faster. First, gensim was optimized heavily, all operations are faster than naive python operations, especially data I/O. Second, some preprocessing steps such as min_count filtering in gensim would reduce the dataset size. More importantly, gensim uses negative_sampling_loss, which is much faster than sampled_softmax_loss, I guess this is the main reason.
Is it easier to find somethings when there are many of them? Just kidding ;-)
It's true that there are many solutions in this non-convex optimization problem, so the model would just find a local optimum. Interestingly, in neural network, most local optima are "good enough". It has been observed that stochastic gradient descent seems to find better local optima than larger batch gradient descent, although this is still a riddle in current research.

«in neural network, most local optima are "good enough"» I think it's more correct to say that in high-dimensional problems, like in neural networks, most local minima are actually saddle points, so they are easy to cross, especially when using more stochastic steps. — Ricardo Magalhães Cruz, Nov 23 '17 at 17:11
That's right, in high-dimensional problems, most critical points are saddle points, but the stochastic dynamics drive the solutions to local optima instead of saddle points, except maybe very flat and wide saddle points. The point is, most found local optima are good enough, as it has been shown empirically that different found local optima usually have almost the same generalization performance, and that is very interesting. The answer to this may lie in the stochastic dynamics that also drive the solutions to flat and wide local optima instead of sharp and narrow local optima. — THN, Jan 09 '20 at 08:20

gensim Doc2Vec vs tensorflow Doc2Vec

Gensim

TF

Update:

1 Answers1