There is a cleaner way to do this with index_table_from_file and Dataset API.
First, create your own tf.Dataset (I assume we have two sentences with some arbitary labels):
sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
Second, create a vocab.txt file that each line's number in this file maps to the same index in the Glove embedding. For example, if the first vocabulary in Glove is "absent" in vocab.txt the first line should "absent" and so on. For simplicity, assume our vocab.txt contains the following words:
first
is
test
this
second
sentence
Then, based on here, define a table that its goal is to convert each word to specific id:
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))
dataset = dataset.batch(1)
Finally, based on this answer, by using nn.embedding_lookup() convert each sentence to embedding:
glove_weights = tf.get_variable('embed', shape=embedding.shape, initializer=initializer=tf.constant_initializer(embedding), trainable=False)
iterator = dataset.make_initializable_iterator()
x, y = iterator.get_next()
embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)
Complete code in eager mode:
import tensorflow as tf
tf.enable_eager_execution()
sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))
dataset = dataset.batch(1)
glove_weights = tf.get_variable('embed', shape=(10000, 300), initializer=tf.truncated_normal_initializer())
for x, y in dataset:
embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)
print(sentence.shape)