Here is an example in pyspark, which I guess is straightforward to port to Scala - the key is the use of model.transform.
First, we train the model as in the example:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
sc = SparkContext()
inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
k = 220 # vector dimensionality
word2vec = Word2Vec().setVectorSize(k)
model = word2vec.fit(inp)
k is the dimensionality of the word vectors - the higher the better (default value is 100), but you will need memory, and the highest number I could go with my machine was 220. (EDIT: Typical values in the relevant publications are between 300 and 1000)
After we have trained the model, we can define a simple function as follows:
def getAnalogy(s, model):
qry = model.transform(s[0]) - model.transform(s[1]) - model.transform(s[2])
res = model.findSynonyms((-1)*qry,5) # return 5 "synonyms"
res = [x[0] for x in res]
for k in range(0,3):
if s[k] in res:
res.remove(s[k])
return res[0]
Now, here are some examples with countries and their capitals:
s = ('france', 'paris', 'portugal')
getAnalogy(s, model)
# u'lisbon'
s = ('china', 'beijing', 'russia')
getAnalogy(s, model)
# u'moscow'
s = ('spain', 'madrid', 'greece')
getAnalogy(s, model)
# u'athens'
s = ('germany', 'berlin', 'portugal')
getAnalogy(s, model)
# u'lisbon'
s = ('japan', 'tokyo', 'sweden')
getAnalogy(s, model)
# u'stockholm'
s = ('finland', 'helsinki', 'iran')
getAnalogy(s, model)
# u'tehran'
s = ('egypt', 'cairo', 'finland')
getAnalogy(s, model)
# u'helsinki'
The results are not always correct - I'll leave it to you to experiment, but they get better with more training data and increased vector dimensionality k.
The for loop in the function removes entries that belong to the input query itself, as I noticed that frequently the correct answer was the second one in the returned list, with the first usually being one of the input terms.