Consider the following example
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))
dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)
> dtrain_spark
# Source:   table<dtrain> [?? x 3]
# Database: spark_connection
  text                     doc_id class
  <chr>                     <int> <dbl>
1 Chinese Beijing Chinese       1     1
2 Chinese Chinese Shanghai      2     1
3 Chinese Macao                 3     1
4 Tokyo Japan Chinese           4     0
Here I have the classic Naive Bayes example where class identifies documents falling into the China category.
I am able to run a Naives Bayes classifier in sparklyr by doing the following:
dtrain_spark %>% 
ft_tokenizer(input.col = "text", output.col = "tokens") %>% 
ft_count_vectorizer(input_col = 'tokens', output_col = 'myvocab') %>% 
  select(myvocab, class) %>%  
  ml_naive_bayes( label_col = "class", 
                  features_col = "myvocab", 
                  prediction_col = "pcol",
                  probability_col = "prcol", 
                  raw_prediction_col = "rpcol",
                  model_type = "multinomial", 
                  smoothing = 0.6, 
                  thresholds = c(0.2, 0.4))
which outputs:
NaiveBayesModel (Transformer)
<naive_bayes_5e946aec597e> 
 (Parameters -- Column Names)
  features_col: myvocab
  label_col: class
  prediction_col: pcol
  probability_col: prcol
  raw_prediction_col: rpcol
 (Transformer Info)
  num_classes:  int 2 
  num_features:  int 6 
  pi:  num [1:2] -1.179 -0.368 
  theta:  num [1:2, 1:6] -1.417 -0.728 -2.398 -1.981 -2.398 ... 
  thresholds:  num [1:2] 0.2 0.4 
However, I have two major questions:
How can I assess the performance of this classifier in-sample? Where are the accuracy metrics?
Even more importantly, how can I use this trained model to predict new values, say, in the following
sparktest dataframe?
Test data:
dtest <- data_frame(text = c("Chinese Chinese Chinese Tokyo Japan",
                             "random stuff"))
dtest_spark <- copy_to(sc, dtest, overwrite = TRUE)
> dtest_spark
# Source:   table<dtest> [?? x 1]
# Database: spark_connection
  text                               
  <chr>                              
1 Chinese Chinese Chinese Tokyo Japan
2 random stuff 
Thanks!