I am currently trying to process large volumes of text. As part of the process I want to do things such as tokenization and stemming. However some of my steps require loading an external model (for example the OpenNLP tokenizers). I am currently trying the following approach:
    SparkConf sparkConf = new SparkConf().setAppName("Spark Tokenizer");
    JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
    SQLContext sqlContext = new SQLContext(sparkContext);
    DataFrame corpus = sqlContext.read().text("/home/zezke/document.nl");
    // Create pipeline components
    Tokenizer tokenizer = new Tokenizer()
            .setInputCol("value")
            .setOutputCol("tokens");
    DataFrame tokenizedCorpus = tokenizer.transform(corpus);
    // Save the output
    tokenizedCorpus.write().mode(SaveMode.Overwrite).json("/home/zezke/experimentoutput");
The current approach I am trying is using a UnaryTransformer.
public class Tokenizer extends UnaryTransformer<String, List<String>, Tokenizer> implements Serializable {
    private final static String uid = Tokenizer.class.getSimpleName() + "_" + UUID.randomUUID().toString();
    private static Map<String, String> stringReplaceMap;
    @Override
    public void validateInputType(DataType inputType) {
        assert (inputType.equals(DataTypes.StringType)) :
                String.format("Input type must be %s, but got %s", DataTypes.StringType.simpleString(), inputType.simpleString());
    }
    public Function1<String, List<String>> createTransformFunc() {
        Function1<String, List<String>> f = new TokenizerFunction();
        return f;
    }
    public DataType outputDataType() {
        return DataTypes.createArrayType(DataTypes.StringType, true);
    }
    public String uid() {
        return uid;
    }
    private class TokenizerFunction extends AbstractFunction1<String, List<String>> implements Serializable {
        public List<String> apply(String sentence) {
             ... code goes here
        }
    }
}
Now my questions are:
- What is the best time to load the model? I don't want to load the model multiple times.
- How do I distribute the model to various nodes?
Thanks in advance, Spark is a bit daunting to get into, but it looks promising.
 
    