Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech?

Question

i would like to build my own - here am not sure which one - tokenizer (from Lucene point of view) or my own analyzer. I already write a code that tokenize my documents in word (as a List < String > or a List < Word > where Word is a class with only a kind of container with 3 public String : word, pos, lemma - pos stand for part-of-speech tag).

i'm not sure what i am going to index, maybe only "Word.lemma" or something like "Word.lemma + '#' + Word.pos", probably i will do some filtering from a stop word list based on part-of-speech.

btw here is my misunderstanding : i am not sure where i should plug to the Lucene API,

should i wrap my own tokenizer inside a new tokenizer ? should i rewrite TokenStream ? should i consider that this is the job of the analyzer rather than the tokenizer ? or shoud i bypass everything and directly build my index by adding my word directly inside index, using IndexWriter, Fieldable and so on ? (if so do you know of any documentation on how to create it's own index from scratch when bypass ing the analysis process)

best regards

EDIT : may be the simplest way should be to org.apache.commons.lang.StringUtils.join my Word-s with a space on the exit of my personal tokenizer/analyzer and rely on the WhiteSpaceTokenizer to feed Lucene (and other classical filters) ?

EDIT : so, i have read EnglishLemmaTokenizer pointed by Larsmans... but where i am still confused, is the fact that i end my own analysis/tokenization process with a complete *List < Word > * (Word class wrapping .form/.pos/.lemma) , this process rely on an external binary that i had wrapped in Java (this is a must do / can not do otherwise - it is not on a consumer point of view, i get the full list as a result) and i still not see how i should wrap it again to get back to the normal Lucene analysis process.

also i will be using the TermVector feature with TF.IDF like scoring (may be redefining my own), i may also be interested in the proximty searching, thus, discarding some words from their part-of- speech before providing them to a Lucene built-in tokenizer or internal analyzer may seem a bad idea. And i have difficulties in thinking of a "proper" way to wrap a Word.form / Word.pos / Word.lemma (or even other Word.anyOtherUnterestingAttribute) to the Lucene way.

EDIT: BTW, here is a piece of code that i write inspired by the one of @Larsmans :

class MyLuceneTokenizer extends TokenStream {

    private PositionIncrementAttribute posIncrement;
    private CharTermAttribute termAttribute;

    private List<TaggedWord> tagged;
    private int position;

    public MyLuceneTokenizer(Reader input, String language, String pathToExternalBinary) {
        super();

        posIncrement = addAttribute(PositionIncrementAttribute.class);
        termAttribute = addAttribute(CharTermAttribute.class); // TermAttribute is deprecated!

        // import com.google.common.io.CharStreams;            
        text = CharStreams.toString(input); //see http://stackoverflow.com/questions/309424/in-java-how-do-i-read-convert-an-inputstream-to-a-string
        tagged = MyTaggerWrapper.doTagging(text, language, pathToExternalBinary);
        position = 0;
    }

    public final boolean incrementToken()
            throws IOException {
        if (position > tagged.size() -1) {
            return false;
        }

        int increment = 1; // will probably be changed later depending upon any POS filtering or insertion @ same place...
        String form = (tagged.get(position)).word;
        String pos = (tagged.get(position)).pos;
        String lemma = (tagged.get(position)).lemma;

        // logic filtering should be here...
        // BTW we have broken the idea behing the Lucene nested filters or analyzers! 
        String kept = lemma;

        if (kept != null) {
            posIncrement.setPositionIncrement(increment);
            char[] asCharArray = kept.toCharArray();
            termAttribute.copyBuffer(asCharArray, 0, asCharArray.length);
            //termAttribute.setTermBuffer(kept);
            position++;
        }

        return true;
    }
}

class MyLuceneAnalyzer extends Analyzer {
    private String language;
    private String pathToExternalBinary;

    public MyLuceneAnalyzer(String language, String pathToExternalBinary) {
        this.language = language;
        this.pathToExternalBinary = pathToExternalBinary;
    }

    @Override
    public TokenStream tokenStream(String fieldname, Reader input) {
        return new MyLuceneTokenizer(input, language, pathToExternalBinary);
    }
}

score 1 · Answer 1 · answered May 18 '12 at 09:02

1

There are various options here, but when I tried to wrap a POS tagger in Lucene, I found that implementing a new TokenStream and wrapping that inside a new Analyzer was the easiest option. In any case, mucking with IndexWriter directly seems like a bad idea. You can find my code on my GitHub.

answered May 18 '12 at 09:02

Fred Foo

355,277
75
744
836

thank you for your reply, i have edited my question as i am still confused. You propose "various option here", please can it be possible that you explain some of the one that you have think of ? also in your code, i have a `Type mismatch: cannot convert from List> to List>` from the line `List> tokenized = MaxentTagger.tokenizeText(input);` in _EnglishKLemmaTokenize.java_ – user1340802 May 18 '12 at 14:22

score 1 · Answer 2 · answered Sep 10 '12 at 07:38

1

If you want to use UIMA, Salmon Run has an example. But there is an effort within Lucene contrib modules to include UIMA workflows, see here and here.

answered Sep 10 '12 at 07:38

Renaud

16,073
6
81
79

Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech?

2 Answers2