i would like to build my own - here am not sure which one - tokenizer (from Lucene point of view) or my own analyzer. I already write a code that tokenize my documents in word (as a List < String > or a List < Word > where Word is a class with only a kind of container with 3 public String : word, pos, lemma - pos stand for part-of-speech tag).
i'm not sure what i am going to index, maybe only "Word.lemma" or something like "Word.lemma + '#' + Word.pos", probably i will do some filtering from a stop word list based on part-of-speech.
btw here is my misunderstanding : i am not sure where i should plug to the Lucene API,
should i wrap my own tokenizer inside a new tokenizer ? should i rewrite TokenStream ? should i consider that this is the job of the analyzer rather than the tokenizer ? or shoud i bypass everything and directly build my index by adding my word directly inside index, using IndexWriter, Fieldable and so on ? (if so do you know of any documentation on how to create it's own index from scratch when bypass ing the analysis process)
best regards
EDIT : may be the simplest way should be to org.apache.commons.lang.StringUtils.join my Word-s with a space on the exit of my personal tokenizer/analyzer and rely on the WhiteSpaceTokenizer to feed Lucene (and other classical filters) ?
EDIT : so, i have read EnglishLemmaTokenizer pointed by Larsmans... but where i am still confused, is the fact that i end my own analysis/tokenization process with a complete *List < Word > * (Word class wrapping .form/.pos/.lemma) , this process rely on an external binary that i had wrapped in Java (this is a must do / can not do otherwise - it is not on a consumer point of view, i get the full list as a result) and i still not see how i should wrap it again to get back to the normal Lucene analysis process.
also i will be using the TermVector feature with TF.IDF like scoring (may be redefining my own), i may also be interested in the proximty searching, thus, discarding some words from their part-of- speech before providing them to a Lucene built-in tokenizer or internal analyzer may seem a bad idea. And i have difficulties in thinking of a "proper" way to wrap a Word.form / Word.pos / Word.lemma (or even other Word.anyOtherUnterestingAttribute) to the Lucene way.
EDIT: BTW, here is a piece of code that i write inspired by the one of @Larsmans :
class MyLuceneTokenizer extends TokenStream {
    private PositionIncrementAttribute posIncrement;
    private CharTermAttribute termAttribute;
    private List<TaggedWord> tagged;
    private int position;
    public MyLuceneTokenizer(Reader input, String language, String pathToExternalBinary) {
        super();
        posIncrement = addAttribute(PositionIncrementAttribute.class);
        termAttribute = addAttribute(CharTermAttribute.class); // TermAttribute is deprecated!
        // import com.google.common.io.CharStreams;            
        text = CharStreams.toString(input); //see http://stackoverflow.com/questions/309424/in-java-how-do-i-read-convert-an-inputstream-to-a-string
        tagged = MyTaggerWrapper.doTagging(text, language, pathToExternalBinary);
        position = 0;
    }
    public final boolean incrementToken()
            throws IOException {
        if (position > tagged.size() -1) {
            return false;
        }
        int increment = 1; // will probably be changed later depending upon any POS filtering or insertion @ same place...
        String form = (tagged.get(position)).word;
        String pos = (tagged.get(position)).pos;
        String lemma = (tagged.get(position)).lemma;
        // logic filtering should be here...
        // BTW we have broken the idea behing the Lucene nested filters or analyzers! 
        String kept = lemma;
        if (kept != null) {
            posIncrement.setPositionIncrement(increment);
            char[] asCharArray = kept.toCharArray();
            termAttribute.copyBuffer(asCharArray, 0, asCharArray.length);
            //termAttribute.setTermBuffer(kept);
            position++;
        }
        return true;
    }
}
class MyLuceneAnalyzer extends Analyzer {
    private String language;
    private String pathToExternalBinary;
    public MyLuceneAnalyzer(String language, String pathToExternalBinary) {
        this.language = language;
        this.pathToExternalBinary = pathToExternalBinary;
    }
    @Override
    public TokenStream tokenStream(String fieldname, Reader input) {
        return new MyLuceneTokenizer(input, language, pathToExternalBinary);
    }
}
 
     
     
    
> to List>` from the line `List> tokenized =             MaxentTagger.tokenizeText(input);` in _EnglishKLemmaTokenize.java_  
– user1340802 May 18 '12 at 14:22