The input list of sentences:
sentences = [
    """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]
The desired output:
How Doth the Little Busy Bee,
I'll try again.
Is there a way to extract the citations (can appear in both single and double quotes) with nltk with built-in or third-party tokenizers?
I've tried using the SExprTokenizer tokenizer providing the single and double quotes as parens values but the result was far from the desired, e.g.:
In [1]: from nltk import SExprTokenizer
    ...: 
    ...: 
    ...: sentences = [
    ...:     """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    ...:     """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
    ...: ]
    ...: 
    ...: tokenizer = SExprTokenizer(parens='""', strict=False)
    ...: for sentence in sentences:
    ...:     for item in tokenizer.tokenize(sentence):
    ...:         print(item)
    ...:     print("----")
    ...:     
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
 but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'
There were similar threads like this and this, but all of them suggest a regex-based approach, but, I'm curious if this can be solved with nltk only - sounds like a common task in Natural Language Processing.
 
     
    