Having an issue with understanding why NLTK's word_tokenizer looks at the string "this's" and splits it into "this" "'" "s" instead of keeping them together. I've tested with "test's" and this works fine. When I tested with "results'" it split the apostrophe again. Is this just a particular thing that will happen with apostrophes?
Asked
Active
Viewed 311 times
1
-
1I think this's (heh!) relevant: https://ell.stackexchange.com/q/145503 – Fred Larson Nov 21 '17 at 16:14
-
Have you tried adding \ before. IE `'this\'s'`? – Xantium Nov 21 '17 at 17:05
-
@Simon , I tried and didn't work – Darpan Ganatra Nov 22 '17 at 21:59
-
All good, just parsed it myself – Darpan Ganatra Nov 22 '17 at 22:01
1 Answers
0
It is a normal behavior of NLTK and tokenizers in general to split this's -> this + 's. Because 's a clitique and they are two separate syntactic units.
>>> from nltk import word_tokenize
>>> word_tokenize("this's")
['this', "'s"]
For the case of result it's the same:
>>> word_tokenize("results'")
['results', "'"]
Why are the 's and ' a separate entity from its host?
For the case of this's, 's is an abbreviated form of is which denotes copula. In some cases, it's ambiguous and it can also denote possessive.
And for the 2nd case of results', ' is denoting possessive.
So if we POS tag the tokenized forms we get:
>>> from nltk import word_tokenize, pos_tag
>>> pos_tag(word_tokenize("results'"))
[('results', 'NNS'), ("'", 'POS')]
For the case of this's, the POS tagger thinks it's a possessive because people seldom use this's in written text:
>>> from nltk import word_tokenize, pos_tag
>>> pos_tag(word_tokenize("this's"))
[('this', 'DT'), ("'s", 'POS')]
But if we look at He's -> He + 's, it's clearer that 's is denoting the copula:
>>> pos_tag(word_tokenize("He's good."))
[('He', 'PRP'), ("'s", 'VBZ'), ('good', 'JJ'), ('.', '.')]
Related question: https://stackoverflow.com/a/47384013/610569
alvas
- 115,346
- 109
- 446
- 738