I'd like to find in my ElasticSearch index the string outlook.com inside a text with a match_phrase query. But I don't want results that are something...@outlook.com, that are taken with this query:
GET /my_index/_search
{
"size": 1,
"query": {
"bool": {
"should": [],
"must": [
{
"match_phrase": {
"message": {
"query": "outlook.com",
"slop": 0
}
}
}
]
}
}
}
I think that these results are taken because the tokenizer of the standard analyzer separate something...@outlook.com in [something...],[outlook.com] with @ as separator.
I tried to put the analyzer whitespace to tokenize as [something...@outlook.com] and avoid taking the full emails as results. But with this query:
GET /my_index/_search
{
"size": 1,
"query": {
"bool": {
"should": [],
"must": [
{
"match_phrase": {
"message": {
"query": "outlook.com",
"slop": 0,
"analyzer": "whitespace",
}
}
}
]
}
}
}
still finds results like something...@outlook.com. How can I do?
UPDATE:
In my mapping, I set standard analyzer a time ago. So my intuition is that even if I use a whitespace analyzer at search time, the documents are already tokenized with the standard one, so the tokenization is no more changeable after the indexing time.
I tried doing a painless script to match a certain pattern, but my field is type text so the search takes too much time.
Otherwise, a regexp query can do something similar:
GET /my_index/_search
{
"size": 1,
"query": {
"bool": {
"should": [],
"must": [
{
"regexp": {
"message": ".*[^A-Za-z0-9\\@]outlook.com[^A-Za-z0-9\\@].*"
}
}
]
}
}
}
But unfortunately reading regexp syntax documentation there is a limited set of operators. For example with this regex [^A-Za-z0-9\\@] I mean any characters, but not a @ before outlook.com and
not an alphanumeric character (this is to simulate the word boundary that we could have with the match_phrase query). My problem is that if the field starts or ends with Outlook.com, it's not retrieved because the regex doesn't find a character before or after ([^A-Za-z0-9\\@] doesn't match the empty string).