Elasticsearch index with synonym, stopwords filter not working as expected

Issue

This Content is from Stack Overflow. Question asked by sudheesh

I have created the following index in which I am using stopwords, synonyms, and lowercase filters in an analyzer, then I am using it in field test_field.

{
"mappings": {
      "dynamic": "strict",
      "properties": {
        "test_field": {
          "type": "text",
          "analyzer": "synonym_words_analyzer"
        }
        
      }
},
"settings": {
      "index": {
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_content"
            }
          }
        },
        "analysis": {
          "filter": {
            "english_stopwords": {
            "type":"stop",
            "language":"english",
            "stopwords":["a","an","and","are","as","at","be","but","by","for","if","in","into","is","it","no","not",
              "of","on","or","such","that","the","their","then","there","these","they",
              "this", "to", "was", "will", "with"]
            },
            "english_synonyms": {
              "type": "synonym",
              "lenient": true,
              "synonyms": ["International Cricket Council => ICC, I.C.C"]
            }
          },
          "analyzer": {
            "synonym_words_analyzer": {
              "filter": [
                "lowercase",
                "english_synonyms",
                "english_stopwords"
              ],
              "tokenizer": "standard"
            }
          }
        },
        "number_of_replicas": "0"
      }
    }
}

Then I indexed the following data

{
  "test_field": "International Cricket Council vs bcci"
}
POST wordforms_example/_doc
{
  "test_field": "I.C.C vs bcci"
}
POST /wordforms_example/_doc
{
  "test_field": "ICC vs bcci"
}

Then I executed a match_phrase query

{
   "track_total_hits":true,
   "highlight":{
      "require_field_match":true,
      "fields":{
         "*":{
            
         }
      },
      "pre_tags":[
         "<b>"
      ],
      "post_tags":[
         "</b>"
      ]
   },
   "timeout":"5s",
   "query":{
      "bool":{
         "must":[
            {
               "match_phrase":{
                  "test_field":{
                     "query":"icc"
                  }
               }
            }
         ]
      }
   },
   "from":0,
   "size":100,
   "sort":{
      "_score":"desc"
   }
}

I got the result as expected as follows

      {
        "_index": "wordforms_example",
        "_id": "PVb4VIMBajDdOcAZZ2om",
        "_score": 0.3788134,
        "_source": {
          "test_field": "International Cricket Council vs et"
        },
        "highlight": {
          "test_field": [
            "<b>International Cricket Council</b> vs et"
          ]
        }
      },
      {
        "_index": "wordforms_example",
        "_id": "Plb5VIMBajDdOcAZA2o9",
        "_score": 0.3788134,
        "_source": {
          "test_field": "International Cricket Council vs bcci"
        },
        "highlight": {
          "test_field": [
            "<b>International Cricket Council</b> vs bcci"
          ]
        }
      },
      {
        "_index": "wordforms_example",
        "_id": "QFb5VIMBajDdOcAZZ2qA",
        "_score": 0.3788134,
        "_source": {
          "test_field": "ICC vs bcci"
        },
        "highlight": {
          "test_field": [
            "<b>ICC</b> vs bcci"
          ]
        }
      }
    ]

but when I do match query as follows it doesn’t give any result

{
   "track_total_hits":true,
   "highlight":{
      "require_field_match":true,
      "fields":{
         "*":{
            
         }
      },
      "pre_tags":[
         "<b>"
      ],
      "post_tags":[
         "</b>"
      ]
   },
   "timeout":"5s",
   "query":{
      "bool":{
         "must":[
            {
               "match":{
                  "test_field":{
                    "query": "international"
                  }
                 }
            },
            {
               "match":{
                 "test_field":{
                    "query": "Cricket"
                  }
               }
            },
            {
               "match":{
                 "test_field":{
                    "query": "Council"
                  }
               }
            }
         ]
      }
   },
   "from":0,
   "size":100,
   "sort":{
      "_score":"desc"
   }
}

But instead of giving synonyms as [“International Cricket Council => ICC, I.C.C”],
If I give as follows,
[“International Cricket Council, ICC, I.C.C => International Cricket Council, ICC, I.C.C”]

As a result unnecessary highlighting and all hapening.For all the above queries.Sample result:

     {
       "_index": "wordforms_example",
       "_id": "QlYFVYMBajDdOcAZjGqs",
       "_score": 1.1955718,
       "_source": {
         "test_field": "I.C.C vs bcci"
       },
       "highlight": {
         "test_field": [
           "<b>I.C.C</b> <b>vs</b> <b>bcci</b>"
         ]
       }
     },
     {
       "_index": "wordforms_example",
       "_id": "Q1YFVYMBajDdOcAZlGq8",
       "_score": 1.1955718,
       "_source": {
         "test_field": "ICC vs bcci"
       },
       "highlight": {
         "test_field": [
           "<b>ICC</b> <b>vs</b> <b>bcci</b>"
         ]
       }
     },
     {
       "_index": "wordforms_example",
       "_id": "QVYFVYMBajDdOcAZhWpI",
       "_score": 1.1175997,
       "_source": {
         "test_field": "International Cricket Council vs bcci"
       },
       "highlight": {
         "test_field": [
           "<b>International Cricket</b> <b>Council</b> vs bcci"
         ]
       }
     }
   ]

And when there is a massive amount of data, and if we do a match search for International Cricket Council there will be a count mismatch in results.
Anyone know why this is happening. Thank you



Solution

This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.

This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?