使用es做搜索，比如用户输入柠檬，搜出来的结果，柠檬汽水，柠檬位牙膏等在前面，真正想要的水果那个柠檬在后面。已经在中文分词中加了柠檬，还是不管用

赞同来自: exceptions 、Cheetah 、laoyang360 、novia 、guoliang_1992 、way31415926 、xinfanwang 、jnliao 、strglee 、ziyou 、juin 、intergret 、xiaoke 、nodexy 、janchou 、HerbertMahone 、lwzhuo 、wr 、call_this_name 、Owen666 、YuLiGod 更多 »

这个问题比较有趣！

默认的打分机制参考tf, idf,和field norms，根据给的示例数据看，"柠檬“这个词对于3条文档的tf, idf应该一样，影响打分的只有filed norms。按道理“柠檬”的field length最短，那么得分应该最高，排在第一位返回才对。

为了验证这个结果，我实际测试了一下，过程如下:

创建一个空索引，使用ik_max_word分词器并写入3条数据

PUT testindex/

{

  "mappings": {

    "logs": {

      "properties": {

        "product": {

          "type": "text",

          "analyzer": "ik_max_word"

        }

      }

    }

  }

}





PUT testindex/logs/1

{"product":"柠檬"}



PUT testindex/logs/2

{"product":"柠檬汽水"}



PUT testindex/logs/3

{"product":"柠檬味牙膏"}

查询关键词"柠檬"

POST testindex/_search

{

  "query": {

    "match": {

      "product": "柠檬"

    }

  }

}

查询结果:

"hits": {

    "total": 3,

    "max_score": 0.85747814,

    "hits": [

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "3",

        "_score": 0.85747814,

        "_source": {

          "product": "柠檬味牙膏"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "2",

        "_score": 0.80226827,

        "_source": {

          "product": "柠檬汽水"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "1",

        "_score": 0.7594807,

        "_source": {

          "product": "柠檬"

        }

      }

    ]

  }

"柠檬"居然真分数最低，非常出乎我的意料。

于是，我在查询里打开"explain":true选项，查看分数是怎么计算的，发现doc frequency, avgfieldlength看着都不对。

百思不得其解的情况下，Google了一下，在这里找到了答案:
https://github.com/elastic/ela ... 24429

简而言之，ES的相关性打分计算是每个shard独立做的。一个索引默认5个shard，如果象示例里那样，写入的文档比较少，可能这些文档分布在不同的shard，造成各个shard分别计算各自的得分的时候，并没有将这几条文档放在一起产生统计数据。各自的打分不具有可比性。

所以，后面我又做了一个测试，删掉这个索引，重新创建一个，将shard设置为1，重新写入同样3条文档后再搜索，"柠檬”是排第一位返回的。

那么怎么看待这个问题？因为ES是分布式搜索系统，各个shard独立搜索，独立计算该shard上的文档打分，当数据量比较大的情况下，上面说的差异统计上看基本被抹平了，通常没什么问题。但如果索引的文档比较少，不同shard之间对同一个搜索关键词的统计数据差异可能就比较大，这种情况下只能使用一个shard来解决了。

kennywu76 - Wood

赞同来自: jiangtao 、laoyang360 、大慈大悲掌、Acepcs 、Lemcoo 、dongqiang 更多 »

@napoay 最开始的问题回复里，我说“默认的打分机制参考tf, idf,和field norms” ,是因为我没有意识到5.x以后默认打分机制改变为BM25，这点欢迎纠正。

但是，再重申以下，多分片导致打分不正确问题，和用哪种打分模型没有关系。因为用哪种模型打分，都是在shard level进行的，所参考的统计数据不是全局的。

既然你坚持说用BM25+ik_smart测试，不管几个分片打分都一样的，那我就给你测试看一下，分别用5个shard和1个shard，重复题主的文档，看下打分到底是不是一样:

DELETE testindex



# 5 个分片 

PUT testindex/

{

  "settings": {

    "number_of_shards": 5

  },

  "mappings": {

    "logs": {

      "properties": {

        "product": {

          "type": "text",

          "analyzer": "ik_smart"

        }

      }

    }

  }

}



PUT testindex/logs/1

{"product":"柠檬"}



PUT testindex/logs/2

{"product":"柠檬汽水"}



PUT testindex/logs/3

{"product":"柠檬味牙膏"}



POST testindex/_search

{

  "query": {

    "match": {

      "product": "柠檬"

    }

  }

}



 

5个分片的测试结果:



{

  "took": 0,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 0.2876821,

    "hits": [

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "1",

        "_score": 0.2876821,

        "_source": {

          "product": "柠檬"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "2",

        "_score": 0.25811607,

        "_source": {

          "product": "柠檬汽水"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "3",

        "_score": 0.25316024,

        "_source": {

          "product": "柠檬味牙膏"

        }

      }

    ]

  }

}

现在看1个分片:

DELETE testindex



# 1 个分片 

PUT testindex/

{

  "settings": {

    "number_of_shards": 1

  },

  "mappings": {

    "logs": {

      "properties": {

        "product": {

          "type": "text",

          "analyzer": "ik_smart"

        }

      }

    }

  }

}



PUT testindex/logs/1

{"product":"柠檬"}



PUT testindex/logs/2

{"product":"柠檬汽水"}



PUT testindex/logs/3

{"product":"柠檬味牙膏"}



POST testindex/_search

{

  "query": {

    "match": {

      "product": "柠檬"

    }

  }

}

测试结果:

{

  "took": 0,

  "timed_out": false,

  "_shards": {

    "total": 1,

    "successful": 1,

    "skipped": 0,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 0.16786803,

    "hits": [

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "1",

        "_score": 0.16786803,

        "_source": {

          "product": "柠檬"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "2",

        "_score": 0.11980793,

        "_source": {

          "product": "柠檬汽水"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "3",

        "_score": 0.09476421,

        "_source": {

          "product": "柠檬味牙膏"

        }

      }

    ]

  }

}

请仔细看5个分片和1个分片的测试结果，不要被文档排序相同蒙蔽，他们的分值是完全不同的。如果你用explain:true去分析打分过程，会看到两种分片情况下，打分过程计算的doc freq和avgfieldlength完全不同。这就是官方文档里解释的relevance is broken 现象。

不知道为何你不抓住这个问题的重点，偏要去强调BM25比IF-TDF稳定？

novia - 1&0

赞同来自: guoliang_1992

http://www.jianshu.com/p/c6780752685f 这篇文章也有相关说明，可以看下

kennywu76 - Wood

赞同来自: exceptions

@napoay 我最初回复的问题用的测试版本是ES5.3.2 + ik_max_word + default similarity (BM25)，你可以试一下，是可以得出“错误”的排名的。

我下载了最新版本的ES 5.6.0 +IK插件验证了一下，测试结果和你回答里给的测试结果相符合。但这个看起来“正确”的结果是因为ik不同版本之间的差异带来的。实际测试了一下， 5.3.2之后版本的ik_max_word分词器看起来是有点bug，比如"柠檬味牙膏"，分出来的结果和ik_smart是一样的（也许不是bug，而是正常的算法改进)。所以产生了不同版本之间的评分差异。

我理解不同的分词器带来的评分结果是不一样的，问题最开始给的例子只是一个可以让人感受到评分”不正确“的一个特例。但是这里讨论的重点不是分词器本身带来的差异，而是多shard搜索带来的评分不准确问题。如我最初问题解答里所解释的，通过explain api去查看打分的过程时，doc frequency和avgfieldlength这类参与打分的统计值看起来“不正确”。之所为认为它“不正确”，是因为一开始没意识到，ES默认的打分是每个shard单独进行的，并非参考的全局统计数据。

至于理论支持，参考:
https://www.elastic.co/guide/e ... .html

Github上有人提了类似的问题，原因也都是一样的
https://github.com/elastic/ela ... 24519
https://github.com/elastic/ela ... 24429

jerryhouse - 搜索工程师,技术网站：www.dcharm.com

这个要做query意图挖掘才能解决

napoay

默认的评分算法是BM25，不是tf-idf.

ES 5.4 测试:

PUT testindex/

{

  "mappings": {

    "logs": {

      "properties": {

        "product": {

          "type": "text",

          "analyzer": "ik_max_word",

          "similarity":"classic"

        }

      }

    }

  }

}



PUT testindex/logs/1

{"product":"柠檬"}



PUT testindex/logs/2

{"product":"柠檬汽水"}



PUT testindex/logs/3

{"product":"柠檬味牙膏"}



POST testindex/_search

{

  "query": {

    "match": {

      "product": "柠檬"

    }

  }

}

结果：

{

  "took": 245,

  "timed_out": false,

  "_shards": {

    "total": 1,

    "successful": 1,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 1.5,

    "hits": [

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "1",

        "_score": 1.5,

        "_source": {

          "product": "柠檬"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "2",

        "_score": 1.125,

        "_source": {

          "product": "柠檬汽水"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "3",

        "_score": 1.125,

        "_source": {

          "product": "柠檬味牙膏"

        }

      }

    ]

  }

}

napoay

1楼说分片数会影响打分，有没有理论支撑不太清楚。

把评分模型改为tf-idf，3条文档，分片数分别设为1、2、3、4、5，文档的得分全部一致。测试代码如下：

DELETE testindex





PUT testindex/

{

  "settings":{

    "number_of_shards": 4

  },

  "mappings": {

    "logs": {

      "properties": {

        "product": {

          "type": "text",

          "analyzer": "ik_max_word",

          "similarity":"classic"

        }

      }

    }

  }

}



PUT testindex/logs/1

{"product":"柠檬"}



PUT testindex/logs/2

{"product":"柠檬汽水"}



PUT testindex/logs/3

{"product":"柠檬味牙膏"}



POST testindex/_search

{

  "query": {

    "match": {

      "product": "柠檬"

    }

  }

}

评分结果：

{

  "took": 3,

  "timed_out": false,

  "_shards": {

    "total": 4,

    "successful": 4,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 1.5,

    "hits": [

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "1",

        "_score": 1.5,

        "_source": {

          "product": "柠檬"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "3",

        "_score": 1.125,

        "_source": {

          "product": "柠檬味牙膏"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "2",

        "_score": 1.125,

        "_source": {

          "product": "柠檬汽水"

        }

      }

    ]

  }

}

napoay

另外，不能单纯的以为“柠檬水果”=“柠檬”+“水果”，在es里面保存的是分词器分词后的词项：

{

  "tokens": [

    {

      "token": "柠檬水",

      "start_offset": 0,

      "end_offset": 3,

      "type": "CN_WORD",

      "position": 0

    },

    {

      "token": "柠檬",

      "start_offset": 0,

      "end_offset": 2,

      "type": "CN_WORD",

      "position": 1

    },

    {

      "token": "柠",

      "start_offset": 0,

      "end_offset": 1,

      "type": "CN_WORD",

      "position": 2

    },

    {

      "token": "檬",

      "start_offset": 1,

      "end_offset": 2,

      "type": "CN_WORD",

      "position": 3

    },

    {

      "token": "水果",

      "start_offset": 2,

      "end_offset": 4,

      "type": "CN_WORD",

      "position": 4

    }

  ]

}

评分的计算和分词后的结果是息息相关。
如果把分词器改为ik_smart，会更加符合你的需求：

DELETE testindex





PUT testindex/

{

  "settings":{

    "number_of_shards": 5

  },

  "mappings": {

    "logs": {

      "properties": {

        "product": {

          "type": "text",

          "analyzer": "ik_smart",

          "similarity":"classic"

        }

      }

    }

  }

}



PUT testindex/logs/1

{"product":"柠檬"}



PUT testindex/logs/2

{"product":"柠檬水果"}



PUT testindex/logs/3

{"product":"柠檬可乐"}



POST testindex/_search

{

  "query": {

    "match": {

      "product": "柠檬"

    }

  }

}

结果：

{

  "took": 2,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 1,

    "hits": [

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "1",

        "_score": 1,

        "_source": {

          "product": "柠檬"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "2",

        "_score": 0.625,

        "_source": {

          "product": "柠檬水果"

        }

      },

      {

        "_index": "testindex",

        "_type": "logs",

        "_id": "3",

        "_score": 0.625,

        "_source": {

          "product": "柠檬可乐"

        }

      }

    ]

  }

}

private_void - 一只猹

之前也遇到过这类问题。
我的理解是，ES为了快速搜索反馈，会牺牲一定的精准度（tf_idf的实现也有很多牺牲精准度快速计算，比如docFreq、maxDocs）。分片也是快速搜索的一种策略，所以也有代价。如果数据量大，各分片数据分布会相对更均匀，误差可忽略。dfs_query_then_fetch会计算所有分片的数据，所以没有这方面的误差，但是计算开销大。
两种策略，只能选择合适的一个使用，再通过其他措施弥补当前的不足。

hufuman

支持@wood的说法，我们也遇到了类似的问题，因为数据量不足够，导致的多shard搜索结果和预期不一致。

最终通过shard改成1解决

上面测试结果正确的，反而应该找找为什么会正确，这是不合理的

hufuman

支持@wood的说法，我们也遇到了类似的问题，因为数据量不足够，导致的多shard搜索结果和预期不一致。

最终通过shard改成1解决

上面测试结果正确的，反而应该找找为什么会正确，这是不合理的

hufuman

支持@wood的说法，我们也遇到了类似的问题，因为数据量不足够，导致的多shard搜索结果和预期不一致。

最终通过shard改成1解决

上面测试结果正确的，反而应该找找为什么会正确，这是不合理的

hufuman

支持@wood的说法，我们也遇到了类似的问题，因为数据量不足够，导致的多shard搜索结果和预期不一致。

最终通过shard改成1解决

上面测试结果正确的，反而应该找找为什么会正确，这是不合理的

kennywu76 - Wood

如果实在无法理解，可以测试插入以下范例文档，对比1个shard和5个shard的搜索结果差别，肉眼看到的偏差会更直观

PUT testindex/logs/1

{"product":"柠檬"}



PUT testindex/logs/2

{"product":"柠檬汽水"}



PUT testindex/logs/3

{"product":"柠檬味牙膏"}





PUT testindex/logs/4

{"product":"蛋糕"}

Jinyang Zhou - 菜鸡如我

同意@kennywu76
分片的问题确实存在，数据量可观的情况这点评分差别可以无视了，况且还需要考虑性能。BM25和tfidf只是相关度计算方法的区别，或者其他的算法都不能消除分片独立计算带来的差别。

所以，我的看法是数据量越大，通常为了性能，使用分片也应该越多，同时，这种分片打分一致性问题带来的差异也就越小，反之亦然。

所以我觉得数据量太小的话，干脆就一个分片吧。或者有其他更好的办法吗？

vearne - stay foolish stay hungry

在插入这类doc时，调大它的权重

YuLiGod - 小于小朋友

非常感谢@kennywu76的讲解
总的来说就是在这种情况下只有field norms会影响评分；
而由于field norms值只与文档数量有关，而与文档内容无关。因此，当一个字段中的文档数量变化时，该字段的所有文档的field norm值都会随之改变。
而由于每个shard独立评分，写入的文档比较少时，可能会出现shard中文档数量过少,而无法得出相对整个索引文档更加准确field norms，所以各自分片的打分并不能正确的体现整个索引的打分情况。
而文中的情况就是极少的文档分布到了各自的分片中了

使用es做搜索，比如用户输入柠檬，搜出来的结果，柠檬汽水，柠檬位牙膏等在前面，真正想要的水果那个柠檬在后面。已经在中文分词中加了柠檬，还是不管用

17 个回复

发起人

活动推荐

相关问题

问题状态

使用es做搜索，比如用户输入柠檬，搜出来的结果，柠檬汽水，柠檬位牙膏等在前面，真正想要的水果那个柠檬在后面。已经在中文分词中加了柠檬，还是不管用

与内容相关的链接

17 个回复

发起人

活动推荐

相关问题

问题状态