绊脚石乃是进身之阶。

请问以下搜索为什么score最高的反而没第二个更匹配?

Elasticsearch | 作者 kilik52 | 发布于2017年03月03日 | 阅读数:5450

我自己做了一个fate游戏的数据库,里面有两个人如下:
 
 
{
"name": "阿尔托莉雅・潘德拉贡",
"class": "Saber",
"description": "不列颠传说中的王,被称为骑士王。阿尔托莉雅是幼名,从成为王的那一天起就被称为亚瑟王。在那个骑士道如花般凋零的时代,用手中的圣剑为不列颠带来了短暂的和平与最后的繁荣。虽然史实上是男性,但在这个世界似乎是男装的丽人。"
}
 
{
"name": "吉尔伽美什",
"class": "Archer",
"description": "公元以前统治着苏美尔的都市国家乌鲁克的半神半人的王者。不仅仅是传说而是真实存在的人物,记述于人类最古的叙事诗《吉尔伽美什叙事诗》中的王。"
}

 
我采用了ES自己的那个smartcn analyzer插件。我的搜索词是“不列颠传说中的王”,分词器分出来是“不列颠/传说/中/的/王”。
 
我自己预期中,应该是“阿尔托莉雅・潘德拉贡”这条的score比“吉尔伽美什”高。但是我搜索后,返回的结果是:
 
“吉尔伽美什”的score: 2.447756
“阿尔托莉雅・潘德拉贡”的score: 1.9885943
 
请问这是为什么啊,谢谢!
 
 
补充: 这是对阿尔托莉雅这条数据做explain请求后的结果:
 
{
"_index": "fgo",
"_type": "servant",
"_id": "1",
"matched": true,
"explanation": {
"value": 1.9885945,
"description": "sum of:",
"details": [
{
"value": 1.9885945,
"description": "sum of:",
"details": [
{
"value": 0.39341488,
"description": "weight(_all:不列颠 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.39341488,
"description": "score(doc=0,freq=2.0 = termFreq=2.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.3675334,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 2,
"description": "termFreq=2.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 82,
"description": "avgFieldLength",
"details": []
},
{
"value": 83.591835,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0.28541544,
"description": "weight(_all:传说 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.28541544,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.992121,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 82,
"description": "avgFieldLength",
"details": []
},
{
"value": 83.591835,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0.28541544,
"description": "weight(_all:中 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.28541544,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.992121,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 82,
"description": "avgFieldLength",
"details": []
},
{
"value": 83.591835,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0.53913236,
"description": "weight(_all:的 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.53913236,
"description": "score(doc=0,freq=7.0 = termFreq=7.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.874056,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 7,
"description": "termFreq=7.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 82,
"description": "avgFieldLength",
"details": []
},
{
"value": 83.591835,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0.48521632,
"description": "weight(_all:王 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.48521632,
"description": "score(doc=0,freq=4.0 = termFreq=4.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.6866407,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 4,
"description": "termFreq=4.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 82,
"description": "avgFieldLength",
"details": []
},
{
"value": 83.591835,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": "*:*, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
}

发现问题在这里:
"description": "weight(_all:不列颠 in 0) [PerFieldSimilarity], result of:",

但为什么是 in 0呢?明明我原文description里面有不列颠啊。
 
已邀请:

medcl - 今晚打老虎。

赞同来自: kilik52

你query 加上 explain 查看一下就知道如何打分的,现在默认是 bm25 评分模型。

kilik52

赞同来自: 夏李俊

我自己找到原因了。对于很小的数据量来说,要设定shards的数量为1。SO上的回答原文如下:
 


The problem lies within the distributed score calculation.

You create a new index with default settings, that is, 5 shards. Each shard is its own Lucene index. When you index your data, Elasticsearch needs to decide to which shard the document should go and it does so by hashing on the _id (in absence of the routing parameter).

So, by shifting the IDs, you eventually distributed the documents to different shards. As written above, each shard is its own Lucene index and when you search across multiple shards, you have to combine the different scores of each separate shard and due to the different routing, the individual scores are different.

You can verify this by adding explain to your query. For Sand Roger, the idf is calculated as idf(docFreq=1, maxDocs=1) = 0.30685282 and idf(docFreq=1, maxDocs=2) = 1respectively, which yields the different results.

You can change either the shard size to 1 or the query type to a dfs type. Searching against http://localhost:9200/test/use ... fetch will give you correct scores, because of its
 


 
http://stackoverflow.com/quest ... ethod

要回复问题请先登录注册