es评分公式中的 查询归一因子(queryNorm)怎么理解

如题。
 
官方描述是这样的:


查询归一因子 ( queryNorm )试图将查询 归一化 , 这样就能将两个不同的查询结果相比较。


 但是我觉得这个说法并不好,他还是没能解决我的疑惑,因为比较两个搜索结果是通过评分的,评分中的核心是 tf idf,和长度归一因子(fieldNorm),这不就等于没解释吗?
 
而且根据查询归一化的公式,也不是很理解:  queryNorm = 1 / √sumOfSquaredWeight。 ssumOfSquaredWeights 是查询里每个词的 IDF 的平方和。也就是说查询归一因子是和IDF有关系的,但是还是不能理解这个因子的作用
 
求赐教
 
 
------------------------分割线----------------------------
 
补充一个问题,我发现按照网上公式计算出来的idf filedNorms 总是和我集群计算出来的总是不一样,有的差了不少,很郁闷。
 
而且maxDocs也不准,我文档一共70W条,但是每次explain出来能差一半,难道说这个maxDocs不是指的总文档数?但是倒排索引公式中说明了:
public float idf(long docFreq, long numDocs) {

return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);


numDocs对应explain结果中的应该就是maxDocs吧
 
这样岂不就和idf公式背道而驰了吗?
 
集群是2.2版本。
 
 
已邀请:

yayg2008

赞同来自:

queryNorm 的目的是让已计算的相关性具有可比性,并不会影响其评分和排名。说白了,这个参数对最终相关性没有影响,只是lucence实现的一种手段而已。

kennywu76 - wood@Ctrip

赞同来自:

找到一个比较细致的解释: Notes on Lucene query normalization


When you're trying to understand the nature of your query scores, you've probably heard of TFIDF. TFIDF determines how much weight a given term should be given in a particular field by multiplying the term freq * inverse document frequency.

What you may not realize is TFIDF is actually a reflection of the strength of a term in a field. Lucene also employes a method for determining the weight of a term in the query.

To compute this weight, Lucene uses a query normalization process. Query norms reduce every query term's IDF around to the unit vector. Its a single multiplier applied to every IDF. The ultimate impact is to punish proportionally common terms beyond even their low IDF. This works to the point that:

IDF of proportionally rare terms approaches IDF
IDF of proportionally common terms approach as low as ~0.1 IDF

Important to remember this is entirely contextual. Given VA state laws, think of these terms with their associated doc freqs:

deer: 20
hog: 20
permit: 2000

A search for "deer hog" allows deer and hog to both recieve equally scaled IDF. In this case IDF * sqrt(2). In the case of "deer permit" permit gets 1/3 IDF while deer gets its full IDF.

In one context the score for rare term deer might be driven by IDF if paired with an equally rare term, say "hog". In a second context, the score for deer might be 0.1 IDF if a much much rare term shows up
 


 
简单说,tf/idf反映的是一个term对于一个字段的重要性。 所以Lucene引入QueryNorm是为了计算一个term对于整个Query的重要性,也就是Query包含的所有term的idf放在一起做归一化,让idf成为一个相对值(稀少的term相对大众化的term的idf高) 。这样可以使得不同Query之间的得分可以做相互比较。

从ES官方文档的解释来看, 实际应用中QueryNorm往往达不到预想的目的,所以通常不会去比较两个Query之间的得分,也就是QueryNorm是一个可以忽略的因子。

要回复问题请先登录注册