elasticsearch分词检索的match-query匹配过程分析

Elasticsearch | 作者夏李俊 | 发布于2018年03月14日 | | 阅读数：4586

1. 模拟字符串数据存储

localhost:9200/yigo-redist.1/_analyze?analyzer=default&text=全能片(前)---TRW-GDB7891AT刹车片自带报警线，无单独报警线号码,卡仕欧,卡仕欧,乘用车,刹车片

上面的url表示

索引为`yigo-redist.1`
使用了索引`yigo-redist.1`中的分词器(`analyzer`) `default`
解析的字符串(`text`)为"全能片(前)---TRW-GDB7891AT刹车片自带报警线，无单独报警线号码,卡仕欧,卡仕欧,乘用车,刹车片"

如果结果为:

{

  "tokens" : [ {

    "token" : "全能",

    "start_offset" : 0,

    "end_offset" : 2,

    "type" : "CN_WORD",

    "position" : 1

  }, {

    "token" : "片",

    "start_offset" : 2,

    "end_offset" : 3,

    "type" : "CN_CHAR",

    "position" : 2

  }, {

    "token" : "前",

    "start_offset" : 4,

    "end_offset" : 5,

    "type" : "CN_CHAR",

    "position" : 3

  }, {

    "token" : "trw-gdb7891at",

    "start_offset" : 9,

    "end_offset" : 22,

    "type" : "LETTER",

    "position" : 4

  }, {

    "token" : "刹车片",

    "start_offset" : 22,

    "end_offset" : 25,

    "type" : "CN_WORD",

    "position" : 5

  }, {

    "token" : "自带",

    "start_offset" : 25,

    "end_offset" : 27,

    "type" : "CN_WORD",

    "position" : 6

  }, {

    "token" : "报警",

    "start_offset" : 27,

    "end_offset" : 29,

    "type" : "CN_WORD",

    "position" : 7

  }, {

    "token" : "线",

    "start_offset" : 29,

    "end_offset" : 30,

    "type" : "CN_CHAR",

    "position" : 8

  }, {

    "token" : "无",

    "start_offset" : 31,

    "end_offset" : 32,

    "type" : "CN_WORD",

    "position" : 9

  }, {

    "token" : "单独",

    "start_offset" : 32,

    "end_offset" : 34,

    "type" : "CN_WORD",

    "position" : 10

  }, {

    "token" : "报警",

    "start_offset" : 34,

    "end_offset" : 36,

    "type" : "CN_WORD",

    "position" : 11

  }, {

    "token" : "线",

    "start_offset" : 36,

    "end_offset" : 37,

    "type" : "CN_CHAR",

    "position" : 12

  }, {

    "token" : "号码",

    "start_offset" : 37,

    "end_offset" : 39,

    "type" : "CN_WORD",

    "position" : 13

  }, {

    "token" : "卡",

    "start_offset" : 40,

    "end_offset" : 41,

    "type" : "CN_CHAR",

    "position" : 14

  }, {

    "token" : "仕",

    "start_offset" : 41,

    "end_offset" : 42,

    "type" : "CN_WORD",

    "position" : 15

  }, {

    "token" : "欧",

    "start_offset" : 42,

    "end_offset" : 43,

    "type" : "CN_WORD",

    "position" : 16

  }, {

    "token" : "卡",

    "start_offset" : 44,

    "end_offset" : 45,

    "type" : "CN_CHAR",

    "position" : 17

  }, {

    "token" : "仕",

    "start_offset" : 45,

    "end_offset" : 46,

    "type" : "CN_WORD",

    "position" : 18

  }, {

    "token" : "欧",

    "start_offset" : 46,

    "end_offset" : 47,

    "type" : "CN_WORD",

    "position" : 19

  }, {

    "token" : "乘用车",

    "start_offset" : 48,

    "end_offset" : 51,

    "type" : "CN_WORD",

    "position" : 20

  }, {

    "token" : "刹车片",

    "start_offset" : 52,

    "end_offset" : 55,

    "type" : "CN_WORD",

    "position" : 21

  } ]

}

2. 关键词查询

localhost:9200//yigo-redist.1/_analyze?analyzer=default_search&text=gdb7891

索引为`yigo-redist.1`
使用了索引`yigo-redist.1`中的分词器(`analyzer`) `default_search`
解析的字符串(`text`)为"gdb7891"

返回结果：

{

  "tokens" : [ {

    "token" : "gdb7891",

    "start_offset" : 0,

    "end_offset" : 7,

    "type" : "LETTER",

    "position" : 1

  } ]

}

3. 关键词使用存储的分词器查询

localhost:9200//yigo-redist.1/_analyze?analyzer=default&text=gdb7891

索引为`yigo-redist.1`
使用了索引`yigo-redist.1`中的分词器(`analyzer`) `default_search`
解析的字符串(`text`)为"gdb7891"

返回结果：

{

  "tokens" : [ {

    "token" : "gdb7891",

    "start_offset" : 0,

    "end_offset" : 7,

    "type" : "LETTER",

    "position" : 1

  }, {

    "token" : "",

    "start_offset" : 0,

    "end_offset" : 7,

    "type" : "LETTER",

    "position" : 1

  }, {

    "token" : "gdb7891",

    "start_offset" : 0,

    "end_offset" : 7,

    "type" : "LETTER",

    "position" : 1

  }, {

    "token" : "",

    "start_offset" : 0,

    "end_offset" : 3,

    "type" : "ENGLISH",

    "position" : 2

  }, {

    "token" : "gdb",

    "start_offset" : 0,

    "end_offset" : 3,

    "type" : "ENGLISH",

    "position" : 2

  }, {

    "token" : "gdb",

    "start_offset" : 0,

    "end_offset" : 3,

    "type" : "ENGLISH",

    "position" : 2

  }, {

    "token" : "7891",

    "start_offset" : 3,

    "end_offset" : 7,

    "type" : "ARABIC",

    "position" : 3

  }, {

    "token" : "7891",

    "start_offset" : 3,

    "end_offset" : 7,

    "type" : "ARABIC",

    "position" : 3

  }, {

    "token" : "",

    "start_offset" : 3,

    "end_offset" : 7,

    "type" : "ARABIC",

    "position" : 3

  } ]

}

总结

通过步骤1可以看出,存储的数据"全能片(前)---TRW-GDB7891AT刹车片自带报警线，无单独报警线号码,卡仕欧,卡仕欧,乘用车,刹车片",被拆分成了很多词组碎片,然后存储在了索引数据中
通过步骤2可以看出,当关键词输入"gdb7891",这个在检索分词器(`default_search`)下,没有拆分,只一个可供查询的碎片就是"gdb7891",但是步骤1,拆分的碎片里不存在"gb7891"的词组碎片,唯一相近的就是"trw-gdb7891at",所以使用普通的match-query是无法匹配步骤1输入的索引数据
通过步骤3,可以看出如果使用相同的分词器,"gdb7891"能够拆分成"gdb","7891"等等,通过这2个碎片都能找到步骤1输入的索引数据,但是因为关键词被拆分了,所以会查询到更多的匹配的数据,比如:与"gdb"匹配的,与"7891"匹配的,与"gdb7891"匹配的
如果说想通过分词器(`default_search`)检索出步骤1的数据,需要使用wildcard-query,使用"*gdb7891*",就可以匹配
```
  {      "query": {          "wildcard" : { "description" : "*gdb7891*" }      }  }
```

[尊重社区原创，转载请保留或注明出处]
本文地址：http://elasticsearch.cn/article/533

0

4 个评论

orange

请问一下为什么不能执行你的分词测试的例子，我应该如何修改
{
"error": {
"root_cause": [
{
"type": "parse_exception",
"reason": "request body or source parameter is required"
}
],
"type": "parse_exception",
"reason": "request body or source parameter is required"
},
"status": 400
}

orange 回复 orange

GET /doctortest4/_analyze?analyzer=default&text=111乘用，我用的kibana

夏李俊

我是直接访问了elasticsearch服务,没有用过kibana,请直接访问elasticsearch服务

tygcs

“通过步骤3,可以看出如果使用相同的分词器,"gdb7891"能够拆分成"gdb","7891"等等,通过这2个碎片都能找到步骤1输入的索引数据,但是因为关键词被拆分了,所以会查询到更多的匹配的数据,比如:与"gdb"匹配的,与"7891"匹配的,与"gdb7891"匹配的”

不是很明白啊，第一部分词结果不是这样的嘛？
{
"token" : "trw-gdb7891at",
"start_offset" : 9,
"end_offset" : 22,
"type" : "LETTER",
"position" : 4
},

这里trw-gdb7891at并没有被分开啊，但是步骤三里面分开了，这样怎么能匹配上呢？

要回复文章请先登录或注册

elasticsearch分词检索的match-query匹配过程分析

4 个评论

发起人