elasticsearch-analysis-pinyin更新至es2.4.1和5.0.0-rc1

Elasticsearch | 作者 medcl | 发布于2016年10月13日 | | 阅读数：4787

版本分别支持到最新的 es v2.4.1和 es v5.0.0-rc1
新增若干特性，支持多种选项配置，支持 pinyin 的切分，比之前需要结合 ngram 的方式更加准确，
如：liudehuaalibaba13zhuanghan->liu,de,hua,a,li,ba,ba,13,zhuang,han，
具体配置参加文档：
https://github.com/medcl/elast ... inyin

下载：
https://github.com/medcl/elast ... eases

欢迎测试：

curl -XPUT http://localhost:9200/medcl/ -d'

{

    "index" : {

        "analysis" : {

            "analyzer" : {

                "pinyin_analyzer" : {

                    "tokenizer" : "my_pinyin"

                    }

            },

            "tokenizer" : {

                "my_pinyin" : {

                    "type" : "pinyin",

                    "keep_separate_first_letter" : false,

                    "keep_full_pinyin" : true,

                    "keep_original" : false,

                    "limit_first_letter_length" : 16,

                    "lowercase" : true

                }

            }

        }

    }

}'



curl http://localhost:9200/medcl/_a ... lyzer

{

  "tokens" : [ {

    "token" : "liu",

    "start_offset" : 0,

    "end_offset" : 1,

    "type" : "word",

    "position" : 0

  }, {

    "token" : "de",

    "start_offset" : 1,

    "end_offset" : 2,

    "type" : "word",

    "position" : 1

  }, {

    "token" : "hua",

    "start_offset" : 2,

    "end_offset" : 3,

    "type" : "word",

    "position" : 2

  }, {

    "token" : "a",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 3

  }, {

    "token" : "b",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 4

  }, {

    "token" : "c",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 5

  }, {

    "token" : "d",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 6

  }, {

    "token" : "liu",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 7

  }, {

    "token" : "de",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 8

  }, {

    "token" : "hua",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 9

  }, {

    "token" : "wo",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 10

  }, {

    "token" : "bu",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 11

  }, {

    "token" : "zhi",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 12

  }, {

    "token" : "dao",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 13

  }, {

    "token" : "shi",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 14

  }, {

    "token" : "shui",

    "start_offset" : 2,

    "end_offset" : 31,

    "type" : "word",

    "position" : 15

  }, {

    "token" : "ldhabcdliudehuaw",

    "start_offset" : 0,

    "end_offset" : 16,

    "type" : "word",

    "position" : 16

  } ]

}

[尊重社区原创，转载请保留或注明出处]
本文地址：http://elasticsearch.cn/article/105

pinyin 更新插件拼音

3

3 个评论

chennanlcy

手动点赞

chennanlcy

测试了下medcl大神的拼音分词器，发现一点问题，比如，搜索“沾光”的拼音“zhanguang”时的分词如下：
http://172.19.22.124:9200/_analyze?analyzer=pinyin&pretty&text=zhanguang
{
"tokens" : [
{
"token" : "zhang",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
},
{
"token" : "u",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "ang",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "zhanguang",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 3
}
]
}
这里有2点问题：
1. 这里貌似只使用了正向最大匹配，而没有考虑最优拼音拆分，所以会拆分不是很准确
2. 拼音的偏移位貌似有一丢丢问题，比如第一个token “zhang”不应该是0到8

下面是我实现的lc-pinyin里面使用“最短拼音拆分既最优拆分”的方法可以避免这个问题
http://172.19.22.124:9200/_analyze?analyzer=lc_search&pretty&text=zhanguang
{
"tokens" : [
{
"token" : "zhan",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "guang",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
}
]
}

medcl 回复 chennanlcy

欢迎提交 PR

要回复文章请先登录或注册

elasticsearch-analysis-pinyin更新至es2.4.1和5.0.0-rc1

3 个评论

发起人

活动推荐