好的想法是十分钱一打,真正无价的是能够实现这些想法的人。

自定义analysis与默认分词器分词不一致问题

Elasticsearch | 作者 guoyanbiao520 | 发布于2020年12月01日 | 阅读数:1614


{
"settings": {
"index": {
"number_of_shards": "5",
"max_result_window": "1000000000",
"number_of_replicas": "0"
},
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_hanlp",
"char_filter": [
"html_strip"
]
}
},
"tokenizer": {
"my_hanlp": {
"type": "hanlp_index",
"enable_stop_dictionary": true,
"enable_custom_config": true
}
}
}
},
"mappings": {
"repo-index-type": {
"properties": {
"resName": {
"analyzer": "hanlp_index",
"search_analyzer": "hanlp_index",
"term_vector": "with_positions_offsets",
"type": "text",
"fields": {
"ag": {
"ignore_above": 10922,
"type": "keyword"
}
}
},
"metadata_Content": {
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer",
"type": "text"
}
}
}
}
}
如上,两个字段分别用自定义的hanlp_index与默认的hanlp_index对“国务院常务会议”分词,my_analyzer分词其实是hanlp,而不是hanlp_index,结果为‘国务院’,‘常务会议’,而正常的hanlp_index分词为‘国务院’,‘国务’,‘常务会议’,‘常务会’,‘常务’,‘会议’,求指点为什么自定义之后分词器指定hanlp_index变成了hanlp。。
已邀请:

要回复问题请先登录注册