Elasticsearch 使用IK分词,如何配置同义词?

Elasticsearch | 作者 dicom | 发布于2014年11月24日 | 阅读数:13931

在ES中采用了IK分词,想在索引时使用同义词,该如何配置?IK分词的tokenizer是什么?use_smart配置在此种情况还有效吗?
已邀请:

laigood

赞同来自:

IK没有实现同义词功能吧,这个你可以自己实现,use_smart是个参数,可以自己设置的,详情参考:https://github.com/medcl/elasticsearch-analysis-ik

dicom

赞同来自:

设置tokenizer: ik,再使用synonym token filter可以实现同义词功能,但use_smart参数设置不起作用,不知为什么?

defineconst

赞同来自:

medcl - 今晚打老虎。

赞同来自:

defineconst

赞同来自:

英文的可以分词。但是中文不行。如下:
GET /my_index2/_analyze?analyzer=ik_max_word_syno&text=cosmos
{
"tokens": [
{
"token": "universe",
"start_offset": 0,
"end_offset": 6,
"type": "SYNONYM",
"position": 1
},
{
"token": "cosmos",
"start_offset": 0,
"end_offset": 6,
"type": "SYNONYM",
"position": 1
}
]
}

GET /my_index2/_analyze?analyzer=ik_max_word_syno&text=foosball
{
"tokens": [
{
"token": "foozball",
"start_offset": 0,
"end_offset": 8,
"type": "SYNONYM",
"position": 1
},
{
"token": "foosball",
"start_offset": 0,
"end_offset": 8,
"type": "SYNONYM",
"position": 1
}
]
}

GET /index/_analyze?analyzer=ik
{
"text": "西红柿"
}
{
"tokens": [
{
"token": "text",
"start_offset": 5,
"end_offset": 9,
"type": "ENGLISH",
"position": 1
},
{
"token": "西红柿",
"start_offset": 13,
"end_offset": 16,
"type": "CN_WORD",
"position": 2
}
]
}

GET /my_index2/_analyze?analyzer=ik_max_word_syno&text=西红柿
{
"tokens": []
}

defineconst

赞同来自:

具体参考medcl,http://www.ifunit.com/29/elasticsearch配置同义词
第一步,放置同义词词典
elasticsearch-1.6.0-self\config\analysis\synonym.txt
内容:
#Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos
西红柿, 番茄
马铃薯, 土豆
aa,bb
第二步,配置elasticsearch.yml文件,拷贝了medcl的rtf的1.0.0版本(1.6.0基本插件和管理插件我已配置好,可以联系获取)
内容(截选):
ik_syno:
type: custom
tokenizer: ik
filter: my_synonym
ik_max_word_syno:
type: custom
tokenizer: ik
filter: my_synonym
use_smart: false
#index.analysis.analyzer.default.type: mmseg
index.analysis.analyzer.default.type: ik

defineconst

赞同来自:

第三步,测试分词
GET /index/_analyze?analyzer=ik
{
"text": "西红柿"
}
结果:
{
"tokens": [
{
"token": "text",
"start_offset": 5,
"end_offset": 9,
"type": "ENGLISH",
"position": 1
},
{
"token": "西红柿",
"start_offset": 13,
"end_offset": 16,
"type": "CN_WORD",
"position": 2
}
]
}

defineconst

赞同来自:

GET /index/_analyze?analyzer=ik_max_word_syno
{
"text": "西红柿"
}
结果是:
{
"tokens": [
{
"token": "text",
"start_offset": 5,
"end_offset": 9,
"type": "ENGLISH",
"position": 1
},
{
"token": "西红柿",
"start_offset": 13,
"end_offset": 16,
"type": "SYNONYM",
"position": 2
},
{
"token": "番茄",
"start_offset": 13,
"end_offset": 16,
"type": "SYNONYM",
"position": 2
}
]
}

defineconst

赞同来自:

由此可见,在查询语句中要加入分析器(正在理解,还没有入门),如下:
POST my_index2/fulltext/_search
{
"query": {
"query_string": {
"text": {
"query": "西红柿",
"analyzer": "ik_max_word_syno"
}
}
},
"highlight": {
"pre_tags": [
"<tag1>",
"<tag2>"
],
"post_tags": [
"</tag1>",
"</tag2>"
],
"fields": {
"content": {}
}
}
}

结果是:

{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.19070336,
"hits": [
{
"_index": "my_index2",
"_type": "fulltext",
"_id": "1",
"_score": 0.19070336,
"_source": {
"content": "USA 西红柿Elizabeth is the English queen of united states"
}
},
{
"_index": "my_index2",
"_type": "fulltext",
"_id": "3",
"_score": 0.19070336,
"_source": {
"content": "西红柿蛋汤The United States is wealthy"
}
},
{
"_index": "my_index2",
"_type": "fulltext",
"_id": "4",
"_score": 0.02250402,
"_source": {
"content": "The United States is wealthy番茄炒鸡蛋"
}
}
]
}
}

tianzhaixing - 80后IT男

赞同来自:

## 同义词配置

### step 1

elasticserach.yml 最后一行添加:
index.analysis.analyzer.default.type: ik

### step 2

在elasticsearch-2.3.1/config目录下面,存放synonyms.txt

其中,synonyms.txt 编码格式为 utf-8,内容为:

    #Example:
    ipod, i-pod, i pod
    foozball , foosball
    universe , cosmos
    西红柿, 番茄
    马铃薯, 土豆
    aa, bb

### step 3

新建立索引类型设置:
curl -XPUT localhost:9200/test/_mapping?pretty -d '
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"jt_cn": {
"type": "custom",
"use_smart": "true",
"tokenizer": "ik_smart",
"filter": ["jt_tfr","jt_sfr"],
"char_filter": ["jt_cfr"]
},
"ik_smart": {
"type": "ik",
"use_smart": "true"
},
"ik_max_word": {
"type": "ik",
"use_smart": "false"
}
},
"filter": {
"jt_tfr": {
"type": "stop",
"stopwords": [" "]
},
"jt_sfr": {
"type": "synoym",
"synonyms_path": "synonyms.txt"
}
},
"char_filter": {
"jt_cfr": {
"type": "mapping",
"mappings": [
"| => \\|"
]
}
}
}
}
},
"mappings": {
"solution": {
"properties": {
"title": {
"include_in_all": true,
"analyzer": "jt_cn",
"term_vector": "with_positions_offsets",
"boost": 8,
"store": true,
"type": "string"
}
}
}
}
}
'

### step 4
curl -XPUT localhost:9200/test/solution/1 -d '
{
"title": "番茄"
}
'

curl -XPUT localhost:9200/test/solution/2 -d '
{
"title": "西红柿"
}
'



### step 5
curl -XPOST 'localhost:9200/test/solution/_search?pretty' -d '
{
"query": {
"query_string": {
"title": {
"query": "西红柿",
"analyzer": "jt_cn"
}
}
},
"highlight": {
"pre_tags": [
"<tag1>",
"<tag2>"
],
"post_tags": [
"</tag1>",
"</tag2>"
],
"fields": {
"title": {}
}
}
}
'


结果:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.4500804,
"hits": [
{
"_index": "test",
"_type": "solution",
"_id": "1",
"_score": 0.4500804,
"_source": {
"title": "西红柿"
}
},
{
"_index": "test",
"_type": "solution",
"_id": "2",
"_score": 0.36006433,
"_source": {
"title": "番茄"
}
}
]
}
}

wuyh

赞同来自:

我按照网上所说的
把ik项目先从git上克隆下来,再maven打包,找到那个zip包,解压再/plugins/ik/的wen文件夹下,
然后再elasticsearch.yml的最后一行添加index.analysis.analyzer.default.type: ik
然后重启elasticsearch,就起不来起不来了。看控制台是有异常,但是一闪而过,也看不清楚
怎么破

要回复问题请先登录注册