elasticsearch的termvectors包括了term的位置、词频等信息。这些信息用于相应的数据统计或开发其他功能,本文介绍termvecters如何使用,如何通过java客户端获取termvectors相关信息。
要使用termvctor首先要配置mapping中field的"term_vector"属性,默认状态es不开启termvector,因为这样会增加索引的体积,毕竟多存了不少元数据。
PUT test
{
"mappings": {
"qa_test": {
"dynamic": "strict",
"_all": {
"enabled": false
},
"properties": {
"question": {
"properties": {
"cate": {
"type": "keyword"
},
"desc": {
"type": "text",
"store": true,
"term_vector": "with_positions_offsets_payloads",
"analyzer": "ik_smart"
},
"time": {
"type": "date",
"store": true,
"format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
},
"title": {
"type": "text",
"store": true,
"term_vector": "with_positions_offsets_payloads",
"analyzer": "ik_smart"
}
}
},
"updatetime": {
"type": "date",
"store": true,
"format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "1",
"requests": {
"cache": {
"enable": "true"
}
},
"number_of_replicas": "1"
}
}
}
注意示例中的"title"的"term_vector"属性。
接下来为索引创建一条数据
PUT qa_test_02/qa_test/1
{
"question": {
"cate": [
"装修流程",
"其它"
],
"desc": "筒灯,大洋和索正这两个牌子,哪个好?希望内行的朋友告知一下,谢谢!",
"time": "2016-07-02 19:59:00",
"title": "筒灯大洋和索正这两个牌子哪个好"
},
"updatetime": 1467503940000
}
下面我们看看这条数据上question.title字段的termvector信息
GET qa_test_02/qa_test/1/_termvectors
{
"fields": [
"question.title"
],
"offsets": true,
"payloads": true,
"positions": true,
"term_statistics": true,
"field_statistics": true
}
结果大概这个样子
{
"_index": "qa_test_02",
"_type": "qa_test",
"_id": "1",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"question.title": {
"field_statistics": {
"sum_doc_freq": 9,
"doc_count": 1,
"sum_ttf": 9
},
"terms": {
"和": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 2,
"start_offset": 4,
"end_offset": 5
}
]
},
"哪个": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 7,
"start_offset": 12,
"end_offset": 14
}
]
},
"大洋": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 2,
"end_offset": 4
}
]
},
"好": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 8,
"start_offset": 14,
"end_offset": 15
}
]
},
"正": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 4,
"start_offset": 6,
"end_offset": 7
}
]
},
"牌子": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 6,
"start_offset": 10,
"end_offset": 12
}
]
},
"筒灯": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 2
}
]
},
"索": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 3,
"start_offset": 5,
"end_offset": 6
}
]
},
"这两个": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 5,
"start_offset": 7,
"end_offset": 10
}
]
}
}
}
}
}
下面我们说说如何通过java代码实现termvector的获取,不说废话直接上代码
TermVectorsResponse termVectorResponse = client.prepareTermVectors().setIndex(sourceindexname).setType(sourceindextype)
.setId(id).setSelectedFields(fieldname).setTermStatistics(true).execute()
.actionGet();
XContentBuilder builder = XContentFactory.contentBuilder(XContentType.JSON);
termVectorResponse.toXContent(builder, null);
System.out.println(builder.string());
Fields fields = termVectorResponse.getFields();
Iterator<String> iterator = fields.iterator();
while (iterator.hasNext()) {
String field = iterator.next();
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator();
while (termsEnum.next() != null) {
BytesRef term = termsEnum.term();
if (term != null) {
System.out.println(term.utf8ToString() + termsEnum.totalTermFreq());
}
}
}
获取TermVectorsResponse的代码很好理解,主要是设置索引名称、索引type、索引id以及需要展示的若干属性。
接下来是如何获取某一term的termvector,有两种方案第一种是通过TermVectorsResponse的toXContent方法直接生成XContentBuilder,这种方法可以直接获取和上面通过DSL查询一样的json结果;第二种是通过Fields的iterator遍历fields,获取TermsEnum,熟悉lucene的同学应会更熟悉第二种方法。
[尊重社区原创,转载请保留或注明出处]
本文地址:http://elasticsearch.cn/article/461
本文地址:http://elasticsearch.cn/article/461