java 客户端 获取 termvectors

作者: JiaShiwen   发布时间:2018-01-19

elasticsearch的termvectors包括了term的位置、词频等信息。这些信息用于相应的数据统计或开发其他功能,本文介绍termvecters如何使用,如何通过java客户端获取termvectors相关信息。

要使用termvctor首先要配置mapping中field的"term_vector"属性,默认状态es不开启termvector,因为这样会增加索引的体积,毕竟多存了不少元数据。

PUT test
{
  "mappings": {
    "qa_test": {
      "dynamic": "strict",
      "_all": {
        "enabled": false
      },
      "properties": {
        "question": {
          "properties": {
            "cate": {
              "type": "keyword"
            },
            "desc": {
              "type": "text",
              "store": true,
              "term_vector": "with_positions_offsets_payloads",
              "analyzer": "ik_smart"
            },
            "time": {
              "type": "date",
              "store": true,
              "format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
            },
            "title": {
              "type": "text",
              "store": true,
              "term_vector": "with_positions_offsets_payloads",
              "analyzer": "ik_smart"
            }
          }
        },
        "updatetime": {
          "type": "date",
          "store": true,
          "format": "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss"
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "1",
      "requests": {
        "cache": {
          "enable": "true"
        }
      },
      "number_of_replicas": "1"
    }
  }
}

注意示例中的"title"的"term_vector"属性。

接下来为索引创建一条数据

PUT qa_test_02/qa_test/1
{
  "question": {
    "cate": [
      "装修流程",
      "其它"
    ],
    "desc": "筒灯,大洋和索正这两个牌子,哪个好?希望内行的朋友告知一下,谢谢!",
    "time": "2016-07-02 19:59:00",
    "title": "筒灯大洋和索正这两个牌子哪个好"
  },
  "updatetime": 1467503940000
}

下面我们看看这条数据上question.title字段的termvector信息

GET qa_test_02/qa_test/1/_termvectors
{
  "fields": [
    "question.title"
  ],
  "offsets": true,
  "payloads": true,
  "positions": true,
  "term_statistics": true,
  "field_statistics": true
}

结果大概这个样子

{
  "_index": "qa_test_02",
  "_type": "qa_test",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "question.title": {
      "field_statistics": {
        "sum_doc_freq": 9,
        "doc_count": 1,
        "sum_ttf": 9
      },
      "terms": {
        "和": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 4,
              "end_offset": 5
            }
          ]
        },
        "哪个": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 7,
              "start_offset": 12,
              "end_offset": 14
            }
          ]
        },
        "大洋": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 2,
              "end_offset": 4
            }
          ]
        },
        "好": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 8,
              "start_offset": 14,
              "end_offset": 15
            }
          ]
        },
        "正": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 4,
              "start_offset": 6,
              "end_offset": 7
            }
          ]
        },
        "牌子": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 6,
              "start_offset": 10,
              "end_offset": 12
            }
          ]
        },
        "筒灯": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 2
            }
          ]
        },
        "索": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 3,
              "start_offset": 5,
              "end_offset": 6
            }
          ]
        },
        "这两个": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 5,
              "start_offset": 7,
              "end_offset": 10
            }
          ]
        }
      }
    }
  }
}

下面我们说说如何通过java代码实现termvector的获取,不说废话直接上代码

            TermVectorsResponse     termVectorResponse = client.prepareTermVectors().setIndex(sourceindexname).setType(sourceindextype)
                        .setId(id).setSelectedFields(fieldname).setTermStatistics(true).execute()
                        .actionGet();
                XContentBuilder builder = XContentFactory.contentBuilder(XContentType.JSON);
                termVectorResponse.toXContent(builder, null);
                System.out.println(builder.string());
                Fields fields = termVectorResponse.getFields();
                Iterator<String> iterator = fields.iterator();
                while (iterator.hasNext()) {
                    String field = iterator.next();
                    Terms terms = fields.terms(field);
                    TermsEnum termsEnum = terms.iterator();
                    while (termsEnum.next() != null) {
                        BytesRef term = termsEnum.term();
                        if (term != null) {
                            System.out.println(term.utf8ToString() + termsEnum.totalTermFreq());
                        }
                    }
                }

获取TermVectorsResponse的代码很好理解,主要是设置索引名称、索引type、索引id以及需要展示的若干属性。

接下来是如何获取某一term的termvector,有两种方案第一种是通过TermVectorsResponse的toXContent方法直接生成XContentBuilder,这种方法可以直接获取和上面通过DSL查询一样的json结果;第二种是通过Fields的iterator遍历fields,获取TermsEnum,熟悉lucene的同学应会更熟悉第二种方法。

0 个评论

要回复文章请先登录注册