elasticsearch 索引pdf等类型文件性能如何优化？

Elasticsearch | 作者 ggchangan | 发布于2016年05月20日 | 阅读数：11878

1.环境
elasticsearch: 2.1.1
使用elasticsearch-mapper-attachments插件进行文件索引
文件类型:pdf
文件大小：35m
2.性能统计
索引核心操作耗时：112924ms = 112.9s
索引核心操作是：client.prepareIndex(index, ATTACHMENT_TYPE, id.toString()) .setSource(source).execute().actionGet();

简单地说，就是用这种方法，索引35m的pdf文件，竟然使用了112.9s，在半实时环境下，不可接受。

问题：
1.索引文件性能是不是就是这个量级，此种现象是正常的？如果正常，就应该从导入策略上去优化
2.如果不正常，有没有更好地办法？

6 个回复

kennywu76 - Wood

赞同来自: ouyangDD 、cxy

你的mapping是怎样的？我在本机(macbook pro)测试了一下，用python索引一个42MB左右的英文PDF文件，耗时3.775s。

测试用的ES版本是2.2.2 ，插件Mapper Attachments Plugin。
测试索引的mapping设置如下:

POST /trying-out-mapper-attachments

{

   "mappings": {

      "ebook": {

         "properties": {

            "file" : {

                "type" : "attachment",

                "fields" : {

                    "content" : {"index" : "analyzed"},

                    "title" : {"store" : "yes"},

                    "date" : {"store" : "yes"},

                    "author" : {"analyzer" : "standard"},

                    "keywords" : {"store" : "yes"},

                    "content_type" : {"store" : "yes"},

                    "content_length" : {"store" : "yes"},

                    "language" : {"store" : "yes"}

                }

            }

         }

      }

   }

}

测试pdf大小:

-rwxr-xr-x  1 xgwu  wheel  42471315  6 22 17:09 hfdp.pdf

测试的python代码:

import base64

from elasticsearch import Elasticsearch



es = Elasticsearch()



f = open('/tmp/hfdp.pdf','rb')

buff = base64.b64encode(f.read())

f.close()



es.index(index="trying-out-mapper-attachments", doc_type="ebook", body={"file":buff})

耗时如下:

整个耗时包括文件读取，base64编码，以及调用ES index api做索引。

对content字段进行全文检索，耗时61ms:

POST /trying-out-mapper-attachments/_search

{

   "query": {

      "match": {

         "file.content": "java"

      }

   },

   "fields": [

      "file.title",

      "file.content_type",

      "file.content_length",

      "file.date"

   ]

}

可以正确的match:

{

   "took": 61,

   "timed_out": false,

   "_shards": {

      "total": 5,

      "successful": 5,

      "failed": 0

   },

   "hits": {

      "total": 1,

      "max_score": 0.014383726,

      "hits": [

         {

            "_index": "trying-out-mapper-attachments",

            "_type": "ebook",

            "_id": "AVzPE-97uN82ZvxDdc3x",

            "_score": 0.014383726,

            "fields": {

               "file.content_length": [

                  "42471315"

               ],

               "file.date": [

                  "2011-10-13T07:37:00Z"

               ],

               "file.content_type": [

                  "application/pdf"

               ],

               "file.title": [

                  "Head First Design Patterns"

               ]

            }

         }

      ]

   }

}