用来放用户画像,基本的功能是根据标签查询和聚合,目前总共30个标签,以后估计怎么也得上百个标签。
总共的数据是2.8亿,1.2T左右,最终的索引量是150亿,标签都是nest类型。
目前集群是128G内存 40core
查询和聚合大概需要10+s
大神们看看正常不? 其实技术方案调研阶段,对ES还不是很了解。
总共的数据是2.8亿,1.2T左右,最终的索引量是150亿,标签都是nest类型。
目前集群是128G内存 40core
查询和聚合大概需要10+s
大神们看看正常不? 其实技术方案调研阶段,对ES还不是很了解。
4 个回复
kennywu76 - Wood
赞同来自: cheese 、youryida 、machao
其效果等同于一系列扁平的属性:
tab_a.code
tab_b.code
....
nested documents主要应用于外层文档和内层文档有需要join,并且需要维护内层object的独立性的场景。 在索引上实际每个内嵌的object都是单独成一条文档,因此实际生成索引的文档数会远大于外层文档数量。 而使用普通的object类型,内嵌的object被flatten成文档的一条属性而已,实际生成的文档数量小得多。 查询和聚合速度都会快很多。
dixingxing
赞同来自: AlixMu
我们的场景是分析对“宝马3系”感兴趣的“女性” 用户, “是否有车”,“省份”, “感兴趣的品牌” 这几个标签的分布情况。
对应的mapping:
{
"index1": {
"mappings": {
"tag_type": {
"dynamic": "false",
"_all": {
"enabled": false
},
"properties": {
"tab_a": {
"type": "nested",
"properties": {
"code": {
"type": "string",
"index": "not_analyzed"
}
}
},
"tab_b": {
"type": "nested",
"properties": {
"code": {
"type": "string",
"index": "not_analyzed"
}
}
},
......
}
}
}
}
}
查询:
GET /index1/_search/
{
"size": 0,
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"nested": {
"path": "tag_a",
"filter": {
"term": {
"tag_a.code": "1"
}
}
}
},
{
"nested": {
"path": "tag_b",
"filter": {
"term": {
"tag_b.code": "01"
}
}
}
}
]
}
}
}
},
"aggs": {
"name1": {
"nested": {
"path": "tag_c"
},
"aggs": {
"name1_1": {
"terms": {
"field": "tag_c.code",
"size": "2"
}
}
}
},
"name2": {
"nested": {
"path": "tag_d"
},
"aggs": {
"name2_1": {
"terms": {
"field": "tag_d.code",
"size": "2"
}
}
}
},
"name3": {
"nested": {
"path": "tag_e"
},
"aggs": {
"name3_1": {
"terms": {
"field": "tag_e.code",
"size": "2"
}
}
}
}
}
}
响应:
{
"took": 5315,
"timed_out": false,
"_shards": {
"total": 8,
"successful": 8,
"failed": 0
},
"hits": {
"total": 66023052,
"max_score": 0,
"hits":
},
"aggregations": {
"name3": {
"doc_count": 35795694,
"name3_1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 22021428
},
{
"key": "0",
"doc_count": 13774266
}
]
}
},
"name2": {
"doc_count": 82357236,
"name2_1": {
"doc_count_error_upper_bound": 130740,
"sum_other_doc_count": 39859740,
"buckets": [
{
"key": "3",
"doc_count": 24093930
},
{
"key": "17",
"doc_count": 15408540
}
]
}
},
"name1": {
"doc_count": 12053064,
"name1_1": {
"doc_count_error_upper_bound": 14758,
"sum_other_doc_count": 12006666,
"buckets": [
{
"key": "4403002868",
"doc_count": 24936
},
{
"key": "4403002857",
"doc_count": 21462
}
]
}
}
}
}
我们发现hits的记录数越多,相应的耗时越长, 聚合的标签越多,耗时越长。
比如上面这个查询如果去掉tab_b这个过滤条件,hits 246596466 时,响应时间就变成10秒了。
我们用10个线程测试的情况不是很理想, 90%的响应时间在16秒内。
接下来可能的测试方向:
1.减小原始数据大小
2.关闭_source
希望大神能针对我们的场景给出一些优化的思路。
最后是一些配置信息。
服务器配置:
8台 128g, 40核的服务器 (非SSD)
索引信息:
748G (算上replica 1.46T)
"settings": {
"index": {
"creation_date": "1489748059778",
"refresh_interval": "-1",
"number_of_shards": "8",
"number_of_replicas": "1",
"uuid": "bRhFkrHJQdCZrV5fF9ilmQ",
"version": {
"created": "2040499"
}
}
}
段合并后 80 个segments (算上replica 每个服务器上10个segments)
集群的配置:
cluster.name: es
node.name: es1
bootstrap.memory_lock: true
network.host: xxx.xx.xx.x
http.port: 9200
transport.tcp.port: 9300
network.bind_host: xxx.xx.xx.x
discovery.zen.ping.unicast.hosts: ["xxx.xx.xx.x","xxx.xx.xx.x","xxx.xx.xx.x","xxx.xx.xx.x","xxx.xx.xx.x","xxx.xx.xx.x","xxx.xx.xx.x"]
discovery.zen.minimum_master_nodes: 4
gateway.recover_after_data_nodes: 6
gateway.expected_nodes: 8
node.max_local_storage_nodes: 1
action.destructive_requires_name: true
action.auto_create_index: false
indices.store.throttle.max_bytes_per_sec: "100mb"
index.merge.scheduler.max_thread_count: 1
indices.breaker.total.limit: 70%
indices.breaker.fielddata.limit: 20%
indices.breaker.request.limit: 40%
indices.fielddata.cache.size: 20%
indices.queries.cache.size: 40%
indices.memory.index_buffer_size: 1024m
indices.memory.min_shard_index_buffer_size: 512m
indices.requests.cache.size: 2%
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 0ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 0ms
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms
script.engine.groovy.inline.aggs: true
script.engine.groovy.inline.search: true
monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s
medcl - 今晚打老虎。
赞同来自:
cheese
赞同来自: