Elastic日报 第287期 (2018-05-30)

  1. 利用ES完成文本标注。 http://t.cn/R1Idxmd

  2. 利用ES进行图像相似搜索。 http://t.cn/Rq9AvuD

  3. refresh与flush的区别。 http://t.cn/R1Idxmg

编辑: bsll

归档:https://elasticsearch.cn/article/645

订阅:https://tinyletter.com/elastic-daily

继续阅读 »
  1. 利用ES完成文本标注。 http://t.cn/R1Idxmd

  2. 利用ES进行图像相似搜索。 http://t.cn/Rq9AvuD

  3. refresh与flush的区别。 http://t.cn/R1Idxmg

编辑: bsll

归档:https://elasticsearch.cn/article/645

订阅:https://tinyletter.com/elastic-daily

收起阅读 »

Elastic日报 第286期 (2018-05-29)

1.深入理解 Elasticsearch 的批操作。
http://t.cn/R1tuJMq 
2.SpringBoot整合ElasticSearch实现多版本的兼容。
http://t.cn/R3VlVu7 
3.Elasticsearch learning to rank 详细入门文档。
http://t.cn/R1tu9Nw 

编辑:叮咚光军
归档:https://elasticsearch.cn/article/644 
订阅:https://tinyletter.com/elastic-daily
 
继续阅读 »
1.深入理解 Elasticsearch 的批操作。
http://t.cn/R1tuJMq 
2.SpringBoot整合ElasticSearch实现多版本的兼容。
http://t.cn/R3VlVu7 
3.Elasticsearch learning to rank 详细入门文档。
http://t.cn/R1tu9Nw 

编辑:叮咚光军
归档:https://elasticsearch.cn/article/644 
订阅:https://tinyletter.com/elastic-daily
  收起阅读 »

logstash-filter-elasticsearch的简易安装

不同版本的logstash集成的插件不一样,在5.6版本就未集成logstash-filter-elasticsearch插件,所以需要自己安装。

官方提供的方法因为需要联网,并且需要调整插件管理源,比较麻烦,针对logstash-filter-elasticsearch插件,使用下面这种方式安装。

logstash-filter-elasticsearch插件安装

1、在git上下载logstash-filter-elasticsearch压缩包,logstash-filter-elasticsearch.zip,

2、在logstash的目录下新建plugins目录,解压logstash-filter-elasticsearch.zip到此目录下。

3、在logstash目录下的Gemfile中添加一行:
gem "logstash-filter-elasticsearch", :path => "./plugins/logstash-filter-elasticsearch"

4、重启logstash即可。
 
此方法适用logstash-filter-elasticsearch,但不适用全部logstash插件。
继续阅读 »
不同版本的logstash集成的插件不一样,在5.6版本就未集成logstash-filter-elasticsearch插件,所以需要自己安装。

官方提供的方法因为需要联网,并且需要调整插件管理源,比较麻烦,针对logstash-filter-elasticsearch插件,使用下面这种方式安装。

logstash-filter-elasticsearch插件安装

1、在git上下载logstash-filter-elasticsearch压缩包,logstash-filter-elasticsearch.zip,

2、在logstash的目录下新建plugins目录,解压logstash-filter-elasticsearch.zip到此目录下。

3、在logstash目录下的Gemfile中添加一行:
gem "logstash-filter-elasticsearch", :path => "./plugins/logstash-filter-elasticsearch"

4、重启logstash即可。
 
此方法适用logstash-filter-elasticsearch,但不适用全部logstash插件。 收起阅读 »

转载一篇关于shard数量设计的文章,很赞

How many shards should I have in my Elasticsearch cluster?
 
Elasticsearch is a very versatile platform, that supports a variety of use cases, and provides great flexibility around data organisation and replication strategies. This flexibility can however sometimes make it hard to determine up-front how to best organize your data into indices and shards, especially if you are new to the Elastic Stack. While suboptimal choices  will not necessarily cause problems when first starting out, they have the potential to cause performance problems as data volumes grow over time. The more data the cluster holds, the more difficult it also becomes to correct the problem, as reindexing of large amounts of data can sometimes be required.

When we come across users that are experiencing performance problems, it is not uncommon that this can be traced back to issues around how data is indexed and number of shards in the cluster. This is especially true for use-cases involving multi-tenancy and/or use of time-based indices. When discussing this with users, either in person at events or meetings or via our forum, some of the most common questions are “How many shards should I have?” and “How large should my shards be?”.

This blog post aims to help you answer these questions and provide practical guidelines for use cases that involve the use of time-based indices, e.g. logging or security analytics, in a single place.

What is a shard?

Before we start, we need to establish some facts and terminology that we will need in later sections.

Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.

As data is written to a shard, it is periodically published into new immutable Lucene segments on disk, and it is at this time it becomes available for querying. This is referred to as a refresh. How this works is described in greater detail in Elasticsearch: the Definitive Guide.

As the number of segments grow, these are periodically consolidated into larger segments. This process is referred to as merging. As all segments are immutable, this means that the disk space used will typically fluctuate during indexing, as new, merged segments need to be created before the ones they replace can be deleted. Merging can be quite resource intensive, especially with respect to disk I/O.

The shard is the unit at which Elasticsearch distributes data around the cluster. The speed at which Elasticsearch can move shards around when rebalancing data, e.g. following a failure, will depend on the size and number of shards as well as network and disk performance.

TIP: Avoid having very large shards as this can negatively affect the cluster's ability to recover from failure. There is no fixed limit on how large shards can be, but a shard size of 50GB is often quoted as a limit that has been seen to work for a variety of use-cases.

Index by retention period

As segments are immutable, updating a document requires Elasticsearch to first find the existing document, then mark it as deleted and add the updated version. Deleting a document also requires the document to be found and marked as deleted. For this reason, deleted documents will continue to tie up disk space and some system resources until they are merged out, which can consume a lot of system resources.

Elasticsearch allows complete indices to be deleted very efficiently directly from the file system, without explicitly having to delete all records individually. This is by far the most efficient way to delete data from Elasticsearch.

TIP: Try to use time-based indices for managing data retention whenever possible. Group data into indices based on the retention period. Time-based indices also make it easy to vary the number of primary shards and replicas over time, as this can be changed for the next index to be generated. This simplifies adapting to changing data volumes and requirements.

Are indices and shards not free?

For each Elasticsearch index, information about mappings and state is stored in the cluster state. This is kept in memory for fast access. Having a large number of indices in a cluster can therefore result in a large cluster state, especially if mappings are large. This can become slow to update as all updates need to be done through a single thread in order to guarantee consistency before the changes are distributed across the cluster.

TIP: In order to reduce the number of indices and avoid large and sprawling mappings, consider storing data with similar structure in the same index rather than splitting into separate indices based on where the data comes from. It is important to find a good balance between the number of indices and the mapping size for each individual index.

Each shard has data that need to be kept in memory and use heap space. This includes data structures holding information at the shard level, but also at the segment level in order to define where data reside on disk. The size of these data structures is not fixed and will vary depending on the use-case.

One important characteristic of the segment related overhead is however that it is not strictly proportional to the size of the segment. This means that larger segments have less overhead per data volume compared to smaller segments. The difference can be substantial.

In order to be able to store as much data as possible per node, it becomes important to manage heap usage and reduce the amount of overhead as much as possible. The more heap space a node has, the more data and shards it can handle.

Indices and shards are therefore not free from a cluster perspective, as there is some level of resource overhead for each index and shard.

TIP: Small shards result in small segments, which increases overhead. Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.

TIP: As the overhead per shard depends on the segment count and size, forcing smaller segments to merge into larger ones through a forcemerge operation can reduce overhead and improve query performance. This should ideally be done once no more data is written to the index. Be aware that this is an expensive operation that should ideally be performed during off-peak hours.

TIP: The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.

How does shard size affect performance?

In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard.

This means that the minimum query latency, when no caching is involved, will depend on the data, the type of query, as well as the size of the shard. Querying lots of small shards will make the processing per shard faster, but as many more tasks need to be queued up and processed in sequence, it is not necessarily going to be faster than querying a smaller number of larger shards. Having lots of small shards can also reduce the query throughput if there are multiple concurrent queries.

TIP: The best way to determine the maximum shard size from a query performance perspective is to benchmark using realistic data and queries. Always benchmark with a query and indexing load representative of what the node would need to handle in production, as optimizing for a single query might give misleading results.

How do I manage shard size?

When using time-based indices, each index has traditionally been associated with a fixed time period. Daily indices are very common, and often used for holding data with short retention period or large daily volumes. These allow retention period to be managed with good granularity and makes it easy to adjust for changing volumes on a daily basis. Data with a longer retention period, especially if the daily volumes do not warrant the use of daily indices, often use weekly or monthly induces in order to keep the shard size up. This reduces the number of indices and shards that need to be stored in the cluster over time.

TIP: If using time-based indices covering a fixed period, adjust the period each index covers based on the retention period and expected data volumes in order to reach the target shard size.

Time-based indices with a fixed time interval works well when data volumes are reasonably predictable and change slowly. If the indexing rate can vary quickly, it is very difficult to maintain a uniform target shard size.

In order to be able to better handle this type of scenarios, the Rollover and Shrink APIs were introduced. These add a lot of flexibility to how indices and shards are managed, specifically for time-based indices.

The rollover index API makes it possible to specify the number of documents and index should contain and/or the maximum period documents should be written to it. Once one of these criteria has been exceeded, Elasticsearch can trigger a new index to be created for writing without downtime. Instead of having each index cover a specific time-period, it is now possible to switch to a new index at a specific size, which makes it possible to more easily achieve an even shard size for all indices.

In cases where data might be updated, there is no longer a distinct link between the timestamp of the event and the index it resides in when using this API, which may make updates significantly less efficient as each update my need to be preceded by a search.

TIP: If you have time-based, immutable data where volumes can vary significantly over time, consider using the rollover index API to achieve an optimal target shard size by dynamically varying the time-period each index covers. This gives great flexibility and can help avoid having too large or too small shards when volumes are unpredictable.


The shrink index API allows you to shrink an existing index into a new index with fewer primary shards. If an even spread of shards across nodes is desired during indexing, but this will result in too small shards, this API can be used to reduce the number of primary shards once the index is no longer indexed into. This will result in larger shards, better suited for longer term storage of data.


TIP: If you need to have each index cover a specific time period but still want to be able to spread indexing out across a large number of nodes, consider using the shrink API to reduce the number of primary shards once the index is no longer indexed into. This API can also be used to reduce the number of shards in case you have initially configured too many shards.


Conclusions

This blog post has provided tips and practical guidelines around how to best manage data in Elasticsearch. If you are interested in learning more, "Elasticsearch: the definitive guide" contains a section about designing for scale, which is well worth reading even though it is a bit old.

A lot of the decisions around how to best distribute your data across indices and shards will however depend on the use-case specifics, and it can sometimes be hard to determine how to best apply the advice available. For more in-depth and personal advice you can engage with us commercially through a subscription and let our Support and Consulting teams help accelerate your project. If you are happy to discuss your use-case in the open, you can also get help from our community and through our public forum.
继续阅读 »
How many shards should I have in my Elasticsearch cluster?
 
Elasticsearch is a very versatile platform, that supports a variety of use cases, and provides great flexibility around data organisation and replication strategies. This flexibility can however sometimes make it hard to determine up-front how to best organize your data into indices and shards, especially if you are new to the Elastic Stack. While suboptimal choices  will not necessarily cause problems when first starting out, they have the potential to cause performance problems as data volumes grow over time. The more data the cluster holds, the more difficult it also becomes to correct the problem, as reindexing of large amounts of data can sometimes be required.

When we come across users that are experiencing performance problems, it is not uncommon that this can be traced back to issues around how data is indexed and number of shards in the cluster. This is especially true for use-cases involving multi-tenancy and/or use of time-based indices. When discussing this with users, either in person at events or meetings or via our forum, some of the most common questions are “How many shards should I have?” and “How large should my shards be?”.

This blog post aims to help you answer these questions and provide practical guidelines for use cases that involve the use of time-based indices, e.g. logging or security analytics, in a single place.

What is a shard?

Before we start, we need to establish some facts and terminology that we will need in later sections.

Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.

As data is written to a shard, it is periodically published into new immutable Lucene segments on disk, and it is at this time it becomes available for querying. This is referred to as a refresh. How this works is described in greater detail in Elasticsearch: the Definitive Guide.

As the number of segments grow, these are periodically consolidated into larger segments. This process is referred to as merging. As all segments are immutable, this means that the disk space used will typically fluctuate during indexing, as new, merged segments need to be created before the ones they replace can be deleted. Merging can be quite resource intensive, especially with respect to disk I/O.

The shard is the unit at which Elasticsearch distributes data around the cluster. The speed at which Elasticsearch can move shards around when rebalancing data, e.g. following a failure, will depend on the size and number of shards as well as network and disk performance.

TIP: Avoid having very large shards as this can negatively affect the cluster's ability to recover from failure. There is no fixed limit on how large shards can be, but a shard size of 50GB is often quoted as a limit that has been seen to work for a variety of use-cases.

Index by retention period

As segments are immutable, updating a document requires Elasticsearch to first find the existing document, then mark it as deleted and add the updated version. Deleting a document also requires the document to be found and marked as deleted. For this reason, deleted documents will continue to tie up disk space and some system resources until they are merged out, which can consume a lot of system resources.

Elasticsearch allows complete indices to be deleted very efficiently directly from the file system, without explicitly having to delete all records individually. This is by far the most efficient way to delete data from Elasticsearch.

TIP: Try to use time-based indices for managing data retention whenever possible. Group data into indices based on the retention period. Time-based indices also make it easy to vary the number of primary shards and replicas over time, as this can be changed for the next index to be generated. This simplifies adapting to changing data volumes and requirements.

Are indices and shards not free?

For each Elasticsearch index, information about mappings and state is stored in the cluster state. This is kept in memory for fast access. Having a large number of indices in a cluster can therefore result in a large cluster state, especially if mappings are large. This can become slow to update as all updates need to be done through a single thread in order to guarantee consistency before the changes are distributed across the cluster.

TIP: In order to reduce the number of indices and avoid large and sprawling mappings, consider storing data with similar structure in the same index rather than splitting into separate indices based on where the data comes from. It is important to find a good balance between the number of indices and the mapping size for each individual index.

Each shard has data that need to be kept in memory and use heap space. This includes data structures holding information at the shard level, but also at the segment level in order to define where data reside on disk. The size of these data structures is not fixed and will vary depending on the use-case.

One important characteristic of the segment related overhead is however that it is not strictly proportional to the size of the segment. This means that larger segments have less overhead per data volume compared to smaller segments. The difference can be substantial.

In order to be able to store as much data as possible per node, it becomes important to manage heap usage and reduce the amount of overhead as much as possible. The more heap space a node has, the more data and shards it can handle.

Indices and shards are therefore not free from a cluster perspective, as there is some level of resource overhead for each index and shard.

TIP: Small shards result in small segments, which increases overhead. Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.

TIP: As the overhead per shard depends on the segment count and size, forcing smaller segments to merge into larger ones through a forcemerge operation can reduce overhead and improve query performance. This should ideally be done once no more data is written to the index. Be aware that this is an expensive operation that should ideally be performed during off-peak hours.

TIP: The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.

How does shard size affect performance?

In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard.

This means that the minimum query latency, when no caching is involved, will depend on the data, the type of query, as well as the size of the shard. Querying lots of small shards will make the processing per shard faster, but as many more tasks need to be queued up and processed in sequence, it is not necessarily going to be faster than querying a smaller number of larger shards. Having lots of small shards can also reduce the query throughput if there are multiple concurrent queries.

TIP: The best way to determine the maximum shard size from a query performance perspective is to benchmark using realistic data and queries. Always benchmark with a query and indexing load representative of what the node would need to handle in production, as optimizing for a single query might give misleading results.

How do I manage shard size?

When using time-based indices, each index has traditionally been associated with a fixed time period. Daily indices are very common, and often used for holding data with short retention period or large daily volumes. These allow retention period to be managed with good granularity and makes it easy to adjust for changing volumes on a daily basis. Data with a longer retention period, especially if the daily volumes do not warrant the use of daily indices, often use weekly or monthly induces in order to keep the shard size up. This reduces the number of indices and shards that need to be stored in the cluster over time.

TIP: If using time-based indices covering a fixed period, adjust the period each index covers based on the retention period and expected data volumes in order to reach the target shard size.

Time-based indices with a fixed time interval works well when data volumes are reasonably predictable and change slowly. If the indexing rate can vary quickly, it is very difficult to maintain a uniform target shard size.

In order to be able to better handle this type of scenarios, the Rollover and Shrink APIs were introduced. These add a lot of flexibility to how indices and shards are managed, specifically for time-based indices.

The rollover index API makes it possible to specify the number of documents and index should contain and/or the maximum period documents should be written to it. Once one of these criteria has been exceeded, Elasticsearch can trigger a new index to be created for writing without downtime. Instead of having each index cover a specific time-period, it is now possible to switch to a new index at a specific size, which makes it possible to more easily achieve an even shard size for all indices.

In cases where data might be updated, there is no longer a distinct link between the timestamp of the event and the index it resides in when using this API, which may make updates significantly less efficient as each update my need to be preceded by a search.

TIP: If you have time-based, immutable data where volumes can vary significantly over time, consider using the rollover index API to achieve an optimal target shard size by dynamically varying the time-period each index covers. This gives great flexibility and can help avoid having too large or too small shards when volumes are unpredictable.


The shrink index API allows you to shrink an existing index into a new index with fewer primary shards. If an even spread of shards across nodes is desired during indexing, but this will result in too small shards, this API can be used to reduce the number of primary shards once the index is no longer indexed into. This will result in larger shards, better suited for longer term storage of data.


TIP: If you need to have each index cover a specific time period but still want to be able to spread indexing out across a large number of nodes, consider using the shrink API to reduce the number of primary shards once the index is no longer indexed into. This API can also be used to reduce the number of shards in case you have initially configured too many shards.


Conclusions

This blog post has provided tips and practical guidelines around how to best manage data in Elasticsearch. If you are interested in learning more, "Elasticsearch: the definitive guide" contains a section about designing for scale, which is well worth reading even though it is a bit old.

A lot of the decisions around how to best distribute your data across indices and shards will however depend on the use-case specifics, and it can sometimes be hard to determine how to best apply the advice available. For more in-depth and personal advice you can engage with us commercially through a subscription and let our Support and Consulting teams help accelerate your project. If you are happy to discuss your use-case in the open, you can also get help from our community and through our public forum. 收起阅读 »

Elastic日报 第285期 (2018-05-28)

1.使用 ES 5.4来搜索汉语、韩语和日语,第一部分:分析器。 
http://t.cn/R1G3z3Q

2.去哪儿网ELK安全监控中心踩坑和实践
http://t.cn/R1qhAYL

3. Elasticsearch内核解析 - 写入篇
http://t.cn/R1q5Y5u


编辑:cyberdak
归档:https://elasticsearch.cn/article/641
订阅:https://tinyletter.com/elastic-daily 
继续阅读 »
1.使用 ES 5.4来搜索汉语、韩语和日语,第一部分:分析器。 
http://t.cn/R1G3z3Q

2.去哪儿网ELK安全监控中心踩坑和实践
http://t.cn/R1qhAYL

3. Elasticsearch内核解析 - 写入篇
http://t.cn/R1q5Y5u


编辑:cyberdak
归档:https://elasticsearch.cn/article/641
订阅:https://tinyletter.com/elastic-daily  收起阅读 »

Elastic日报 第284期 (2018-05-27)

1.ElasticSearch的地理编码对象。
http://t.cn/R12Q3zm
2.开源X / MIT许可的用于光栅和矢量地理空间数据格式的转换器库。
http://t.cn/R128rZU
3.(自备梯子)关于数据。
http://t.cn/R12ETtD

编辑:至尊宝
归档:https://elasticsearch.cn/article/640
订阅:https://tinyletter.com/elastic-daily
继续阅读 »
1.ElasticSearch的地理编码对象。
http://t.cn/R12Q3zm
2.开源X / MIT许可的用于光栅和矢量地理空间数据格式的转换器库。
http://t.cn/R128rZU
3.(自备梯子)关于数据。
http://t.cn/R12ETtD

编辑:至尊宝
归档:https://elasticsearch.cn/article/640
订阅:https://tinyletter.com/elastic-daily 收起阅读 »

Elastic日报 第283期 (2018-05-26)

  1. postmark使用curator经验分享。 http://t.cn/R1wYJxL

  2. kreuzwerker数据从SQL Server迁移到ES经验。 http://t.cn/R1wYJxZ

  3. 在kibana中使用自定义底图描绘区域和坐标。 http://t.cn/R1wYJx2
继续阅读 »
  1. postmark使用curator经验分享。 http://t.cn/R1wYJxL

  2. kreuzwerker数据从SQL Server迁移到ES经验。 http://t.cn/R1wYJxZ

  3. 在kibana中使用自定义底图描绘区域和坐标。 http://t.cn/R1wYJx2
收起阅读 »

Elastic日报 第282期 (2018-05-25)

1、利用HDFS备份实现Elasticsearch容灾
http://t.cn/R17PZJv
2、Elasticsearch 架构以及源码概览
http://t.cn/R17PGhf
3、Elasticsearch图像检索实践
http://t.cn/R17PVoX

编辑:铭毅天下
归档:https://elasticsearch.cn/article/638
订阅:https://tinyletter.com/elastic-daily
 
继续阅读 »
1、利用HDFS备份实现Elasticsearch容灾
http://t.cn/R17PZJv
2、Elasticsearch 架构以及源码概览
http://t.cn/R17PGhf
3、Elasticsearch图像检索实践
http://t.cn/R17PVoX

编辑:铭毅天下
归档:https://elasticsearch.cn/article/638
订阅:https://tinyletter.com/elastic-daily
  收起阅读 »

[技术交流] Elasticsearch冉冉升起,几款开源数据引擎对比

2018年的数据引擎排名已经出啦,不知道各位小伙伴留意没~~~
排名.png

Elasticsearch强势进入前十,是一颗冉冉升起的新星。
 
一些小伙伴经常会碰到选取何种数据引擎的情况。在此,我们将把几个热门的开源数据引擎进行对比,供大家参考。
对比.png


从上表看出:
MySQL:将数据保存在不同的表中,使用SQL语言进行交互,是目前非常流行的关系型数据库管理系统。如果您需要一个传统的数据库,那么MySQL是不错的选择。
MongoDB:基于分布式文件存储的NoSQL数据库,数据结构由键值对组成,具有可扩展性。如果您需要存储和查询非结构化信息,不太需要分析或全文检索,那么MongoDB是不错的选择。
Redis:高性能的键值对数据库,支持一定程度的事务性。主要应用为缓存,对读写有非常高性能要求的场景中,是不错的选择。但因为是靠全内存加速,所以数据量大的情况下,配置要求也很高。
Elasticsearch:基于Lucene的分布式搜索引擎,具有全文检索,同义词处理,相关度排名,复杂数据分析的能力。如果您想做文本类检索,及相关性排序,以及指标类分析,Elasticsearch会非常适合。它在文档全文检索(网站搜索,APP搜索)和日志分析(运营,运维)领域拥有得天独厚的优势。
 
此文抛砖引玉,欢迎小伙伴留言讨论不同数据引擎的适用场景~~

另有想快速体验Elasticsearch欢迎戳下面链接~~~
华为云搜索服务是云上的Elasticsearch,具有简单易用、无忧运维、弹性灵活、数据可靠等特点,欢迎使用~~~~
 
继续阅读 »
2018年的数据引擎排名已经出啦,不知道各位小伙伴留意没~~~
排名.png

Elasticsearch强势进入前十,是一颗冉冉升起的新星。
 
一些小伙伴经常会碰到选取何种数据引擎的情况。在此,我们将把几个热门的开源数据引擎进行对比,供大家参考。
对比.png


从上表看出:
MySQL:将数据保存在不同的表中,使用SQL语言进行交互,是目前非常流行的关系型数据库管理系统。如果您需要一个传统的数据库,那么MySQL是不错的选择。
MongoDB:基于分布式文件存储的NoSQL数据库,数据结构由键值对组成,具有可扩展性。如果您需要存储和查询非结构化信息,不太需要分析或全文检索,那么MongoDB是不错的选择。
Redis:高性能的键值对数据库,支持一定程度的事务性。主要应用为缓存,对读写有非常高性能要求的场景中,是不错的选择。但因为是靠全内存加速,所以数据量大的情况下,配置要求也很高。
Elasticsearch:基于Lucene的分布式搜索引擎,具有全文检索,同义词处理,相关度排名,复杂数据分析的能力。如果您想做文本类检索,及相关性排序,以及指标类分析,Elasticsearch会非常适合。它在文档全文检索(网站搜索,APP搜索)和日志分析(运营,运维)领域拥有得天独厚的优势。
 
此文抛砖引玉,欢迎小伙伴留言讨论不同数据引擎的适用场景~~

另有想快速体验Elasticsearch欢迎戳下面链接~~~
华为云搜索服务是云上的Elasticsearch,具有简单易用、无忧运维、弹性灵活、数据可靠等特点,欢迎使用~~~~
  收起阅读 »

Elastic日报 第281期 (2018-05-24)

  1. 重磅!kibana中文手册发布。 http://t.cn/R3eoVvc

  2. Elasticsearch如何实现 SQL语句中 Group By 和 Limit 的功能。 http://t.cn/R3k85NN

  3. Laravel 中使用 ElasticSearch。 http://t.cn/R3k8V48
继续阅读 »
  1. 重磅!kibana中文手册发布。 http://t.cn/R3eoVvc

  2. Elasticsearch如何实现 SQL语句中 Group By 和 Limit 的功能。 http://t.cn/R3k85NN

  3. Laravel 中使用 ElasticSearch。 http://t.cn/R3k8V48
收起阅读 »

Elastic日报 第280期 (2018-05-23)

1. Elasticsearch 集群
http://t.cn/R3eBpR0 
2. ElasticSearch的搭建与数据统计
http://t.cn/R3eBB2S 
3. Filebeat 源码分析
http://t.cn/Rtxs35p 
 
编辑:江水
归档:https://elasticsearch.cn/article/635 
订阅:https://tinyletter.com/elastic-daily
 
继续阅读 »
1. Elasticsearch 集群
http://t.cn/R3eBpR0 
2. ElasticSearch的搭建与数据统计
http://t.cn/R3eBB2S 
3. Filebeat 源码分析
http://t.cn/Rtxs35p 
 
编辑:江水
归档:https://elasticsearch.cn/article/635 
订阅:https://tinyletter.com/elastic-daily
  收起阅读 »

干货 | Elasticsearch 布道者Medcl对话携程Wood大叔核心笔记

Elastic Podcast 第二期来啦, 这一次我们来到了位于上海的携程旅行网,携程内部大量运用了 Elasticsearch来进行集中式的运维日志管理和为业务部门提供统一的搜索服务平台, 目前线上总共部署了多达 94 个 Elasticsearch 集群和超过 700 多个 Elasticsearch 节点,每天新增日志 1600 亿条,峰值达到 300 万每秒,存放在 Elasticsearch里面的索引文档达到 2.5 万亿,磁盘存储达到 PB 级。 想知道携程是如何应对这些海量数据下的挑战,以及最佳实践,让我们一起来收听这一期的 Podcast,跟随携程的两位技术负责人吴晓刚和胡航来一探究竟。

音频地址:http://m.ximalaya.com/111156131/sound/89571047

主持人:Elastic 技术布道师,曾勇(Medcl)。 嘉宾: 1、吴晓刚(Wood大叔),携程技术保障部系统研发总监, Elasticsearch 国内早期实践者,中文社区活跃用户。 曾在 eBay, Morgan Stanley, PPTV 等国内外公司从事系统软件研发、系统集成与技术支持工作。对于大规模 IT 系统的运维自动化、可视化、性能优化具有浓厚的兴趣。在技术方面一直抱有知其然知其所以然的态度。

2、胡航,携程旅行网高级技术经理,负责相关搜索实现、SOA服务的开发。曾供职于腾讯、盛大等公司,对新技术持有强烈的好奇心,目前关注于 Elasticsearch 的业务实现、JVM 性能优化等。

1、携程Elasticsearch使用历史

1.1 运维组Wood大叔:

2014年,ES0.9版本。 选型对比:MongoDB——数据量级大了以后,出现性能瓶颈。 调研后,选型:ELK(Elasticsearch、Logstash、Kibana)。 实现效果:实时看效果、查询、聚合。

1.2 胡航业务组:

业务场景:酒店价格。 选型依据:ES分布式、可调试性能好。 版本:ES2.3。 时间:2017年中,逐步转向ES,5.3版本。 效果:显著。专注于后端开发,业务交由业务团队自己去做。

2、携程Elasticsearch规模

2.1 运维组Wood大叔:

集群:94个。最小三个节点,最大:360+节点。 节点:700+。 每日增量:1600亿条。 峰值:300W/s。 总数据量:2.5万亿,PB数量级。 面对挑战: 1)实时写入。 2)业务流程相关,几个月-2年的历史数据。

2.2 胡航业务组:

业务场景:3集群,每集群6个节点。 单个索引:最大1000W-2000W。

关注:ES基础框架,帮业务部分实现写入、查询、DSL调优。 查询:3000-4000/s。

携程ES规模量全国数一数二,有很大挑战。

3、携程Elasticsearch淌过的坑

3.1 运维组Wood大叔:

3.1.1 痛点1:内存溢出。

原因:早期版本,对查询限制做的不充分;数据量上了规模,查询、聚合会非常耗内存。

升级版本后,ES做了很多处理,集群层面做了限制。越来越稳定。

3.1.2 痛点2:集群故障无法恢复。

3.1.3 痛点3:translog做不完。

3.1.4 痛点4:集群的平台化管理。

需要:研究底层工作机制,找到规避方法。 经验丰富后,运维效率提升。

3.2胡航业务组:

3.2.1 痛点1:ES基础不熟悉带来的问题;

3.2.2 痛点2:性能问题——最终排查是ES5.X keyword类型的原因。

4、架构

4.1运维组Wood大叔:

1、早期:ELK+redis(中间缓存)

挑战: 1)redis承受能力差。

2)redis单线程。

改善: 1)redis改为kafka(磁盘级别),数据畅通了。

2)Logstash内存消耗大。——改为:logstash forward,推荐官方Beats。

3)数据规模后,需要很多服务器,Logstash非常耗内存。

优化:用golang开发了一个gohangout (https://github.com/childe/gohangout ) ,

内存比java 版的hangout(https://github.com/childe/hangout) 内存大幅降低。

4.2 胡航搜索业务组:

1)单点集群压力瓶颈。

改为:业务数据导入ES,提供定制客户端。

2)搜索平台,接入更多业务需求。

不方便在kibana做定制开发,自己做了简单网站查询数据、监控,满足业务的贴切需求。

5、ES6.3最新特性(抢先看)

5.1 ES6.3 支持Sql接口

Wood大叔: kibana看DSL,拷贝后修改。新用户不熟悉,会不方便。

BI部分也需要,类似sql的查询。

优点:更简单,发挥更大的作用。

携程BI部门——应用场景:搜索的关键词、 统计热词,目的地等信息。

Kibana满足不了需求,就要写代码。如果有了sql,会非常快的开发。

胡航搜索业务组: 写DSL,还是稍微复杂。借助 NLPChina ElasticsearchSql插件实现。

实际应用发现插件还是有问题,期待ES官方推出Sql查询。

5.2 增加kibana丰富表现力

5.3 更快的索引速度

refresh优化:提升吞吐。

6、ELK Stack最喜欢的特性

Wood大叔: 丰富的扩展能力,用户不必关心底层的实现。通过服务器增加节点,方便大数据量查询。

胡航: ES可视化、可调试特性。 举例: 1)出现问题排查DSL是不是合适?Mapping是不是合适?

2)相信ES的社区,不必关心底层,更多的时间做业务(解放双手)。

3)ES中做好数据模型,实现业务需求。

7、ELK Stack最需要完善的

Wood大叔: 1)集群的保护待更进一步完善 数据丢失后的处理?

节点损毁后的处理?

目的:减轻运维的负担;

2)甄别坏查询,Slow log存在缺陷。 很难判定真正故障是哪个慢查询。

集群发下故障的时候,有API实时分析会更好(比单纯查slow log)。

胡航: 1)ES坑还很多,比较耗费时间。

2)期待社区对常见问题整理。

3)期待官方总结完善的向导,类似:Cookbook。

初级上手的话可以参考借鉴(大家缺乏经验)

8、初学者的建议

1)初学者必读——《Elasticsearch: 权威指南》(英文、中文) WOOD大叔至少看了3遍。

2)不断的实践。

3)带着问题,再去找文档,构建知识体系。

4)多参与社区,尝试理解和解决社区问题,不断学习,以提升自己。 互帮互助,共同成长!

5)中文社区的小建议:问题精华版收集——新手通读,学习前人经验。

9、如何看待Elasticsearch在国内的发展?

1)参与和贡献,国内做的不足;

2)中文分词插件等,如:分词质量要求高,专业语义搜索支持(提高搜索相关性等)、情感标注、NLP等。

3)在中文应用场景应用更加丰富。

4)社区问题比较分散,社区需要意见领袖,加强某领域讨论、深入交流等。

5)medcl:ElasticTips成立,大家都去参与,以图片形式分享。

6) 社区还会做更多的事情,大家多分享、互相交流。

10、小结

非常震惊,wood大叔看了3遍《Elasticsearch: 权威指南》,我们没有理由不努力。

共勉,加油!

继续阅读 »

Elastic Podcast 第二期来啦, 这一次我们来到了位于上海的携程旅行网,携程内部大量运用了 Elasticsearch来进行集中式的运维日志管理和为业务部门提供统一的搜索服务平台, 目前线上总共部署了多达 94 个 Elasticsearch 集群和超过 700 多个 Elasticsearch 节点,每天新增日志 1600 亿条,峰值达到 300 万每秒,存放在 Elasticsearch里面的索引文档达到 2.5 万亿,磁盘存储达到 PB 级。 想知道携程是如何应对这些海量数据下的挑战,以及最佳实践,让我们一起来收听这一期的 Podcast,跟随携程的两位技术负责人吴晓刚和胡航来一探究竟。

音频地址:http://m.ximalaya.com/111156131/sound/89571047

主持人:Elastic 技术布道师,曾勇(Medcl)。 嘉宾: 1、吴晓刚(Wood大叔),携程技术保障部系统研发总监, Elasticsearch 国内早期实践者,中文社区活跃用户。 曾在 eBay, Morgan Stanley, PPTV 等国内外公司从事系统软件研发、系统集成与技术支持工作。对于大规模 IT 系统的运维自动化、可视化、性能优化具有浓厚的兴趣。在技术方面一直抱有知其然知其所以然的态度。

2、胡航,携程旅行网高级技术经理,负责相关搜索实现、SOA服务的开发。曾供职于腾讯、盛大等公司,对新技术持有强烈的好奇心,目前关注于 Elasticsearch 的业务实现、JVM 性能优化等。

1、携程Elasticsearch使用历史

1.1 运维组Wood大叔:

2014年,ES0.9版本。 选型对比:MongoDB——数据量级大了以后,出现性能瓶颈。 调研后,选型:ELK(Elasticsearch、Logstash、Kibana)。 实现效果:实时看效果、查询、聚合。

1.2 胡航业务组:

业务场景:酒店价格。 选型依据:ES分布式、可调试性能好。 版本:ES2.3。 时间:2017年中,逐步转向ES,5.3版本。 效果:显著。专注于后端开发,业务交由业务团队自己去做。

2、携程Elasticsearch规模

2.1 运维组Wood大叔:

集群:94个。最小三个节点,最大:360+节点。 节点:700+。 每日增量:1600亿条。 峰值:300W/s。 总数据量:2.5万亿,PB数量级。 面对挑战: 1)实时写入。 2)业务流程相关,几个月-2年的历史数据。

2.2 胡航业务组:

业务场景:3集群,每集群6个节点。 单个索引:最大1000W-2000W。

关注:ES基础框架,帮业务部分实现写入、查询、DSL调优。 查询:3000-4000/s。

携程ES规模量全国数一数二,有很大挑战。

3、携程Elasticsearch淌过的坑

3.1 运维组Wood大叔:

3.1.1 痛点1:内存溢出。

原因:早期版本,对查询限制做的不充分;数据量上了规模,查询、聚合会非常耗内存。

升级版本后,ES做了很多处理,集群层面做了限制。越来越稳定。

3.1.2 痛点2:集群故障无法恢复。

3.1.3 痛点3:translog做不完。

3.1.4 痛点4:集群的平台化管理。

需要:研究底层工作机制,找到规避方法。 经验丰富后,运维效率提升。

3.2胡航业务组:

3.2.1 痛点1:ES基础不熟悉带来的问题;

3.2.2 痛点2:性能问题——最终排查是ES5.X keyword类型的原因。

4、架构

4.1运维组Wood大叔:

1、早期:ELK+redis(中间缓存)

挑战: 1)redis承受能力差。

2)redis单线程。

改善: 1)redis改为kafka(磁盘级别),数据畅通了。

2)Logstash内存消耗大。——改为:logstash forward,推荐官方Beats。

3)数据规模后,需要很多服务器,Logstash非常耗内存。

优化:用golang开发了一个gohangout (https://github.com/childe/gohangout ) ,

内存比java 版的hangout(https://github.com/childe/hangout) 内存大幅降低。

4.2 胡航搜索业务组:

1)单点集群压力瓶颈。

改为:业务数据导入ES,提供定制客户端。

2)搜索平台,接入更多业务需求。

不方便在kibana做定制开发,自己做了简单网站查询数据、监控,满足业务的贴切需求。

5、ES6.3最新特性(抢先看)

5.1 ES6.3 支持Sql接口

Wood大叔: kibana看DSL,拷贝后修改。新用户不熟悉,会不方便。

BI部分也需要,类似sql的查询。

优点:更简单,发挥更大的作用。

携程BI部门——应用场景:搜索的关键词、 统计热词,目的地等信息。

Kibana满足不了需求,就要写代码。如果有了sql,会非常快的开发。

胡航搜索业务组: 写DSL,还是稍微复杂。借助 NLPChina ElasticsearchSql插件实现。

实际应用发现插件还是有问题,期待ES官方推出Sql查询。

5.2 增加kibana丰富表现力

5.3 更快的索引速度

refresh优化:提升吞吐。

6、ELK Stack最喜欢的特性

Wood大叔: 丰富的扩展能力,用户不必关心底层的实现。通过服务器增加节点,方便大数据量查询。

胡航: ES可视化、可调试特性。 举例: 1)出现问题排查DSL是不是合适?Mapping是不是合适?

2)相信ES的社区,不必关心底层,更多的时间做业务(解放双手)。

3)ES中做好数据模型,实现业务需求。

7、ELK Stack最需要完善的

Wood大叔: 1)集群的保护待更进一步完善 数据丢失后的处理?

节点损毁后的处理?

目的:减轻运维的负担;

2)甄别坏查询,Slow log存在缺陷。 很难判定真正故障是哪个慢查询。

集群发下故障的时候,有API实时分析会更好(比单纯查slow log)。

胡航: 1)ES坑还很多,比较耗费时间。

2)期待社区对常见问题整理。

3)期待官方总结完善的向导,类似:Cookbook。

初级上手的话可以参考借鉴(大家缺乏经验)

8、初学者的建议

1)初学者必读——《Elasticsearch: 权威指南》(英文、中文) WOOD大叔至少看了3遍。

2)不断的实践。

3)带着问题,再去找文档,构建知识体系。

4)多参与社区,尝试理解和解决社区问题,不断学习,以提升自己。 互帮互助,共同成长!

5)中文社区的小建议:问题精华版收集——新手通读,学习前人经验。

9、如何看待Elasticsearch在国内的发展?

1)参与和贡献,国内做的不足;

2)中文分词插件等,如:分词质量要求高,专业语义搜索支持(提高搜索相关性等)、情感标注、NLP等。

3)在中文应用场景应用更加丰富。

4)社区问题比较分散,社区需要意见领袖,加强某领域讨论、深入交流等。

5)medcl:ElasticTips成立,大家都去参与,以图片形式分享。

6) 社区还会做更多的事情,大家多分享、互相交流。

10、小结

非常震惊,wood大叔看了3遍《Elasticsearch: 权威指南》,我们没有理由不努力。

共勉,加油!

收起阅读 »

Elastic日报 第279期 (2018-05-22)

1.使用 Elasticsearch 和 Spring data 实现一个简单标签设计模式。
http://t.cn/R33Mt28 
2.Elasticsearch 查询构造器转换小工具。
http://t.cn/R33MfPM 
3.Elasticsearch DSL Python 文档,值得收藏。
http://t.cn/R8xuJC1 

编辑:叮咚光军
归档:https://elasticsearch.cn/article/633
订阅:https://tinyletter.com/elastic-daily 
 
继续阅读 »
1.使用 Elasticsearch 和 Spring data 实现一个简单标签设计模式。
http://t.cn/R33Mt28 
2.Elasticsearch 查询构造器转换小工具。
http://t.cn/R33MfPM 
3.Elasticsearch DSL Python 文档,值得收藏。
http://t.cn/R8xuJC1 

编辑:叮咚光军
归档:https://elasticsearch.cn/article/633
订阅:https://tinyletter.com/elastic-daily 
  收起阅读 »

Elastic Podcast 第二期,嘉宾:吴晓刚/胡航@Ctrip


banner.jpg

 Elastic Podcast 第二期来啦, 这一次我们来到了位于上海的携程旅行网,携程内部大量运用了 Elasticsearch 来进行集中式的运维日志管理和为业务部门提供统一的搜索服务平台,目前线上总共部署了多达 94 个 Elasticsearch 集群和超过 700 多个 Elasticsearch 节点,每天新增日志 1600 亿条,峰值达到 300 万每秒,存放在 Elasticsearch 里面的索引文档达到 2.5 万亿,磁盘存储达到 PB 级。想知道携程是如何应对这些海量数据下的挑战,以及最佳实践,让我们一起来收听这一期的 Podcast,跟随携程的两位技术负责人吴晓刚和胡航来一探究竟。


主持人:

Elastic 技术布道师,曾勇(Medcl)。


嘉宾:

吴晓刚,携程技术保障部系统研发总监, Elasticsearch 国内早期实践者,中文社区活跃用户。 曾在 eBay, Morgan Stanley, PPTV 等国内外公司从事系统软件研发、系统集成与技术支持工作。对于大规模 IT 系统的运维自动化、可视化、性能优化具有浓厚的兴趣。在技术方面一直抱有知其然知其所以然的态度。


胡航,携程旅行网高级技术经理,负责相关搜索实现、SOA服务的开发。曾供职于腾讯、盛大等公司,对新技术持有强烈的好奇心,目前关注于 Elasticsearch 的业务实现、JVM 性能优化等。


可以点击下面的任意链接来收听(时长约 50 分钟):


往期:Elastic 在德比软件的使用


关于 Elastic Podcast

《Elastic Podcast》是由 Elastic 中文社区发起的一档谈话类的播客节目,节目会定期邀请 Elastic 开源软件的用户,一起来聊一聊围绕他们在使用 Elastic 开源软件过程中的各种话题,包括行业应用、架构案例、经验分享等等。
 

ctrip_podcast_pic.jpg

[胡航/吴晓刚/曾勇]
继续阅读 »

banner.jpg

 Elastic Podcast 第二期来啦, 这一次我们来到了位于上海的携程旅行网,携程内部大量运用了 Elasticsearch 来进行集中式的运维日志管理和为业务部门提供统一的搜索服务平台,目前线上总共部署了多达 94 个 Elasticsearch 集群和超过 700 多个 Elasticsearch 节点,每天新增日志 1600 亿条,峰值达到 300 万每秒,存放在 Elasticsearch 里面的索引文档达到 2.5 万亿,磁盘存储达到 PB 级。想知道携程是如何应对这些海量数据下的挑战,以及最佳实践,让我们一起来收听这一期的 Podcast,跟随携程的两位技术负责人吴晓刚和胡航来一探究竟。


主持人:

Elastic 技术布道师,曾勇(Medcl)。


嘉宾:

吴晓刚,携程技术保障部系统研发总监, Elasticsearch 国内早期实践者,中文社区活跃用户。 曾在 eBay, Morgan Stanley, PPTV 等国内外公司从事系统软件研发、系统集成与技术支持工作。对于大规模 IT 系统的运维自动化、可视化、性能优化具有浓厚的兴趣。在技术方面一直抱有知其然知其所以然的态度。


胡航,携程旅行网高级技术经理,负责相关搜索实现、SOA服务的开发。曾供职于腾讯、盛大等公司,对新技术持有强烈的好奇心,目前关注于 Elasticsearch 的业务实现、JVM 性能优化等。


可以点击下面的任意链接来收听(时长约 50 分钟):


往期:Elastic 在德比软件的使用


关于 Elastic Podcast

《Elastic Podcast》是由 Elastic 中文社区发起的一档谈话类的播客节目,节目会定期邀请 Elastic 开源软件的用户,一起来聊一聊围绕他们在使用 Elastic 开源软件过程中的各种话题,包括行业应用、架构案例、经验分享等等。
 

ctrip_podcast_pic.jpg

[胡航/吴晓刚/曾勇] 收起阅读 »