Elasticsearch

ES数据没了？谁动了我的数据？

yangmf2040 发表了文章 • 0 个评论 • 2861 次浏览 • 2023-05-13 08:46 • 来自相关话题

背景

我们在使用 Elasticsearch 的时候，可能会遇到数据“丢”了的情况。有可能是数据没成功写入 ES 集群，也可能是数据被误删了。

针对数据被误删，有没有好的解决办法呢？

其实我们可以把“删除数据”这个操作管理起来。当 ES 集群接收到删除数据命令的时候，先不执行该命令，而是生成一条删除数据的记录，经过管理人员批准后，该命令才会执行。这样不仅可以管理数据的删除，还可以进行删除操作的追踪：什么人，什么时间，发送了什么样的删除指令，从哪个 IP 发送的，以什么身份登录的等等。

要实现这个解决办法，我们可借助 INFINI Gateway 和 Console 的帮助。

方案架构

![](https://www.infinilabs.com/img ... p1.png)

方案效果

INFINI Gateway 作为 ES 集群的代理，接收所有请求
INFINI Gateway 对删除数据操作进行拦截，在 Console UI 界面生成记录
管理人员在 Console UI 界面审批操作记录，审批通过操作被执行

方案演示

测试数据准备

测试索引 test1，一共有 3 条数据。message 内容分别是"line 1"，"line 2"和"line 3"。

![](https://www.infinilabs.com/img ... p2.png)

启动 INFINI Gateway 及 Console

网关配置新增内容

增加对 DELETE 操作的捕获，不直接执行，写入队列中。后续由队列生成特定的记录。

```
router:
- name: my_router
  default_flow: default_flow
  tracing_flow: logging_flow
  rules:
  - method:
    - "DELETE"
      pattern:
    - "/{any_index}"
    - "/{any_index}/{any_type}"
    - "/{any_index}/{any_type}/{any_docid}"
      flow:
    - audit_flow
  - method:
    - "*"
      pattern:
    - "/{any_index}/_delete_by_query"
    - "/_delete_by_query"
      flow:
    - audit_flow
      flow:
- name: audit_flow
  filter:
  - logging:
    queue_name: del_queue
    pipeline:
- name: del_queue_ingest
  auto_start: true
  keep_running: true
  processor:
  - json_indexing:
    input_queue: "del_queue"
    idle_timeout_in_seconds: 1
    elasticsearch: "logging-server"
    index_name: "del_requests"
    worker_size: 1
    bulk_size_in_kb: 1
```

执行删除操作

ES 支持多种删除操作，简单总结归纳如下：
1. 删除指定文档 id
2. 删除索引
3. 根据查询删除指定数据（_delete_by_query）
  
  执行删除操作之前，先通过 INFINI Gateway 访问 ES 集群，证明可正常访问数据。
  ![](https://www.infinilabs.com/img ... p3.png)
  执行上述的几种删除命令，注意要发给 INFINI Gateway 的 8000 端口。
  ![](https://www.infinilabs.com/img ... p4.png)
  
  数据查询验证数据还在
  
  ![](https://www.infinilabs.com/img ... p5.png)
  
  Console 界面查看未批准的删除记录
  
  ![](https://www.infinilabs.com/img ... p6.png)
  
  所有删除操作，都被记录，待审批
  
  Console 界面进行审批通过
  
  ![](https://www.infinilabs.com/img ... p7.jpg)
  
  选择一条记录，批准执行。Operation-approve
  
  数据查询验证数据
  
  "message": "line 2"的文档已被删除。
  
  ![](https://www.infinilabs.com/img ... p8.png)
  
  Console 界面查看历史记录
  
  ![](https://www.infinilabs.com/img ... p9.png)
  
  继续批准测试
  
  批准删除一条文档
  
  ![](https://www.infinilabs.com/img ... 10.png)
  
  ![](https://www.infinilabs.com/img ... 11.png)
  
  "message": "line 1" 的文档不在了。
  
  批准删除索引
  
  ![](https://www.infinilabs.com/img ... 12.png)
  
  ![](https://www.infinilabs.com/img ... 13.png)
  
  索引不在了。
  
  至此我们演示了如何利用 INFINI Gateway 和 Console 对 ES 集群删除操作进行管控，本文只是抛砖引玉，相信还有更多有意思的场景等待大家发掘。

请教大佬，搜索结果按某个字段进行分类，每个分类第一条结果优先展示，同一分类其他结果进行减分降权（不是过滤掉），应该怎么做呢

贡献

YuLiGod 回复了问题 • 6 人关注 • 3 个回复 • 4221 次浏览 • 2023-05-06 12:02 • 来自相关话题

Elasticsearch：如何在 Elastic 中实现图片相似度搜索

liuxg 发表了文章 • 0 个评论 • 3159 次浏览 • 2023-05-04 16:17 • 来自相关话题

原文： [Elasticsearch：如何在 Elastic 中实现图片相似度搜索](https://elasticstack.blog.csdn ... 312757)

作者：[Radovan Ondas](https://www.elastic.co/blog/author/radovan-ondas "Radovan Ondas")

![](https://img-blog.csdnimg.cn/d1 ... f2.png)

在本文章，我们将了解如何通过几个步骤在 [Elastic](https://so.csdn.net/so/search% ... 1.7020) 中实施相似图像搜索。开始设置应用程序环境，然后导入 NLP 模型，最后完成为你的图像集生成嵌入。

[Elastic 图像相似性搜索概览 >>](https://elasticstack.blog.csdn ... 93311 "Elastic 图像相似性搜索概览 >>")

![](https://img-blog.csdnimg.cn/b1 ... 8.jpeg)

Elasticsearch：如何在 Elastic 中实现图片相似度搜索

如何设置环境
======

第一步是为你的应用程序设置环境。一般要求包括：

Git
Python 3.9
Docker
数百张图片

使用数百张图像以确保获得最佳效果非常重要。

转到工作文件夹并检查创建的存储库代码。然后导航到存储库文件夹。

```
1. git clone https://github.com/radoondas/f ... h.git
2. cd flask-elastic-image-search
 
  
3. $ git clone https://github.com/radoondas/f ... h.git
4. Cloning into 'flask-elastic-image-search'...
5. remote: Enumerating objects: 105, done.
6. remote: Counting objects: 100% (105/105), done.
7. remote: Compressing objects: 100% (72/72), done.
8. remote: Total 105 (delta 37), reused 94 (delta 27), pack-reused 0
9. Receiving objects: 100% (105/105), 20.72 MiB | 9.75 MiB/s, done.
10. Resolving deltas: 100% (37/37), done.
11. $ cd flask-elastic-image-search/
12. $ pwd
13. /Users/liuxg/python/flask-elastic-image-search
 
  因为你将使用 Python 来运行代码，所以你需要确保满足所有要求并且环境已准备就绪。现在[创建虚拟环境](<a href="https://so.csdn.net/so/search?q=%E5%88%9B%E5%BB%BA%E8%99%9A%E6%8B%9F%E7%8E%AF%E5%A2%83&spm=1001.2101.3001.702" rel="nofollow" target="_blank">https://so.csdn.net/so/search% ... 1.702</a>0)并安装所有依赖项。 
14. python3 -m venv .venv
15. source .venv/bin/activate
16. pip install -r requirements.txt
 
```

![](https://img-blog.csdnimg.cn/5e ... f9.png)

安装
==

如果你还没有安装好自己的 Elasticsearch 及 Kibana，请参考如下的文章来进行安装：
[如何在 Linux，MacOS 及 Windows 上进行安装 Elasticsearch](https://elasticstack.blog.csdn ... 13578 "如何在 Linux，MacOS 及 Windows 上进行安装 Elasticsearch")
[Kibana：如何在 Linux，MacOS 及 Windows 上安装 Elastic 栈中的 Kibana](https://elasticstack.blog.csdn ... 33732 "Kibana：如何在 Linux，MacOS 及 Windows 上安装 Elastic 栈中的 Kibana")

特别注意的是：我们将以最新的 Elastic Stack 8.6.1 来进行展示。请参考 Elastic Stack 8.x 的文章进行安装。

启动白金版试用功能
=========

由于上传模型是一个白金版的功能，我们需要启动试用功能。更多关于订阅的信息，请参考网址：[订阅 | Elastic Stack 产品和支持 | Elastic](https://www.elastic.co/cn/subscriptions "订阅 | Elastic Stack 产品和支持 | Elastic")。

![](https://img-blog.csdnimg.cn/e9 ... 84.png)

![](https://img-blog.csdnimg.cn/c8 ... 05.png)

![](https://img-blog.csdnimg.cn/e2 ... 11.png)

![](https://img-blog.csdnimg.cn/60 ... 30.png)

这样我们就成功地启动了白金版试用功能。

Elasticsearch 集群和嵌入模型
=====================

登录到你的帐户以启动 Elasticsearch 集群。设置一个小型集群：
一个具有 2GB 内存的 HOT 节点
一个具有 4GB 内存的 ML（机器学习）节点（此节点的大小很重要，因为你将导入 Elasticsearch 的 NLP 模型会消耗约 1.5GB 的内存。）

部署准备就绪后，转到 Kibana 并检查机器学习节点的容量。你将在视图中看到一个机器学习节点。目前没有加载模型。

![](https://img-blog.csdnimg.cn/6e ... 11.png)

![](https://img-blog.csdnimg.cn/19 ... 62.png)

使用 Eland 库从 OpenAI 上传 CLIP 嵌入模型。 Eland 是一个 Python Elasticsearch 客户端，用于在 Elasticsearch 中探索和分析数据，能够处理文本和图像。您将使用此模型从文本输入生成嵌入并查询匹配图像。在 Eland 库的[文档](https://www.elastic.co/guide/e ... .html "文档")中找到更多详细信息。

对于下一步，你将需要 Elasticsearch 端点。你可以从部署详细信息部分的 Elasticsearch 云控制台获取它。

![](https://img-blog.csdnimg.cn/7d ... 0.jpeg)

在本示例中，我们将使用本地部署来进行展示，所以，我们并不必要完成上面的步骤。

Eland
-----

Eland 可以通过 pip 从 [PyPI](https://pypi.org/project/eland "PyPI") 安装。在安装之前，我们需要安装好自己的 Python。

```
1. $ python --version
2. Python 3.10.2
 
  可以使用 Pip 从 PyPI 安装 Eland： 
 python -m pip install eland
  也可以使用 Conda 从 Conda Forge 安装 Eland： 
 conda install -c conda-forge eland
  希望在不安装 Eland 的情况下使用它的用户，为了只运行可用的脚本，可以构建 Docker 容器： 
3. git clone https://github.com/elastic/eland
4. cd eland
5. docker build -t elastic/eland .
 
```

![](https://img-blog.csdnimg.cn/ed ... de.png)

Eland 将 Hugging Face 转换器模型到其 TorchScript 表示的转换和分块过程封装在一个 Python 方法中；因此，这是推荐的导入方法。
6. [安装 Eland Python 客户端](https://github.com/elastic/eland#getting-started "安装 Eland Python 客户端")。
7. 运行 eland_import_hub_model 脚本。例如：
 
```
8. eland_import_hub_model --url \
9. --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \
10. --task-type ner
```
指定 URL 以访问你的集群。例如，https://%26lt%3Buser%26gt%3B:% ... gt%3B。
在 Hugging Face 模型中心中指定模型的标识符。
指定 NLP 任务的类型。支持的值为 fill_mask、ner、text_classification、text_embedding, question_answering 和 zero_shot_classification。

![](https://img-blog.csdnimg.cn/88 ... 71.png)

上传模型
----

我们使用如下的命令来进行上传模型：

```
1. eland_import_hub_model --url https://%26lt%3Buser%26gt%3B:% ... gt%3B \
2. --hub-model-id sentence-transformers/clip-ViT-B-32-multilingual-v1 \
3. --task-type text_embedding \
4. --ca-certs \
5. --start
 
  针对我的情况： 
6. eland_import_hub_model --url https://elastic:ZgzSt2vHNwA6yP ... :9200 \
7. --hub-model-id sentence-transformers/clip-ViT-B-32-multilingual-v1 \
8. --task-type text_embedding \
9. --ca-certs /Users/liuxg/elastic/elasticsearch-8.6.1/config/certs/http_ca.crt \
10. --start
 
 **请注意**：你需要根据自己的 Elasticsearch 访问端点，用户名及密码来修改上面的设置，同时你需要根据自己的配置修改上面的证书路径。 运行上面的命令： ![](<a href="https://img-blog.csdnimg.cn/055d0f4a24c74ca683a2b04daa972bb5.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/05 ... b5.pn</a>g) 上面显示，我们已经成功地上传了模型。我们可以到 Kibana 中进行查看： ![](<a href="https://img-blog.csdnimg.cn/72a4ac019a13423a8655b4337ef23755.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/72 ... 55.pn</a>g) ![](<a href="https://img-blog.csdnimg.cn/67300625d1c247cdb33fcfe48461b90d.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/67 ... 0d.pn</a>g) ![](<a href="https://img-blog.csdnimg.cn/7da05828849e4db3a2f5f04356a1545c.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/7d ... 5c.pn</a>g) ![](<a href="https://img-blog.csdnimg.cn/808df0af71604dc1931e6999826884f4.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/80 ... f4.pn</a>g) 上面显示我们已经上传了所需要的 CLIP 模型，并且它的状态是 started。 如何创建图像嵌入 ======== 在设置 Elasticsearch 集群并导入嵌入模型后，你需要矢量化图像数据并为数据集中的每个图像创建图像嵌入。 ![](<a href="https://img-blog.csdnimg.cn/6e793f36c0ae4f44ab476788abe4807f.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/6e ... 7f.pn</a>g) ![](<a href="https://img-blog.csdnimg.cn/63048a39526a4f47b1468c12d73fa93a.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/63 ... 3a.pn</a>g) 要创建图像嵌入，请使用简单的 Python 脚本。你可以在此处找到该脚本：create-image-embeddings.py。该脚本将遍历你的图像目录并生成单独的图像嵌入。它将使用名称和相对路径创建文档，并使用提供的映射将其保存到 Elasticsearch 索引 my-image-embeddings 中。 将所有图像（照片）放入文件夹 app/static/images。使用带有子文件夹的目录结构来组织图像。所有图像准备就绪后，使用几个参数执行脚本。 至少要有几百张图像才能获得合理的结果，这一点至关重要。图像太少不会产生预期的结果，因为你要搜索的空间非常小，而且到搜索向量的距离也非常相似。我尝试在网上下载很多的照片，但是感觉一张一张地下载非常麻烦。你可以在谷歌浏览器中添加插件 [Image downloader - Imageye](<a href="https://chrome.google.com/webstore/detail/image-downloader-imageye/agionbommeaifngbhincahgmoflcikhm" rel="nofollow" target="_blank">https://chrome.google.com/webs ... cikhm</a> "Image downloader - Imageye")。它可以方便地把很多照片一次下载下来。 在 image_embeddings 文件夹中，运行脚本并为变量使用你的值。 
11. cd image_embeddings
12. python3 create-image-embeddings.py \
13. --es_host='https://localhost:9200' \
14. --es_user='elastic' --es_password='ZgzSt2vHNwA6yPn-fllr' \
15. --ca_certs='/Users/liuxg/elastic/elasticsearch-8.6.1/config/certs/http_ca.crt'
 
 根据图像的数量、它们的大小、你的 CPU 和你的网络连接，此任务将需要一些时间。在尝试处理完整数据集之前，先试验少量图像。脚本完成后，你可以使用 Kibana 开发工具验证索引 my-image-embeddings 是否存在并具有相应的文档。 ![](<a href="https://img-blog.csdnimg.cn/18101e42476a4d8f95a0c2962b0d998f.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/18 ... 8f.pn</a>g) 我们在Kibana 中进行查看： 
 GET _cat/indices/my-image-embeddings?v
  上面命令的响应为： 
16. health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
17. yellow open my-image-embeddings h6oUBdHCScWmXOZaf57oWg 1 1 145 0 1.4mb 1.4mb
 
  查看文档，你会看到非常相似的 JSON 对象（如示例）。你将在图像文件夹中看到图像名称、图像 ID 和相对路径。此路径用于前端应用程序以在搜索时正确显示图像。JSON 文档中最重要的部分是包含 CLIP 模型生成的密集矢量的 image_embedding。当应用程序正在搜索图像或类似图像时使用此矢量。 
 GET my-image-embeddings/_search
  
18. {
19. "_index": "my-image-embeddings",
20. "_id": "_g9ACIUBMEjlQge4tztV",
21. "_score": 6.703597,
22. "_source": {
23. "image_id": "IMG_4032",
24. "image_name": "IMG_4032.jpeg",
25. "image_embedding": [
26. -0.3415695130825043,
27. 0.1906963288784027,
28. .....
29. -0.10289803147315979,
30. -0.15871885418891907
31. ],
32. "relative_path": "phone/IMG_4032.jpeg"
33. }
34. }
 
 ![](<a href="https://img-blog.csdnimg.cn/f64311fb558c46b5bf7f17aa2a836311.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/f6 ... 11.pn</a>g) 使用 Flask 应用程序搜索图像 ================= 现在你的环境已全部设置完毕，你可以进行下一步，使用我们作为概念证明提供的 Flask 应用程序，使用自然语言实际搜索图像并查找相似图像。该 Web 应用程序具有简单的 UI，使图像搜索变得简单。你可以在此 GitHub [存储库](<a href="https://github.com/radoondas/flask-elastic-image-search" rel="nofollow" target="_blank">https://github.com/radoondas/f ... earch</a> "存储库")中访问原型 Flask 应用程序。 后台应用程序执行两个任务。在搜索框中输入搜索字符串后，文本将使用机器学习 _infer 端点进行矢量化。然后，针对带有向量的索引 my-image-embeddings 执行带有密集向量的查询。 你可以在示例中看到这两个查询。第一个 API 调用使用 _infer 端点，结果是一个密集矢量。 
35. POST _ml/trained_models/sentence-transformers__clip-vit-b-32-multilingual-v1/_infer
36. {
37. "docs" : [
38. {"text_field": "Yellow mountain is the most beautiful mountain in China"}
39. ]
40. }
 
  上面的响应如下： ![](<a href="https://img-blog.csdnimg.cn/6b4f423d3ed94d1cbbe5e1b7ed4cf8e2.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/6b ... e2.pn</a>g) 在第二个任务中，搜索查询，我们将使用密集矢量并获得按分数排序的图像。 
 `
41. GET my-image-embeddings/_search
42. {
43. "fields": [
44. "image_id",
45. "image_name",
46. "relative_path"
47. ],
48. "_source": false,
49. "knn": {
50. "field": "image_embedding",
51. "k": 5,
52. "num_candidates": 10,
53. "query_vector": [
54. 0.03395160660147667,
55. 0.007704082876443863,
56. 0.14996188879013062,
57. -0.10693030804395676,
58. ...
59. 0.05140634626150131,
60. 0.07114913314580917
61. ]
62. }
63. }
 
 `![](https://csdnimg.cn/release/blo ... te.png)
  ![](<a href="https://img-blog.csdnimg.cn/01734d0151524e9f9f08bd2868808ab5.pn" rel="nofollow" target="_blank">https://img-blog.csdnimg.cn/01 ... b5.pn</a>g) 要启动并运行 Flask 应用程序，请导航到存储库的根文件夹并配置 .env 文件。配置文件中的值用于连接到 Elasticsearch 集群。你需要为以下变量插入值。这些与图像嵌入生成中使用的值相同。 **.env** 
64. ES_HOST='URL:PORT'
65. ES_USER='elastic'
66. ES_PWD='password'
 
  为了能够使得我们自构建的 Elasticsearch 集群能够被正确地访问，我们必须把 Elasticsearch 的根证书拷贝到 Flask 应用的相应目录中： **flask-elastic-image-search/app/conf/ca.crt** 
67. (.venv) $ pwd
68. /Users/liuxg/python/flask-elastic-image-search/app/conf
69. (.venv) $ cp ~/elastic/elasticsearch-8.6.1/config/certs/http_ca.crt ca.crt
70. overwrite ca.crt? (y/n [n]) y
 
  在上面，我们替换了仓库中原有的证书文件 ca.crt。 准备就绪后，运行主文件夹中的 flask 应用程序并等待它启动。 
71. In the main directory
72. $ flask run --port=5001
 
```

如果应用程序启动，你将看到类似于下面的输出，它在末尾指示你需要访问哪个 URL 才能访问该应用程序。

![](https://img-blog.csdnimg.cn/f9 ... 41.png)

恭喜！你的应用程序现在应该已启动并正在运行，并且可以通过互联网浏览器在 [http://127.0.0.1:5001](http://127.0.0.1:5001 "http://127.0.0.1:5001";) 上访问。

导航到图像搜索选项卡并输入描述你最佳图像的文本。尝试使用非关键字或描述性文字。

在下面的示例中，输入的文本是 “Yellow mountain is the most beautiful mountain in China”。结果显示在我们的数据集中。如果用户喜欢结果集中的一张特定图像，只需单击它旁边的按钮，就会显示类似的图像。用户可以无限次地这样做，并通过图像数据集构建自己的路径。

![](https://img-blog.csdnimg.cn/c6 ... a8.png)

我们尝试另外的一个例子。这次我们输入：I love beautiful girls。

![](https://img-blog.csdnimg.cn/6e ... a1.png)

搜索也可以通过简单地上传图像来进行。该应用程序会将图像转换为矢量并在数据集中搜索相似的图像。为此，导航到第三个选项卡 “Similar Image”，从磁盘上传图像，然后点击 “Search”。

![](https://img-blog.csdnimg.cn/d7 ... 95.png)

![](https://img-blog.csdnimg.cn/32 ... 84.png)

我们可以看到相似的图片。我们尝试使用一个女孩的照片再试试：

![](https://img-blog.csdnimg.cn/79 ... f0.png)

因为我们在 Elasticsearch 中使用的 [NLP](https://so.csdn.net/so/search% ... 1.7020)（sentence-transformers/clip-ViT-B-32-multilingual-v1）模型是多语言的，支持多语言推理，所以尽量搜索自己语言的图片。然后也使用英文文本验证结果。我们尝试使用 “黄山是中国最漂亮的山”：

![](https://img-blog.csdnimg.cn/1e ... 0d.png)

请务必注意，使用的模型是通用模型，这些模型非常准确，但你获得的结果会因用例或其他因素而异。如果你需要更高的精度，则必须采用通用模型或开发自己的模型 —— CLIP 模型只是一个起点。

代码摘要
====

你可以在 GitHub [存储库](https://github.com/radoondas/f ... earch "存储库")中找到完整的代码。你可能正在检查 [routes.py](https://github.com/radoondas/f ... es.py "routes.py") 中的代码，它实现了应用程序的主要逻辑。除了明显的路线定义之外，你还应该关注定义 _infer 和 _search 端点（infer_trained_model 和 knn_search_images）的方法。生成图像嵌入的代码位于 [create-image-embeddings.py](https://github.com/radoondas/f ... gs.py "create-image-embeddings.py")文件中。

总结
==

现在你已经设置了 Flask 应用程序，你可以轻松地搜索你自己的图像集！ Elastic 在平台内提供了矢量搜索的原生集成，避免了与外部进程的通信。你可以灵活地开发和使用你可能使用 PyTorch 开发的自定义嵌入模型。

语义图像搜索具有其他传统图像搜索方法的以下优点：
更高的准确度：向量相似性捕获上下文和关联，而不依赖于图像的文本元描述。
增强的用户体验：与猜测哪些关键字可能相关相比，描述你正在寻找的内容或提供示例图像。
图像数据库的分类：不用担心对图像进行分类——相似性搜索可以在一堆图像中找到相关图像，而无需对它们进行组织。

如果你的用例更多地依赖于文本数据，你可以在[以前的博客](https://www.elastic.co/blog/ho ... arted "以前的博客")中了解更多关于实现语义搜索和将自然语言处理应用于文本的信息。对于文本数据，向量相似度与传统关键词评分的结合呈现了两全其美的效果。

准备好开始了吗？在我们的[虚拟活动中心](https://www.elastic.co/events/ ... kshop "虚拟活动中心")报名参加矢量搜索实践研讨会，并在我们的[在线论坛](https://discuss.elastic.co/tag/vector-search "在线论坛")中与社区互动。

Web Scraper + Elasticsearch + Kibana + SearchKit 打造的豆瓣电影top250 搜索演示系统

作者：小森同学

声明：电影数据来源于“豆瓣电影”，如有侵权，请联系删除

Web Scraper

json { "_id": "top250", "startUrl": ["<a href="https://movie.douban.com/top250?start=" rel="nofollow" target="_blank">https://movie.douban.com/top250?start=</a>[0-225:25]&filter="], "selectors": [{ "id": "container", "multiple": true, "parentSelectors": ["_root"], "selector": ".grid_view li", "type": "SelectorElement" }, { "id": "name", "multiple": false, "parentSelectors": ["container"], "regex": "", "selector": "span.title:nth-of-type(1)", "type": "SelectorText" }, { "id": "number", "multiple": false, "parentSelectors": ["container"], "regex": "", "selector": "em", "type": "SelectorText" }, { "id": "score", "multiple": false, "parentSelectors": ["container"], "regex": "", "selector": "span.rating_num", "type": "SelectorText" }, { "id": "review", "multiple": false, "parentSelectors": ["container"], "regex": "", "selector": "span.inq", "type": "SelectorText" }, { "id": "year", "multiple": false, "parentSelectors": ["container"], "regex": "\\d{4}", "selector": "p:nth-of-type(1)", "type": "SelectorText" }, { "id": "tour_guide", "multiple": false, "parentSelectors": ["container"], "regex": "^导演: \\S*", "selector": "p:nth-of-type(1)", "type": "SelectorText" }, { "id": "type", "multiple": false, "parentSelectors": ["container"], "regex": "[^/]+$", "selector": "p:nth-of-type(1)", "type": "SelectorText" }, { "id": "area", "multiple": false, "parentSelectors": ["container"], "regex": "[^\\/]+(?=\\/[^\\/]*$)", "selector": "p:nth-of-type(1)", "type": "SelectorText" }, { "id": "detail_link", "multiple": false, "parentSelectors": ["container"], "selector": ".hd a", "type": "SelectorLink" }, { "id": "director", "multiple": false, "parentSelectors": ["detail_link"], "regex": "", "selector": "span:nth-of-type(1) .attrs a", "type": "SelectorText" }, { "id": "screenwriter", "multiple": false, "parentSelectors": ["detail_link"], "regex": "(?<=编剧: )[\\u4e00-\\u9fa5A-Za-z0-9/()\\·\\s]+(?=主演)", "selector": "div#info", "type": "SelectorText" }, { "id": "film_length", "multiple": false, "parentSelectors": ["detail_link"], "regex": "\\d+", "selector": "span[property='v:runtime']", "type": "SelectorText" }, { "id": "IMDb", "multiple": false, "parentSelectors": ["detail_link"], "regex": "(?<=[IMDb:\\s+])\\S*(?=\\d*$)", "selector": "div#info", "type": "SelectorText" }, { "id": "language", "multiple": false, "parentSelectors": ["detail_link"], "regex": "(?<=语言: )\\S+", "selector": "div#info", "type": "SelectorText" }, { "id": "alias", "multiple": false, "parentSelectors": ["detail_link"], "regex": "(?<=又名: )[\\u4e00-\\u9fa5A-Za-z0-9/()\\s]+(?=IMDb)", "selector": "div#info", "type": "SelectorText" }, { "id": "pic", "multiple": false, "parentSelectors": ["container"], "selector": "img", "type": "SelectorImage" }] } 

elasticsearch

 { "mappings": { "properties": { "IMDb": { "type": "keyword", "copy_to": [ "all" ] }, "alias": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "copy_to": [ "all" ], "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "all": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "area": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "copy_to": [ "all" ], "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "director": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "copy_to": [ "all" ], "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "film_length": { "type": "long" }, "id": { "type": "keyword" }, "language": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "copy_to": [ "all" ], "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "link": { "type": "keyword" }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "copy_to": [ "all" ], "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "number": { "type": "long" }, "photo": { "type": "keyword" }, "review": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "copy_to": [ "all" ], "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "score": { "type": "double" }, "screenwriter": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "copy_to": [ "all" ], "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "type": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "copy_to": [ "all" ], "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "year": { "type": "long" } } } } 

kibana

需要使用pipeline对索引字段进行处理，如对type 通过空格进行分割为数组等，可以参照官方文档或其他博客。

制作仪表板省略, 请自行搜索

SearchKit

参考 https://github.com/searchkit/searchkit-starter-app

通知设置新通知

ES数据没了？谁动了我的数据？

背景

方案架构

方案效果

方案演示

测试数据准备

启动 INFINI Gateway 及 Console

执行删除操作

数据查询验证数据还在

Console 界面查看未批准的删除记录

Console 界面进行审批通过

数据查询验证数据

Console 界面查看历史记录

继续批准测试

请教大佬，搜索结果按某个字段进行分类，每个分类第一条结果优先展示，同一分类其他结果进行减分降权（不是过滤掉），应该怎么做呢

Elasticsearch：如何在 Elastic 中实现图片相似度搜索

In the main directory

ik自定义分词和停用词遇到一个问题, 或者在脚本中如何过滤不想返回的数据呢

使用es做搜索，比如用户输入柠檬，搜出来的结果，柠檬汽水，柠檬位牙膏等在前面，真正想要的水果那个柠檬在后面。已经在中文分词中加了柠檬，还是不管用

es使用老版本命令插入新版本的问题！！！

es中, painless可以把json字符串转为数组或list的吗

ES是否可以设置内部做重试？

ngram分词，and操作搜索不到我理想结果，求大神帮忙看下呢

Web Scraper + Elasticsearch + Kibana + SearchKit 打造的豆瓣电影top250 搜索演示系统

Web Scraper

elasticsearch

kibana

SearchKit

es bulk写入数据时，查询变得很慢

es局部更新文档字段

Elastic7.10.0 restore定期恢复抛 data too large

Es怎么实现按多字段去重查询呢？

es script score查询与function score查询的区别

热门话题

通知设置 新通知

Elasticsearch

背景

方案架构

方案效果

方案演示

测试数据准备

启动 INFINI Gateway 及 Console

执行删除操作

数据查询验证数据还在

Console 界面查看未批准的删除记录

Console 界面进行审批通过

数据查询验证数据

Console 界面查看历史记录

继续批准测试

In the main directory

Web Scraper

elasticsearch

kibana

SearchKit

热门话题

通知设置新通知