Elasticsearch

failed to find analyzer type [mmseg_maxword] or tokenizer

贡献

zplzpl 回复了问题 • 5 人关注 • 4 个回复 • 10123 次浏览 • 2016-06-27 17:03 • 来自相关话题

java爬虫爬取Elastic中文社区用作es测试数据

kl 发表了文章 • 1 个评论 • 8923 次浏览 • 2016-03-29 23:10 • 来自相关话题

前言
为了测试es的完美功能，笔者使用爬虫爬取了Elastic中文社区和CSDN的大量数据，作为测试之用，下面简单介绍一下折腾的过程
认识 WebCollector
WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架（内核），它提供精简的的API，只需少量代码即可实现一个功能强大的爬虫。WebCollector-Hadoop是WebCollector的Hadoop版本，支持分布式爬取。
WebCollector致力于维护一个稳定、可扩的爬虫内核，便于开发者进行灵活的二次开发。内核具有很强的扩展性，用户可以在内核基础上开发自己想要的爬虫。源码中集成了Jsoup，可进行精准的网页解析。2.x版本中集成了selenium，可以处理javascript生成的数据。
官网地址：http://crawlscript.github.io/WebCollector/
使用步骤
导入jar依赖，笔者是maven项目，所有加入如下pom.xml依赖
ps:笔者这里是使用的最新版的，maven仓库目前最新版的是2.09，所以使用最新的就自己下载打包吧
环境有了后，直接新建一个类继承BreadthCrawler类重新visit方法，你的处理逻辑都在visit方法里面，下面楼主贴下我的代码
爬取Elastic中文社区资源

/**

 * Created by 小陈 on 2016/3/29.

 */

@Component

public class ElasticCrawler extends BreadthCrawler {

    @Autowired

     IpaDao ipaDao;

    public ElasticCrawler() {

        super("crawl", true);

        /*start page*/

        this.addSeed("xxx");

        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/

        this.addRegex("xxx");

        /*do not fetch jpg|png|gif*/

        this.addRegex("-.*\\.(jpg|png|gif).*");

        /*do not fetch url contains #*/

//        this.addRegex("-.*#.*");

    }

    @Override

    public void visit(Page page, CrawlDatums next) {

        String url = page.getUrl();

        String content="";

        try {

             content = ContentExtractor.getContentByUrl(url);

        }catch (Exception e){

            e.printStackTrace();

        }

          /*抽取标题*/

        String title=page.getDoc().title();

        System.out.println("-------------------->"+title);

        if(!title.isEmpty() && ! content.isEmpty()){

                Pa pa=new Pa(title,content);

               ipaDao.save(pa);//持久化到数据库

            }

    }

爬取CSDN资源

/**

 * @author kl by 2016/3/29

 * @boke www.kailing.pub

 */

@Component

public class CSDNCrawler extends BreadthCrawler {

    @Autowired

    IpaDao ipaDao;

    public CSDNCrawler() {

        super("crawl", true);

        /*start page*/

        this.addSeed("http://blog.csdn.net/.*");//添加种子地址

        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/

        this.addRegex("http://blog.csdn.net/.*/article/details/.*");

        /*do not fetch jpg|png|gif*/

        this.addRegex("-.*\\.(jpg|png|gif).*");

        /*do not fetch url contains #*/

//        this.addRegex("-.*#.*");

    }

    @Override

    public void visit(Page page, CrawlDatums next) {

        String url = page.getUrl();

        String content="";

        try {

            content = ContentExtractor.getContentByUrl(url);

        }catch (Exception e){

            e.printStackTrace();

        }

        if (page.matchUrl("http://blog.csdn.net/.*/article/details/.*")) {

            String title = page.select("div[class=article_title]").first().text();

            String author = page.select("div[id=blog_userface]").first().text();//获取作者名

            System.out.println("title:" + title + "\tauthor:" + author);

            if(!title.isEmpty() && ! content.isEmpty()){

                Pa pa=new Pa(title,content);

                ipaDao.save(pa);

            }

        }

    }

ps:Elastic中文社区的爬取规则和谐了，楼主是爱社区的，大家可以放心的爬CSDN吧，WebCollector功能很强大，爬虫的一个关键就是需要知道网站的url规则，有兴趣的可以研究下，Elastic的数据不多，分吧钟就够了，CSDN爬了5，6分钟，没有做深度的爬，取了大概二三十万的数据样子，只取标题和正文

去我博客查看原文 http://www.kailing.pub/article/index/arcid/86.html
下面是导入数据的截图

怎么关闭Elasticsearch服务

贡献

guoyiqin 回复了问题 • 5 人关注 • 4 个回复 • 10184 次浏览 • 2016-04-06 17:06 • 来自相关话题

通过elasticsearch head插件，无法关闭服务器

贡献

owen 回复了问题 • 3 人关注 • 3 个回复 • 6484 次浏览 • 2016-03-30 15:37 • 来自相关话题

java使用HTTP Rest client 客户端Jest连接操作es，功能很强大

kl 发表了文章 • 6 个评论 • 27857 次浏览 • 2016-03-28 23:30 • 来自相关话题

前言

在了解jest框架前，楼主一直尝试用官方的Elasticsearch java api连接es服务的，可是，不知何故，一直报如下的异常信息，谷歌了很久，都说是jvm版本不一致导致的问题，可我是本地测试的，jvm肯定是一致的，这个问题现在都木有解决，but，这怎么能阻止我探索es的脚步呢，so，让我发现了jest 这个框架

org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream
我的测试代码是参考官方api实例的，官方api地址：Elasticsearch java api,代码如下：

Client client = new TransportClient().addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300)); QueryBuilder queryBuilder = QueryBuilders.termQuery("content", "搜"); SearchResponse searchResponse = client.prepareSearch("indexdata").setTypes("fulltext") .setQuery(queryBuilder) .execute() .actionGet(); SearchHits hits = searchResponse.getHits(); System.out.println("查询到记录数:" + hits.getTotalHits()); SearchHit[] searchHists = hits.getHits(); for(SearchHit sh : searchHists){ System.out.println("content:"+sh.getSource().get("content")); } client.close();
如果有人知道怎么回事，告诉一下楼主吧，让楼主坑的明白，感激不尽了，我的es版本是2.2.0

进入正题

了解jest

jest是一个基于 HTTP Rest 的连接es服务的api工具集，功能强大，能够使用es java api的查询语句，项目是开源的，github地址：https://github.com/searchbox-io/Jest

我的测试用例

分词器：ik，分词器地址：https://github.com/medcl/elasticsearch-analysis-ik ，es的很多功能都是基于插件提供的，es版本升级都2.2.0后，安装插件的方式不一样了，如果你安装ik分词插件有问题，请点击右上角的qq联系博主

新建索引

curl -XPUT http://localhost:9200/indexdata

创建索引的mapping，指定分词器

curl -XPOST http://localhost:9200/indexdata/fulltext/_mapping

{
"fulltext": {
"_all": {
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word",
"term_vector": "no",
"store": "false"
},
"properties": {
"content": {
"type": "string",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word",
"include_in_all": "true",
"boost": 8
},
"description": {
"type": "string",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word",
"include_in_all": "true",
"boost": 8
},
"title": {
"type": "string",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word",
"include_in_all": "true",
"boost": 8
},
"keyword": {
"type": "string",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word",
"include_in_all": "true",
"boost": 8
}
}
}
}

mapping信息可以用head插件查看，如下

导入数据和查询，看代码吧

@RunWith(SpringJUnit4ClassRunner.class) @SpringApplicationConfiguration(classes = ElasticSearchTestApplication.class) public class JestTestApplicationTests { @Autowired private KlarticleDao klarticleDao; //得到JestClient实例 public JestClient getClient()throws Exception{ JestClientFactory factory = new JestClientFactory(); factory.setHttpClientConfig(new HttpClientConfig .Builder("http://127.0.0.1:9200") .multiThreaded(true) .build()); return factory.getObject(); } /** * 导入数据库数据到es * @throws Exception */ @Test public void contextLoads() throws Exception{ JestClient client=getClient(); Listlists=klarticleDao.findAll(); for(Klarticle k:lists){ Index index = new Index.Builder(k).index("indexdata").type("fulltext").id(k.getArcid()+"").build(); System.out.println("添加索引----》"+k.getTitle()); client.execute(index); } //批量新增的方式,效率更高 Bulk.Builder bulkBuilder = new Bulk.Builder(); for(Klarticle k:lists){ Index index = new Index.Builder(k).index("indexdata").type("fulltext").id(k.getArcid()+"").build(); bulkBuilder.addAction(index); } client.execute(bulkBuilder.build()); client.shutdownClient(); } //搜索测试 @Test public void JestSearchTest()throws Exception{ SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.matchQuery("content", "搜索")); Search search = new Search.Builder(searchSourceBuilder.toString()) // multiple index or types can be added. .addIndex("indexdata") .build(); JestClient client =getClient(); SearchResult result= client.execute(search); // List> hits = result.getHits(Klarticle.class); Listarticles = result.getSourceAsObjectList(Klarticle.class); for(Klarticle k:articles){ System.out.println("------->："+k.getTitle()); } } }下面是依赖的jar，maven项目 <dependency> <groupId>io.searchbox</groupId> <artifactId>jest</artifactId> <version>2.0.0</version> </dependency>  <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.6.1</version> </dependency> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>2.2.0</version> </dependency> </dependencies>
去我的博客查看原文：http://www.kailing.pub/article/index/arcid/84.html

大家聊一聊使用的什么版本的Elasticsearch，看看Elasticsearch版本变化

kl 发表了文章 • 4 个评论 • 7852 次浏览 • 2016-03-28 19:12 • 来自相关话题

我是最近从lucene过渡Elasticsearch的，直接用的最新的2.2.0版本的。发现离线安装插件的方式和以前不一样了，一些配置也有改变，最大的问题是java client api 连接报了如下的异常，我是参照官方api测试的，地址：https://www.elastic.co/guide/e ... .html

org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream

谷歌都说是服务和客户端的jvm不一致，我是本机环境测试的，所以，现在这个问题都还没解决，有遇到过的么，还是和版本有关系啊

我是最近从lucene过渡Elasticsearch的，直接用的最新的2.2.0版本的。发现离线安装插件的方式和以前不一样了，一些配置也有改变，最大的问题是java client api 连接报了如下的异常，我是参照官方api测试的，地址：https://www.elastic.co/guide/e ... .html

org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream

谷歌都说是服务和客户端的jvm不一致，我是本机环境测试的，所以，现在这个问题都还没解决，有遇到过的么，还是和版本有关系啊

ES处理大数据插入运行几个小时候出现性能问题

贡献

helloes 回复了问题 • 2 人关注 • 1 个回复 • 12516 次浏览 • 2016-03-28 17:09 • 来自相关话题

拼音搜索+中文搜索

贡献

niweiyi1314 回复了问题 • 11 人关注 • 5 个回复 • 23688 次浏览 • 2018-04-26 14:08 • 来自相关话题

从mysql同步数据到es方法大讨论

贡献

sunyizhen 回复了问题 • 5 人关注 • 3 个回复 • 14110 次浏览 • 2016-08-04 17:23 • 来自相关话题

ElasticsearchTimeoutException问题

贡献

helloes 回复了问题 • 2 人关注 • 1 个回复 • 10042 次浏览 • 2016-03-24 02:10 • 来自相关话题

如何使用自定义的 Similarity插件

贡献

smile_sunshine 回复了问题 • 2 人关注 • 2 个回复 • 6734 次浏览 • 2016-03-24 21:04 • 来自相关话题

elasticsearch 如何计算 TP50 TP95 TP99 这样的值？

贡献

medcl 回复了问题 • 3 人关注 • 2 个回复 • 9332 次浏览 • 2016-03-24 10:38 • 来自相关话题

使用spark向elasticsearch中写入数据异常

贡献

joe23_2006 回复了问题 • 6 人关注 • 6 个回复 • 21794 次浏览 • 2018-01-31 13:36 • 来自相关话题

Elasticsearch 2.2 集群配置

贡献

Jea 回复了问题 • 6 人关注 • 9 个回复 • 5645 次浏览 • 2017-04-14 13:45 • 来自相关话题

aggregation 统计用户一天之内登录次数大于n次的用户列表, 这个应该怎么写聚合json

贡献

ggchangan 回复了问题 • 4 人关注 • 2 个回复 • 6576 次浏览 • 2016-03-22 15:32 • 来自相关话题

通知设置新通知