java爬虫爬取Elastic中文社区用作es测试数据

Elasticsearch | 作者 kl | 发布于2016年03月29日 | | 阅读数：9251

前言
为了测试es的完美功能，笔者使用爬虫爬取了Elastic中文社区和CSDN的大量数据，作为测试之用，下面简单介绍一下折腾的过程
认识 WebCollector
WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架（内核），它提供精简的的API，只需少量代码即可实现一个功能强大的爬虫。WebCollector-Hadoop是WebCollector的Hadoop版本，支持分布式爬取。
WebCollector致力于维护一个稳定、可扩的爬虫内核，便于开发者进行灵活的二次开发。内核具有很强的扩展性，用户可以在内核基础上开发自己想要的爬虫。源码中集成了Jsoup，可进行精准的网页解析。2.x版本中集成了selenium，可以处理javascript生成的数据。
官网地址：http://crawlscript.github.io/WebCollector/
使用步骤
导入jar依赖，笔者是maven项目，所有加入如下pom.xml依赖
ps:笔者这里是使用的最新版的，maven仓库目前最新版的是2.09，所以使用最新的就自己下载打包吧
环境有了后，直接新建一个类继承BreadthCrawler类重新visit方法，你的处理逻辑都在visit方法里面，下面楼主贴下我的代码
爬取Elastic中文社区资源

/**

 * Created by 小陈 on 2016/3/29.

 */

@Component

public class ElasticCrawler extends BreadthCrawler {

    @Autowired

     IpaDao ipaDao;

    public ElasticCrawler() {

        super("crawl", true);

        /*start page*/

        this.addSeed("xxx");

        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/

        this.addRegex("xxx");

        /*do not fetch jpg|png|gif*/

        this.addRegex("-.*\\.(jpg|png|gif).*");

        /*do not fetch url contains #*/

//        this.addRegex("-.*#.*");

    }

    @Override

    public void visit(Page page, CrawlDatums next) {

        String url = page.getUrl();

        String content="";

        try {

             content = ContentExtractor.getContentByUrl(url);

        }catch (Exception e){

            e.printStackTrace();

        }

          /*抽取标题*/

        String title=page.getDoc().title();

        System.out.println("-------------------->"+title);

        if(!title.isEmpty() && ! content.isEmpty()){

                Pa pa=new Pa(title,content);

               ipaDao.save(pa);//持久化到数据库

            }

    }

爬取CSDN资源

/**

 * @author kl by 2016/3/29

 * @boke www.kailing.pub

 */

@Component

public class CSDNCrawler extends BreadthCrawler {

    @Autowired

    IpaDao ipaDao;

    public CSDNCrawler() {

        super("crawl", true);

        /*start page*/

        this.addSeed("http://blog.csdn.net/.*");//添加种子地址

        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/

        this.addRegex("http://blog.csdn.net/.*/article/details/.*");

        /*do not fetch jpg|png|gif*/

        this.addRegex("-.*\\.(jpg|png|gif).*");

        /*do not fetch url contains #*/

//        this.addRegex("-.*#.*");

    }

    @Override

    public void visit(Page page, CrawlDatums next) {

        String url = page.getUrl();

        String content="";

        try {

            content = ContentExtractor.getContentByUrl(url);

        }catch (Exception e){

            e.printStackTrace();

        }

        if (page.matchUrl("http://blog.csdn.net/.*/article/details/.*")) {

            String title = page.select("div[class=article_title]").first().text();

            String author = page.select("div[id=blog_userface]").first().text();//获取作者名

            System.out.println("title:" + title + "\tauthor:" + author);

            if(!title.isEmpty() && ! content.isEmpty()){

                Pa pa=new Pa(title,content);

                ipaDao.save(pa);

            }

        }

    }

ps:Elastic中文社区的爬取规则和谐了，楼主是爱社区的，大家可以放心的爬CSDN吧，WebCollector功能很强大，爬虫的一个关键就是需要知道网站的url规则，有兴趣的可以研究下，Elastic的数据不多，分吧钟就够了，CSDN爬了5，6分钟，没有做深度的爬，取了大概二三十万的数据样子，只取标题和正文

去我博客查看原文 http://www.kailing.pub/article/index/arcid/86.html
下面是导入数据的截图