看,灰机...

logstash 如何读取pdf、docx等文件

Logstash | 作者 jiping | 发布于2020年10月28日 | 阅读数:1923

最近在处理一个新的需求:
       如何通过logstash读取pdf、docx等二进制文件
       经过多次调整测试,我想ruby Filter应该可以解决问题
       但是在使用ruby过程中发现问题。Ruby读取pdf文件确实有包splitdoc可以用,如下:
       我先定一个函数pdf_to_text(pdf_filename) 
       require 'docsplit' 
            def pdf_to_text(pdf_filename) 
                Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)
                txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt' 
                txt_filename = Dir.tmpdir + '/' + txt_file 
                extracted_text = File.read(txt_filename) 
                File.delete(txt_filename)
                extracted_text 
            end 
       一切都好像没问题,但当我将本地pdf绝对路径传入pdf_to_text('C:\Ruby27-x64\sample2.pdf')
       结果却报错如下:
       Traceback (most recent call last):
       12: from C:/Ruby27-x64/bin/irb.cmd:31:in `<main>'
       11: from C:/Ruby27-x64/bin/irb.cmd:31:in `load'
       10: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/irb-1.2.6/exe/irb:11:in `<top (required)>'
        9: from (irb):14
        8: from (irb):3:in `pdf_to_text'
        7: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit.rb:52:in `extract_text'
        6: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in `extract'
        5: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in `each'
        4: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:38:in `block in extract'
        3: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:54:in `extract_from_pdf'
        2: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:108:in `extract_full'
        1: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:101:in `run'
Traceback (most recent call last):
        17: from C:/Ruby27-x64/bin/irb.cmd:31:in `<main>'
        16: from C:/Ruby27-x64/bin/irb.cmd:31:in `load'
        15: from C:/Ruby2
希望能有哪位大神帮我看看问题
已邀请:

jiping - 90年后IT男

赞同来自: liuxg KnightOfCoder

问题已解决,我是通过编写自定义java Filter 实现效果的,有需要可以私下联系我我给你提供我编译好的插件
功能:读取pdf、docx、doc文件中文本内容后再event中增加一个content字段

liuxg - Elastic

赞同来自:

我之前写了一篇关于读取 PDF 的文章,请参阅 "Elasticsearch:如何对 PDF 文件进行搜索" https://elasticstack.blog.csdn ... 71230

要回复问题请先登录注册