最近在处理一个新的需求:
如何通过logstash读取pdf、docx等二进制文件
经过多次调整测试,我想ruby Filter应该可以解决问题
但是在使用ruby过程中发现问题。Ruby读取pdf文件确实有包splitdoc可以用,如下:
我先定一个函数pdf_to_text(pdf_filename)
require 'docsplit'
def pdf_to_text(pdf_filename)
Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)
txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
txt_filename = Dir.tmpdir + '/' + txt_file
extracted_text = File.read(txt_filename)
File.delete(txt_filename)
extracted_text
end
一切都好像没问题,但当我将本地pdf绝对路径传入pdf_to_text('C:\Ruby27-x64\sample2.pdf')
结果却报错如下:
Traceback (most recent call last):
12: from C:/Ruby27-x64/bin/irb.cmd:31:in `<main>'
11: from C:/Ruby27-x64/bin/irb.cmd:31:in `load'
10: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/irb-1.2.6/exe/irb:11:in `<top (required)>'
9: from (irb):14
8: from (irb):3:in `pdf_to_text'
7: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit.rb:52:in `extract_text'
6: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in `extract'
5: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in `each'
4: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:38:in `block in extract'
3: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:54:in `extract_from_pdf'
2: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:108:in `extract_full'
1: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:101:in `run'
Traceback (most recent call last):
17: from C:/Ruby27-x64/bin/irb.cmd:31:in `<main>'
16: from C:/Ruby27-x64/bin/irb.cmd:31:in `load'
15: from C:/Ruby2
希望能有哪位大神帮我看看问题
如何通过logstash读取pdf、docx等二进制文件
经过多次调整测试,我想ruby Filter应该可以解决问题
但是在使用ruby过程中发现问题。Ruby读取pdf文件确实有包splitdoc可以用,如下:
我先定一个函数pdf_to_text(pdf_filename)
require 'docsplit'
def pdf_to_text(pdf_filename)
Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)
txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
txt_filename = Dir.tmpdir + '/' + txt_file
extracted_text = File.read(txt_filename)
File.delete(txt_filename)
extracted_text
end
一切都好像没问题,但当我将本地pdf绝对路径传入pdf_to_text('C:\Ruby27-x64\sample2.pdf')
结果却报错如下:
Traceback (most recent call last):
12: from C:/Ruby27-x64/bin/irb.cmd:31:in `<main>'
11: from C:/Ruby27-x64/bin/irb.cmd:31:in `load'
10: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/irb-1.2.6/exe/irb:11:in `<top (required)>'
9: from (irb):14
8: from (irb):3:in `pdf_to_text'
7: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit.rb:52:in `extract_text'
6: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in `extract'
5: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:32:in `each'
4: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:38:in `block in extract'
3: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:54:in `extract_from_pdf'
2: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:108:in `extract_full'
1: from C:/Ruby27-x64/lib/ruby/gems/2.7.0/gems/docsplit-0.7.6/lib/docsplit/text_extractor.rb:101:in `run'
Traceback (most recent call last):
17: from C:/Ruby27-x64/bin/irb.cmd:31:in `<main>'
16: from C:/Ruby27-x64/bin/irb.cmd:31:in `load'
15: from C:/Ruby2
希望能有哪位大神帮我看看问题
2 个回复
jiping - 90年后IT男
赞同来自: liuxg 、KnightOfCoder
功能:读取pdf、docx、doc文件中文本内容后再event中增加一个content字段
liuxg - Elastic
赞同来自: