Hot to use ChupaText as Ruby library

You can use ChupaText as Ruby library. If you want to extract text data from many input data, chupa-text command may be inefficient. You need to execute chupa-text command to process one input file. You need to execute chupa-text command N times to process N input files. It means that you need to initializes ChupaText N times. It may be inefficient.

You can reduce initializations of ChupaText by using ChupaText as Ruby library.

Here is a simple usage:

require "chupa-text"
gem "chupa-text-decomposer-html"

ChupaText::Decomposers.load

extractor = ChupaText::Extractor.new
extractor.apply_configuration(ChupaText::Configuration.default)

extractor.extract("http://ranguba.org/") do |text_data|
  puts(text_data.body)
end
extractor.extract("http://ranguba.org/ja/") do |text_data|
  puts(text_data.body)
end

It is better that you use Bundler to manager decomposer plugins:

# Gemfile
source "https://rubygems.org"

gem "chupa-text-decomposer-html"
gem "chupa-text-decomposer-XXX"
# ...

Here is a usage that uses the Gemfile:

require "bundler/setup"

ChupaText::Decomposers.load

extractor = ChupaText::Extractor.new
extractor.apply_configuration(ChupaText::Configuration.default)

extractor.extract("http://ranguba.org/") do |text_data|
  puts(text_data.body)
end
extractor.extract("http://ranguba.org/ja/") do |text_data|
  puts(text_data.body)
end

Use ChupaText::Data#[] to get meta-data from extracted text data. For example, you can get title from input HTML:

extractor.extract("http://ranguba.org/") do |text_data|
  puts(text_data["title"])
end

It is depended on decomposer that what meta-data can be got. See decomposer's documentation to know about it.