Hot to use ChupaText as Ruby library
You can use ChupaText as Ruby library. If you want to extract text
data from many input data, chupa-text
command may be
inefficient. You need to execute chupa-text
command to process one
input file. You need to execute chupa-text
command N times to
process N input files. It means that you need to initializes ChupaText
N times. It may be inefficient.
You can reduce initializations of ChupaText by using ChupaText as Ruby library.
Here is a simple usage:
require "chupa-text"
gem "chupa-text-decomposer-html"
ChupaText::Decomposers.load
extractor = ChupaText::Extractor.new
extractor.apply_configuration(ChupaText::Configuration.default)
extractor.extract("http://ranguba.org/") do |text_data|
puts(text_data.body)
end
extractor.extract("http://ranguba.org/ja/") do |text_data|
puts(text_data.body)
end
It is better that you use Bundler to manager decomposer plugins:
# Gemfile
source "https://rubygems.org"
gem "chupa-text-decomposer-html"
gem "chupa-text-decomposer-XXX"
# ...
Here is a usage that uses the Gemfile:
require "bundler/setup"
ChupaText::Decomposers.load
extractor = ChupaText::Extractor.new
extractor.apply_configuration(ChupaText::Configuration.default)
extractor.extract("http://ranguba.org/") do |text_data|
puts(text_data.body)
end
extractor.extract("http://ranguba.org/ja/") do |text_data|
puts(text_data.body)
end
Use ChupaText::Data#[] to get meta-data from extracted text data. For example, you can get title from input HTML:
extractor.extract("http://ranguba.org/") do |text_data|
puts(text_data["title"])
end
It is depended on decomposer that what meta-data can be got. See decomposer's documentation to know about it.