How to create a decomposer
You can extend ChupaText by Ruby. You can add supported input type by writing a decomposer module.
Overview
Decomposer is a Ruby class. It needs the following two API:
target?
decompose
Both of them accept only one argument data
. data
is an input
data.
First, ChupaText calls target?
method of your decomposer. If your
decomposer can decompose the input data, your target?
method should
return true
.
If your decomposer's target?
method returns true
, ChupaText calls
decomposer
method of your decomposer. Your decomposer needs to
decomposer the input data and yield
extracted text data or other
format data that will be decomposed by other decomposers. Your
decomposer can yield
multiple times.
If your decomposer decomposes an archive file such as tar and zip
archives, your decompose
method will yield
other format data. If
your decomposer extracts text and meta-data from an input such as
HTML, your decompose
method will yield
text data.
Example
Let's create a simple XML decomposer as an example. It extracts text data from input XML.
For example, here is an input XML:
<root>
Hello <em>&</em> World!
</root>
The XML decomposer extracts the following text:
Hello & World!
ChupaText provides chupa-text-genearte-decomposer
command. It
generates skeleton code for a new decomposer. Let's use it.
chupa-text-genearte-decomposer
accepts required information by
command line options or reading from standard input. You can confirm
the required information by --help
option:
% chupa-text-generate-decomposer --help
Usage: chupa-text-generate-decomposer [options]
--name=NAME Decomposer name
(e.g.: html)
--extensions=EXTENSION1,EXTENSION2,...
Target file extensions
(e.g.: htm,html,xhtml)
--mime-types=TYPE1,TYPE2,... Target MIME types
(e.g.: text/html,application/xhtml+xml)
--author=AUTHOR Author
(e.g.: 'Your Name')
(default: Kouhei Sutou)
--email=EMAIL Author E-mail
(e.g.: [email protected])
(default: [email protected])
--license=LICENSE License
(e.g.: MIT)
(default: LGPLv2.1 or later)
Some pieces of information have the default values. In the above case,
--author
, --email
and -license
have the default values.
XML decomposer uses the following information:
--name
:xml
--extensions
:xml
--mime-types
:text/xml
Run with the above information:
% chupa-text-generate-decomposer --name xml --extensions xml --mime-types text/xml
Creating directory: chupa-text-decomposer-xml
Creating file: chupa-text-decomposer-xml/chupa-text-decomposer-xml.gemspec
Creating file: chupa-text-decomposer-xml/Gemfile
Creating file: chupa-text-decomposer-xml/Rakefile
Creating file: chupa-text-decomposer-xml/LICENSE.txt
Creating directory: chupa-text-decomposer-xml/lib/chupa-text/decomposers
Creating file: chupa-text-decomposer-xml/lib/chupa-text/decomposers/xml.rb
Creating directory: chupa-text-decomposer-xml/test
Creating file: chupa-text-decomposer-xml/test/test-xml.rb
Creating file: chupa-text-decomposer-xml/test/helper.rb
Creating file: chupa-text-decomposer-xml/test/run-test.rb
chupa-text-generate-decomposer
generates a directory that is named
as chupa-text-decomposer-#{name}/
.
Look lib/chupa-text/decomposers/xml.rb
:
module ChupaText
module Decomposers
class Xml < Decomposer
def target?(data)
["xml"].include?(data.extension) or
["text/xml"].include?(data.mime_type)
end
def decompose(data)
raise NotImplementedError, "#{self.class}##{__method__} isn't implemented yet."
text = "IMPLEMENTED ME"
text_data = TextData.new(text)
yield(text_data)
end
end
end
end
The generated code implements target?
method but doesn't implemented
decompose
method completely. Let's implement decompose
method:
require "cgi"
# ...
def decompose(data)
text = CGI.unescapeHTML(untag(data.body).strip)
text_data = TextData.new(text)
yield(text_data)
end
private
def untag(xml)
xml.gsub(/<.+?>/m, "")
end
# ...
chupa-text-generate-decomposer
also generates a test. Run the test:
% bundle install
% rake
/usr/bin/ruby2.0 test/run-test.rb
Loaded suite .
Started
F
===============================================================================
Failure:
test_body(decompose)
/tmp/chupa-text-decomposer-xml/test/test-xml.rb:24:in `test_body'
21: def test_body
22: input_body = "TODO (input)"
23: expected_text = "TODO (extracted)"
=> 24: assert_equal([expected_text],
25: decompose(input_body).collect(&:body))
26: end
27: end
<["TODO (extracted)"]> expected but was
<["TODO (input)"]>
diff:
? ["TODO (ex tracted)"]
? inpu
===============================================================================
Finished in 0.013355116 seconds.
1 tests, 1 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
0% passed
74.88 tests/s, 74.88 assertions/s
rake aborted!
Command failed with status (1): [/usr/bin/ruby2.0 test/run-test.rb...]
/tmp/chupa-text-decomposer-xml/Rakefile:9:in `block in <top (required)>'
The generated test fails because the test has place holders. Look the generated test:
class TestXml < Test::Unit::TestCase
include Helper
def setup
@decomposer = ChupaText::Decomposers::Xml.new({})
end
sub_test_case("decompose") do
def decompose(input_body)
data = ChupaText::Data.new
data.mime_type = "text/xml"
data.body = input_body
decomposed = []
@decomposer.decompose(data) do |decomposed_data|
decomposed << decomposed_data
end
decomposed
end
def test_body
input_body = "TODO (input)"
expected_text = "TODO (extracted)"
assert_equal([expected_text],
decompose(input_body).collect(&:body))
end
end
end
test_body
has TODO codes as place holder:
# ...
def test_body
input_body = "TODO (input)"
expected_text = "TODO (extracted)"
assert_equal([expected_text],
decompose(input_body).collect(&:body))
end
# ...
Fill the TODO by test XML and expected result:
# ...
def test_body
input_body = <<-XML
<root>
Hello <em>&</em> World!
</root>
XML
expected_text = "Hello & World!"
assert_equal([expected_text],
decompose(input_body).collect(&:body))
end
# ...
Run test again:
% rake
/usr/bin/ruby2.0 test/run-test.rb
Loaded suite .
Started
.
Finished in 0.000915172 seconds.
1 tests, 1 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed
1092.69 tests/s, 1092.69 assertions/s
The test is passed!
You can release the generator by the following command. It requires an account on https://rubygems.org/.
% rake release
Can you understand how to create a new decomposer?
API reference
data
Both of target?
and decompose
receives an argument data
. It is a
ChupaText::Data instance or an instance of its sub class. You need
to see the API reference manual just for ChupaText::Data. You don't
use sub class specific API. It is not portable.
target?
target?
should return true
or false
. The decomposer should
return true
if the decomposer can decompose received data
, false
otherwise.
decompose
decompose
decomposes input data
and yield
extracted text data or
decomposed other type data. decompose
can yield
zero or more
times.
Here is a template code to yield
extracted text data:
def decompose(data)
text = extract_text(data)
text_data = ChupaText::TextData.new(text)
# text_data["meta-data1"] = meta_data_value1
# text_data["meta-data2"] = meta_data_value2
# ...
yield(text_data)
end
See lib/chupa-text/decomposers/csv.rb as an example of extracting text data.
Here is a template code to yield
other type data:
def decompose(data)
entries = decompose_archive(data)
entries.each do |entry|
path = entry.path
if entry.respond_to?(:read)
# The input must have "read" method.
input = entry
else
# If the entry doesn't have "read" method, wrap String data
# by StringIO.
input = StringIO.new(entry.data)
end
decomposed_data = ChupaText::VirtualFileData.new(path, input)
decomposed_data.source = data
yield(decomposed_data)
end
end
See lib/chupa-text/decomposers/tar.rb as an example of decomposing to other type data.