Module: Ariel
- Defined in:
- lib/ariel.rb,
lib/ariel/log.rb,
lib/ariel/node.rb,
lib/ariel/rule.rb,
lib/ariel/token.rb,
lib/ariel/learner.rb,
lib/ariel/rule_set.rb,
lib/ariel/wildcards.rb,
lib/ariel/label_utils.rb,
lib/ariel/token_stream.rb,
lib/ariel/node/extracted.rb,
lib/ariel/node/structure.rb,
lib/ariel/candidate_refiner.rb,
lib/ariel/labeled_document_loader.rb
Overview
Ariel - A Ruby Information Extraction Library
Ariel intends to assist in extracting information from semi-structured documents including (but not in any way limited to) web pages. Although you may use libraries such as Hpricot or Rubyful Soup, or even plain Regular Expressions to achieve the same goal, Ariel approaches the problem very differently. Ariel relies on the user labeling examples of the data they want to extract, and then finds patterns across several such labeled examples in order to produce a set of general rules for extracting this information from any similar document.
When working with Ariel, your workflow might look something like this:
-
Define a structure for the data you wish to extract. For example:
@structure = Ariel::StructureNode.new do |r| r.item :article do |a| a.item :title a.item :author a.item :date a.item :body end r.list :comments do |c| c.list_item :comment do |c| c.item :author c.item :date c.item :body end end end
-
Label these fields in a few example documents (normally at least 3). Labels are in the form of
<l:label_name>...</l:label_name>
-
Ariel will read these examples, and try to generate suitable rules that can be used to extract this data from other similarly structured documents. Use Ariel#learn to initiate learn ruling.
-
A wrapper has been generated - we can now happily load documents with the same structure (normally documents generated by the same rules, so different pages from a single site perhaps) and query the extracted data. See Ariel#extract.
Defined Under Namespace
Modules: LabelUtils Classes: CandidateRefiner, LabeledDocumentLoader, Learner, Log, Node, Rule, RuleSet, Token, TokenStream, Wildcards
Class Method Summary collapse
-
.extract(structure, *files_to_extract) ⇒ Object
Will use the given root Node::Structure to extract information from each of the given files (can be any object responding to #read, and if passed a string the parameter will be opened using File.read).
-
.learn(structure, *labeled_files) ⇒ Object
Given a root Node::Structure and a list of labeled_files (either IO objects or strings representing files that can be opened with File.read, will learn rules using the labeled examples. The passed Node::Structure tree is returned with new RuleSets added containing the learnt rules. This structure can now be used with Ariel#extract on unlabeled documents.
Class Method Details
.extract(structure, *files_to_extract) ⇒ Object
Will use the given root Node::Structure to extract information from each of the given files (can be any object responding to #read, and if passed a string the parameter will be opened using File.read). If a block is given, each root Node::Extracted is yielded. An array of each root extracted node is returned.
Ariel.extract structure, 'file1.txt', fileobj, 'file2.html' # =>
an array of 3 Node::Extracted objects
75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/ariel.rb', line 75 def extract(structure, *files_to_extract) raise ArgumentError, "Passed structure is not the parent of the document tree" unless structure.parent.nil? extractions=[] collect_strings(files_to_extract).each do |string| tokenstream = TokenStream.new tokenstream.tokenize string root_node=Ariel::Node::Extracted.new :root, tokenstream, structure structure.apply_extraction_tree_on root_node extractions << root_node yield root_node if block_given? end return extractions end |
.learn(structure, *labeled_files) ⇒ Object
Given a root Node::Structure and a list of labeled_files (either IO objects or strings representing files that can be opened with File.read, will learn rules using the labeled examples. The passed Node::Structure tree is returned with new RuleSets added containing the learnt rules. This structure can now be used with Ariel#extract on unlabeled documents.
Ariel.learn structure, 'file1.html', fileobj, 'file2.html'
61 62 63 64 65 |
# File 'lib/ariel.rb', line 61 def learn(structure, *labeled_files) raise ArgumentError, "Passed structure is not the parent of the document tree" unless structure.parent.nil? labeled_strings=collect_strings(labeled_files) return LabeledDocumentLoader.supervise_learning(structure, *labeled_strings) end |