Class: Undress::Grammar
Overview
Grammars give you a DSL to declare how to convert an HTML document into a different markup language.
Direct Known Subclasses
Instance Attribute Summary collapse
-
#post_processing_rules ⇒ Object
readonly
:nodoc:.
-
#pre_processing_rules ⇒ Object
readonly
:nodoc:.
Class Method Summary collapse
-
.default(&handler) ⇒ Object
Set a default rule for unrecognized tags.
-
.inherited(base) ⇒ Object
:nodoc:.
-
.post_processing(regexp, replacement = nil, &handler) ⇒ Object
Add a post-processing rule to your parser.
-
.post_processing_rules ⇒ Object
:nodoc:.
-
.pre_processing(selector, &handler) ⇒ Object
Add a pre-processing rule to your parser.
-
.pre_processing_rules ⇒ Object
:nodoc:.
-
.process!(node) ⇒ Object
:nodoc:.
-
.rule_for(*tags, &handler) ⇒ Object
Add a parsing rule for a group of html tags.
Instance Method Summary collapse
-
#complete_word?(node) ⇒ Boolean
Helper to determine if a node contents a whole word useful to convert for example a letter italic inside a word.
-
#content_of(node) ⇒ Object
Get the result of parsing the contents of a node.
-
#initialize ⇒ Grammar
constructor
:nodoc:.
-
#method_missing(tag, node, *args) ⇒ Object
:nodoc:.
-
#process(nodes) ⇒ Object
Process a DOM node, converting it to your markup language according to your defined rules.
-
#process!(node) ⇒ Object
:nodoc:.
-
#surrounded_by_whitespace?(node) ⇒ Boolean
Helper method that tells you if the given DOM node is immediately surrounded by whitespace.
Constructor Details
#initialize ⇒ Grammar
:nodoc:
79 80 81 82 |
# File 'lib/undress/grammar.rb', line 79 def initialize #:nodoc: @pre_processing_rules = self.class.pre_processing_rules.dup @post_processing_rules = self.class.post_processing_rules.dup end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(tag, node, *args) ⇒ Object
:nodoc:
142 143 144 |
# File 'lib/undress/grammar.rb', line 142 def method_missing(tag, node, *args) #:nodoc: process(node.children) end |
Instance Attribute Details
#post_processing_rules ⇒ Object (readonly)
:nodoc:
77 78 79 |
# File 'lib/undress/grammar.rb', line 77 def post_processing_rules @post_processing_rules end |
#pre_processing_rules ⇒ Object (readonly)
:nodoc:
76 77 78 |
# File 'lib/undress/grammar.rb', line 76 def pre_processing_rules @pre_processing_rules end |
Class Method Details
.default(&handler) ⇒ Object
Set a default rule for unrecognized tags.
Unless you define a special case, it will ignore the tags and just output the contents of unrecognized tags.
30 31 32 33 34 |
# File 'lib/undress/grammar.rb', line 30 def self.default(&handler) # :yields: element define_method :method_missing do |tag, node, *args| handler.call(node) end end |
.inherited(base) ⇒ Object
:nodoc:
5 6 7 8 |
# File 'lib/undress/grammar.rb', line 5 def self.inherited(base) # :nodoc: base.instance_variable_set(:@post_processing_rules, post_processing_rules) base.instance_variable_set(:@pre_processing_rules, pre_processing_rules) end |
.post_processing(regexp, replacement = nil, &handler) ⇒ Object
Add a post-processing rule to your parser.
This takes a regular expression that will be applied to the output after processing any nodes. It can take a string as a replacement, or a block that will be passed to String#gsub.
post_processing(/\n\n+/, "\n\n") # compress more than two newlines
post_processing(/whatever/) { ... }
44 45 46 |
# File 'lib/undress/grammar.rb', line 44 def self.post_processing(regexp, replacement = nil, &handler) #:yields: matched_string post_processing_rules[regexp] = replacement || handler end |
.post_processing_rules ⇒ Object
:nodoc:
64 65 66 |
# File 'lib/undress/grammar.rb', line 64 def self.post_processing_rules #:nodoc: @post_processing_rules ||= {} end |
.pre_processing(selector, &handler) ⇒ Object
Add a pre-processing rule to your parser.
This lets you mutate the DOM before applying any rule defined with rule_for
. You need to pass a CSS/XPath selector, and a block that takes an Hpricot element to parse it.
pre_processing "ul.toc" do |element|
element.swap("<p>[[toc]]</p>")
end
Would replace any unordered lists with the class toc
for a paragraph containing the code [[toc]]
.
60 61 62 |
# File 'lib/undress/grammar.rb', line 60 def self.pre_processing(selector, &handler) # :yields: element pre_processing_rules[selector] = handler end |
.pre_processing_rules ⇒ Object
:nodoc:
68 69 70 |
# File 'lib/undress/grammar.rb', line 68 def self.pre_processing_rules #:nodoc: @pre_processing_rules ||= {} end |
.process!(node) ⇒ Object
:nodoc:
72 73 74 |
# File 'lib/undress/grammar.rb', line 72 def self.process!(node) #:nodoc: new.process!(node) end |
.rule_for(*tags, &handler) ⇒ Object
Add a parsing rule for a group of html tags.
rule_for :p do |element|
"<this was a paragraph>#{content_of(element)}</this was a paragraph>"
end
will replace your <p>
tags for <this was a paragraph>
tags, without altering the contents.
The element yielded to the block is an Hpricot element for the given tag.
20 21 22 23 24 |
# File 'lib/undress/grammar.rb', line 20 def self.rule_for(*, &handler) # :yields: element .each do |tag| define_method tag.to_sym, &handler end end |
Instance Method Details
#complete_word?(node) ⇒ Boolean
Helper to determine if a node contents a whole word useful to convert for example a letter italic inside a word
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
# File 'lib/undress/grammar.rb', line 125 def complete_word?(node) return true if ! node.previous_node || ! node.next_node p, n = node.previous_node, node.next_node if p.respond_to?(:content) return false if p.content !~ /\s$/ elsif p.respond_to?(:inner_html) return false if p.inner_html !~ /\s$/ elsif n.respond_to?(:content) return false if n.content !~ /^\s/ elsif n.respond_to?(:inner_html) return false if n.content !~ /^\s/ end true end |
#content_of(node) ⇒ Object
Get the result of parsing the contents of a node.
112 113 114 |
# File 'lib/undress/grammar.rb', line 112 def content_of(node) process(node.respond_to?(:children) ? node.children : node) end |
#process(nodes) ⇒ Object
Process a DOM node, converting it to your markup language according to your defined rules. If the node is a Text node, it will return it’s string representation. Otherwise it will call the rule defined for it.
87 88 89 90 91 92 93 94 95 96 97 |
# File 'lib/undress/grammar.rb', line 87 def process(nodes) Array(nodes).map do |node| if node.text? node.to_html elsif node.elem? send node.name.to_sym, node if ! defined?(ALLOWED_TAGS) || ALLOWED_TAGS.empty? || ALLOWED_TAGS.include?(node.name) else "" end end.join("") end |
#process!(node) ⇒ Object
:nodoc:
99 100 101 102 103 104 105 106 107 108 109 |
# File 'lib/undress/grammar.rb', line 99 def process!(node) #:nodoc: pre_processing_rules.each do |selector, handler| node.search(selector).each(&handler) end process(node.children).tap do |text| post_processing_rules.each do |rule, handler| handler.is_a?(String) ? text.gsub!(rule, handler) : text.gsub!(rule, &handler) end end end |
#surrounded_by_whitespace?(node) ⇒ Boolean
Helper method that tells you if the given DOM node is immediately surrounded by whitespace.
118 119 120 121 |
# File 'lib/undress/grammar.rb', line 118 def surrounded_by_whitespace?(node) (node.previous && node.previous.text? && node.previous.to_s =~ /\s+$/) || (node.next && node.next.text? && node.next.to_s =~ /^\s+/) end |