Class: Undress::Grammar
Overview
Grammars give you a DSL to declare how to convert an HTML document into a different markup language.
Direct Known Subclasses
Instance Attribute Summary collapse
-
#post_processing_rules ⇒ Object
readonly
:nodoc:.
-
#pre_processing_rules ⇒ Object
readonly
:nodoc:.
-
#whitelisted_attributes ⇒ Object
readonly
:nodoc:.
Class Method Summary collapse
-
.default(&handler) ⇒ Object
Set a default rule for unrecognized tags.
-
.inherited(base) ⇒ Object
:nodoc:.
-
.post_processing(regexp, replacement = nil, &handler) ⇒ Object
Add a post-processing rule to your parser.
-
.post_processing_rules ⇒ Object
:nodoc:.
-
.pre_processing(selector, &handler) ⇒ Object
Add a pre-processing rule to your parser.
-
.pre_processing_rules ⇒ Object
:nodoc:.
-
.process!(node) ⇒ Object
:nodoc:.
-
.rule_for(*tags, &handler) ⇒ Object
Add a parsing rule for a group of html tags.
-
.whitelist_attributes(*attrs) ⇒ Object
Set a list of attributes you wish to whitelist.
-
.whitelisted_attributes ⇒ Object
:nodoc:.
Instance Method Summary collapse
-
#attributes(node) ⇒ Object
Hash of attributes, according to the white list.
-
#complete_word?(node) ⇒ Boolean
Helper to determine if a node contents a whole word useful to convert for example a letter italic inside a word.
-
#content_of(node) ⇒ Object
Get the result of parsing the contents of a node.
-
#initialize ⇒ Grammar
constructor
:nodoc:.
-
#method_missing(tag, node, *args) ⇒ Object
:nodoc:.
-
#process(nodes) ⇒ Object
Process a DOM node, converting it to your markup language according to your defined rules.
-
#process!(node) ⇒ Object
:nodoc:.
-
#surrounded_by_whitespace?(node) ⇒ Boolean
Helper method that tells you if the given DOM node is immediately surrounded by whitespace.
Constructor Details
#initialize ⇒ Grammar
:nodoc:
95 96 97 98 99 |
# File 'lib/undress/grammar.rb', line 95 def initialize #:nodoc: @pre_processing_rules = self.class.pre_processing_rules.dup @post_processing_rules = self.class.post_processing_rules.dup @whitelisted_attributes = self.class.whitelisted_attributes.dup end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(tag, node, *args) ⇒ Object
:nodoc:
184 185 186 |
# File 'lib/undress/grammar.rb', line 184 def method_missing(tag, node, *args) #:nodoc: process(node.children) end |
Instance Attribute Details
#post_processing_rules ⇒ Object (readonly)
:nodoc:
92 93 94 |
# File 'lib/undress/grammar.rb', line 92 def post_processing_rules @post_processing_rules end |
#pre_processing_rules ⇒ Object (readonly)
:nodoc:
91 92 93 |
# File 'lib/undress/grammar.rb', line 91 def pre_processing_rules @pre_processing_rules end |
#whitelisted_attributes ⇒ Object (readonly)
:nodoc:
93 94 95 |
# File 'lib/undress/grammar.rb', line 93 def whitelisted_attributes @whitelisted_attributes end |
Class Method Details
.default(&handler) ⇒ Object
Set a default rule for unrecognized tags.
Unless you define a special case, it will ignore the tags and just output the contents of unrecognized tags.
30 31 32 33 34 |
# File 'lib/undress/grammar.rb', line 30 def self.default(&handler) # :yields: element define_method :method_missing do |tag, node, *args| handler.call(node) end end |
.inherited(base) ⇒ Object
:nodoc:
5 6 7 8 |
# File 'lib/undress/grammar.rb', line 5 def self.inherited(base) # :nodoc: base.instance_variable_set(:@post_processing_rules, post_processing_rules) base.instance_variable_set(:@pre_processing_rules, pre_processing_rules) end |
.post_processing(regexp, replacement = nil, &handler) ⇒ Object
Add a post-processing rule to your parser.
This takes a regular expression that will be applied to the output after processing any nodes. It can take a string as a replacement, or a block that will be passed to String#gsub.
post_processing(/\n\n+/, "\n\n") # compress more than two newlines
post_processing(/whatever/) { ... }
44 45 46 |
# File 'lib/undress/grammar.rb', line 44 def self.post_processing(regexp, replacement = nil, &handler) #:yields: matched_string post_processing_rules[regexp] = replacement || handler end |
.post_processing_rules ⇒ Object
:nodoc:
79 80 81 |
# File 'lib/undress/grammar.rb', line 79 def self.post_processing_rules #:nodoc: @post_processing_rules ||= {} end |
.pre_processing(selector, &handler) ⇒ Object
Add a pre-processing rule to your parser.
This lets you mutate the DOM before applying any rule defined with rule_for
. You need to pass a CSS/XPath selector, and a block that takes an Hpricot element to parse it.
pre_processing "ul.toc" do |element|
element.swap("<p>[[toc]]</p>")
end
Would replace any unordered lists with the class toc
for a paragraph containing the code [[toc]]
.
60 61 62 |
# File 'lib/undress/grammar.rb', line 60 def self.pre_processing(selector, &handler) # :yields: element pre_processing_rules[selector] = handler end |
.pre_processing_rules ⇒ Object
:nodoc:
83 84 85 |
# File 'lib/undress/grammar.rb', line 83 def self.pre_processing_rules #:nodoc: @pre_processing_rules ||= {} end |
.process!(node) ⇒ Object
:nodoc:
87 88 89 |
# File 'lib/undress/grammar.rb', line 87 def self.process!(node) #:nodoc: new.process!(node) end |
.rule_for(*tags, &handler) ⇒ Object
Add a parsing rule for a group of html tags.
rule_for :p do |element|
"<this was a paragraph>#{content_of(element)}</this was a paragraph>"
end
will replace your <p>
tags for <this was a paragraph>
tags, without altering the contents.
The element yielded to the block is an Hpricot element for the given tag.
20 21 22 23 24 |
# File 'lib/undress/grammar.rb', line 20 def self.rule_for(*, &handler) # :yields: element .each do |tag| define_method tag.to_sym, &handler end end |
.whitelist_attributes(*attrs) ⇒ Object
Set a list of attributes you wish to whitelist
Any attribute not in this list at the moment of parsing will be ignored by the parser. The method Grammar#attributes(node) will return a hash of the filtered attributes. Read its documentation for more details.
whitelist_attributes :id, :class, :lang
71 72 73 |
# File 'lib/undress/grammar.rb', line 71 def self.whitelist_attributes(*attrs) @whitelisted_attributes = attrs end |
.whitelisted_attributes ⇒ Object
:nodoc:
75 76 77 |
# File 'lib/undress/grammar.rb', line 75 def self.whitelisted_attributes #:nodoc: @whitelisted_attributes || [] end |
Instance Method Details
#attributes(node) ⇒ Object
Hash of attributes, according to the white list. By default, no attributes are whitelisted, so you must set which ones to whitelist on each grammar.
Supposing you set :id
and :class
as your whitelisted_attributes
, and you have a node representing this HTML:
<p lang="en" class="greeting">Hello World</p>
Then the method would return:
{ :class => "greeting" }
You can override this method in each grammar and call super
if you will represent your attributes consistently across all nodes (for example, Textile
always shows class an id inside parenthesis.)
177 178 179 180 181 182 |
# File 'lib/undress/grammar.rb', line 177 def attributes(node) node.attributes.to_hash.inject({}) do |attrs,(key,value)| attrs[key.to_sym] = value if whitelisted_attributes.include?(key.to_sym) attrs end end |
#complete_word?(node) ⇒ Boolean
Helper to determine if a node contents a whole word useful to convert for example a letter italic inside a word
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
# File 'lib/undress/grammar.rb', line 142 def complete_word?(node) p, n = node.previous_node, node.next_node return true if !p && !n if p.respond_to?(:content) return false if p.content !~ /\s$/ elsif p.respond_to?(:inner_html) return false if p.inner_html !~ /\s$/ end if n.respond_to?(:content) return false if n.content !~ /^\s/ elsif n.respond_to?(:inner_html) return false if n.inner_html !~ /^\s/ end true end |
#content_of(node) ⇒ Object
Get the result of parsing the contents of a node.
129 130 131 |
# File 'lib/undress/grammar.rb', line 129 def content_of(node) process(node.respond_to?(:children) ? node.children : node) end |
#process(nodes) ⇒ Object
Process a DOM node, converting it to your markup language according to your defined rules. If the node is a Text node, it will return it’s string representation. Otherwise it will call the rule defined for it.
104 105 106 107 108 109 110 111 112 113 114 |
# File 'lib/undress/grammar.rb', line 104 def process(nodes) Array(nodes).map do |node| if node.text? node.to_html elsif node.elem? send node.name.to_sym, node if ! defined?(ALLOWED_TAGS) || ALLOWED_TAGS.empty? || ALLOWED_TAGS.include?(node.name) else "" end end.join("") end |
#process!(node) ⇒ Object
:nodoc:
116 117 118 119 120 121 122 123 124 125 126 |
# File 'lib/undress/grammar.rb', line 116 def process!(node) #:nodoc: pre_processing_rules.each do |selector, handler| node.search(selector).each(&handler) end process(node.children).tap do |text| post_processing_rules.each do |rule, handler| handler.is_a?(String) ? text.gsub!(rule, handler) : text.gsub!(rule, &handler) end end end |
#surrounded_by_whitespace?(node) ⇒ Boolean
Helper method that tells you if the given DOM node is immediately surrounded by whitespace.
135 136 137 138 |
# File 'lib/undress/grammar.rb', line 135 def surrounded_by_whitespace?(node) (node.previous && node.previous.text? && node.previous.to_s =~ /\s+$/) || (node.next && node.next.text? && node.next.to_s =~ /^\s+/) end |