Class: Undress::Grammar

Inherits:
Object show all
Defined in:
lib/undress/grammar.rb

Overview

Grammars give you a DSL to declare how to convert an HTML document into a different markup language.

Direct Known Subclasses

Textile

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeGrammar

:nodoc:



95
96
97
98
99
# File 'lib/undress/grammar.rb', line 95

def initialize #:nodoc:
  @pre_processing_rules = self.class.pre_processing_rules.dup
  @post_processing_rules = self.class.post_processing_rules.dup
  @whitelisted_attributes = self.class.whitelisted_attributes.dup
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(tag, node, *args) ⇒ Object

:nodoc:



184
185
186
# File 'lib/undress/grammar.rb', line 184

def method_missing(tag, node, *args) #:nodoc:
  process(node.children)
end

Instance Attribute Details

#post_processing_rulesObject (readonly)

:nodoc:



92
93
94
# File 'lib/undress/grammar.rb', line 92

def post_processing_rules
  @post_processing_rules
end

#pre_processing_rulesObject (readonly)

:nodoc:



91
92
93
# File 'lib/undress/grammar.rb', line 91

def pre_processing_rules
  @pre_processing_rules
end

#whitelisted_attributesObject (readonly)

:nodoc:



93
94
95
# File 'lib/undress/grammar.rb', line 93

def whitelisted_attributes
  @whitelisted_attributes
end

Class Method Details

.default(&handler) ⇒ Object

Set a default rule for unrecognized tags.

Unless you define a special case, it will ignore the tags and just output the contents of unrecognized tags.



30
31
32
33
34
# File 'lib/undress/grammar.rb', line 30

def self.default(&handler) # :yields: element
  define_method :method_missing do |tag, node, *args|
    handler.call(node)
  end
end

.inherited(base) ⇒ Object

:nodoc:



5
6
7
8
# File 'lib/undress/grammar.rb', line 5

def self.inherited(base) # :nodoc:
  base.instance_variable_set(:@post_processing_rules, post_processing_rules)
  base.instance_variable_set(:@pre_processing_rules, pre_processing_rules)
end

.post_processing(regexp, replacement = nil, &handler) ⇒ Object

Add a post-processing rule to your parser.

This takes a regular expression that will be applied to the output after processing any nodes. It can take a string as a replacement, or a block that will be passed to String#gsub.

post_processing(/\n\n+/, "\n\n") # compress more than two newlines
post_processing(/whatever/) { ... }


44
45
46
# File 'lib/undress/grammar.rb', line 44

def self.post_processing(regexp, replacement = nil, &handler) #:yields: matched_string
  post_processing_rules[regexp] = replacement || handler
end

.post_processing_rulesObject

:nodoc:



79
80
81
# File 'lib/undress/grammar.rb', line 79

def self.post_processing_rules #:nodoc:
  @post_processing_rules ||= {}
end

.pre_processing(selector, &handler) ⇒ Object

Add a pre-processing rule to your parser.

This lets you mutate the DOM before applying any rule defined with rule_for. You need to pass a CSS/XPath selector, and a block that takes an Hpricot element to parse it.

pre_processing "ul.toc" do |element|
  element.swap("<p>[[toc]]</p>")
end

Would replace any unordered lists with the class toc for a paragraph containing the code [[toc]].



60
61
62
# File 'lib/undress/grammar.rb', line 60

def self.pre_processing(selector, &handler) # :yields: element
  pre_processing_rules[selector] = handler
end

.pre_processing_rulesObject

:nodoc:



83
84
85
# File 'lib/undress/grammar.rb', line 83

def self.pre_processing_rules #:nodoc:
  @pre_processing_rules ||= {}
end

.process!(node) ⇒ Object

:nodoc:



87
88
89
# File 'lib/undress/grammar.rb', line 87

def self.process!(node) #:nodoc:
  new.process!(node)
end

.rule_for(*tags, &handler) ⇒ Object

Add a parsing rule for a group of html tags.

rule_for :p do |element|
  "<this was a paragraph>#{content_of(element)}</this was a paragraph>"
end

will replace your <p> tags for <this was a paragraph> tags, without altering the contents.

The element yielded to the block is an Hpricot element for the given tag.



20
21
22
23
24
# File 'lib/undress/grammar.rb', line 20

def self.rule_for(*tags, &handler) # :yields: element
  tags.each do |tag|
    define_method tag.to_sym, &handler
  end
end

.whitelist_attributes(*attrs) ⇒ Object

Set a list of attributes you wish to whitelist

Any attribute not in this list at the moment of parsing will be ignored by the parser. The method Grammar#attributes(node) will return a hash of the filtered attributes. Read its documentation for more details.

whitelist_attributes :id, :class, :lang


71
72
73
# File 'lib/undress/grammar.rb', line 71

def self.whitelist_attributes(*attrs)
  @whitelisted_attributes = attrs
end

.whitelisted_attributesObject

:nodoc:



75
76
77
# File 'lib/undress/grammar.rb', line 75

def self.whitelisted_attributes #:nodoc:
  @whitelisted_attributes || []
end

Instance Method Details

#attributes(node) ⇒ Object

Hash of attributes, according to the white list. By default, no attributes are whitelisted, so you must set which ones to whitelist on each grammar.

Supposing you set :id and :class as your whitelisted_attributes, and you have a node representing this HTML:

<p lang="en" class="greeting">Hello World</p>

Then the method would return:

{ :class => "greeting" }

You can override this method in each grammar and call super if you will represent your attributes consistently across all nodes (for example, Textile always shows class an id inside parenthesis.)



177
178
179
180
181
182
# File 'lib/undress/grammar.rb', line 177

def attributes(node)
  node.attributes.to_hash.inject({}) do |attrs,(key,value)|
    attrs[key.to_sym] = value if whitelisted_attributes.include?(key.to_sym)
    attrs
  end
end

#complete_word?(node) ⇒ Boolean

Helper to determine if a node contents a whole word useful to convert for example a letter italic inside a word

Returns:

  • (Boolean)


142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# File 'lib/undress/grammar.rb', line 142

def complete_word?(node)
  p, n = node.previous_node, node.next_node

  return true if !p && !n 

  if p.respond_to?(:content)
    return false if p.content       !~ /\s$/
  elsif p.respond_to?(:inner_html)
    return false if p.inner_html    !~ /\s$/
  end
  
  if n.respond_to?(:content)
    return false if n.content       !~ /^\s/
  elsif n.respond_to?(:inner_html)
    return false if n.inner_html    !~ /^\s/
  end
  true
end

#content_of(node) ⇒ Object

Get the result of parsing the contents of a node.



129
130
131
# File 'lib/undress/grammar.rb', line 129

def content_of(node)
  process(node.respond_to?(:children) ? node.children : node)
end

#process(nodes) ⇒ Object

Process a DOM node, converting it to your markup language according to your defined rules. If the node is a Text node, it will return it’s string representation. Otherwise it will call the rule defined for it.



104
105
106
107
108
109
110
111
112
113
114
# File 'lib/undress/grammar.rb', line 104

def process(nodes)
  Array(nodes).map do |node|
    if node.text?
      node.to_html
    elsif node.elem? 
      send node.name.to_sym, node if ! defined?(ALLOWED_TAGS) || ALLOWED_TAGS.empty? || ALLOWED_TAGS.include?(node.name)
    else
      ""
    end
  end.join("")
end

#process!(node) ⇒ Object

:nodoc:



116
117
118
119
120
121
122
123
124
125
126
# File 'lib/undress/grammar.rb', line 116

def process!(node) #:nodoc:
  pre_processing_rules.each do |selector, handler|
    node.search(selector).each(&handler)
  end

  process(node.children).tap do |text|
    post_processing_rules.each do |rule, handler|
      handler.is_a?(String) ?  text.gsub!(rule, handler) : text.gsub!(rule, &handler)
    end
  end
end

#surrounded_by_whitespace?(node) ⇒ Boolean

Helper method that tells you if the given DOM node is immediately surrounded by whitespace.

Returns:

  • (Boolean)


135
136
137
138
# File 'lib/undress/grammar.rb', line 135

def surrounded_by_whitespace?(node)
  (node.previous && node.previous.text? && node.previous.to_s =~ /\s+$/) ||
    (node.next && node.next.text? && node.next.to_s =~ /^\s+/)
end