Class: Undress::Grammar

Inherits:

Object

Object
Undress::Grammar

show all

Defined in:: lib/undress/grammar.rb

Overview

Grammars give you a DSL to declare how to convert an HTML document into a different markup language.

Direct Known Subclasses

Textile

Instance Attribute Summary collapse

#post_processing_rules ⇒ Object readonly

:nodoc:.
#pre_processing_rules ⇒ Object readonly

:nodoc:.
#whitelisted_attributes ⇒ Object readonly

:nodoc:.

Class Method Summary collapse

.default(&handler) ⇒ Object

Set a default rule for unrecognized tags.
.inherited(base) ⇒ Object

:nodoc:.
.post_processing(regexp, replacement = nil, &handler) ⇒ Object

Add a post-processing rule to your parser.
.post_processing_rules ⇒ Object

:nodoc:.
.pre_processing(selector, &handler) ⇒ Object

Add a pre-processing rule to your parser.
.pre_processing_rules ⇒ Object

:nodoc:.
.process!(node) ⇒ Object

:nodoc:.
.rule_for(*tags, &handler) ⇒ Object

Add a parsing rule for a group of html tags.
.whitelist_attributes(*attrs) ⇒ Object

Set a list of attributes you wish to whitelist.
.whitelisted_attributes ⇒ Object

:nodoc:.

Instance Method Summary collapse

#attributes(node) ⇒ Object

Hash of attributes, according to the white list.
#complete_word?(node) ⇒ Boolean

Helper to determine if a node contents a whole word useful to convert for example a letter italic inside a word.
#content_of(node) ⇒ Object

Get the result of parsing the contents of a node.
#initialize ⇒ Grammar constructor

:nodoc:.
#method_missing(tag, node, *args) ⇒ Object

:nodoc:.
#process(nodes) ⇒ Object

Process a DOM node, converting it to your markup language according to your defined rules.
#process!(node) ⇒ Object

:nodoc:.
#surrounded_by_whitespace?(node) ⇒ Boolean

Helper method that tells you if the given DOM node is immediately surrounded by whitespace.

Constructor Details

#initialize ⇒ `Grammar`

:nodoc:

# File 'lib/undress/grammar.rb', line 95

def initialize #:nodoc:
  @pre_processing_rules = self.class.pre_processing_rules.dup
  @post_processing_rules = self.class.post_processing_rules.dup
  @whitelisted_attributes = self.class.whitelisted_attributes.dup
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(tag, node, *args) ⇒ `Object`

:nodoc:



184
185
186

# File 'lib/undress/grammar.rb', line 184

def method_missing(tag, node, *args) #:nodoc:
  process(node.children)
end

Instance Attribute Details

#post_processing_rules ⇒ `Object` (readonly)

:nodoc:



92
93
94

# File 'lib/undress/grammar.rb', line 92

def post_processing_rules
  @post_processing_rules
end

#pre_processing_rules ⇒ `Object` (readonly)

:nodoc:



91
92
93

# File 'lib/undress/grammar.rb', line 91

def pre_processing_rules
  @pre_processing_rules
end

#whitelisted_attributes ⇒ `Object` (readonly)

:nodoc:



93
94
95

# File 'lib/undress/grammar.rb', line 93

def whitelisted_attributes
  @whitelisted_attributes
end

Class Method Details

.default(&handler) ⇒ `Object`

Set a default rule for unrecognized tags.

Unless you define a special case, it will ignore the tags and just output the contents of unrecognized tags.

# File 'lib/undress/grammar.rb', line 30

def self.default(&handler) # :yields: element
  define_method :method_missing do |tag, node, *args|
    handler.call(node)
  end
end

.inherited(base) ⇒ `Object`

:nodoc:

# File 'lib/undress/grammar.rb', line 5

def self.inherited(base) # :nodoc:
  base.instance_variable_set(:@post_processing_rules, post_processing_rules)
  base.instance_variable_set(:@pre_processing_rules, pre_processing_rules)
end

.post_processing(regexp, replacement = nil, &handler) ⇒ `Object`

Add a post-processing rule to your parser.

This takes a regular expression that will be applied to the output after processing any nodes. It can take a string as a replacement, or a block that will be passed to String#gsub.

post_processing(/\n\n+/, "\n\n") # compress more than two newlines
post_processing(/whatever/) { ... }



44
45
46

# File 'lib/undress/grammar.rb', line 44

def self.post_processing(regexp, replacement = nil, &handler) #:yields: matched_string
  post_processing_rules[regexp] = replacement || handler
end

.post_processing_rules ⇒ `Object`

:nodoc:



79
80
81

# File 'lib/undress/grammar.rb', line 79

def self.post_processing_rules #:nodoc:
  @post_processing_rules ||= {}
end

.pre_processing(selector, &handler) ⇒ `Object`

Add a pre-processing rule to your parser.

This lets you mutate the DOM before applying any rule defined with rule_for. You need to pass a CSS/XPath selector, and a block that takes an Hpricot element to parse it.

pre_processing "ul.toc" do |element|
  element.swap("<p>[[toc]]</p>")
end

Would replace any unordered lists with the class toc for a paragraph containing the code [[toc]].



60
61
62

# File 'lib/undress/grammar.rb', line 60

def self.pre_processing(selector, &handler) # :yields: element
  pre_processing_rules[selector] = handler
end

.pre_processing_rules ⇒ `Object`

:nodoc:



83
84
85

# File 'lib/undress/grammar.rb', line 83

def self.pre_processing_rules #:nodoc:
  @pre_processing_rules ||= {}
end

.process!(node) ⇒ `Object`

:nodoc:



87
88
89

# File 'lib/undress/grammar.rb', line 87

def self.process!(node) #:nodoc:
  new.process!(node)
end

.rule_for(*tags, &handler) ⇒ `Object`

Add a parsing rule for a group of html tags.

rule_for :p do |element|
  "<this was a paragraph>#{content_of(element)}</this was a paragraph>"
end

will replace your <p> tags for <this was a paragraph> tags, without altering the contents.

The element yielded to the block is an Hpricot element for the given tag.

# File 'lib/undress/grammar.rb', line 20

def self.rule_for(*tags, &handler) # :yields: element
  tags.each do |tag|
    define_method tag.to_sym, &handler
  end
end

.whitelist_attributes(*attrs) ⇒ `Object`

Set a list of attributes you wish to whitelist

Any attribute not in this list at the moment of parsing will be ignored by the parser. The method Grammar#attributes(node) will return a hash of the filtered attributes. Read its documentation for more details.

whitelist_attributes :id, :class, :lang



71
72
73

# File 'lib/undress/grammar.rb', line 71

def self.whitelist_attributes(*attrs)
  @whitelisted_attributes = attrs
end

.whitelisted_attributes ⇒ `Object`

:nodoc:



75
76
77

# File 'lib/undress/grammar.rb', line 75

def self.whitelisted_attributes #:nodoc:
  @whitelisted_attributes || []
end

Instance Method Details

#attributes(node) ⇒ `Object`

Hash of attributes, according to the white list. By default, no attributes are whitelisted, so you must set which ones to whitelist on each grammar.

Supposing you set :id and :class as your whitelisted_attributes, and you have a node representing this HTML:

<p lang="en" class="greeting">Hello World</p>

Then the method would return:

{ :class => "greeting" }

You can override this method in each grammar and call super if you will represent your attributes consistently across all nodes (for example, Textile always shows class an id inside parenthesis.)

# File 'lib/undress/grammar.rb', line 177

def attributes(node)
  node.attributes.to_hash.inject({}) do |attrs,(key,value)|
    attrs[key.to_sym] = value if whitelisted_attributes.include?(key.to_sym)
    attrs
  end
end

#complete_word?(node) ⇒ `Boolean`

Helper to determine if a node contents a whole word useful to convert for example a letter italic inside a word

Returns:

(Boolean)

# File 'lib/undress/grammar.rb', line 142

def complete_word?(node)
  p, n = node.previous_node, node.next_node

  return true if !p && !n 

  if p.respond_to?(:content)
    return false if p.content       !~ /\s$/
  elsif p.respond_to?(:inner_html)
    return false if p.inner_html    !~ /\s$/
  end
  
  if n.respond_to?(:content)
    return false if n.content       !~ /^\s/
  elsif n.respond_to?(:inner_html)
    return false if n.inner_html    !~ /^\s/
  end
  true
end

#content_of(node) ⇒ `Object`

Get the result of parsing the contents of a node.



129
130
131

# File 'lib/undress/grammar.rb', line 129

def content_of(node)
  process(node.respond_to?(:children) ? node.children : node)
end

#process(nodes) ⇒ `Object`

Process a DOM node, converting it to your markup language according to your defined rules. If the node is a Text node, it will return it’s string representation. Otherwise it will call the rule defined for it.

# File 'lib/undress/grammar.rb', line 104

def process(nodes)
  Array(nodes).map do |node|
    if node.text?
      node.to_html
    elsif node.elem? 
      send node.name.to_sym, node if ! defined?(ALLOWED_TAGS) || ALLOWED_TAGS.empty? || ALLOWED_TAGS.include?(node.name)
    else
      ""
    end
  end.join("")
end

#process!(node) ⇒ `Object`

:nodoc:

# File 'lib/undress/grammar.rb', line 116

def process!(node) #:nodoc:
  pre_processing_rules.each do |selector, handler|
    node.search(selector).each(&handler)
  end

  process(node.children).tap do |text|
    post_processing_rules.each do |rule, handler|
      handler.is_a?(String) ?  text.gsub!(rule, handler) : text.gsub!(rule, &handler)
    end
  end
end

#surrounded_by_whitespace?(node) ⇒ `Boolean`

Helper method that tells you if the given DOM node is immediately surrounded by whitespace.

Returns:

(Boolean)

# File 'lib/undress/grammar.rb', line 135

def surrounded_by_whitespace?(node)
  (node.previous && node.previous.text? && node.previous.to_s =~ /\s+$/) ||
    (node.next && node.next.text? && node.next.to_s =~ /^\s+/)
end

Class: Undress::Grammar

Overview

Direct Known Subclasses

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ Grammar

Dynamic Method Handling

#method_missing(tag, node, *args) ⇒ Object

Instance Attribute Details

#post_processing_rules ⇒ Object (readonly)

#pre_processing_rules ⇒ Object (readonly)

#whitelisted_attributes ⇒ Object (readonly)

Class Method Details

.default(&handler) ⇒ Object

.inherited(base) ⇒ Object

.post_processing(regexp, replacement = nil, &handler) ⇒ Object

.post_processing_rules ⇒ Object

.pre_processing(selector, &handler) ⇒ Object

.pre_processing_rules ⇒ Object

.process!(node) ⇒ Object

.rule_for(*tags, &handler) ⇒ Object

.whitelist_attributes(*attrs) ⇒ Object

.whitelisted_attributes ⇒ Object

Instance Method Details

#attributes(node) ⇒ Object

#complete_word?(node) ⇒ Boolean

#content_of(node) ⇒ Object

#process(nodes) ⇒ Object

#process!(node) ⇒ Object

#surrounded_by_whitespace?(node) ⇒ Boolean

#initialize ⇒ `Grammar`

#method_missing(tag, node, *args) ⇒ `Object`

#post_processing_rules ⇒ `Object` (readonly)

#pre_processing_rules ⇒ `Object` (readonly)

#whitelisted_attributes ⇒ `Object` (readonly)

.default(&handler) ⇒ `Object`

.inherited(base) ⇒ `Object`

.post_processing(regexp, replacement = nil, &handler) ⇒ `Object`

.post_processing_rules ⇒ `Object`

.pre_processing(selector, &handler) ⇒ `Object`

.pre_processing_rules ⇒ `Object`

.process!(node) ⇒ `Object`

.rule_for(*tags, &handler) ⇒ `Object`

.whitelist_attributes(*attrs) ⇒ `Object`

.whitelisted_attributes ⇒ `Object`

#attributes(node) ⇒ `Object`

#complete_word?(node) ⇒ `Boolean`

#content_of(node) ⇒ `Object`

#process(nodes) ⇒ `Object`

#process!(node) ⇒ `Object`

#surrounded_by_whitespace?(node) ⇒ `Boolean`