Class: WebPageParser::BaseRegexpParser

Inherits:
BaseParser
  • Object
show all
Includes:
Oniguruma
Defined in:
lib/web-page-parser/base_parser.rb

Overview

BaseRegexpParser is designed to be sub-classed to write new parsers that use regular. It provides some basic help but most of the work needs to be done by the sub-class.

Simple pages could be implemented by just defining new regular expression constants, but more advanced parsing can be achieved with the *_processor methods.

Constant Summary collapse

TITLE_RE =

The regular expression to extract the title

//
DATE_RE =

The regular expression to extract the date

//
CONTENT_RE =

The regular expression to extract the content

//
KILL_CHARS_RE =

The regular expression to find all characters that should be removed from any content.

ORegexp.new('[\n\r]+')
HTML_ENTITIES_DECODER =

The object used to turn HTML entities into real charaters

HTMLEntities.new

Instance Attribute Summary

Attributes inherited from BaseParser

#guid, #url

Instance Method Summary collapse

Methods inherited from BaseParser

#hash

Constructor Details

#initialize(options = { }) ⇒ BaseRegexpParser

Returns a new instance of BaseRegexpParser.



87
88
89
90
# File 'lib/web-page-parser/base_parser.rb', line 87

def initialize(options = { })
  super(options)
  @page = encode(@page)
end

Instance Method Details

#contentObject

The content method returns the important body text of the web page.

It does basic extraction and pre-processing of the page content and then calls the content_processor method for any other more custom processing work that needs doing. Lastly, it does some basic post processing and returns the content as a string.

When writing a new parser, the CONTENT_RE constant should be defined in the subclass. The KILL_CHARS_RE constant can be overridden if necessary.



155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/web-page-parser/base_parser.rb', line 155

def content
  return @content if @content
  matches = class_const(:CONTENT_RE).match(page)
  if matches
    @content = class_const(:KILL_CHARS_RE).gsub(matches[1].to_s, '')
    content_processor
    @content.collect! { |p| decode_entities(p.strip) }
    @content.delete_if { |p| p == '' or p.nil? }
  end
  @content = [] if @content.nil?
  @content
end

#dateObject

The date method returns a the timestamp of the web page, as a DateTime object.

It does the basic extraction using the DATE_RE regular expression but the work of converting the text into a DateTime object needs to be done by the date_processor method.



136
137
138
139
140
141
142
143
# File 'lib/web-page-parser/base_parser.rb', line 136

def date
  return @date if @date
  if matches = class_const(:DATE_RE).match(page)
    @date = matches[1].to_s.strip
    date_processor
    @date
  end
end

#decode_entities(s) ⇒ Object

Convert html entities to unicode



169
170
171
# File 'lib/web-page-parser/base_parser.rb', line 169

def decode_entities(s)
  HTML_ENTITIES_DECODER.decode(s)
end

#encode(s) ⇒ Object

Handle any string encoding



93
94
95
96
97
98
99
100
# File 'lib/web-page-parser/base_parser.rb', line 93

def encode(s)
  return s if s.nil?
  return s if s.valid_encoding?
  if s.force_encoding("iso-8859-1").valid_encoding?
    return s.encode('utf-8', 'iso-8859-1')
  end
  s
end

#pageObject

return the page contents, retrieving it from the server if necessary



103
104
105
# File 'lib/web-page-parser/base_parser.rb', line 103

def page
  @page ||= retrieve_page
end

#retrieve_page(rurl = nil) ⇒ Object

request the page from the server and return the raw contents



108
109
110
111
112
113
114
# File 'lib/web-page-parser/base_parser.rb', line 108

def retrieve_page(rurl = nil)
  durl = rurl || url
  return nil unless durl
  durl = filter_url(durl) if self.respond_to?(:filter_url)
  self.class.retrieve_session ||= WebPageParser::HTTP::Session.new
  encode(self.class.retrieve_session.get(durl))
end

#titleObject

The title method returns the title of the web page.

It does the basic extraction using the TITLE_RE regular expression and handles text encoding. More advanced parsing can be done by overriding the title_processor method.



121
122
123
124
125
126
127
128
# File 'lib/web-page-parser/base_parser.rb', line 121

def title
  return @title if @title
  if matches = class_const(:TITLE_RE).match(page)
    @title = matches[1].to_s.strip
    title_processor
    @title = decode_entities(@title)
  end
end