Class: WebPageParser::BaseRegexpParser
- Inherits:
-
BaseParser
- Object
- BaseParser
- WebPageParser::BaseRegexpParser
- Includes:
- Oniguruma
- Defined in:
- lib/web-page-parser/base_parser.rb
Overview
BaseRegexpParser is designed to be sub-classed to write new parsers that use regular. It provides some basic help but most of the work needs to be done by the sub-class.
Simple pages could be implemented by just defining new regular expression constants, but more advanced parsing can be achieved with the *_processor methods.
Direct Known Subclasses
BbcNewsPageParserV1, BbcNewsPageParserV2, GuardianPageParserV1, NewYorkTimesPageParserV1
Constant Summary collapse
- TITLE_RE =
The regular expression to extract the title
//
- DATE_RE =
The regular expression to extract the date
//
- CONTENT_RE =
The regular expression to extract the content
//
- KILL_CHARS_RE =
The regular expression to find all characters that should be removed from any content.
ORegexp.new('[\n\r]+')
- HTML_ENTITIES_DECODER =
The object used to turn HTML entities into real charaters
HTMLEntities.new
Instance Attribute Summary
Attributes inherited from BaseParser
Instance Method Summary collapse
-
#content ⇒ Object
The content method returns the important body text of the web page.
-
#date ⇒ Object
The date method returns a the timestamp of the web page, as a DateTime object.
-
#decode_entities(s) ⇒ Object
Convert html entities to unicode.
-
#encode(s) ⇒ Object
Handle any string encoding.
-
#initialize(options = { }) ⇒ BaseRegexpParser
constructor
A new instance of BaseRegexpParser.
-
#page ⇒ Object
return the page contents, retrieving it from the server if necessary.
-
#retrieve_page(rurl = nil) ⇒ Object
request the page from the server and return the raw contents.
-
#title ⇒ Object
The title method returns the title of the web page.
Methods inherited from BaseParser
Constructor Details
#initialize(options = { }) ⇒ BaseRegexpParser
Returns a new instance of BaseRegexpParser.
87 88 89 90 |
# File 'lib/web-page-parser/base_parser.rb', line 87 def initialize( = { }) super() @page = encode(@page) end |
Instance Method Details
#content ⇒ Object
The content method returns the important body text of the web page.
It does basic extraction and pre-processing of the page content and then calls the content_processor method for any other more custom processing work that needs doing. Lastly, it does some basic post processing and returns the content as a string.
When writing a new parser, the CONTENT_RE constant should be defined in the subclass. The KILL_CHARS_RE constant can be overridden if necessary.
155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/web-page-parser/base_parser.rb', line 155 def content return @content if @content matches = class_const(:CONTENT_RE).match(page) if matches @content = class_const(:KILL_CHARS_RE).gsub(matches[1].to_s, '') content_processor @content.collect! { |p| decode_entities(p.strip) } @content.delete_if { |p| p == '' or p.nil? } end @content = [] if @content.nil? @content end |
#date ⇒ Object
The date method returns a the timestamp of the web page, as a DateTime object.
It does the basic extraction using the DATE_RE regular expression but the work of converting the text into a DateTime object needs to be done by the date_processor method.
136 137 138 139 140 141 142 143 |
# File 'lib/web-page-parser/base_parser.rb', line 136 def date return @date if @date if matches = class_const(:DATE_RE).match(page) @date = matches[1].to_s.strip date_processor @date end end |
#decode_entities(s) ⇒ Object
Convert html entities to unicode
169 170 171 |
# File 'lib/web-page-parser/base_parser.rb', line 169 def decode_entities(s) HTML_ENTITIES_DECODER.decode(s) end |
#encode(s) ⇒ Object
Handle any string encoding
93 94 95 96 97 98 99 100 |
# File 'lib/web-page-parser/base_parser.rb', line 93 def encode(s) return s if s.nil? return s if s.valid_encoding? if s.force_encoding("iso-8859-1").valid_encoding? return s.encode('utf-8', 'iso-8859-1') end s end |
#page ⇒ Object
return the page contents, retrieving it from the server if necessary
103 104 105 |
# File 'lib/web-page-parser/base_parser.rb', line 103 def page @page ||= retrieve_page end |
#retrieve_page(rurl = nil) ⇒ Object
request the page from the server and return the raw contents
108 109 110 111 112 113 114 |
# File 'lib/web-page-parser/base_parser.rb', line 108 def retrieve_page(rurl = nil) durl = rurl || url return nil unless durl durl = filter_url(durl) if self.respond_to?(:filter_url) self.class.retrieve_session ||= WebPageParser::HTTP::Session.new encode(self.class.retrieve_session.get(durl)) end |
#title ⇒ Object
The title method returns the title of the web page.
It does the basic extraction using the TITLE_RE regular expression and handles text encoding. More advanced parsing can be done by overriding the title_processor method.
121 122 123 124 125 126 127 128 |
# File 'lib/web-page-parser/base_parser.rb', line 121 def title return @title if @title if matches = class_const(:TITLE_RE).match(page) @title = matches[1].to_s.strip title_processor @title = decode_entities(@title) end end |