Class: WebPageParser::BaseRegexpParser

Inherits:

Object
BaseParser
WebPageParser::BaseRegexpParser

Includes:: Oniguruma

Defined in:: lib/web-page-parser/base_parser.rb

Overview

BaseRegexpParser is designed to be sub-classed to write new parsers that use regular. It provides some basic help but most of the work needs to be done by the sub-class.

Simple pages could be implemented by just defining new regular expression constants, but more advanced parsing can be achieved with the *_processor methods.

Direct Known Subclasses

BbcNewsPageParserV1, BbcNewsPageParserV2, GuardianPageParserV1, NewYorkTimesPageParserV1

Constant Summary collapse

TITLE_RE = The regular expression to extract the title

//

DATE_RE = The regular expression to extract the date

//

CONTENT_RE = The regular expression to extract the content

//

KILL_CHARS_RE = The regular expression to find all characters that should be removed from any content.

ORegexp.new('[\n\r]+')

HTML_ENTITIES_DECODER = The object used to turn HTML entities into real charaters

HTMLEntities.new

Instance Attribute Summary

Attributes inherited from BaseParser

#guid, #url

Instance Method Summary collapse

#content ⇒ Object

The content method returns the important body text of the web page.
#date ⇒ Object

The date method returns a the timestamp of the web page, as a DateTime object.
#decode_entities(s) ⇒ Object

Convert html entities to unicode.
#encode(s) ⇒ Object

Handle any string encoding.
#initialize(options = { }) ⇒ BaseRegexpParser constructor

A new instance of BaseRegexpParser.
#page ⇒ Object

return the page contents, retrieving it from the server if necessary.
#retrieve_page(rurl = nil) ⇒ Object

request the page from the server and return the raw contents.
#title ⇒ Object

The title method returns the title of the web page.

Methods inherited from BaseParser

#hash

Constructor Details

#initialize(options = { }) ⇒ `BaseRegexpParser`

Returns a new instance of BaseRegexpParser.

# File 'lib/web-page-parser/base_parser.rb', line 87

def initialize(options = { })
  super(options)
  @page = encode(@page)
end

Instance Method Details

#content ⇒ `Object`

The content method returns the important body text of the web page.

It does basic extraction and pre-processing of the page content and then calls the content_processor method for any other more custom processing work that needs doing. Lastly, it does some basic post processing and returns the content as a string.

When writing a new parser, the CONTENT_RE constant should be defined in the subclass. The KILL_CHARS_RE constant can be overridden if necessary.

# File 'lib/web-page-parser/base_parser.rb', line 155

def content
  return @content if @content
  matches = class_const(:CONTENT_RE).match(page)
  if matches
    @content = class_const(:KILL_CHARS_RE).gsub(matches[1].to_s, '')
    content_processor
    @content.collect! { |p| decode_entities(p.strip) }
    @content.delete_if { |p| p == '' or p.nil? }
  end
  @content = [] if @content.nil?
  @content
end

#date ⇒ `Object`

The date method returns a the timestamp of the web page, as a DateTime object.

It does the basic extraction using the DATE_RE regular expression but the work of converting the text into a DateTime object needs to be done by the date_processor method.

# File 'lib/web-page-parser/base_parser.rb', line 136

def date
  return @date if @date
  if matches = class_const(:DATE_RE).match(page)
    @date = matches[1].to_s.strip
    date_processor
    @date
  end
end

#decode_entities(s) ⇒ `Object`

Convert html entities to unicode



169
170
171

# File 'lib/web-page-parser/base_parser.rb', line 169

def decode_entities(s)
  HTML_ENTITIES_DECODER.decode(s)
end

#encode(s) ⇒ `Object`

Handle any string encoding

# File 'lib/web-page-parser/base_parser.rb', line 93

def encode(s)
  return s if s.nil?
  return s if s.valid_encoding?
  if s.force_encoding("iso-8859-1").valid_encoding?
    return s.encode('utf-8', 'iso-8859-1')
  end
  s
end

#page ⇒ `Object`

return the page contents, retrieving it from the server if necessary



103
104
105

# File 'lib/web-page-parser/base_parser.rb', line 103

def page
  @page ||= retrieve_page
end

#retrieve_page(rurl = nil) ⇒ `Object`

request the page from the server and return the raw contents

# File 'lib/web-page-parser/base_parser.rb', line 108

def retrieve_page(rurl = nil)
  durl = rurl || url
  return nil unless durl
  durl = filter_url(durl) if self.respond_to?(:filter_url)
  self.class.retrieve_session ||= WebPageParser::HTTP::Session.new
  encode(self.class.retrieve_session.get(durl))
end

#title ⇒ `Object`

The title method returns the title of the web page.

It does the basic extraction using the TITLE_RE regular expression and handles text encoding. More advanced parsing can be done by overriding the title_processor method.

# File 'lib/web-page-parser/base_parser.rb', line 121

def title
  return @title if @title
  if matches = class_const(:TITLE_RE).match(page)
    @title = matches[1].to_s.strip
    title_processor
    @title = decode_entities(@title)
  end
end

Class: WebPageParser::BaseRegexpParser

Overview

Direct Known Subclasses

Constant Summary collapse

Instance Attribute Summary

Attributes inherited from BaseParser

Instance Method Summary collapse

Methods inherited from BaseParser

Constructor Details

#initialize(options = { }) ⇒ BaseRegexpParser

Instance Method Details

#content ⇒ Object

#date ⇒ Object

#decode_entities(s) ⇒ Object

#encode(s) ⇒ Object

#page ⇒ Object

#retrieve_page(rurl = nil) ⇒ Object

#title ⇒ Object