Class: WebPageParser::BbcNewsPageParserV2

Inherits:
BaseParser
  • Object
show all
Defined in:
lib/web-page-parser/parsers/bbc_news_page_parser.rb

Overview

BbcNewsPageParserV2 parses BBC News web pages

Direct Known Subclasses

BbcNewsPageParserV3

Constant Summary collapse

TITLE_RE =
ORegexp.new('<meta name="Headline" content="(.*)"', 'i')
DATE_RE =
ORegexp.new('<meta name="OriginalPublicationDate" content="(.*)"', 'i')
CONTENT_RE =
ORegexp.new('S BO -->(.*?)<!-- E BO', 'm')
STRIP_BLOCKS_RE =
ORegexp.new('<(table|noscript|script|object|form)[^>]*>.*?</\1>', 'i')
STRIP_TAGS_RE =
ORegexp.new('</?(b|div|img|tr|td|br|font|span)[^>]*>','i')
STRIP_COMMENTS_RE =
ORegexp.new('<!--.*?-->')
STRIP_CAPTIONS_RE =
ORegexp.new('<!-- caption .+?<!-- END - caption -->')
WHITESPACE_RE =
ORegexp.new('[\t ]+')
PARA_RE =
Regexp.new('</?p[^>]*>', Regexp::IGNORECASE)

Constants inherited from BaseParser

WebPageParser::BaseParser::HTML_ENTITIES_DECODER, WebPageParser::BaseParser::ICONV, WebPageParser::BaseParser::KILL_CHARS_RE

Instance Attribute Summary

Attributes inherited from BaseParser

#guid, #page, #url

Method Summary

Methods inherited from BaseParser

#content, #date, #decode_entities, #hash, #initialize, #title

Constructor Details

This class inherits a constructor from WebPageParser::BaseParser