Class: WebPageParser::GuardianPageParserV1

Inherits:
BaseParser
  • Object
show all
Defined in:
lib/web-page-parser/parsers/guardian_page_parser.rb

Overview

BbcNewsPageParserV1 parses BBC News web pages exactly like the old News Sniffer BbcNewsPage class did. This should only ever be used for backwards compatability with News Sniffer and is never supplied for use by a factory.

Constant Summary collapse

ICONV =
nil
TITLE_RE =
ORegexp.new('<meta property="og:title" content="(.*)"', 'i')
DATE_RE =
ORegexp.new('<meta property="article:published_time" content="(.*)"', 'i')
CONTENT_RE =
ORegexp.new('article-body-blocks">(.*?)<div id="related"', 'm')
STRIP_TAGS_RE =
ORegexp.new('</?(a|span|div|img|tr|td|!--|table)[^>]*>','i')
PARA_RE =
Regexp.new(/<(p|h2)[^>]*>(.*?)<\/\1>/i)

Constants inherited from BaseParser

BaseParser::HTML_ENTITIES_DECODER, BaseParser::KILL_CHARS_RE

Instance Attribute Summary

Attributes inherited from BaseParser

#guid, #page, #url

Method Summary

Methods inherited from BaseParser

#content, #date, #decode_entities, #hash, #initialize, #title

Constructor Details

This class inherits a constructor from WebPageParser::BaseParser