Class: WebPageParser::BbcNewsPageParserV2
- Inherits:
-
BaseParser
- Object
- BaseParser
- WebPageParser::BbcNewsPageParserV2
- Defined in:
- lib/web-page-parser/parsers/bbc_news_page_parser.rb
Overview
BbcNewsPageParserV2 parses BBC News web pages
Direct Known Subclasses
Constant Summary collapse
- TITLE_RE =
ORegexp.new('<meta name="Headline" content="(.*)"', 'i')
- DATE_RE =
ORegexp.new('<meta name="OriginalPublicationDate" content="(.*)"', 'i')
- CONTENT_RE =
ORegexp.new('S BO -->(.*?)<!-- E BO', 'm')
- STRIP_BLOCKS_RE =
ORegexp.new('<(table|noscript|script|object|form)[^>]*>.*?</\1>', 'i')
- STRIP_TAGS_RE =
ORegexp.new('</?(b|div|img|tr|td|br|font|span)[^>]*>','i')
- STRIP_COMMENTS_RE =
ORegexp.new('<!--.*?-->')
- STRIP_CAPTIONS_RE =
ORegexp.new('<!-- caption .+?<!-- END - caption -->')
- WHITESPACE_RE =
ORegexp.new('[\t ]+')
- PARA_RE =
Regexp.new('</?p[^>]*>', Regexp::IGNORECASE)
Constants inherited from BaseParser
WebPageParser::BaseParser::HTML_ENTITIES_DECODER, WebPageParser::BaseParser::ICONV, WebPageParser::BaseParser::KILL_CHARS_RE
Instance Attribute Summary
Attributes inherited from BaseParser
Method Summary
Methods inherited from BaseParser
#content, #date, #decode_entities, #hash, #initialize, #title
Constructor Details
This class inherits a constructor from WebPageParser::BaseParser