Class: WebPageParser::BbcNewsPageParserV4
- Inherits:
-
BbcNewsPageParserV3
- Object
- BaseParser
- BbcNewsPageParserV2
- BbcNewsPageParserV3
- WebPageParser::BbcNewsPageParserV4
- Defined in:
- lib/web-page-parser/parsers/bbc_news_page_parser.rb
Constant Summary collapse
- CONTENT_RE =
ORegexp.new('<div class=.story-body.>(.*?)<!-- / story\-body', 'm')
- STRIP_PAGE_BOOKMARKS =
ORegexp.new('<div id="page-bookmark-links-head".+?</div>', 'm')
- STRIP_STORY_DATE =
ORegexp.new('<span class="date".+?</span>', 'm')
- STRIP_STORY_LASTUPDATED =
ORegexp.new('<span class="time\-text".+?</span>', 'm')
- STRIP_STORY_TIME =
ORegexp.new('<span class="time".+?</span>', 'm')
- TITLE_RE =
ORegexp.new('<h1 class="story\-header">(.+?)</h1>', 'm')
- STRIP_CAPTIONS_RE2 =
ORegexp.new('<div class=.caption.+?</div>','m')
- STRIP_HIDDEN_A =
ORegexp.new('<a class=.hidden.+?</a>','m')
- STRIP_STORY_FEATURE =
ORegexp.new('<div class=.story\-feature.+?</div>', 'm')
- STRIP_HYPERPUFF_RE =
ORegexp.new('<div class=.embedded-hyper.+?<div class=.hyperpuff.+?</div>.+?</div>', 'm')
- STRIP_MARKETDATA_RE =
ORegexp.new('<div class=.market\-data.+?</div>', 'm')
- STRIP_EMBEDDEDHYPER_RE =
ORegexp.new('<div class=.embedded\-hyper.+?</div>', 'm')
Constants inherited from BbcNewsPageParserV3
WebPageParser::BbcNewsPageParserV3::ICONV, WebPageParser::BbcNewsPageParserV3::STRIP_FEATURES_RE, WebPageParser::BbcNewsPageParserV3::STRIP_MARKET_DATA_WIDGET_RE
Constants inherited from BbcNewsPageParserV2
WebPageParser::BbcNewsPageParserV2::DATE_RE, WebPageParser::BbcNewsPageParserV2::PARA_RE, WebPageParser::BbcNewsPageParserV2::STRIP_BLOCKS_RE, WebPageParser::BbcNewsPageParserV2::STRIP_CAPTIONS_RE, WebPageParser::BbcNewsPageParserV2::STRIP_COMMENTS_RE, WebPageParser::BbcNewsPageParserV2::STRIP_TAGS_RE, WebPageParser::BbcNewsPageParserV2::WHITESPACE_RE
Constants inherited from BaseParser
WebPageParser::BaseParser::DATE_RE, WebPageParser::BaseParser::HTML_ENTITIES_DECODER, WebPageParser::BaseParser::ICONV, WebPageParser::BaseParser::KILL_CHARS_RE
Instance Attribute Summary
Attributes inherited from BaseParser
Instance Method Summary collapse
Methods inherited from BaseParser
#content, #date, #decode_entities, #hash, #initialize, #title
Constructor Details
This class inherits a constructor from WebPageParser::BaseParser
Instance Method Details
#content_processor ⇒ Object
122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/web-page-parser/parsers/bbc_news_page_parser.rb', line 122 def content_processor @content = STRIP_PAGE_BOOKMARKS.gsub(@content, '') @content = STRIP_STORY_DATE.gsub(@content, '') @content = STRIP_STORY_LASTUPDATED.gsub(@content, '') @content = STRIP_STORY_TIME.gsub(@content, '') @content = TITLE_RE.gsub(@content, '') @content = STRIP_CAPTIONS_RE2.gsub(@content, '') @content = STRIP_HIDDEN_A.gsub(@content, '') @content = STRIP_STORY_FEATURE.gsub(@content, '') @content = STRIP_HYPERPUFF_RE.gsub(@content, '') @content = STRIP_MARKETDATA_RE.gsub(@content, '') @content = STRIP_EMBEDDEDHYPER_RE.gsub(@content, '') super end |