Class: NHKore::SearchScraper
- Defined in:
- lib/nhkore/search_scraper.rb
Overview
Direct Known Subclasses
Constant Summary collapse
- DEFAULT_RESULT_COUNT =
100
- FUTSUU_SITE =
'nhk.or.jp/news/html/'
- YASASHII_SITE =
'nhk.or.jp/news/easy/'
- FUTSUU_REGEX =
/\A[^.]+\.#{Regexp.quote(FUTSUU_SITE)}.+\.html?/i.freeze
- YASASHII_REGEX =
/\A[^.]+\.#{Regexp.quote(YASASHII_SITE)}.+\.html?/i.freeze
- IGNORE_LINK_REGEX =
%r{ /about\.html? # https://www3.nhk.or.jp/news/easy/about.html |/movieplayer\.html? # https://www3.nhk.or.jp/news/easy/movieplayer.html?id=k10038422811_1207251719_1207251728.mp4&teacuprbbs=4feb73432045dbb97c283d64d459f7cf |/audio\.html? # https://www3.nhk.or.jp/news/easy/player/audio.html?id=k10011555691000 |/news/easy/index\.html? # http://www3.nhk.or.jp/news/easy/index.html # https://cgi2.nhk.or.jp/news/easy/easy_enq/bin/form/enqform.html?id=k10011916321000&title=日本の会社が作った鉄道の車両「あずま」がイギリスで走る # https://www3.nhk.or.jp/news/easy/easy_enq/bin/form/enqform.html?id=k10012689671000&title=「鬼滅の刃」の映画が台湾でも始まって大勢の人が見に行く |/enqform\.html? }x.freeze
Constants inherited from Scraper
NHKore::Scraper::DEFAULT_HEADER
Instance Attribute Summary
Attributes inherited from Scraper
#kargs, #max_redirects, #max_retries, #redirect_rule, #str_or_io, #url
Instance Method Summary collapse
- #ignore_link?(link, cleaned: true) ⇒ Boolean
-
#initialize(url, eat_cookie: true, header: {}, **kargs) ⇒ SearchScraper
constructor
Search Engines are strict, so trigger using the default HTTP header fields with header: {} and fetch/set the cookie using eat_cookie: true.
Methods inherited from Scraper
#fetch_cookie, #html_doc, #join_url, #open, #open_file, #open_url, #read, #reopen, #rss_doc
Constructor Details
#initialize(url, eat_cookie: true, header: {}, **kargs) ⇒ SearchScraper
Search Engines are strict, so trigger using the default HTTP header fields with header: {} and fetch/set the cookie using eat_cookie: true.
49 50 51 |
# File 'lib/nhkore/search_scraper.rb', line 49 def initialize(url,eat_cookie: true,header: {},**kargs) super(url,eat_cookie: ,header: header,**kargs) end |
Instance Method Details
#ignore_link?(link, cleaned: true) ⇒ Boolean
53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/nhkore/search_scraper.rb', line 53 def ignore_link?(link,cleaned: true) return true if link.nil? link = Util.unspace_web_str(link).downcase unless cleaned return true if link.empty? return true if IGNORE_LINK_REGEX.match?(link) return false end |