Class: NHKore::SearchScraper

Inherits:

Scraper

Object
Scraper
NHKore::SearchScraper

show all

Defined in:: lib/nhkore/search_scraper.rb

Direct Known Subclasses

BingScraper

Constant Summary collapse

DEFAULT_RESULT_COUNT =

FUTSUU_SITE =

'nhk.or.jp/news/html/'

YASASHII_SITE =

'nhk.or.jp/news/easy/'

FUTSUU_REGEX = www3.nhk.or.jp/news/html/20200220/k10012294001000.html

/\A[^.]+\.#{Regexp.quote(FUTSUU_SITE)}.+\.html?/i.freeze

YASASHII_REGEX = www3.nhk.or.jp/news/easy/k10012294001000/k10012294001000.html www3.nhk.or.jp/news/easy/article/disaster_heat.html

/\A[^.]+\.#{Regexp.quote(YASASHII_SITE)}.+\.html?/i.freeze

IGNORE_LINK_REGEX =

%r{
  /about\.html?               # https://www3.nhk.or.jp/news/easy/about.html
  |/movieplayer\.html?        # https://www3.nhk.or.jp/news/easy/movieplayer.html?id=k10038422811_1207251719_1207251728.mp4&teacuprbbs=4feb73432045dbb97c283d64d459f7cf
  |/audio\.html?              # https://www3.nhk.or.jp/news/easy/player/audio.html?id=k10011555691000
  |/news/easy/index\.html?    # https://www3.nhk.or.jp/news/easy/index.html
  |/disaster_earthquake.html  # https://www3.nhk.or.jp/news/easy/article/disaster_earthquake.html

  # https://cgi2.nhk.or.jp/news/easy/easy_enq/bin/form/enqform.html?id=k10011916321000&title=日本の会社が作った鉄道の車両「あずま」がイギリスで走る
  # https://www3.nhk.or.jp/news/easy/easy_enq/bin/form/enqform.html?id=k10012689671000&title=「鬼滅の刃」の映画が台湾でも始まって大勢の人が見に行く
  |/enqform\.html?
}x.freeze

Constants inherited from Scraper

NHKore::Scraper::DEFAULT_HEADER

Instance Attribute Summary

Attributes inherited from Scraper

#kargs, #max_redirects, #max_retries, #redirect_rule, #str_or_io, #url

Instance Method Summary collapse

#fetch_valid_link?(link) ⇒ Boolean

Example: www3.nhk.or.jp/news/easy/k10014150691000/k10014150691000.html.
#ignore_link?(link, cleaned: true) ⇒ Boolean
#initialize(url, eat_cookie: true, header: {}, **kargs) ⇒ SearchScraper constructor

Search Engines are strict, so trigger using the default HTTP header fields with header: {} and fetch/set the cookie using eat_cookie: true.

Methods inherited from Scraper

#fetch_cookie, #html_doc, #join_url, #open, #open_file, #open_url, #read, #reopen, #rss_doc

Constructor Details

#initialize(url, eat_cookie: true, header: {}, **kargs) ⇒ `SearchScraper`

Search Engines are strict, so trigger using the default HTTP header fields with header: {} and fetch/set the cookie using eat_cookie: true.



47
48
49

# File 'lib/nhkore/search_scraper.rb', line 47

def initialize(url,eat_cookie: true,header: {},**kargs)
  super
end

Instance Method Details

#fetch_valid_link?(link) ⇒ `Boolean`

Example: www3.nhk.or.jp/news/easy/k10014150691000/k10014150691000.html

Returns:

(Boolean)

# File 'lib/nhkore/search_scraper.rb', line 62

def fetch_valid_link?(link)
  uri = begin
    URI(link)
  rescue StandardError
    return false # Bad URL.
  end

  begin
    ssl = uri.scheme.to_s.strip.downcase.include?('https')

    Net::HTTP.start(uri.host,uri.port,use_ssl: ssl) do |http|
      resp = http.head(uri.request_uri)
      code = resp.code

      return code != '404'
    end
  rescue StandardError
    # Ignore; try actually scraping the article anyway.
  end

  return true
end

#ignore_link?(link, cleaned: true) ⇒ `Boolean`