Module: RelatonGb::TScrapper

Extended by:
Scrapper
Defined in:
lib/relaton_gb/t_scrapper.rb

Overview

Social standard scarpper.

Constant Summary

Constants included from Scrapper

Scrapper::STAGES

Class Method Summary collapse

Methods included from Scrapper

fetch_structuredidentifier, get_contributors, get_docid, get_status, get_titles, get_type, org, scrapped_data

Class Method Details

.scrape_doc(hit) ⇒ RelatonGb::GbBibliographicItem

Parameters:

Returns:



43
44
45
46
47
48
49
# File 'lib/relaton_gb/t_scrapper.rb', line 43

def scrape_doc(hit)
  src = "http://www.ttbz.org.cn#{hit.pid}"
  doc = Nokogiri::HTML OpenURI.open_uri(src), nil, Encoding::UTF_8.to_s
  GbBibliographicItem.new(**scrapped_data(doc, src, hit))
rescue OpenURI::HTTPError, SocketError, OpenSSL::SSL::SSLError, Net::OpenTimeout
  raise RelatonBib::RequestError, "Cannot access #{src}"
end

.scrape_page(text) ⇒ RelatonGb::HitCollection

rubocop:disable Metrics/MethodLength, Metrics/AbcSize

Parameters:

  • text (String)

Returns:



21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# File 'lib/relaton_gb/t_scrapper.rb', line 21

def scrape_page(text)
  search_html = OpenURI.open_uri(
    "http://www.ttbz.org.cn/Home/Standard?searchType=2&key=" \
    "#{CGI.escape(text.tr('-', [8212].pack('U')))}",
  ).read
  header = Nokogiri::HTML search_html
  xpath = '//table[contains(@class, "standard_list_table")]/tr/td/a'
  t_xpath = "../preceding-sibling::td[4]"
  hits = header.xpath(xpath).map do |h|
    docref = h.at(t_xpath).text.gsub(/Ă¢\u0080\u0094/, "-")
    status = h.at("../preceding-sibling::td[1]").text.delete "\r\n"
    pid = h[:href].sub(%r{/$}, "")
    Hit.new pid: pid, docref: docref, status: status, scrapper: self
  end
  HitCollection.new hits
rescue OpenURI::HTTPError, SocketError, OpenSSL::SSL::SSLError, Net::OpenTimeout
  raise RelatonBib::RequestError, "Cannot access http://www.ttbz.org.cn/Home/Standard"
end