Module: RelatonGb::TScrapper
- Extended by:
- Scrapper
- Defined in:
- lib/relaton_gb/t_scrapper.rb
Overview
Social standard scarpper.
Constant Summary
Constants included from Scrapper
Class Method Summary collapse
- .scrape_doc(hit) ⇒ RelatonGb::GbBibliographicItem
-
.scrape_page(text) ⇒ RelatonGb::HitCollection
rubocop:disable Metrics/MethodLength, Metrics/AbcSize.
Methods included from Scrapper
fetch_structuredidentifier, get_contributors, get_docid, get_status, get_titles, get_type, org, scrapped_data
Class Method Details
.scrape_doc(hit) ⇒ RelatonGb::GbBibliographicItem
43 44 45 46 47 48 49 |
# File 'lib/relaton_gb/t_scrapper.rb', line 43 def scrape_doc(hit) src = "http://www.ttbz.org.cn#{hit.pid}" doc = Nokogiri::HTML OpenURI.open_uri(src), nil, Encoding::UTF_8.to_s GbBibliographicItem.new(**scrapped_data(doc, src, hit)) rescue OpenURI::HTTPError, SocketError, OpenSSL::SSL::SSLError, Net::OpenTimeout raise RelatonBib::RequestError, "Cannot access #{src}" end |
.scrape_page(text) ⇒ RelatonGb::HitCollection
rubocop:disable Metrics/MethodLength, Metrics/AbcSize
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# File 'lib/relaton_gb/t_scrapper.rb', line 21 def scrape_page(text) search_html = OpenURI.open_uri( "http://www.ttbz.org.cn/Home/Standard?searchType=2&key=" \ "#{CGI.escape(text.tr('-', [8212].pack('U')))}", ).read header = Nokogiri::HTML search_html xpath = '//table[contains(@class, "standard_list_table")]/tr/td/a' t_xpath = "../preceding-sibling::td[4]" hits = header.xpath(xpath).map do |h| docref = h.at(t_xpath).text.gsub(/Ă¢\u0080\u0094/, "-") status = h.at("../preceding-sibling::td[1]").text.delete "\r\n" pid = h[:href].sub(%r{/$}, "") Hit.new pid: pid, docref: docref, status: status, scrapper: self end HitCollection.new hits rescue OpenURI::HTTPError, SocketError, OpenSSL::SSL::SSLError, Net::OpenTimeout raise RelatonBib::RequestError, "Cannot access http://www.ttbz.org.cn/Home/Standard" end |