Class: Birdsong::Scraper

Inherits:
Object
  • Object
show all
Includes:
Capybara::DSL
Defined in:
lib/birdsong/scrapers/scraper.rb

Overview

rubocop:disable Metrics/ClassLength

Direct Known Subclasses

TweetScraper

Constant Summary collapse

@@logger =
Logger.new(STDOUT)
@@session_id =
nil

Instance Method Summary collapse

Constructor Details

#initializeScraper

Returns a new instance of Scraper.



44
45
46
# File 'lib/birdsong/scrapers/scraper.rb', line 44

def initialize
  Capybara.default_driver = :selenium_birdsong
end

Instance Method Details

#get_content_of_subpage_from_url(url, subpage_search, additional_search_parameters = nil) ⇒ Object

Instagram uses GraphQL (like most of Facebook I think), and returns an object that actually is used to seed the page. We can just parse this for most things.

additional_search_params is a comma seperated keys example: ‘data,xdt_api_v1_media__shortcode__web_info,items`



55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/birdsong/scrapers/scraper.rb', line 55

def get_content_of_subpage_from_url(url, subpage_search, additional_search_parameters = nil)
  # So this is fun:
  # For pages marked as misinformation we have to use one method (interception of requrest) and
  # for pages that are not, we can just pull the data straight from the page.
  #
  # How do we figure out which is which?... for now we'll just run through both and see where we
  # go with it.

  # Our user data no longer lives in the graphql object passed initially with the page.
  # Instead it comes in as part of a subsequent call. This intercepts all calls, checks if it's
  # the one we want, and then moves on.
  response_body = nil

  page.driver.browser.intercept do |request, &continue|
    # This passes the request forward unmodified, since we only care about the response
    # puts "checking request: #{request.url}"

    continue.call(request) && next unless request.url.include?(subpage_search)


    continue.call(request) do |response|

      # Check if not a CORS prefetch and finish up if not
      if !response.body.empty? && response.body
        check_passed = true
        unless additional_search_parameters.nil?
          body_to_check = Oj.load(response.body)

          search_parameters = additional_search_parameters.split(",")
          search_parameters.each_with_index do |key, index|
            break if body_to_check.nil?

            check_passed = false unless body_to_check.has_key?(key)
            body_to_check = body_to_check[key]
          end
        end

        response_body = response.body if check_passed == true
      end
    end
  rescue Selenium::WebDriver::Error::WebDriverError
    # Eat them
  rescue Birdsong::WebDriverError
  end

  # Now that the intercept is set up, we visit the page we want
  page.driver.browser.navigate.to(url)
  # We wait until the correct intercept is processed or we've waited 60 seconds
  start_time = Time.now
  # puts "Waiting.... #{url}"

  sleep(rand(1...10))
  while response_body.nil? && (Time.now - start_time) < 60
    sleep(0.1)
  end

  page.driver.execute_script("window.stop();")

  raise Birdsong::NoTweetFoundError if response_body.nil?
  Oj.load(response_body)
rescue Birdsong::WebDriverError
end