Class: SnapSearch::Detector
- Inherits:
-
Object
- Object
- SnapSearch::Detector
- Defined in:
- lib/snap_search/detector.rb
Overview
This is used to detect if an incoming request to a HTTP server is coming from a robot.
Instance Attribute Summary collapse
-
#check_file_extensions ⇒ Object
readonly
Returns the value of attribute check_file_extensions.
-
#extensions ⇒ Object
readonly
Returns the value of attribute extensions.
-
#extensions_json ⇒ Object
Returns the value of attribute extensions_json.
-
#ignored_routes ⇒ Object
readonly
Returns the value of attribute ignored_routes.
-
#matched_routes ⇒ Object
readonly
Returns the value of attribute matched_routes.
-
#robots ⇒ Object
Returns the value of attribute robots.
-
#robots_json ⇒ Object
Returns the value of attribute robots_json.
Instance Method Summary collapse
-
#detect(options = {}) ⇒ true, false
Detects if the request came from a search engine robot.
-
#get_decoded_path(params, uri) ⇒ String
Gets the decoded URL path relevant for detecting matched or ignored routes during detection.
-
#get_encoded_url(params, uri) ⇒ String
Gets the encoded URL that is passed to SnapSearch so that SnapSearch can scrape the encoded URL.
-
#get_real_qs_and_hash_fragment(params, escape) ⇒ Hash
Gets the real query string and hash fragment by reversing the Google’s escaped_fragment protocol to the hash bang mode.
-
#initialize(options = {}) ⇒ Detector
constructor
Create a new Detector instance.
Constructor Details
#initialize(options = {}) ⇒ Detector
Create a new Detector instance.
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# File 'lib/snap_search/detector.rb', line 23 def initialize(={}) raise TypeError, 'options must be a Hash or respond to #to_h or #to_hash' unless .is_a?(Hash) || .respond_to?(:to_h) || .respond_to?(:to_hash) = .to_h rescue .to_hash @options = { matched_routes: [], ignored_routes: [], robots_json: SnapSearch.root.join('resources', 'robots.json'), extensions_json: SnapSearch.root.join('resources', 'extensions.json'), check_file_extensions: false }.merge() # Reverse merge: The hash `merge` is called on is used as the default and the options argument is merged into it @matched_routes, @ignored_routes, @check_file_extensions = .values_at(:matched_routes, :ignored_routes, :check_file_extensions) self.robots_json = @options[:robots_json] # Use the setter method which sets the @robots_json instance variable to the path, then sets @robots to the parsed JSON of the path's contents. self.extensions_json = @options[:extensions_json] # Use the setter method which sets the @extensions_json instance variable to the path, then sets @extensions to the parsed JSON of the path's contents. end |
Instance Attribute Details
#check_file_extensions ⇒ Object (readonly)
Returns the value of attribute check_file_extensions.
11 12 13 |
# File 'lib/snap_search/detector.rb', line 11 def check_file_extensions @check_file_extensions end |
#extensions ⇒ Object (readonly)
Returns the value of attribute extensions.
13 14 15 |
# File 'lib/snap_search/detector.rb', line 13 def extensions @extensions end |
#extensions_json ⇒ Object
Returns the value of attribute extensions_json.
13 14 15 |
# File 'lib/snap_search/detector.rb', line 13 def extensions_json @extensions_json end |
#ignored_routes ⇒ Object (readonly)
Returns the value of attribute ignored_routes.
10 11 12 |
# File 'lib/snap_search/detector.rb', line 10 def ignored_routes @ignored_routes end |
#matched_routes ⇒ Object (readonly)
Returns the value of attribute matched_routes.
10 11 12 |
# File 'lib/snap_search/detector.rb', line 10 def matched_routes @matched_routes end |
#robots ⇒ Object
Returns the value of attribute robots.
12 13 14 |
# File 'lib/snap_search/detector.rb', line 12 def robots @robots end |
#robots_json ⇒ Object
Returns the value of attribute robots_json.
12 13 14 |
# File 'lib/snap_search/detector.rb', line 12 def robots_json @robots_json end |
Instance Method Details
#detect(options = {}) ⇒ true, false
Detects if the request came from a search engine robot. It will intercept in cascading order:
1. on a GET request
2. on an HTTP or HTTPS protocol
3. not on any ignored robot user agents
4. not on any route not matching the whitelist
5. not on any route matching the blacklist
6. not on any invalid file extensions if there is a file extension
7. on requests with _escaped_fragment_ query parameter
8. on any matched robot user agents
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# File 'lib/snap_search/detector.rb', line 92 def detect(={}) = { matched_routes: @matched_routes, ignored_routes: @ignored_routes, robots_json: @robots_json, check_file_extensions: false }.merge() raise ArgumentError, 'options[:request] must be an instance of Rack::Request' unless [:request].is_a?(Rack::Request) self.robots_json = [:robots_json] if [:robots_json] != @robots_json # If a new robots_json path is given, use the custom setter method which will set @robots to that parsed JSON file uri = Addressable::URI.parse( [:request].url ) params = [:request].params real_path = get_decoded_path(params, uri) document_root = [:request]['DOCUMENT_ROOT'] # only intercept on get requests, SnapSearch robot cannot submit a POST, PUT or DELETE request return false unless [:request].get? # only intercept on http or https protocols return false unless %W[http https].include?(uri.scheme) # detect ignored user agents, if true, then return false return false if [:request].user_agent =~ /#{ @robots['ignore'].collect { |user_agent| Regexp.escape(user_agent) }.join(?|) }/i # if the requested route doesn't match any of the whitelisted routes, then the request is ignored # of course this only runs if there are any routes on the whitelist return false if ![:matched_routes].nil? && ![:matched_routes].empty? && ![:matched_routes].all? { |route| real_path =~ route } # detect ignored routes return false if ![:ignored_routes].nil? && [:ignored_routes].any? { |route| real_path =~ route } # detect extensions in order to prevent direct requests to static files if [:check_file_extensions] extensions['generic'] = [] unless extensions['generic'].is_a?(Array) extensions['ruby'] = [] unless extensions['ruby'].is_a?(Array) valid_extensions = extensions['generic'] + extensions['ruby'] valid_extensions.collect! { |value| value.to_s.downcase.strip } # Transform all extensions to Strings if they arn't already. Then downcase and strip whitespace/newlines from the beginning & end of all values. real_path_uri = Addressable::URI.parse(real_path) extension = real_path_uri.extname extension = extension[1..-1].downcase unless extension.empty? return false if !extension.empty? && !valid_extensions.include?(extension) end # detect escaped fragment (since the ignored user agents has been already been detected, SnapSearch won't continue the interception loop) return true if !uri.query_values.nil? && uri.query_values.has_key?('_escaped_fragment_') # detect matched robots, if true, then return true return true if [:request].user_agent =~ /#{ @robots['match'].collect { |user_agent| Regexp.escape(user_agent) }.join(?|) }/i # if no match at all, return false false end |
#get_decoded_path(params, uri) ⇒ String
Gets the decoded URL path relevant for detecting matched or ignored routes during detection. It is also used for static file detection.
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
# File 'lib/snap_search/detector.rb', line 182 def get_decoded_path(params, uri) raise TypeError, 'params must be a Hash or respond to #to_h or #to_hash' unless params.is_a?(Hash) || params.respond_to?(:to_h) || params.respond_to?(:to_hash) params = params.to_h rescue params.to_hash raise TypeError, 'uri must be an instance of Addressable::URI ' unless uri.is_a?(Addressable::URI) # NOTE: Have to pass the Rack::Request instance and use the `request.params` method to retrieve the parameters because: # uri.to_s # => "http://localhost/snapsearch/path1?key1=value1&_escaped_fragment_=%2Fpath2%3Fkey2=value2" # uri.query_values # => {"key1"=>"value1", "_escaped_fragment_"=>"/path2?key2"} # request.params # => {"key1"=>"value1", "_escaped_fragment_"=>"/path2?key2=value2"} # Is seems Addressable screws up the spliting of params into a Hash, but Rack does not. if !uri.query_values.nil? && uri.query_values.has_key?('_escaped_fragment_') qs_and_hash = get_real_qs_and_hash_fragment(params, false) Addressable::URI.unescape(uri.path) + qs_and_hash['qs'] + qs_and_hash['hash'] else Addressable::URI.unescape("#{ uri.path }?#{ uri.query }") end end |
#get_encoded_url(params, uri) ⇒ String
Gets the encoded URL that is passed to SnapSearch so that SnapSearch can scrape the encoded URL. If escaped_fragment query parameter is used, this is converted back to a hash fragment URL.
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
# File 'lib/snap_search/detector.rb', line 157 def get_encoded_url(params, uri) raise TypeError, 'params must be a Hash or respond to #to_h' unless params.is_a?(Hash) || params.respond_to?(:to_h) raise TypeError, 'uri must be an instance of Addressable::URI ' unless uri.is_a?(Addressable::URI) # NOTE: Have to pass the Rack::Request instance and use the `request.params` method to retrieve the parameters because: # uri.to_s # => "http://localhost/snapsearch/path1?key1=value1&_escaped_fragment_=%2Fpath2%3Fkey2=value2" # uri.query_values # => {"key1"=>"value1", "_escaped_fragment_"=>"/path2?key2"} # request.params # => {"key1"=>"value1", "_escaped_fragment_"=>"/path2?key2=value2"} # Is seems Addressable screws up the spliting of params into a Hash, but Rack does not. if !uri.query_values.nil? && uri.query_values.has_key?('_escaped_fragment_') qs_and_hash = get_real_qs_and_hash_fragment(params, true) url = "#{uri.scheme}://#{uri.}#{uri.path}" # Remove the query and fragment (SCHEME + AUTHORITY + PATH)... Addressable::URI encodes the uri url.to_s + qs_and_hash['qs'] + qs_and_hash['hash'] else uri.to_s end end |
#get_real_qs_and_hash_fragment(params, escape) ⇒ Hash
Gets the real query string and hash fragment by reversing the Google’s escaped_fragment protocol to the hash bang mode. This is used for both getting the encoded url for scraping and the decoded path for detection and is only called when the URI has a QUERY section.
Google will convert convert URLs like so: Original URL: example.com/path1?key1=value1#!/path2?key2=value2 Original Structure: DOMAIN - PATH - QS - HASH BANG - HASH PATH - HASH QS Search Engine URL: example.com/path1?key1=value1&escaped_fragment=%2Fpath2%3Fkey2=value2 Search Engine Structure: DOMAIN - PATH - QS - ESCAPED FRAGMENT Everything after the hash bang will be stored as the escaped_fragment, even if they are query strings. Therefore we have to reverse this process to get the original url which will be used for snapshotting purposes. This means the original URL can have 2 query strings components. The QS before the HASH BANG will be received by both the server and the client. However not all client side frameworks will process this QS. The HASH QS will only be received by the client as anything after hash does not get sent to the server. Most client side frameworks will process this HASH QS. See this for more information: developers.google.com/webmasters/ajax-crawling/docs/specification
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
# File 'lib/snap_search/detector.rb', line 220 def get_real_qs_and_hash_fragment(params, escape) raise TypeError, 'params must be a Hash or respond to #to_h or #to_hash' unless params.is_a?(Hash) || params.respond_to?(:to_h) || params.respond_to?(:to_hash) params = params.to_h rescue params.to_hash query_params = params.dup query_params.delete('_escaped_fragment_') query_params = query_params.to_a query_string = '' unless query_params.empty? query_params.collect! { |key, value| [ Addressable::URI.escape(key), Addressable::URI.escape(value) ] } if escape query_params.collect! { |key, value| "#{key}=#{value}" } query_string = "?#{ query_params.join(?&) }" end hash_fragment = params['_escaped_fragment_'] hash_fragment_string = '' hash_fragment_string = "#!#{hash_fragment}" unless hash_fragment.nil? { 'qs' => query_string, 'hash' => hash_fragment_string } end |