Class: Robotstxt::Parser

Inherits:

Object

Object
Robotstxt::Parser

show all

Defined in:: lib/robotstxt/parser.rb

Instance Attribute Summary collapse

#body ⇒ Object readonly

Returns the value of attribute body.
#found ⇒ Object readonly

Returns the value of attribute found.
#robot_id ⇒ Object

Returns the value of attribute robot_id.
#rules ⇒ Object readonly

Returns the value of attribute rules.
#sitemaps ⇒ Object readonly

Analyze the robots.txt file to return an Array containing the list of XML Sitemaps URLs.

Instance Method Summary collapse

#allowed?(var) ⇒ Boolean

Check if the URL is allowed to be crawled from the current Robot_id.
#found? ⇒ Boolean

This method returns true if the Robots.txt parsing is gone.
#get(hostname) ⇒ Object

Requires and parses the Robots.txt file for the hostname.
#initialize(robot_id = nil) ⇒ Parser constructor

Initializes a new Robots::Robotstxtistance with robot_id option.

Constructor Details

#initialize(robot_id = nil) ⇒ `Parser`

Initializes a new Robots::Robotstxtistance with robot_id option.

client = Robotstxt::Robotstxtistance.new('my_robot_id')

# File 'lib/robotstxt/parser.rb', line 29

def initialize(robot_id = nil)
	
	@robot_id = '*'
	@rules = []
	@sitemaps = []
	@robot_id = robot_id.downcase if !robot_id.nil?
	
end

Instance Attribute Details

#body ⇒ `Object` (readonly)

Returns the value of attribute body.



23
24
25

# File 'lib/robotstxt/parser.rb', line 23

def body
  @body
end

#found ⇒ `Object` (readonly)

Returns the value of attribute found.



23
24
25

# File 'lib/robotstxt/parser.rb', line 23

def found
  @found
end

#robot_id ⇒ `Object`

Returns the value of attribute robot_id.



22
23
24

# File 'lib/robotstxt/parser.rb', line 22

def robot_id
  @robot_id
end

#rules ⇒ `Object` (readonly)

Returns the value of attribute rules.



23
24
25

# File 'lib/robotstxt/parser.rb', line 23

def rules
  @rules
end

#sitemaps ⇒ `Object` (readonly)

Analyze the robots.txt file to return an Array containing the list of XML Sitemaps URLs.

client = Robotstxt::Robotstxtistance.new('my_robot_id')
if client.get('http://www.simonerinzivillo.it')
  client.sitemaps.each{ |url|
  puts url
}
end



125
126
127

# File 'lib/robotstxt/parser.rb', line 125

def sitemaps
  @sitemaps
end

Instance Method Details

#allowed?(var) ⇒ `Boolean`

Check if the URL is allowed to be crawled from the current Robot_id.

client = Robotstxt::Robotstxtistance.new('my_robot_id')
if client.get('http://www.simonerinzivillo.it')
  client.allowed?('http://www.simonerinzivillo.it/no-dir/')
end

This method returns true if the robots.txt file does not block the access to the URL.

Returns:

(Boolean)

# File 'lib/robotstxt/parser.rb', line 94

def allowed?(var)
	is_allow = true
	url = URI.parse(var)
	querystring = (!url.query.nil?) ? '?' + url.query : ''
	url_path = url.path + querystring
	
	@rules.each {|ua|
		
		if @robot_id == ua[0] || ua[0] == '*' 
			
			ua[1].each {|d|
				
				is_allow = false if url_path.match('^' + d ) || d == '/'
				
			}
			
		end
		
	}
	is_allow
end

#found? ⇒ `Boolean`

This method returns true if the Robots.txt parsing is gone.

Returns:

(Boolean)



131
132
133

# File 'lib/robotstxt/parser.rb', line 131

def found?
	!!@found
end

#get(hostname) ⇒ `Object`

Requires and parses the Robots.txt file for the hostname.

client = Robotstxt::Robotstxtistance.new('my_robot_id')
client.get('http://www.simonerinzivillo.it')

This method returns true if the parsing is gone.

# File 'lib/robotstxt/parser.rb', line 47

def get(hostname)
	
	@ehttp = true
	url = URI.parse(hostname)
	
	begin
		http = Net::HTTP.new(url.host, url.port)
		if url.scheme == 'https'
			http.verify_mode = OpenSSL::SSL::VERIFY_NONE
			http.use_ssl = true 
		end
		
		response =  http.request(Net::HTTP::Get.new('/robots.txt'))
		
		case response
			when Net::HTTPSuccess then
			@found = true
			@body = response.body
			parse()		
			
			else
			@found = false
		end 
		
		return @found
		
		rescue Timeout::Error, Errno::EINVAL, Errno::ECONNRESET => e
		if @ehttp
			@ettp = false
			retry 
			else
			return nil
		end
	end
	
end

Class: Robotstxt::Parser

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(robot_id = nil) ⇒ Parser

Instance Attribute Details

#body ⇒ Object (readonly)

#found ⇒ Object (readonly)

#robot_id ⇒ Object

#rules ⇒ Object (readonly)

#sitemaps ⇒ Object (readonly)