Class: Wgit::RobotsParser

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/robots_parser.rb

Overview

The RobotsParser class handles parsing and processing of a web servers robots.txt file.

Constant Summary collapse

KEY_COMMENT =

Key representing the start of a comment.

"#"
KEY_SEPARATOR =

Key value separator used in robots.txt files.

":"
KEY_USER_AGENT =

Key representing a user agent.

"User-agent"
KEY_ALLOW =

Key representing an allow URL rule.

"Allow"
KEY_DISALLOW =

Key representing a disallow URL rule.

"Disallow"
USER_AGENT_WGIT =

Value representing the Wgit user agent.

:wgit
USER_AGENT_ANY =

Value representing any user agent including Wgit.

:*
PATHS_ALL =

Value representing any and all paths.

%w[/ *].freeze

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(contents) ⇒ RobotsParser

Initializes and returns a Wgit::RobotsParser instance having parsed the robot.txt contents.

Parameters:

  • contents (String, #to_s)

    The contents of the robots.txt file to be parsed.



38
39
40
41
42
43
44
45
46
# File 'lib/wgit/robots_parser.rb', line 38

def initialize(contents)
  @rules = {
    allow_paths: Set.new,
    disallow_paths: Set.new
  }

  assert_respond_to(contents, :to_s)
  parse(contents.to_s)
end

Instance Attribute Details

#rulesObject (readonly) Also known as: paths

Hash containing the user-agent allow/disallow URL rules. Looks like: allow_paths: ["/"] disallow_paths: ["/accounts", ...]



31
32
33
# File 'lib/wgit/robots_parser.rb', line 31

def rules
  @rules
end

Instance Method Details

#allow_pathsArray<String>

Returns the allow paths/rules for this parser's robots.txt contents.

Returns:

  • (Array<String>)

    The allow paths/rules to follow.



58
59
60
# File 'lib/wgit/robots_parser.rb', line 58

def allow_paths
  @rules[:allow_paths].to_a
end

#allow_rules?Boolean

Returns whether or not there are allow rules applying to Wgit.

Returns:

  • (Boolean)

    True if there are allow rules for Wgit to follow, false otherwise.



81
82
83
# File 'lib/wgit/robots_parser.rb', line 81

def allow_rules?
  @rules[:allow_paths].any?
end

#disallow_pathsArray<String>

Returns the disallow paths/rules for this parser's robots.txt contents.

Returns:

  • (Array<String>)

    The disallow paths/rules to follow.



65
66
67
# File 'lib/wgit/robots_parser.rb', line 65

def disallow_paths
  @rules[:disallow_paths].to_a
end

#disallow_rules?Boolean

Returns whether or not there are disallow rules applying to Wgit.

Returns:

  • (Boolean)

    True if there are disallow rules for Wgit to follow, false otherwise.



89
90
91
# File 'lib/wgit/robots_parser.rb', line 89

def disallow_rules?
  @rules[:disallow_paths].any?
end

#inspectString

Overrides String#inspect to shorten the printed output of a Parser.

Returns:

  • (String)

    A short textual representation of this Parser.



51
52
53
# File 'lib/wgit/robots_parser.rb', line 51

def inspect
  "#<Wgit::RobotsParser has_rules=#{rules?} no_index=#{no_index?}>"
end

#no_index?Boolean Also known as: banned?

Returns whether or not Wgit is banned from indexing this site.

Returns:

  • (Boolean)

    True if Wgit should not index this site, false otherwise.



97
98
99
# File 'lib/wgit/robots_parser.rb', line 97

def no_index?
  @rules[:disallow_paths].any? { |path| PATHS_ALL.include?(path) }
end

#rules?Boolean

Returns whether or not there are rules applying to Wgit.

Returns:

  • (Boolean)

    True if there are rules for Wgit to follow, false otherwise.



73
74
75
# File 'lib/wgit/robots_parser.rb', line 73

def rules?
  allow_rules? || disallow_rules?
end