Class: Ronin::Web::CLI::Commands::Wordlist Private
- Inherits:
-
Ronin::Web::CLI::Command
- Object
- Core::CLI::Command
- Ronin::Web::CLI::Command
- Ronin::Web::CLI::Commands::Wordlist
- Includes:
- Core::CLI::Logging, SpiderOptions
- Defined in:
- lib/ronin/web/cli/commands/wordlist.rb
Overview
This class is part of a private API. You should avoid using this class if possible, as it may be removed or be changed in the future.
Builds a wordlist by spidering a website.
Usage
ronin-web wordlist [options] {--host HOST | --domain DOMAIN | --site URL}
Options
--open-timeout SECS Sets the connection open timeout
--read-timeout SECS Sets the read timeout
--ssl-timeout SECS Sets the SSL connection timeout
--continue-timeout SECS Sets the continue timeout
--keep-alive-timeout SECS Sets the connection keep alive timeout
-P, --proxy PROXY Sets the proxy to use
-H, --header NAME: VALUE Sets a default header
--host-header NAME=VALUE Sets a default header
-u chrome-linux|chrome-macos|chrome-windows|chrome-iphone|chrome-ipad|chrome-android|firefox-linux|firefox-macos|firefox-windows|firefox-iphone|firefox-ipad|firefox-android|safari-macos|safari-iphone|safari-ipad|edge,
--user-agent The User-Agent to use
-U, --user-agent-string STRING The User-Agent string to use
-R, --referer URL Sets the Referer URL
--delay SECS Sets the delay in seconds between each request
-l, --limit COUNT Only spiders up to COUNT pages
-d, --max-depth DEPTH Only spiders up to max depth
--enqueue URL Adds the URL to the queue
--visited URL Marks the URL as previously visited
--strip-fragments Enables/disables stripping the fragment component of every URL
--strip-query Enables/disables stripping the query component of every URL
--visit-host HOST Visit URLs with the matching host name
--visit-hosts-like /REGEX/ Visit URLs with hostnames that match the REGEX
--ignore-host HOST Ignore the host name
--ignore-hosts-like /REGEX/ Ignore the host names matching the REGEX
--visit-port PORT Visit URLs with the matching port number
--visit-ports-like /REGEX/ Visit URLs with port numbers that match the REGEX
--ignore-port PORT Ignore the port number
--ignore-ports-like /REGEX/ Ignore the port numbers matching the REGEXP
--visit-link URL Visit the URL
--visit-links-like /REGEX/ Visit URLs that match the REGEX
--ignore-link URL Ignore the URL
--ignore-links-like /REGEX/ Ignore URLs matching the REGEX
--visit-ext FILE_EXT Visit URLs with the matching file ext
--visit-exts-like /REGEX/ Visit URLs with file exts that match the REGEX
--ignore-ext FILE_EXT Ignore the URLs with the file ext
--ignore-exts-like /REGEX/ Ignore URLs with file exts matching the REGEX
-r, --robots Specifies whether to honor robots.txt
--host HOST Spiders the specific HOST
--domain DOMAIN Spiders the whole domain
--site URL Spiders the website, starting at the URL
-o, --output PATH The wordlist to write to
-X, --content-xpath XPATH The XPath for the content (Default: //body)
-C, --content-css-path XPATH The XPath for the content
--meta-tags Parse certain meta-tags (Default: enabled)
--no-meta-tags Ignore meta-tags
--alt-tags Parse alt-tags on images (Default: enabled)
--no-alt-tags Also parse alt-tags on images
--paths Also parse URL paths
--query-params-names Also parse URL query param names
--query-param-values Also parse URL query param values
--only-paths Only build a wordlist based on the paths
--only-query-param Only build a wordlist based on the query param names
--only-query-param-values Only build a wordlist based on the query param values
-f, --format txt|gz|bzip2|xz Specifies the format of the wordlist file
-A, --append Append new words to the wordlist file intead of overwriting the file
-L, --lang LANG The language of the text to parse
--stop-word WORD A stop-word to ignore
--only-query-param-values Only build a wordlist based on the query param values
-f, --format txt|gz|bzip2|xz Specifies the format of the wordlist file
-A, --append Append new words to the wordlist file intead of overwriting the file
-L, --lang LANG The language of the text to parse
--stop-word WORD A stop-word to ignore
--ignore-word WORD Ignores the word
--digits Accepts words containing digits (Default: enabled)
--no-digits Ignores words containing digits
--special-char CHAR Allows a special character within a word (Default: _, -, ')
--numbers Accepts numbers as words (Default: disabled)
--no-numbers Ignores numbers
--acronyms Treats acronyms as words (Default: enabled)
--no-acronyms Ignores acronyms
--normalize-case Converts all words to lowercase
--no-normalize-case Preserve the case of words and letters (Default: enabled)
--normalize-apostrophes Removes apostrophes from words
--no-normalize-apostrophes Preserve apostrophes from words (Default: enabled)
--normalize-acronyms Removes '.' characters from acronyms
--no-normalize-acronyms Preserve '.' characters in acronyms (Default: enabled)
-h, --help Print help information
Constant Summary collapse
- META_TAGS_XPATH =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
XPath to find
description
andkeywords
meta
-tags. '/head/meta[@name="description" or @name="keywords"]/@content'
- TEXT_XPATH =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
XPath to find all text elements.
'//text()[not (ancestor-or-self::script or ancestor-or-self::style)]'
- COMMENT_XPATH =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
XPath to find all HTML comments.
'//comment()'
- ALT_TAGS_XPATH =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
XPath which finds all image
alt
-tags, SVGdesc
elements, anda
title
attributes. '//img/@alt|//area/@alt|//input/@alt|//a/@title'
- WORDLIST_BUILDER_OPTIONS =
This constant is part of a private API. You should avoid using this constant if possible, as it may be removed or be changed in the future.
List of command
options
that directly map to the keyword arguments ofWordlist::Builder.new
. [ :format, :append, :lang, :digits, :numbers, :acronyms, :normalize_case, :normalize_apostrophes, :normalize_acronyms ]
Instance Attribute Summary collapse
-
#content_xpath ⇒ String
readonly
private
The XPath or CSS-path for the page's content.
-
#ignore_words ⇒ Array<String>
readonly
private
List of words to ignore.
-
#special_chars ⇒ Array<String>
readonly
private
The list of special characters to allow in words.
-
#stop_words ⇒ Array<String>
readonly
private
List of stop-words to ignore.
Attributes included from SpiderOptions
Instance Method Summary collapse
-
#infer_wordlist_path ⇒ String
private
Generates the wordlist output path based on the
--host
,--domain
, or--site
options. -
#initialize(**kwargs) ⇒ Wordlist
constructor
private
Initializes the
ronin-web wordlist
command. -
#parse_html(page) ⇒ Object
private
Parses the spidered page's HTML and adds the words to the wordlist.
-
#parse_page(page) ⇒ Object
private
Parses the spidered page's content and adds the words to the wordlist.
-
#parse_url_path(url) ⇒ Object
private
Parses the URL's directory names of a spidered page and adds them to the wordlist.
-
#parse_url_query_param_names(url) ⇒ Object
private
Parses the URL's query param names of a spidered page and adds them to the wordlist.
-
#parse_url_query_param_values(url) ⇒ Object
private
Parses the URL's query param values of a spidered page and adds them to the wordlist.
-
#run ⇒ Object
private
Runs the
ronin-web wordlist
command. -
#wordlist_builder_kwargs ⇒ Object
private
Creates a keyword arguments
Hash
of all commandoptions
that will be directly passed toWordlist::Builder.new
. -
#wordlist_path ⇒ String
private
The wordlist output path.
Methods included from SpiderOptions
#continue_timeout, #continue_timeout=, #default_headers, #delay, #delay=, #history, #host_headers, #ignore_exts, #ignore_hosts, #ignore_links, #ignore_ports, #ignore_schemes, included, #keep_alive_timeout, #keep_alive_timeout=, #limit, #limit=, #max_depth, #max_depth=, #new_agent, #open_timeout, #open_timeout=, #proxy, #proxy=, #queue, #read_timeout, #read_timeout=, #referer, #referer=, #robots, #robots=, #ssl_timeout, #ssl_timeout=, #strip_fragments, #strip_fragments=, #strip_query, #strip_query=, #user_agent, #user_agent=, #visit_exts, #visit_hosts, #visit_links, #visit_ports, #visit_schemes
Constructor Details
#initialize(**kwargs) ⇒ Wordlist
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Initializes the ronin-web wordlist
command.
279 280 281 282 283 284 285 286 287 288 289 290 291 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 279 def initialize(**kwargs) super(**kwargs) @content_xpath = nil @parse_meta_tags = true @parse_comments = true @parse_alt_tags = true @stop_words = [] @ignore_words = [] @special_chars = [] end |
Instance Attribute Details
#content_xpath ⇒ String (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
The XPath or CSS-path for the page's content.
256 257 258 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 256 def content_xpath @content_xpath end |
#ignore_words ⇒ Array<String> (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
List of words to ignore.
266 267 268 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 266 def ignore_words @ignore_words end |
#special_chars ⇒ Array<String> (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
The list of special characters to allow in words.
271 272 273 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 271 def special_chars @special_chars end |
#stop_words ⇒ Array<String> (readonly)
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
List of stop-words to ignore.
261 262 263 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 261 def stop_words @stop_words end |
Instance Method Details
#infer_wordlist_path ⇒ String
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Generates the wordlist output path based on the --host
,
--domain
, or --site
options.
355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 355 def infer_wordlist_path if [:host] then "#{[:host]}.txt" elsif [:domain] then "#{[:domain]}.txt" elsif [:site] uri = URI.parse([:site]) unless uri.port == uri.default_port "#{uri.host}:#{uri.port}.txt" else "#{uri.host}.txt" end else print_error "must specify --host, --domain, or --site" exit(1) end end |
#parse_html(page) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Parses the spidered page's HTML and adds the words to the wordlist.
471 472 473 474 475 476 477 478 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 471 def parse_html(page) page.search(@xpath).each do |node| text = node.inner_text text.strip! @wordlist.parse(text) unless text.empty? end end |
#parse_page(page) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Parses the spidered page's content and adds the words to the wordlist.
457 458 459 460 461 462 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 457 def parse_page(page) if page.html? log_info "Parsing HTML on #{page.url} ..." parse_html(page) end end |
#parse_url_path(url) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Parses the URL's directory names of a spidered page and adds them to the wordlist.
411 412 413 414 415 416 417 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 411 def parse_url_path(url) log_info "Parsing #{url} ..." url.path.split('/').each do |dirname| @wordlist.add(dirname) unless dirname.empty? end end |
#parse_url_query_param_names(url) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Parses the URL's query param names of a spidered page and adds them to the wordlist.
426 427 428 429 430 431 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 426 def parse_url_query_param_names(url) unless url.query_params.empty? log_info "Parsing query param for #{url} ..." @wordlist.append(url.query_params.keys) end end |
#parse_url_query_param_values(url) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Parses the URL's query param values of a spidered page and adds them to the wordlist.
440 441 442 443 444 445 446 447 448 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 440 def parse_url_query_param_values(url) unless url.query_params.empty? log_info "Parsing query param values for #{url} ..." url.query_params.each_value do |value| @wordlist.add(value) end end end |
#run ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Runs the ronin-web wordlist
command.
309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 309 def run @wordlist = ::Wordlist::Builder.new(wordlist_path,**wordlist_builder_kwargs) @xpath = "#{@content_xpath}#{TEXT_XPATH}" @xpath << "|#{META_TAGS_XPATH}" if @parse_meta_tags @xpath << "|#{@content_xpath}#{COMMENT_XPATH}" if @parse_comments @xpath << "|#{@content_xpath}#{ALT_TAGS_XPATH}" if @parse_alt_tags begin new_agent do |agent| if [:only_paths] agent.every_url(&method(:parse_url_path)) elsif [:only_query_param_names] agent.every_url(&method(:parse_url_query_param_names)) elsif [:only_query_param_values] agent.every_url(&method(:parse_url_query_param_values)) else agent.every_url(&method(:parse_url_path)) if [:paths] agent.every_url(&method(:parse_url_query_param_names)) if [:query_param_names] agent.every_url(&method(:parse_url_query_param_values)) if [:query_param_values] agent.every_ok_page(&method(:parse_page)) end end ensure @wordlist.close end end |
#wordlist_builder_kwargs ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Creates a keyword arguments Hash
of all command options
that
will be directly passed to Wordlist::Builder.new
390 391 392 393 394 395 396 397 398 399 400 401 402 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 390 def wordlist_builder_kwargs kwargs = {} WORDLIST_BUILDER_OPTIONS.each do |key| kwargs[key] = [key] if .has_key?(key) end kwargs[:stop_words] = @stop_words unless @stop_words.empty? kwargs[:ignore_words] = @ignore_words unless @ignore_words.empty? kwargs[:special_chars] = @special_chars unless @special_chars.empty? return kwargs end |
#wordlist_path ⇒ String
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
The wordlist output path.
344 345 346 |
# File 'lib/ronin/web/cli/commands/wordlist.rb', line 344 def wordlist_path .fetch(:output) { infer_wordlist_path } end |