Class: Wordlist::Builder

Inherits:
Object
  • Object
show all
Defined in:
lib/wordlist/builder.rb

Overview

Parses text and builds a wordlist file.

Since:

  • 1.0.0

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(path, format: Format.infer(path), append: false, **kwargs) ⇒ Builder

Creates a new word-list Builder object.

Parameters:

  • path (String)

    The path of the wordlist file.

  • format (:txt, :gz, :bzip2, :xz, :zip, :7zip, nil) (defaults to: Format.infer(path))

    The format of the wordlist. If not given the format will be inferred from the file extension.

  • append (Boolean) (defaults to: false)

    Indicates whether new words will be appended to the wordlist or overwrite the wordlist.

  • kwargs (Hash{Symbol => Object})

    Additional keyword arguments for Lexer#initialize.

Options Hash (**kwargs):

  • :lang (Symbol)

    The language to use. Defaults to Lexer::Lang.default.

  • :stop_words (Array<String>)

    The explicit stop-words to ignore. If not given, default stop words will be loaded based on lang or Lexer::Lang.default.

  • :ignore_words (Array<String, Regexp>)

    Optional list of words to ignore. Can contain Strings or Regexps.

  • :digits (Boolean) — default: true

    Controls whether parsed words may contain digits or not.

  • :special_chars (Array<String>) — default: Lexer::SPCIAL_CHARS

    The additional special characters allowed within words.

  • :numbers (Boolean) — default: false

    Controls whether whole numbers will be parsed as words.

  • :acronyms (Boolean) — default: true

    Controls whether acronyms will be parsed as words.

  • :normalize_case (Boolean) — default: false

    Controls whether to convert all words to lowercase.

  • :normalize_apostrophes (Boolean) — default: false

    Controls whether apostrophes will be removed from the end of words.

  • :normalize_acronyms (Boolean) — default: false

    Controls whether acronyms will have . characters removed.

Raises:

  • (ArgumentError)

    The format could not be inferred from the file extension, or the ignore_words keyword contained a value other than a String or Regexp.

Since:

  • 1.0.0



90
91
92
93
94
95
96
97
98
99
100
# File 'lib/wordlist/builder.rb', line 90

def initialize(path, format: Format.infer(path), append: false, **kwargs)
  @path   = ::File.expand_path(path)
  @format = format
  @append = append

  @lexer = Lexer.new(**kwargs)
  @unique_filter = UniqueFilter.new

  load! if append? && ::File.file?(@path)
  open!
end

Instance Attribute Details

#format:txt, ... (readonly)

The format of the wordlist file.

Returns:

  • (:txt, :gzip, :bzip2, :xz, :zip, :7zip)

Since:

  • 1.0.0



26
27
28
# File 'lib/wordlist/builder.rb', line 26

def format
  @format
end

#lexerLexer (readonly)

The word lexer.

Returns:

Since:

  • 1.0.0



31
32
33
# File 'lib/wordlist/builder.rb', line 31

def lexer
  @lexer
end

#pathString (readonly)

Path of the wordlist

Returns:

  • (String)

Since:

  • 1.0.0



21
22
23
# File 'lib/wordlist/builder.rb', line 21

def path
  @path
end

#unique_filterUniqueFilter (readonly)

The unique filter.

Returns:

Since:

  • 1.0.0



36
37
38
# File 'lib/wordlist/builder.rb', line 36

def unique_filter
  @unique_filter
end

Class Method Details

.open(path, **kwargs) {|builder| ... } ⇒ Builder

Creates a new Builder object with the given arguments, opens the word-list file, passes the builder object to the given block then finally closes the word-list file.

Examples:

Builder.open('path/to/file.txt') do |builder|
  builder.parse(text)
end

Parameters:

  • path (String)

    The path of the wordlist file.

Yields:

  • (builder)

    If a block is given, it will be passed the new builder.

Yield Parameters:

  • builder (self)

    The newly created builder object.

Returns:

  • (Builder)

    The newly created builder object.

Since:

  • 1.0.0



124
125
126
127
128
129
130
131
132
133
134
135
136
# File 'lib/wordlist/builder.rb', line 124

def self.open(path,**kwargs)
  builder = new(path,**kwargs)

  if block_given?
    begin
      yield builder
    ensure
      builder.close
    end
  end

  return builder
end

Instance Method Details

#add(word) ⇒ self Also known as: <<, push

Appends the given word to the wordlist file, only if it has not been previously added.

Parameters:

  • word (String)

    The word to append.

Returns:

  • (self)

    The builder object.

Since:

  • 1.0.0



168
169
170
171
172
173
174
# File 'lib/wordlist/builder.rb', line 168

def add(word)
  if @unique_filter.add?(word)
    write(word)
  end

  return self
end

#append(words) ⇒ self Also known as: concat

Add the given words to the word-list.

Parameters:

  • words (Array<String>)

    The words to add to the list.

Returns:

  • (self)

    The builder object.

Since:

  • 1.0.0



188
189
190
191
# File 'lib/wordlist/builder.rb', line 188

def append(words)
  words.each { |word| add(word) }
  return self
end

#append?Boolean

Determines if the builder will append new words to the existing wordlist or overwrite it.

Returns:

  • (Boolean)

Since:

  • 1.0.0



144
145
146
# File 'lib/wordlist/builder.rb', line 144

def append?
  @append
end

#closeObject

Closes the word-list file.

Since:

  • 1.0.0



225
226
227
228
229
230
# File 'lib/wordlist/builder.rb', line 225

def close
  unless @io.closed?
    @io.close
    @unique_filter.clear
  end
end

#closed?Boolean

Indicates whether the wordlist builder has been closed.

Returns:

  • (Boolean)

Since:

  • 1.0.0



237
238
239
# File 'lib/wordlist/builder.rb', line 237

def closed?
  @io.closed?
end

#comment(message) ⇒ Object

Writes a comment line to the wordlist file.

Parameters:

  • message (String)

    The comment message to write.

Since:

  • 1.0.0



154
155
156
# File 'lib/wordlist/builder.rb', line 154

def comment(message)
  write("# #{message}")
end

#parse(text) ⇒ Object

Parses the given text, adding each unique word to the word-list file.

Parameters:

  • text (String)

    The text to parse.

Since:

  • 1.0.0



201
202
203
204
205
# File 'lib/wordlist/builder.rb', line 201

def parse(text)
  @lexer.parse(text) do |word|
    add(word)
  end
end

#parse_file(path) ⇒ Object

Parses the contents of the file at the given path, adding each unique word to the word-list file.

Parameters:

  • path (String)

    The path of the file to parse.

Since:

  • 1.0.0



214
215
216
217
218
219
220
# File 'lib/wordlist/builder.rb', line 214

def parse_file(path)
  ::File.open(path) do |file|
    file.each_line do |line|
      parse(line)
    end
  end
end