Module: Squish
- Defined in:
- lib/squish.rb,
lib/squish/version.rb
Defined Under Namespace
Modules: SQUISH_VERSION Classes: Bucket, Internal, Leaf, Node
Class Method Summary collapse
-
.all_bytes ⇒ Object
Returns a string containing all possible bytes.
-
.classify(document, buckets) ⇒ Object
Classifies a document, based on an array of supplied buckets.
-
.classify!(document, buckets) ⇒ Object
Classifies a document, based on an array of supplied buckets.
-
.filter_document(document) ⇒ Object
Filters an entire document (Hash).
-
.filter_value(value) ⇒ Object
Does a visual reduction of the characters contained within the value.
Class Method Details
.all_bytes ⇒ Object
Returns a string containing all possible bytes. This is appended to the raw bucket dump to ensure that all bytes can be handled by the tree, since incoming documents may contain bytes not previously encountered within training data.
356 357 358 359 360 361 362 363 364 365 |
# File 'lib/squish.rb', line 356 def self.all_bytes #:nodoc: if !defined?(@all_bytes) || @all_bytes == nil all_bytes = "" for i in 0...256 all_bytes << i.chr end @all_bytes = all_bytes end return @all_bytes end |
.classify(document, buckets) ⇒ Object
Classifies a document, based on an array of supplied buckets.
37 38 39 40 41 42 43 44 45 46 47 48 |
# File 'lib/squish.rb', line 37 def self.classify(document, buckets) best_result = nil best_score = nil for bucket in buckets score = bucket.compress(document) if best_score == nil || (score < best_score) best_score = score best_result = bucket.name end end return best_result end |
.classify!(document, buckets) ⇒ Object
Classifies a document, based on an array of supplied buckets. The document is automatically added to the bucket after classification.
52 53 54 55 56 57 58 |
# File 'lib/squish.rb', line 52 def self.classify!(document, buckets) result = self.classify(document, buckets) for bucket in buckets bucket << document if bucket.name == result end return result end |
.filter_document(document) ⇒ Object
Filters an entire document (Hash)
368 369 370 371 372 373 374 |
# File 'lib/squish.rb', line 368 def self.filter_document(document) #:nodoc: filtered_document = {} for key in document.keys filtered_document[key] = filter_value(document[key]) end return filtered_document end |
.filter_value(value) ⇒ Object
Does a visual reduction of the characters contained within the value. This prevents “1337” speak from degrading the effectiveness of the algorithm in any way. This is intentionally a VERY lossy algorithm, and isn’t particularly efficient, but it works. The main advantage of this algorithm is that while some information may be lost from legitimate documents, more patterns will be revealed in illegitimate documents, with ultimately more critical information revealed than is lost.
383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 |
# File 'lib/squish.rb', line 383 def self.filter_value(value) #:nodoc: filtered_value = value.to_s.dup # Remove whitespace because spammers sometimes insert extraneous # whitespace, and the main algorithm shouldn't give false positives due # to a lack of whitespace, but it may give false positives due to extra # whitespace. filtered_value.gsub!(/\s/, "") filtered_value.gsub!(/~/, "-") filtered_value.gsub!(/\|/, "I") filtered_value.gsub!(/!/, "I") filtered_value.gsub!(/1/, "I") filtered_value.gsub!(/l/, "I") filtered_value.gsub!(/\+/, "t") filtered_value.gsub!(/3/, "e") filtered_value.gsub!(/7/, "T") filtered_value.gsub!(/@/, "a") filtered_value.gsub!(/4/, "A") filtered_value.gsub!(/8/, "B") filtered_value.gsub!(/6/, "G") filtered_value.gsub!(/\$/, "S") filtered_value.gsub!(/0/, "O") filtered_value.gsub!(/\(\)/, "O") filtered_value.gsub!(/I\)/, "D") filtered_value.gsub!(/\]\)/, "D") filtered_value.gsub!(/\[\)/, "D") filtered_value.gsub!(/I\*/, "P") filtered_value.gsub!(/\]\*/, "P") filtered_value.gsub!(/\*/, "a") filtered_value.gsub!(/I2/, "R") filtered_value.gsub!(/I=/, "F") filtered_value.gsub!(/I\\I/, "N") filtered_value.gsub!(/\`\//, "Y") filtered_value.gsub!(/\/\\\/\\/, "M") filtered_value.gsub!(/\\\/\\\//, "W") filtered_value.gsub!(/\\\/\\\//, "W") filtered_value.gsub!(/I\\\/I/, "M") filtered_value.gsub!(/IVI/i, "M") filtered_value.gsub!(/VV/, "W") filtered_value.gsub!(/\\X\//, "W") filtered_value.gsub!(/\/\\\//, "N") filtered_value.gsub!(/\\\/\\/, "N") filtered_value.gsub!(/\/V\\/i, "M") filtered_value.gsub!(/\/V/i, "N") filtered_value.gsub!(/\\N/, "W") filtered_value.gsub!(/\\\//, "V") filtered_value.gsub!(/\>\</, "X") filtered_value.gsub!(/I-I/, "H") filtered_value.gsub!(/\]-\[/, "H") filtered_value.gsub!(/\}\{/, "H") filtered_value.gsub!(/I_I/, "U") filtered_value.gsub!(/I\</, "K") filtered_value.gsub!(/\]\</, "K") filtered_value.gsub!(/\(/, "C") filtered_value.gsub!(/\//, "I") filtered_value.gsub!(/\\/, "I") filtered_value.downcase! return filtered_value end |