Module: Squish

Defined in:
lib/squish.rb,
lib/squish/version.rb

Defined Under Namespace

Modules: SQUISH_VERSION Classes: Bucket, Internal, Leaf, Node

Class Method Summary collapse

Class Method Details

.all_bytesObject

Returns a string containing all possible bytes. This is appended to the raw bucket dump to ensure that all bytes can be handled by the tree, since incoming documents may contain bytes not previously encountered within training data.



356
357
358
359
360
361
362
363
364
365
# File 'lib/squish.rb', line 356

def self.all_bytes #:nodoc:
  if !defined?(@all_bytes) || @all_bytes == nil
    all_bytes = ""
    for i in 0...256
      all_bytes << i.chr
    end
    @all_bytes = all_bytes
  end
  return @all_bytes
end

.classify(document, buckets) ⇒ Object

Classifies a document, based on an array of supplied buckets.



37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/squish.rb', line 37

def self.classify(document, buckets)
  best_result = nil
  best_score = nil
  for bucket in buckets
    score = bucket.compress(document)
    if best_score == nil || (score < best_score)
      best_score = score
      best_result = bucket.name
    end
  end
  return best_result
end

.classify!(document, buckets) ⇒ Object

Classifies a document, based on an array of supplied buckets. The document is automatically added to the bucket after classification.



52
53
54
55
56
57
58
# File 'lib/squish.rb', line 52

def self.classify!(document, buckets)
  result = self.classify(document, buckets)
  for bucket in buckets
    bucket << document if bucket.name == result
  end
  return result
end

.filter_document(document) ⇒ Object

Filters an entire document (Hash)



368
369
370
371
372
373
374
# File 'lib/squish.rb', line 368

def self.filter_document(document) #:nodoc:
  filtered_document = {}
  for key in document.keys
    filtered_document[key] = filter_value(document[key])
  end
  return filtered_document
end

.filter_value(value) ⇒ Object

Does a visual reduction of the characters contained within the value. This prevents “1337” speak from degrading the effectiveness of the algorithm in any way. This is intentionally a VERY lossy algorithm, and isn’t particularly efficient, but it works. The main advantage of this algorithm is that while some information may be lost from legitimate documents, more patterns will be revealed in illegitimate documents, with ultimately more critical information revealed than is lost.



383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
# File 'lib/squish.rb', line 383

def self.filter_value(value) #:nodoc:
  filtered_value = value.to_s.dup
  
  # Remove whitespace because spammers sometimes insert extraneous
  # whitespace, and the main algorithm shouldn't give false positives due
  # to a lack of whitespace, but it may give false positives due to extra
  # whitespace.
  filtered_value.gsub!(/\s/, "")
  
  filtered_value.gsub!(/~/, "-")
  filtered_value.gsub!(/\|/, "I")
  filtered_value.gsub!(/!/, "I")
  filtered_value.gsub!(/1/, "I")
  filtered_value.gsub!(/l/, "I")
  filtered_value.gsub!(/\+/, "t")
  filtered_value.gsub!(/3/, "e")
  filtered_value.gsub!(/7/, "T")
  filtered_value.gsub!(/@/, "a")
  filtered_value.gsub!(/4/, "A")
  filtered_value.gsub!(/8/, "B")
  filtered_value.gsub!(/6/, "G")
  filtered_value.gsub!(/\$/, "S")
  filtered_value.gsub!(/0/, "O")
  filtered_value.gsub!(/\(\)/, "O")
  filtered_value.gsub!(/I\)/, "D")
  filtered_value.gsub!(/\]\)/, "D")
  filtered_value.gsub!(/\[\)/, "D")
  filtered_value.gsub!(/I\*/, "P")
  filtered_value.gsub!(/\]\*/, "P")
  filtered_value.gsub!(/\*/, "a")
  filtered_value.gsub!(/I2/, "R")
  filtered_value.gsub!(/I=/, "F")
  filtered_value.gsub!(/I\\I/, "N")
  filtered_value.gsub!(/\`\//, "Y")
  filtered_value.gsub!(/\/\\\/\\/, "M")
  filtered_value.gsub!(/\\\/\\\//, "W")
  filtered_value.gsub!(/\\\/\\\//, "W")
  filtered_value.gsub!(/I\\\/I/, "M")
  filtered_value.gsub!(/IVI/i, "M")
  filtered_value.gsub!(/VV/, "W")
  filtered_value.gsub!(/\\X\//, "W")
  filtered_value.gsub!(/\/\\\//, "N")
  filtered_value.gsub!(/\\\/\\/, "N")
  filtered_value.gsub!(/\/V\\/i, "M")
  filtered_value.gsub!(/\/V/i, "N")
  filtered_value.gsub!(/\\N/, "W")
  filtered_value.gsub!(/\\\//, "V")
  filtered_value.gsub!(/\>\</, "X")
  filtered_value.gsub!(/I-I/, "H")
  filtered_value.gsub!(/\]-\[/, "H")
  filtered_value.gsub!(/\}\{/, "H")
  filtered_value.gsub!(/I_I/, "U")
  filtered_value.gsub!(/I\</, "K")
  filtered_value.gsub!(/\]\</, "K")
  filtered_value.gsub!(/\(/, "C")
  filtered_value.gsub!(/\//, "I")
  filtered_value.gsub!(/\\/, "I")
  filtered_value.downcase!
  
  return filtered_value
end