Module: DeBiLinguifier

Extended by:
DeBiLinguifier
Included in:
DeBiLinguifier
Defined in:
lib/debilinguifier.rb

Overview

This only works with capital AND detoned characters of latin and greek charsets

Constant Summary collapse

SYMBOLS =

The symbols

'\s\.\,\@\d\-\(\)\:\/\&\''.freeze
GREEK_LOOKING_CHARS =

A regular expression to check if the input phrase’s characters all belong in the greek charset

Regexp.new("(^[Α-ΩABEHIKMNOPTXYZ#{SYMBOLS}]+)+$").freeze
LATIN_LOOKING_CHARS =

A regular expression to check if the input phrase’s characters all belong in the latin charset

Regexp.new("(^[A-ZΑΒΕΗΙΚΜΝΟΡΤΥΧΖ#{SYMBOLS}]+)+$").freeze
LATIN_ALPHABET_PLUS_SYMBOLS =

A regular expression to match strings already written only with latin charset

Regexp.new("(^[A-Z#{SYMBOLS}]+)+$").freeze
GREEK_ALPHABET_PLUS_SYMBOLS =

A regular expression to match strings already written only with latin charset

Regexp.new("(^[Α-Ω#{SYMBOLS}]+)+$").freeze

Instance Method Summary collapse

Instance Method Details

#can_write_only_greek?(input) ⇒ Boolean

Determine if the whole phrase can be written only with greek charset

Returns:

  • (Boolean)


61
62
63
# File 'lib/debilinguifier.rb', line 61

def can_write_only_greek?(input)
  !!(input.match(GREEK_LOOKING_CHARS))
end

#can_write_only_latin?(input) ⇒ Boolean

Determine if the whole phrase can be written only with latin charset

Returns:

  • (Boolean)


66
67
68
# File 'lib/debilinguifier.rb', line 66

def can_write_only_latin?(input)
  !!(input.match(LATIN_LOOKING_CHARS))
end

#dbl(input, bias = 'greek') ⇒ String

Only works with latin and greek charsets. An input phrase can only be one of five things: 1) Already only in greek or only in latin charset. 2) Written in a mixed charset, but can be written with just the greek charset. 3) Written in a mixed charset, but can be written with just the latin charset. 4) Written in a mixed charset, but cannot be written with only one of the [greek, latin] charsets.

In this case we split the phrase into words and apply the above rules to each word seperately.
If case 4 applies to a single word, then we have to return it greek-ified or latin-ified.
This way we will be able to produce SQL queries in a more deterministic way.
(Actually, when searching for a phrase that has been processed by our dbl before writting to the db,
 we will also have to process through dbl the phrase we are looking for before quering the db).

5) Written in a mixed charset, but can be written either with just the greek charset or just the latin charset

(greek bias is the default and only behavior in this case)

Note: We are deliberately ignoring case 5, as it is of no use at the moment as a separate case. It is actually the initersection of cases 2 and 3. Using case 2 instead.

Returns:

  • (String)

    the de-bi-linguized string



38
39
40
41
42
43
44
45
46
47
48
# File 'lib/debilinguifier.rb', line 38

def dbl(input, bias='greek')
  if(is_greek_only?(input) || is_latin_only?(input)) # Case 1
    input
  elsif(can_write_only_greek?(input))                # Case 2
    return_in_greek(input)
  elsif(can_write_only_latin?(input))                # Case 3
    return_in_latin(input)
  else                                               # Case 4
    return_in_mixed_charset(input, bias)
  end
end

#is_greek_only?(input) ⇒ Boolean

Determine if the input phrase is already only in greek charset

Returns:

  • (Boolean)


51
52
53
# File 'lib/debilinguifier.rb', line 51

def is_greek_only?(input)
  !!(input.match(GREEK_ALPHABET_PLUS_SYMBOLS))
end

#is_latin_only?(input) ⇒ Boolean

Determine if the input phrase is already only in latin charset

Returns:

  • (Boolean)


56
57
58
# File 'lib/debilinguifier.rb', line 56

def is_latin_only?(input)
  !!(input.match(LATIN_ALPHABET_PLUS_SYMBOLS))
end

#return_in_greek(input) ⇒ Object

Return the phrase using the greek characters only



71
72
73
# File 'lib/debilinguifier.rb', line 71

def return_in_greek(input)
  input.tr('abehikmnoptxyz'.upcase, 'αβεηικμνορτχυζ'.upcase)
end

#return_in_latin(input) ⇒ Object

Return the phrase using the latin characters only



76
77
78
# File 'lib/debilinguifier.rb', line 76

def return_in_latin(input)
  input.tr('αβεηικμνορτχυζ'.upcase, 'abehikmnoptxyz'.upcase) 
end

#return_in_mixed_charset(input, bias) ⇒ Object

Return the phrase using both charsets



81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# File 'lib/debilinguifier.rb', line 81

def return_in_mixed_charset(input, bias)
  # Split the phrase in words and recursively try to return each word in the "correct" charset
  # If that is not possible (e.g. a word contains both "Φ" and "C", the word must either be greek-ified (default) 
  # or latin-ified. The reason for this is that we will be able to do SQL queries, as long as the word - or phrase
  # we are looking for has been passed through dbl.
  # We first split the input phrase, based on the SYMBOLS delimiters 
  words_arr = input.split(/(?<=[#{SYMBOLS}])/)
  if words_arr.length == 1            # If it was only one word, return it, according to the bias.
    if bias == 'greek'
      return return_in_greek(words_arr.join.to_s)  # If the bias is 'greek', return the word 'greek-ified'
    elsif bias == 'latin'
      return return_in_latin(words_arr.join.to_s)  # Else if bias is 'latin' return it 'latinified'.
    else 
      return (words_arr.join.to_s)                 # Else return it as-is (not advisable!)
    end
  else                                # Else apply dbl to each word we got after splitting input
    words_arr2 =[]
    words_arr.each do |word|
      words_arr2 << dbl(word)
    end
    return words_arr2.join.to_s
  end
end