Class: FriendlyId::SlugString

Inherits:
ActiveSupport::Multibyte::Chars
  • Object
show all
Defined in:
lib/friendly_id/slug_string.rb

Overview

This class provides some string-manipulation methods specific to slugs. Its Unicode support is provided by ActiveSupport::Multibyte::Chars; this is needed primarily for Unicode encoding normalization and proper calculation of string lengths.

Note that this class includes many “bang methods” such as #clean! and #normalize! that perform actions on the string in-place. Each of these methods has a corresponding “bangless” method (i.e., SlugString#clean! and SlugString#clean) which does not appear in the documentation because it is generated dynamically.

All of the bang methods return an instance of String, while the bangless versions return an instance of FriendlyId::SlugString, so that calls to methods specific to this class can be chained:

string = SlugString.new("hello world")
string.with_dashes! # => "hello-world"
string.with_dashes  # => <FriendlyId::SlugString:0x000001013e1590 @wrapped_string="hello-world">

See Also:

Constant Summary collapse

APPROXIMATIONS =

All values are Unicode decimal characters or character arrays.

{
  :common => Hash[
    192, 65, 193, 65, 194, 65, 195, 65, 196, 65, 197, 65, 198, [65, 69],
    199, 67, 200, 69, 201, 69, 202, 69, 203, 69, 204, 73, 205, 73, 206,
    73, 207, 73, 208, 68, 209, 78, 210, 79, 211, 79, 212, 79, 213, 79,
    214, 79, 215, 120, 216, 79, 217, 85, 218, 85, 219, 85, 220, 85, 221,
    89, 222, [84, 104], 223, [115, 115], 224, 97, 225, 97, 226, 97, 227,
    97, 228, 97, 229, 97, 230, [97, 101], 231, 99, 232, 101, 233, 101,
    234, 101, 235, 101, 236, 105, 237, 105, 238, 105, 239, 105, 240, 100,
    241, 110, 242, 111, 243, 111, 244, 111, 245, 111, 246, 111, 248, 111,
    249, 117, 250, 117, 251, 117, 252, 117, 253, 121, 254, [116, 104],
    255, 121, 256, 65, 257, 97, 258, 65, 259, 97, 260, 65, 261, 97, 262,
    67, 263, 99, 264, 67, 265, 99, 266, 67, 267, 99, 268, 67, 269, 99,
    270, 68, 271, 100, 272, 68, 273, 100, 274, 69, 275, 101, 276, 69, 277,
    101, 278, 69, 279, 101, 280, 69, 281, 101, 282, 69, 283, 101, 284, 71,
    285, 103, 286, 71, 287, 103, 288, 71, 289, 103, 290, 71, 291, 103,
    292, 72, 293, 104, 294, 72, 295, 104, 296, 73, 297, 105, 298, 73, 299,
    105, 300, 73, 301, 105, 302, 73, 303, 105, 304, 73, 305, 105, 306,
    [73, 74], 307, [105, 106], 308, 74, 309, 106, 310, 75, 311, 107, 312,
    107, 313, 76, 314, 108, 315, 76, 316, 108, 317, 76, 318, 108, 319, 76,
    320, 108, 321, 76, 322, 108, 323, 78, 324, 110, 325, 78, 326, 110,
    327, 78, 328, 110, 329, [39, 110], 330, [78, 71], 331, [110, 103],
    332, 79, 333, 111, 334, 79, 335, 111, 336, 79, 337, 111, 338, [79,
    69], 339, [111, 101], 340, 82, 341, 114, 342, 82, 343, 114, 344, 82,
    345, 114, 346, 83, 347, 115, 348, 83, 349, 115, 350, 83, 351, 115,
    352, 83, 353, 115, 354, 84, 355, 116, 356, 84, 357, 116, 358, 84, 359,
    116, 360, 85, 361, 117, 362, 85, 363, 117, 364, 85, 365, 117, 366, 85,
    367, 117, 368, 85, 369, 117, 370, 85, 371, 117, 372, 87, 373, 119,
    374, 89, 375, 121, 376, 89, 377, 90, 378, 122, 379, 90, 380, 122, 381,
    90, 382, 122
  ].freeze,
  :german => Hash[252, [117, 101], 246, [111, 101], 228, [97, 101]],
  :spanish => Hash[209, [78, 110], 241, [110, 110]]
}
CP1252 =

CP-1252 decimal byte => UTF-8 approximation as an array of bytes

{
  128 => [226, 130, 172],
  129 => nil,
  130 => [226, 128, 154],
  131 => [198, 146],
  132 => [226, 128, 158],
  133 => [226, 128, 166],
  134 => [226, 128, 160],
  135 => [226, 128, 161],
  136 => [203, 134],
  137 => [226, 128, 176],
  138 => [197, 160],
  139 => [226, 128, 185],
  140 => [197, 146],
  141 => nil,
  142 => [197, 189],
  143 => nil,
  144 => nil,
  145 => [226, 128, 152],
  146 => [226, 128, 153],
  147 => [226, 128, 156],
  148 => [226, 128, 157],
  149 => [226, 128, 162],
  150 => [226, 128, 147],
  151 => [226, 128, 148],
  152 => [203, 156],
  153 => [226, 132, 162],
  154 => [197, 161],
  155 => [226, 128, 186],
  156 => [197, 147],
  157 => nil,
  158 => [197, 190],
  159 => [197, 184]
}

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(string) ⇒ SlugString

Returns a new instance of SlugString.

Parameters:

  • string (String)

    The string to use as the basis of the SlugString.



147
148
149
150
# File 'lib/friendly_id/slug_string.rb', line 147

def initialize(string)
  super string.to_s
  tidy_bytes!
end

Class Method Details

.dump_approximationsObject

This method can be used by developers wishing to debug the APPROXIMATIONS hashes, which are written in a hard-to-read format. => {“À”=>“A”, “Á”=>“A”, “”=>“A”, “Ô=>“A”, “Ä”=>“A”, “Å”=>“A”, “Æ”=>“AE”, “Ç”=>“C”, “È”=>“E”, “É”=>“E”, “Ê”=>“E”, “Ë”=>“E”, “Ì”=>“I”, “Í”=>“I”, “Δ=>“I”, “Ï”=>“I”, “Д=>“D”, “Ñ”=>“N”, “Ò”=>“O”, “Ó”=>“O”, “Ô”=>“O”, “Õ”=>“O”, “Ö”=>“O”, “×”=>“x”, “Ø”=>“O”, “Ù”=>“U”, “Ú”=>“U”, “Û”=>“U”, “Ü”=>“U”, “Ý”=>“Y”, “Þ”=>“Th”, “ß”=>“ss”, “à”=>“a”, “á”=>“a”, “â”=>“a”, “ã”=>“a”, “ä”=>“a”, “å”=>“a”, “æ”=>“ae”, “ç”=>“c”, “è”=>“e”, “é”=>“e”, “ê”=>“e”, “ë”=>“e”, “ì”=>“i”, “í”=>“i”, “î”=>“i”, “ï”=>“i”, “ð”=>“d”, “ñ”=>“n”, “ò”=>“o”, “ó”=>“o”, “ô”=>“o”, “õ”=>“o”, “ö”=>“o”, “ø”=>“o”, “ù”=>“u”, “ú”=>“u”, “û”=>“u”, “ü”=>“u”, “ý”=>“y”, “þ”=>“th”, “ÿ”=>“y”, “Ā”=>“A”, “ā”=>“a”, “Ă”=>“A”, “ă”=>“a”, “Ą”=>“A”, “ą”=>“a”, “Ć”=>“C”, “ć”=>“c”, “Ĉ”=>“C”, “ĉ”=>“c”, “Ċ”=>“C”, “ċ”=>“c”, “Č”=>“C”, “č”=>“c”, “Ď”=>“D”, “ď”=>“d”, “Đ”=>“D”, “đ”=>“d”, “Ē”=>“E”, “ē”=>“e”, “Ĕ”=>“E”, “ĕ”=>“e”, “Ė”=>“E”, “ė”=>“e”, “Ę”=>“E”, “ę”=>“e”, “Ě”=>“E”, “ě”=>“e”, “Ĝ”=>“G”, “ĝ”=>“g”, “Ğ”=>“G”, “ğ”=>“g”, “Ġ”=>“G”, “ġ”=>“g”, “Ģ”=>“G”, “ģ”=>“g”, “Ĥ”=>“H”, “ĥ”=>“h”, “Ħ”=>“H”, “ħ”=>“h”, “Ĩ”=>“I”, “ĩ”=>“i”, “Ī”=>“I”, “ī”=>“i”, “Ĭ”=>“I”, “ĭ”=>“i”, “Į”=>“I”, “į”=>“i”, “İ”=>“I”, “ı”=>“i”, “IJ”=>“IJ”, “ij”=>“ij”, “Ĵ”=>“J”, “ĵ”=>“j”, “Ķ”=>“K”, “ķ”=>“k”, “ĸ”=>“k”, “Ĺ”=>“L”, “ĺ”=>“l”, “Ļ”=>“L”, “ļ”=>“l”, “Ľ”=>“L”, “ľ”=>“l”, “Ŀ”=>“L”, “ŀ”=>“l”, “Ł”=>“L”, “ł”=>“l”, “Ń”=>“N”, “ń”=>“n”, “Ņ”=>“N”, “ņ”=>“n”, “Ň”=>“N”, “ň”=>“n”, “ʼn”=>“‘n”, “Ŋ”=>“NG”, “ŋ”=>“ng”, “Ō”=>“O”, “ō”=>“o”, “Ŏ”=>“O”, “ŏ”=>“o”, “Ő”=>“O”, “ő”=>“o”, “Œ”=>“OE”, “œ”=>“oe”, “Ŕ”=>“R”, “ŕ”=>“r”, “Ŗ”=>“R”, “ŗ”=>“r”, “Ř”=>“R”, “ř”=>“r”, “Ś”=>“S”, “ś”=>“s”, “Ŝ”=>“S”, “ŝ”=>“s”, “Ş”=>“S”, “ş”=>“s”, “Š”=>“S”, “š”=>“s”, “Ţ”=>“T”, “ţ”=>“t”, “Ť”=>“T”, “ť”=>“t”, “Ŧ”=>“T”, “ŧ”=>“t”, “Ũ”=>“U”, “ũ”=>“u”, “Ū”=>“U”, “ū”=>“u”, “Ŭ”=>“U”, “ŭ”=>“u”, “Ů”=>“U”, “ů”=>“u”, “Ű”=>“U”, “ű”=>“u”, “Ų”=>“U”, “ų”=>“u”, “Ŵ”=>“W”, “ŵ”=>“w”, “Ŷ”=>“Y”, “ŷ”=>“y”, “Ÿ”=>“Y”, “Ź”=>“Z”, “ź”=>“z”, “Ż”=>“Z”, “ż”=>“z”, “Ž”=>“Z”, “ž”=>“z”, :german => “ö”=>“oe”, “ä”=>“ae”, :spanish => “ñ”=>“nn”}

Examples:


> ruby -rrubygems -rlib/friendly_id -e 'p FriendlyId::SlugString.dump_approximations'

Returns:

  • Hash



139
140
141
142
143
# File 'lib/friendly_id/slug_string.rb', line 139

def self.dump_approximations
  Hash[APPROXIMATIONS.map do |name, approx|
    [name, Hash[approx.map {|key, value| [[key].pack("U*"), [value].flatten.pack("U*")]}]]
  end]
end

Instance Method Details

#approximate_ascii!(*args) ⇒ Object

Approximate an ASCII string. This works only for Western strings using characters that are Roman-alphabet characters + diacritics. Non-letter characters are left unmodified.

string = SlugString.new "Łódź, Poland"
string.approximate_ascii                 # => "Lodz, Poland"
string = SlugString.new "日本"
string.approximate_ascii                 # => "日本"

You can pass any key(s) from APPROXIMATIONS as arguments. This allows for contextual approximations. By default; :spanish and :german are provided:

string = SlugString.new "Jürgen Müller"
string.approximate_ascii                 # => "Jurgen Muller"
string.approximate_ascii :german        # => "Juergen Mueller"
string = SlugString.new "¡Feliz año!"
string.approximate_ascii                 # => "¡Feliz ano!"
string.approximate_ascii :spanish       # => "¡Feliz anno!"

You can modify the built-in approximations, or add your own:

# Make Spanish use "nh" rather than "nn"
FriendlyId::SlugString::APPROXIMATIONS[:spanish] = {
  # Ñ => "Nh"
  209 => [78, 104],
  # ñ => "nh"
  241 => [110, 104]
}

It’s also possible to use a custom approximation for all strings:

FriendlyId::SlugString.approximations << :german

Notice that this method does not simply convert to ASCII; if you want to remove non-ASCII characters such as “¡” and “¿”, use #to_ascii!:

string.approximate_ascii!(:spanish)       # => "¡Feliz anno!"
string.to_ascii!                          # => "Feliz anno!"

Parameters:

  • *args (Symbol)

Returns:

  • String



193
194
195
196
# File 'lib/friendly_id/slug_string.rb', line 193

def approximate_ascii!(*args)
  @maps = (self.class.approximations + args + [:common]).flatten.uniq
  @wrapped_string = normalize_utf8(:c).unpack("U*").map { |char| approx_char(char) }.flatten.pack("U*")
end

#clean!Object

Removes leading and trailing spaces or dashses, and replaces multiple whitespace characters with a single space.

Returns:

  • String



201
202
203
# File 'lib/friendly_id/slug_string.rb', line 201

def clean!
  @wrapped_string = @wrapped_string.gsub(/\A\-|\-\z/, "").gsub(/\s+/u, " ").strip
end

#downcase!Object

Lowercases the string. Note that this works for Unicode strings, though your milage may vary with Greek and Turkic strings.

Returns:

  • String



208
209
210
# File 'lib/friendly_id/slug_string.rb', line 208

def downcase!
  @wrapped_string = apply_mapping :lowercase_mapping
end

#normalize!Object

Normalize the string for use as a FriendlyId. Note that in this context, normalize means, strip, remove non-letters/numbers, downcasing and converting whitespace to dashes. ActiveSupport::Multibyte::Chars#normalize is aliased to normalize_utf8 in this subclass.

Returns:

  • String



251
252
253
254
255
256
# File 'lib/friendly_id/slug_string.rb', line 251

def normalize!
  clean!
  word_chars!
  downcase!
  with_dashes!
end

#normalize_for!(config) ⇒ Object

Normalize the string for a given Configuration.

Parameters:

Returns:

  • String



237
238
239
240
241
# File 'lib/friendly_id/slug_string.rb', line 237

def normalize_for!(config)
  approximate_ascii! if config.approximate_ascii?
  to_ascii! if config.strip_non_ascii?
  normalize!
end

#tidy_bytes!(force = false) ⇒ Object

Attempt to replace invalid UTF-8 bytes with valid ones. This method naively assumes if you have invalid UTF8 bytes, they are either Windows CP-1252 or ISO8859-1. In practice this isn’t a bad assumption, but may not always work.

Passing true will forcibly tidy all bytes, assuming that the string’s encoding is CP-1252 or ISO-8859-1.



265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
# File 'lib/friendly_id/slug_string.rb', line 265

def tidy_bytes!(force = false)

  if force
    @wrapped_string = @wrapped_string.unpack("C*").map do |b|
      tidy_byte(b)
    end.flatten.compact.pack("C*").unpack("U*").pack("U*")
  end

  bytes = @wrapped_string.unpack("C*")
  conts_expected = 0
  last_lead = 0

  bytes.each_index do |i|

    byte          = bytes[i]
    is_ascii      = byte < 128
    is_cont       = byte > 127 && byte < 192
    is_lead       = byte > 191 && byte < 245
    is_unused     = byte > 240
    is_restricted = byte > 244

    # Impossible or highly unlikely byte? Clean it.
    if is_unused || is_restricted
      bytes[i] = tidy_byte(byte)
    elsif is_cont
      # Not expecting contination byte? Clean up. Otherwise, now expect one less.
      conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1
    else
      if conts_expected > 0
        # Expected continuation, but got ASCII or leading? Clean backwards up to
        # the leading byte.
        (1..(i - last_lead)).each {|j| bytes[i - j] = tidy_byte(bytes[i - j])}
        conts_expected = 0
      end
      if is_lead
        # Final byte is leading? Clean it.
        if i == bytes.length - 1
          bytes[i] = tidy_byte(bytes.last)
        else
          # Valid leading byte? Expect continuations determined by position of
          # first zero bit, with max of 3.
          conts_expected = byte < 224 ? 1 : byte < 240 ? 2 : 3
          last_lead = i
        end
      end
    end
  end
  @wrapped_string = bytes.empty? ? "" : bytes.flatten.compact.pack("C*").unpack("U*").pack("U*")
end

#to_ascii!Object

Delete any non-ascii characters.

Returns:

  • String



317
318
319
320
321
322
323
324
# File 'lib/friendly_id/slug_string.rb', line 317

def to_ascii!
  if ">= 1.9".respond_to?(:force_encoding)
    @wrapped_string.encode!("ASCII", :invalid => :replace, :undef => :replace,
      :replace => "")
  else
    @wrapped_string = tidy_bytes.normalize_utf8(:c).unpack("U*").reject {|char| char > 127}.pack("U*")
  end
end

#truncate!(max) ⇒ Object

Truncate the string to max length.

Returns:

  • String



328
329
330
# File 'lib/friendly_id/slug_string.rb', line 328

def truncate!(max)
  @wrapped_string = self[0...max].to_s if length > max
end

#upcase!Object

Upper-cases the string. Note that this works for Unicode strings, though your milage may vary with Greek and Turkic strings.

Returns:

  • String



335
336
337
# File 'lib/friendly_id/slug_string.rb', line 335

def upcase!
  @wrapped_string = apply_mapping :uppercase_mapping
end

#validate_for!(config) ⇒ Object

Validate that the slug string is not blank or reserved, and truncate it to the max length if necessary.

Parameters:

Returns:

  • String

Raises:

  • FriendlyId::BlankError

  • FriendlyId::ReservedError



345
346
347
348
349
350
# File 'lib/friendly_id/slug_string.rb', line 345

def validate_for!(config)
  truncate!(config.max_length)
  raise FriendlyId::BlankError if blank?
  raise FriendlyId::ReservedError if config.reserved?(self)
  self
end

#with_dashes!Object

Replaces whitespace with dashes (“-”).

Returns:

  • String



354
355
356
# File 'lib/friendly_id/slug_string.rb', line 354

def with_dashes!
  @wrapped_string = @wrapped_string.gsub(/[\s\-]+/u, "-")
end

#word_chars!Object

Remove any non-word characters.

Returns:

  • String



214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
# File 'lib/friendly_id/slug_string.rb', line 214

def word_chars!
  @wrapped_string = normalize_utf8(:c).unpack("U*").map { |char|
    case char
    # control chars
    when 0..31
    # punctuation; 45 is "-" (HYPHEN-MINUS) and allowed
    when 33..44
    # more puncuation
    when 46..47
    # more puncuation and other symbols
    when 58..64
    # brackets and other symbols
    when 91..96
    # braces, pipe, tilde, etc.
    when 123..191
    else char
    end
  }.compact.pack("U*")
end