Class: U::String

Inherits:
Data
  • Object
show all
Includes:
Comparable
Defined in:
ext/u/rb_u_string.c,
lib/u-1.0/string.rb,
ext/u/rb_u_string.c

Overview

A U::String is a sequence of zero or more Unicode characters encoded as UTF-8. It’s interface is an extension of that of Ruby’s built-in String class that provides better Unicode support, as it handles things such as casing, width, collation, and various other Unicode properties that Ruby’s built-in String class simply doesn’t bother itself with. It also provides “backwards compatibility” with Ruby 1.8.7 so that you can use Unicode without upgrading to Ruby 2.0 (which you probably should do, though).

It differs from Ruby’s built-in String class in one other very important way in that it doesn’t provide any way to change an existing object. That is, a U::String is a value object.

A U::String is most easily created from a String by calling #u. Most U::String methods that return a stringy result will return a U::String, so you only have to do that once. You can get back a String by calling #to_str.

Validation of a U::String’s content isn’t performed until any access to it is made, at which time an ArgumentError will be raised if it isn’t valid.

U::String has a lot of methods defined upon it, so let’s break them up into categories to get a proper overview of what’s possible to do with one. Let’s begin with the interrogators. There are three kinds of interrogators, validity-checking ones, property-checking ones, and content-matching ones.

The validity-checking interrogator is #valid_encoding?, which makes sure that the UTF-8 sequence itself is valid.

The property-checking interrogators are #alnum?, #alpha?, #ascii_only?, #assigned?, #case_ignorable?, #cased?, #cntrl?, #defined?, #digit?, #graph?, #newline?, #print?, #punct?, #soft_dotted?, #space?, #title?, #valid?, #wide?, #wide_cjk?, #xdigit?, and #zero_width?. These interrogators check the corresponding Unicode property of each characters in the U::String and if all characters have this property, they’ll return true.

Very close relatives to the property-checking interrogators are #folded?, #lower?, and #upper?, which check whether a string has been cased in a given way, and #normalized?, which checks whether the receiver has been normalized, optionally to a specific normalization form.

The content-matching interrogators are #==, #===, #=~, #match, #empty?, #end_with?, #eql?, #include?, #index, #rindex, and #start_with?. These interrogators check that a substring of the U::String matches another string or Regexp and either return a Boolean result, and index into the U::String where the match begins or MatchData for full matching information.

Related to the content-matching interrogators are #<=>, #casecmp, and #collation_key, all of which compare a U::String against another for ordering.

Related to the property-checking interrogators are #canonical_combining_class, #general_category, #grapheme_break, #line_break, #script, and #word_break, which return the value of the Unicode property in question, the general category being the one often interrogated.

There are a couple of other “interrogators” in #bytesize, #length, #size, #width that return integer properties of the U::String as a whole, where #length and #width are probably the most useful.

Beyond interrogators there are quite a few methods for iterating over the content of a U::String, each viewing it in its own way: #each_byte, #each_char, #each_codepoint, #each_grapheme_cluster, #each_line, and #each_word. They all have respective methods (#bytes, #chars, #codepoints, #grapheme_clusters, #lines, #words) that return an Array instead of yielding each result.

Quite a few methods are devoted to extracting a substring of a U::String, namely #[], #slice, #byteslice, #chomp, #chop, #chr, #getbyte, #lstrip, #ord, #rstrip, #strip.

There are a few methods for case-shifting: #downcase, #foldcase, #titlecase, and #upcase. Then there’s #mirror, #normalize, and #reverse that alter the string in other ways.

The methods #center, #ljust, and #rjust pad a U::String to make it a certain number of cells wide.

Then there’s a couple of methods that are more related in the arguments they take than in function: #count, #delete, #squeeze, #tr, and #tr_s. These methods all take specifications of character/code point ranges that should be counted, deleted, squeezed, and translated (plus squeezed).

Deconstructing a U::String can be done with #partition and #rpartition, which splits it around a divider, #scan, which extracts matches to a pattern, #split, which splits it on a divider.

Substitution of all matches to a pattern can be made with #gsub and of the first match to a pattern with #sub.

Creating larger U::Strings from smaller ones is done with #+, which concatenates two of them, and #*, which concatenates a U::String to itself a number of times.

A U::String can also be used as a specification as to how to format a number of values via #% (and its alias #format) into a new U::String, much like snprintf(3) in C.

The content of a U::String can be #dumped and #inspected to make it reader-friendly, but also debugger-friendly.

Finally, a U::String has a few methods to turn its content into other values: #hash, which turns it into a hash value to be used for hashing, #hex, #oct, #to_i, which turn it into a Integer, #to_str, #to_s, #b, which turn it into a String, and #to_sym (and its alias #intern), which turns it into a Symbol.

Note that some methods defined on String are missing. #Capitalize doesn’t exist, as capitalization isn’t a Unicode concept. #Sum doesn’t exist, as a U::String generally doesn’t contain content that you need a checksum of. #Crypt doesn’t exist for similar reasons. #Swapcase isn’t useful on a String and it certainly isn’t useful in a Unicode context. As a U::String doesn’t contain arbitrary data, #unpack is left to String. #Next/#succ would perhaps be implementable, but haven’t, as a satisfactory implementation hasn’t been thought of.

Instance Method Summary collapse

Constructor Details

#new(string = nil) ⇒ Object

Sets up a U::String wrapping STRING after encoding it as UTF-8 and freezing it.

Parameters:

  • string (String, nil) (defaults to: nil)


161
162
163
164
165
166
167
168
169
170
171
172
173
# File 'ext/u/rb_u_string.c', line 161

static VALUE
rb_u_string_initialize(int argc, VALUE *argv, VALUE self)
{
        VALUE rb;

        rb_scan_args(argc, argv, "01", &rb);
        if (!NIL_P(rb)) {
                StringValue(rb);
                rb_u_string_set_rb(self, rb);
        }

        return Qnil;
}

Instance Method Details

#%(value) ⇒ U::String Also known as: format

Returns a formatted string of the values in Array(VALUE) by treating the receiver as a format specification of this formatted string.

A format specification is a string consisting of sequences of normal characters that are copied verbatim and field specifiers. A field specifier consists of a ‘%`, followed by any optional flags, an optional width, an optional precision, and a directive:

%[flags][width][.[precision]]directive

Note that this means that a lone ‘%` at the end of the string is simply copied verbatim as it, by this definition, isn’t a field directive.

The directive determines how this field should be formatted. The flags, width, and precision modify this interpretation.

The field often takes a value from VALUE and formats it according to a given set of rules, which depend on the flags, width, and precision, but can also output other, hardwired, values.

The directives that don’t take a value are

<table>

<thead>
  <tr><th>Directive</th><th>Description</th></tr>
</thead>
<tbody>
  <tr>
    <td>%</td>
    <td>Outputs ‘%’.</td>
  </tr>
  <tr>
    <td>\n</td>
    <td>Outputs “%\n”.</td>
  </tr>
  <tr>
    <td>\0</td>
    <td>Outputs “%\0”.</td>
  </tr>
</tbody>

</table>

None of these directives take any flags, width, or precision.

All of the following directives allow you to specify a width. The width only ever limits the minimum width of the field, that is, at least width cells will be filled by the field, but perhaps more will actually be required in the end.

<dl>

<dt>c</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding]character[right-padding]</code></pre>

  <p>If a width <em>w</em> has been specified and the
  ‘<code>-</code>’ flag hasn’t been given, <em>left-padding</em>
  consists of enough spaces to make the whole field at least <em>w</em>
  cells wide, otherwise it’s empty.</p>

  <p><em>Character</em> is the result of #to_str#chr on the
  argument, if it responds to #to_str, otherwise it’s the result of
  #to_int turned into a string containing the character at that code
  point.  A precision isn’t allowed.  The {#width} of the character is
  used in any width calculations.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>-</code>’
  flag has been given, <em>right-padding</em> consists of enough spaces
  to make the whole field at least <em>w</em> cells wide, otherwise it’s
  empty.</p>
</dd>
<dt>s</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding]string[right-padding]</code></pre>

  <p><em>Left-padding</em> and <em>right-padding</em> are the same as
  for the ‘c’ directive described above.</p>

  <p><em>String</em> is a substring of the result of #to_s on the
  argument that is <em>w</em> cells wide, where <em>w</em> = precision,
  if a precision has been specified, <em>w</em> = {#width}
  otherwise.</p>
</dd>
<dt>p</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding]inspect[right-padding]</code></pre>

  <p><em>Left-padding</em> and <em>right-padding</em> are the same as
  for the ‘c’ directive described above.</p>

  <p><em>String</em> is a substring of the result of #inspect on the
  argument that is <em>w</em> cells wide, where <em>w</em> = precision,
  if a precision has been specified, <em>w</em> = {#width}
  otherwise.</p>
</dd>
<dt>d</dt>
<dt>i</dt>
<dt>u</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding][prefix/sign][zeroes]
  [precision-filler]digits[right-padding]</code></pre>

  <p>If a width <em>w</em> has been specified and neither the
  ‘<code>-</code>’ nor the ‘<code>0</code>’ flag has been given,
  <em>left-padding</em> consists of enough spaces to make the whole
  field at least <em>w</em> cells wide, otherwise it’s empty.</p>

  <p><em>Prefix/sign</em> is “-” if the argument is negative, “+” if the
  ‘<code>+</code>’ flag was given, and “ ” if the ‘<code> </code>’ flag
  was given, otherwise it’s empty.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>0</code>’
  flag has been given and neither the ‘<code>-</code>’ flag has been
  given nor a precision has been specified, <em>zeroes</em> consists of
  enough zeroes to make the whole field at least <em>w</em> cells wide,
  otherwise it’s empty.</p>

  <p>If a precision <em>p</em> has been specified,
  <em>precision-filler</em> consists of enough zeroes to make for
  <em>p</em> digits of output, otherwise it’s empty.</p>

  <p><em>Digits</em> consists of the digits in base 10 that represent
  the result of calling Integer with the argument as its argument.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>-</code>’
  flag has been given, <em>right-padding</em> consists of enough spaces
  to make the whole field at least <em>w</em> cells wide, otherwise it’s
  empty.</p>

  <table>
    <thead><tr><th>Flag</th><th>Description</th></tr></thead>
    <tbody>
      <tr>
        <td>(Space)</td>
        <td>Add a “ ” prefix to non-negative numbers</td>
      </tr>
      <tr>
        <td><code>+</code></td>
        <td>Add a “+” sign to non-negative numbers; overrides the
        ‘<code> </code>’ flag</td>
      </tr>
      <tr>
        <td><code>0</code></td>
        <td>Use ‘0’ for any width padding; ignored when a precision has
        been specified</td>
      </tr>
      <tr>
        <td><code>-</code></td>
        <td>Left justify the output with ‘ ’ as padding; overrides the
        ‘<code>0</code>’ flag</td>
      </tr>
    </tbody>
  </table>
</dd>
<dt>o</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding][prefix/sign][zeroes/sevens]
  [precision-filler]octal-digits[right-padding]</code></pre>

  <p>If a width <em>w</em> has been specified and neither the
  ‘<code>-</code>’ nor the ‘<code>0</code>’ flag has been given,
  <em>left-padding</em> consists of enough spaces to make the whole
  field at least <em>w</em> cells wide, otherwise it’s empty.</p>

  <p><em>Prefix/sign</em> is “-” if the argument is negative and the
  ‘<code>+</code>’ or ‘<code> </code>’ flag was given, “..” if the
  argument is negative, “+” if the ‘<code>+</code>’ flag was given, and
  “ ” if the ‘<code> </code>’ flag was given, otherwise it’s empty.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>0</code>’
  flag has been given and neither the ‘<code>-</code>’ flag has been
  given nor a precision has been specified, <em>zeroes/sevens</em>
  consists of enough zeroes, if the argument is non-negative or if the
  ‘<code>+</code>’ or ‘<code> </code>’ flag has been specified, sevens
  otherwise, to make the whole field at least <em>w</em> cells wide,
  otherwise it’s empty.</p>

  <p>If a precision <em>p</em> has been specified,
  <em>precision-filler</em> consists of enough zeroes, if the argument
  is non-negative or if the ‘<code>+</code>’ or ‘<code> </code>’ flag
  has been specified, sevens otherwise, to make for <em>p</em> digits of
  output, otherwise it’s empty.</p>

  <p><em>Octal-digits</em> consists of the digits in base 8 that
  represent the result of #to_int on the argument, using ‘0’ through
  ‘7’.  A negative value will be output as a two’s complement value.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>-</code>’
  flag has been given, <em>right-padding</em> consists of enough spaces
  to make the whole field at least <em>w</em> cells wide, otherwise it’s
  empty.</p>

  <table>
    <thead><tr><th>Flag</th><th>Description</th></tr></thead>
    <tbody>
      <tr>
        <td>(Space)</td>
        <td>Add a “ ” prefix to non-negative numbers and don’t output
        negative numbers as two’s complement values</td>
      </tr>
      <tr>
        <td><code>+</code></td>
        <td>Add a “+” sign to non-negative numbers and don’t output
        negative numbers as two’s complement values; overrides the
        ‘<code> </code>’ flag</td>
      </tr>
      <tr>
        <td><code>0</code></td>
        <td>Use ‘0’ for any width padding; ignored when a precision has
        been specified</td>
      </tr>
      <tr>
        <td><code>-</code></td>
        <td>Left justify the output with ‘ ’ as padding; overrides the
        ‘<code>0</code>’ flag</td>
      </tr>
      <tr>
        <td><code>#</code></td>
        <td>Increase precision to include as many digits as necessary to
        make the first digit ‘0’, but don’t include the ‘0’ itself</td>
      </tr>
    </tbody>
  </table>
</dd>
<dt>x</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding][sign][base-prefix][prefix][zeroes/fs]
  [precision-filler]hexadecimal-digits[right-padding]</code></pre>

  <p><em>Left-padding</em> and <em>right-padding</em> are the same as
  for the ‘o’ directive described above.  <em>Zeroes/fs</em> is the same
  as <em>zeroes/sevens</em> for the ‘o’ directive, except that it uses
  ‘f’ characters instead of sevens.  The same goes for
  <em>precision-filler</em>.</p>

  <p><em>Sign</em> is “-” if the argument is negative and the
  ‘<code>+</code>’ or ‘<code> </code>’ flag was given, “+” if the
  argument is non-negative and the ‘<code>+</code>’ flag was given, and
  “ ” if the argument is non-negative and the ‘<code> </code>’ flag was
  given, otherwise it’s empty.</p>

  <p><em>Base-prefix</em> is “0x” if the ‘<code>#</code>’ flag was given
  and the result of #to_int on the argument is non-zero.</p>

  <p><em>Prefix</em> is “..” if the argument is negative and neither the
  ‘<code>+</code>’ nor the ‘<code> </code>’ flag was given.</p>

  <p><em>Hexadecimal-digits</em> consists of the digits in base 16 that
  represent the result of #to_int on the argument, using ‘0’ through ‘9’
  and ‘a’ through ‘f’.  A negative value will be output as a two’s
  complement value.</p>

  <table>
    <thead><tr><th>Flag</th><th>Description</th></tr></thead>
    <tbody>
      <tr><td>(Space)</td><td>Same as for ‘o’</td></tr>
      <tr><td><code>+</code></td><td>Same as for ‘o’</td></tr>
      <tr><td><code>0</code></td><td>Same as for ‘o’</td></tr>
      <tr><td><code>-</code></td><td>Same as for ‘o’</td></tr>
      <tr><td><code>#</code></td><td>Prefix non-zero values with “0x”</td></tr>
    </tbody>
  </table>
</dd>
<dt>X</dt>
<dd>
  <p>Same as ‘x’, except that it uses uppercase letters instead.</p>
</dd>
<dt>b</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding][sign][base-prefix][prefix][zeroes/ones]
  [precision-filler]binary-digits[right-padding]</code></pre>

  <p><em>Left-padding</em> and <em>right-padding</em> are the same as
  for the ‘o’ directive described above.  <em>Base-prefix</em> and
  <em>prefix</em> are the same as for the ‘x’ directive, except that
  <em>base-prefix</em> outputs “0b”.  <em>Zeroes/ones</em> is the same
  as <em>zeroes/fs</em> for the ‘x’ directive, except that it uses ones
  instead of sevens.  The same goes for <em>precision-filler</em>.</p>

  <p><em>Binary-digits</em> consists of the digits in base 2 that
  represent the result of #to_int on the argument, using ‘0’ and ‘1’.  A
  negative value will be output as a two’s complement value.</p>

  <table>
    <thead><tr><th>Flag</th><th>Description</th></tr></thead>
    <tbody>
      <tr><td>(Space)</td><td>Same as for ‘o’</td></tr>
      <tr><td><code>+</code></td><td>Same as for ‘o’</td></tr>
      <tr><td><code>0</code></td><td>Same as for ‘o’</td></tr>
      <tr><td><code>-</code></td><td>Same as for ‘o’</td></tr>
      <tr><td><code>#</code></td><td>Prefix non-zero values with “0b”</td></tr>
    </tbody>
  </table>
</dd>
<dt>B</dt>
<dd>
  <p>Same as ‘b’, except that it uses a “0B” prefix for the
  ‘<code>#</code>’ flag.</p>
</dd>
<dt>f</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding][prefix/sign][zeroes]
  integer-part[decimal-point][fractional-part][right-padding]</code></pre>

  <p>If a width <em>w</em> has been specified and neither the
  ‘<code>-</code>’ nor the ‘<code>0</code>’ flag has been given,
  <em>left-padding</em> consists of enough spaces to make the whole
  field at least <em>w</em> cells wide, otherwise it’s empty.</p>

  <p><em>Prefix/sign</em> is “-” if the argument is negative, “+” if the
  ‘<code>+</code>’ flag was given, and “ ” if the ‘<code> </code>’ flag
  was given, otherwise it’s empty.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>0</code>’
  flag has been given and the ‘<code>-</code>’ flag has not been given,
  <em>zeroes</em> consists of enough zeroes to make the whole field
  at least <em>w</em> cells wide, otherwise it’s empty.</p>

  <p><em>Integer-part</em> consists of the digits in base 10 that
  represent the integer part of the result of calling Float with the
  argument as its argument.</p>

  <p><em>Decimal-point</em> is “.” if the precision isn’t 0 or if the
  ‘<code>#</code>’ flag has been given.</p>

  <p><em>Fractional-part</em> consists of <em>p</em> digits in base 10
  that represent the fractional part of the result of calling Float with
  the argument as its argument, where <em>p</em> = precision, if one has
  been specified, <em>p</em> = 6 otherwise.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>-</code>’
  flag has been given, <em>right-padding</em> consists of enough spaces
  to make the whole field at least <em>w</em> cells wide, otherwise it’s
  empty.</p>

  <table>
    <thead><tr><th>Flag</th><th>Description</th></tr></thead>
    <tbody>
      <tr>
        <td>(Space)</td>
        <td>Add a “ ” prefix to non-negative numbers</td>
      </tr>
      <tr>
        <td><code>+</code></td>
        <td>Add a “+” sign to non-negative numbers; overrides the
        ‘<code> </code>’ flag</td>
      </tr>
      <tr>
        <td><code>0</code></td>
        <td>Use ‘0’ for any width padding; ignored when a precision has
        been specified</td>
      </tr>
      <tr>
        <td><code>-</code></td>
        <td>Left justify the output with ‘ ’ as padding; overrides the
        ‘<code>0</code>’ flag</td>
      </tr>
      <tr>
        <td>#</td>
        <td>Output a decimal point, even if no fractional part
        follows</td>
      </tr>
    </tbody>
  </table>
</dd>
<dt>e</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding][prefix/sign][zeroes]
  digit[decimal-point][fractional-part]exponent[right-padding]</code></pre>

  <p>If a width <em>w</em> has been specified and neither the
  ‘<code>-</code>’ nor the ‘<code>0</code>’ flag has been given,
  <em>left-padding</em> consists of enough spaces to make the whole
  field at least <em>w</em> + <em>e</em> cells wide, where <em>e</em> ≥
  4 is the width of the exponent, otherwise it’s empty.</p>

  <p><em>Prefix/sign</em> is “-” if the argument is negative, “+” if the
  ‘<code>+</code>’ flag was given, and “ ” if the ‘<code> </code>’ flag
  was given, otherwise it’s empty.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>0</code>’
  flag has been given and the ‘<code>-</code>’ flag has not been given,
  <em>zeroes</em> consists of enough zeroes to make the whole field
  <em>w</em> + <em>e</em> cells wide, where <em>e</em> ≥ 4 is the width
  of the exponent, otherwise it’s empty.</p>

  <p><em>Digit</em> consists of one digit in base 10 that represent the
  most significant digit of the result of calling Float with the
  argument as its argument.</p>

  <p><em>Decimal-point</em> is “.” if the precision isn’t 0 or if the
  ‘<code>#</code>’ flag has been given.</p>

  <p><em>Fractional-part</em> consists of <em>p</em> digits in base 10
  that represent all but the most significant digit of the result of
  calling Float with the argument as its argument, where <em>p</em> =
  precision, if one has been specified, <em>p</em> = 6 otherwise.</p>

  <p><em>Exponent</em> consists of “e” followed by the exponent in base
  10 required to turn the result of calling Float with the argument as
  its argument into a decimal fraction with one non-zero digit in the
  integer part.  If the exponent is 0, “+00” will be output.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>-</code>’
  flag has been given, <em>right-padding</em> consists of enough spaces
  to make the whole field at least <em>w</em> + <em>e</em> cells wide,
  where <em>e</em> ≥ 4 is the width of the exponent, otherwise it’s
  empty.</p>

  <table>
    <thead><tr><th>Flag</th><th>Description</th></tr></thead>
    <tbody>
      <tr>
        <td>(Space)</td>
        <td>Add a “ ” prefix to non-negative numbers</td>
      </tr>
      <tr>
        <td><code>+</code></td>
        <td>Add a “+” sign to non-negative numbers; overrides the
        ‘<code> </code>’ flag</td>
      </tr>
      <tr>
        <td><code>0</code></td>
        <td>Use ‘0’ for any width padding; ignored when a precision has
        been specified</td>
      </tr>
      <tr>
        <td><code>-</code></td>
        <td>Left justify the output with ‘ ’ as padding; overrides the
        ‘<code>0</code>’ flag</td>
      </tr>
      <tr>
        <td>#</td>
        <td>Output a decimal point, even if no fractional part
        follows</td>
      </tr>
    </tbody>
  </table>
</dd>
<dt>E</dt>
<dd>
  <p>Same as ‘e’, except that it uses an uppercase ‘E’ for the exponent
  separator.</p>
</dd>
<dt>g</dt>
<dd>
  <p>Same as ‘e’ if the exponent is less than -4 or if the exponent is
  greater than or equal to the precision, otherwise ‘f’ is used.  The
  precision defaults to 6 and a precision of 0 is treated as a precision
  of 1.  Trailing zeros are removed from the fractional part of the
  result.</p>
</dd>
<dt>G</dt>
<dd>
  <p>Same as ‘g’, except that it uses an uppercase ‘E’ for the exponent
  separator.</p>
</dd>
<dt>a</dt>
<dd>
  <p>Outputs</p>

  <pre><code>[left-padding][prefix/sign][zeroes]
  digit[hexadecimal-point][fractional-part]exponent[right-padding]</code></pre>

  <p>If a width <em>w</em> has been specified and neither the
  ‘<code>-</code>’ nor the ‘<code>0</code>’ flag has been given,
  <em>left-padding</em> consists of enough spaces to make the whole
  field at least <em>w</em> + <em>e</em> cells wide, where <em>e</em> ≥
  3 is the width of the exponent, otherwise it’s empty.</p>

  <p><em>Prefix/sign</em> is “-” if the argument is negative, “+” if the
  ‘<code>+</code>’ flag was given, and “ ” if the ‘<code> </code>’ flag
  was given, otherwise it’s empty.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>0</code>’
  flag has been given and the ‘<code>-</code>’ flag has not been given,
  <em>zeroes</em> consists of enough zeroes to make the whole field
  <em>w</em> + <em>e</em> cells wide, where <em>e</em> ≥ 3 is the width
  of the exponent, otherwise it’s empty.</p>

  <p><em>Digit</em> consists of one digit in base 16 that represent the
  most significant digit of the result of calling Float with the
  argument as its argument, using ‘0’ through ‘9’ and ‘a’ through ‘f’.</p>

  <p><em>Decimal-point</em> is “.” if the precision isn’t 0 or if the
  ‘<code>#</code>’ flag has been given.</p>

  <p><em>Fractional-part</em> consists of <em>p</em> digits in base 16
  that represent all but the most significant digit of the result of
  calling Float with the argument as its argument, where <em>p</em> =
  precision, if one has been specified, <em>p</em> = <em>q</em>, where
  <em>q</em> is the number of digits required to represent the number
  exactly, otherwise.  Digits are output using ‘0’ through ‘9’ and ‘a’
  through ‘f’.</p>

  <p><em>Exponent</em> consists of “p” followed by the exponent of 2 in
  base 10 required to turn the result of calling Float with the argument
  as its argument into a decimal fraction with one non-zero digit in the
  integer part.  If the exponent is 0, “+0” will be output.</p>

  <p>If a width <em>w</em> has been specified and the ‘<code>-</code>’
  flag has been given, <em>right-padding</em> consists of enough spaces
  to make the whole field at least <em>w</em> + <em>e</em> cells
  wide, where <em>e</em> ≥ 3 is the width of the exponent, otherwise
  it’s empty.</p>

  <table>
    <thead><tr><th>Flag</th><th>Description</th></tr></thead>
    <tbody>
      <tr>
        <td>(Space)</td>
        <td>Add a “ ” prefix to non-negative numbers</td>
      </tr>
      <tr>
        <td><code>+</code></td>
        <td>Add a “+” sign to non-negative numbers; overrides the
        ‘<code> </code>’ flag</td>
      </tr>
      <tr>
        <td><code>0</code></td>
        <td>Use ‘0’ for any width padding; ignored when a precision has
        been specified</td>
      </tr>
      <tr>
        <td><code>-</code></td>
        <td>Left justify the output with ‘ ’ as padding; overrides the
        ‘<code>0</code>’ flag</td>
      </tr>
      <tr>
        <td>#</td>
        <td>Output a decimal point, even if no fractional part
        follows</td>
      </tr>
    </tbody>
  </table>
</dd>
<dt>A</dt>
<dd>
  <p>Same as ‘a’, except that it uses an uppercase letters instead.</p>
</dd>

</dl>

A warning is issued if the ‘‘0`’ flag is given when the ‘`-`’ flag has also been given to the ‘d’, ‘i’, ‘u’, ‘o’, ‘x’, ‘X’, ‘b’, or ‘B’ directives.

A warning is issued if the ‘‘0`’ flag is given when a precision has been specified for the ‘d’, ‘i’, ‘u’, ‘o’, ‘x’, ‘X’, ‘b’, or ‘B’ directives.

A warning is issued if the ‘ ’ flag is given when the ‘‘+`’ flag has also been given to the ‘d’, ‘i’, ‘u’, ‘o’, ‘x’, ‘X’, ‘b’, or ‘B’ directives.

A warning is issued if the ‘‘0`’ flag is given when the ‘o’, ‘x’, ‘X’, ‘b’, or ‘B’ directives has been given a negative argument.

A warning is issued if the ‘‘#`’ flag is given when the ‘o’ directive has been given a negative argument.

Any taint on the receiver and any taint on arguments to any ‘s’ and ‘p’ directives is inherited by the result.

Returns:

Raises:

  • (ArgumentError)

    If the receiver isn’t a valid format specification

  • (ArgumentError)

    If any flags are given to the ‘%’, ‘n’, or ‘0’ directives

  • (ArgumentError)

    If an argument is given to the ‘%’, ‘n’, or ‘0’ directives

  • (ArgumentError)

    If a width is specified for the ‘%’, ‘n’, or ‘0’ directives

  • (ArgumentError)

    If a precision is specified for the ‘%’, ‘n’, ‘0’, or ‘c’ directives

  • (ArgumentError)

    If any of the flags ‘ ’, ‘‘+`’, ’`0`’, or ‘`#`’ are given to the ‘c’, ‘s’, or ‘p’ directives

  • (ArgumentError)

    If the ‘‘#`’ flag is given to the ‘d’, ‘i’, or ‘u’ directives

  • (ArgumentError)

    If the argument to the ‘c’ directive doesn’t respond to #to_str or #to_int



1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
# File 'ext/u/rb_u_string_format.c', line 1736

VALUE
rb_u_string_format_m(VALUE self, VALUE argument)
{
        volatile VALUE tmp = rb_check_array_type(argument);

        if (!NIL_P(tmp))
                return rb_u_string_format(RARRAY_LENINT(tmp), RARRAY_PTR(tmp), self);

        return rb_u_string_format(1, &argument, self);
}

#*(n) ⇒ U::String

Returns The concatenation of N copies of the receiver, inheriting any taint and untrust.

Parameters:

  • n (#to_int)

Returns:

  • (U::String)

    The concatenation of N copies of the receiver, inheriting any taint and untrust

Raises:

  • (ArgumentError)

    If N < 0

  • (ArgumentError)

    If N > 0 and N × #bytesize > LONG_MAX



9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# File 'ext/u/rb_u_string_times.c', line 9

VALUE
rb_u_string_times(VALUE self, VALUE rbtimes)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        long times = NUM2LONG(rbtimes);
        if (times < 0)
                rb_u_raise(rb_eArgError, "negative argument: %ld", times);

        /* TODO: Isn’t this off by one, as we add one to length for the
         * ALLOC_N() call? */
        if (times > 0 && LONG_MAX / times < USTRING_LENGTH(string))
                rb_u_raise(rb_eArgError, "argument too big: %ld", times);
        long length = times * USTRING_LENGTH(string);

        char *product = ALLOC_N(char, length + 1);
        long i = USTRING_LENGTH(string);
        if (i > 0) {
                memcpy(product, USTRING_STR(string), i);
                for ( ; i <= times / 2; i *= 2)
                        memcpy(product + i, product, i);
                memcpy(product + i, product, times - i);
        }
        product[length] = '\0';

        return rb_u_string_new_c_own(self, product, length);
}

#+(other) ⇒ U::String

Returns The concatenation of OTHER to the receiver, inheriting any taint on either.

Parameters:

Returns:

  • (U::String)

    The concatenation of OTHER to the receiver, inheriting any taint on either

Raises:



8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'ext/u/rb_u_string_plus.c', line 8

VALUE
rb_u_string_plus(VALUE self, VALUE rbother)
{
        const struct rb_u_string *string = RVAL2USTRING(self);
        const struct rb_u_string *other = RVAL2USTRING_ANY(rbother);

        long string_length = USTRING_LENGTH(string);
        long other_length = USTRING_LENGTH(other);

        /* TODO: Isn’t this off by one, as we add one to length for the
         * ALLOC_N() call? */
        if (string_length > LONG_MAX - other_length)
                rb_u_raise(rb_eArgError, "length of resulting string would be too big");
        long length = string_length + other_length;

        char *sum = ALLOC_N(char, length + 1);
        memcpy(sum, USTRING_STR(string), string_length);
        memcpy(sum + string_length, USTRING_STR(other), other_length);
        sum[length] = '\0';

        VALUE result = rb_u_string_new_uninfected_own(sum, length);
        if (OBJ_TAINTED(self) || OBJ_TAINTED(rbother))
                OBJ_TAINT(result);

        return result;
}

#<=>(other, locale = ENV['LC_COLLATE']) ⇒ Fixnum

Returns the comparison of the receiver and OTHER using the linguistically correct rules of LOCALE. The LOCALE must be given as a language, region, and encoding, for example, “en_US.UTF-8”.

This operation is known as “collation” and you can find more information about the collation algorithm employed in the Unicode Technical Standard #10, see unicode.org/reports/tr10/.

Parameters:

Returns:

  • (Fixnum)

Raises:

  • (Errno::EILSEQ)

    If a character in the receiver can’t be converted into the encoding of the locale

See Also:



21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# File 'ext/u/rb_u_string_collate.c', line 21

VALUE
rb_u_string_collate(int argc, VALUE *argv, VALUE self)
{
        const char *locale = NULL;

        VALUE rbother, rblocale;
        if (rb_scan_args(argc, argv, "11", &rbother, &rblocale) == 2)
                locale = StringValuePtr(rblocale);
        else {
                const char * const env[] = { "LC_ALL", "LC_COLLATE", "LANG", NULL };
                for (const char * const *p = env; *p != NULL; p++)
                        if ((locale = getenv(*p)) != NULL)
                                break;
        }

        const struct rb_u_string *string = RVAL2USTRING(self);
        const struct rb_u_string *other = RVAL2USTRING_ANY(rbother);

        errno = 0;
        int r = u_collate(USTRING_STR(string), USTRING_LENGTH(string),
                          USTRING_STR(other), USTRING_LENGTH(other),
                          locale);
        if (errno != 0)
                rb_u_raise_errno(errno, "can’t collate strings");
        return INT2FIX(r);
}

#==(other) ⇒ Boolean Also known as: ===

Returns True if the receiver’s bytes equal those of OTHER.

Parameters:

Returns:

  • (Boolean)

    True if the receiver’s bytes equal those of OTHER

See Also:



8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'ext/u/rb_u_string_equal.c', line 8

VALUE
rb_u_string_equal(VALUE self, VALUE rbother)
{
        if (self == rbother)
                return Qtrue;

        if (RTEST(rb_obj_is_kind_of(rbother, rb_cUString)))
                return rb_u_string_eql(self, rbother);

        if (!rb_respond_to(rbother, rb_intern("to_str")))
                return Qfalse;

        const struct rb_u_string *string = RVAL2USTRING(self);
        const struct rb_u_string *other = RVAL2USTRING_ANY(rbother);

        const char *p = USTRING_STR(string);
        const char *q = USTRING_STR(other);

        if (p == q)
                return Qtrue;

        long p_length = USTRING_LENGTH(string);
        long q_length = USTRING_LENGTH(other);

        return p_length == q_length && memcmp(p, q, q_length) == 0 ?  Qtrue : Qfalse;
}

#=~(other) ⇒ Numeric?

Returns The result of OTHER`#=~‘(self), that is, the index of the first character of the match of OTHER in the receiver, if one exists.

Parameters:

  • other (Regexp, #=~)

Returns:

  • (Numeric, nil)

    The result of OTHER`#=~‘(self), that is, the index of the first character of the match of OTHER in the receiver, if one exists

Raises:

  • (TypeError)

    If OTHER is a U::String or String



10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'ext/u/rb_u_string_match.c', line 10

VALUE
rb_u_string_match(VALUE self, VALUE other)
{
        if (RTEST(rb_obj_is_kind_of(other, rb_cUString)))
                rb_u_raise(rb_eTypeError, "type mismatch: U::String given");

        switch (TYPE(other)) {
        case T_STRING:
                rb_u_raise(rb_eTypeError, "type mismatch: String given");
                break;
        case T_REGEXP: {
                const struct rb_u_string *string = RVAL2USTRING(self);

                long index = rb_reg_search(other, rb_str_to_str(self), 0, 0);
                if (index < 0)
                        return Qnil;

                return LONG2NUM(u_pointer_to_offset(USTRING_STR(string),
                                                    USTRING_STR(string) + index));
        }
        default:
                return rb_funcall(other, rb_intern("=~"), 1, self);
        }
}

#[](index) ⇒ U::String? #[](index, length) ⇒ U::String? #[](range) ⇒ U::String? #[](regexp, reference = 0) ⇒ U::String? #[](string) ⇒ U::String? #[](object) ⇒ nil Also known as: slice

Overloads:

  • #[](index) ⇒ U::String?

    Returns The substring [max(i, 0), min(#length, i + 1)], where i = INDEX if INDEX ≥ 0, i = #length - abs(INDEX) otherwise, inheriting any taint and untrust, or nil if this substring is empty.

    Parameters:

    • index (#to_int)

    Returns:

    • (U::String, nil)

      The substring [max(i, 0), min(#length, i + 1)], where i = INDEX if INDEX ≥ 0, i = #length - abs(INDEX) otherwise, inheriting any taint and untrust, or nil if this substring is empty

  • #[](index, length) ⇒ U::String?

    Returns The substring [max(i, 0), min(#length, i + LENGTH)], where i = INDEX if INDEX ≥ 0, i = #length - abs(INDEX) otherwise, inheriting any taint or untrust, or nil if LENGTH < 0.

    Parameters:

    • index (#to_int)
    • length (#to_int)

    Returns:

    • (U::String, nil)

      The substring [max(i, 0), min(#length, i + LENGTH)], where i = INDEX if INDEX ≥ 0, i = #length - abs(INDEX) otherwise, inheriting any taint or untrust, or nil if LENGTH < 0

  • #[](range) ⇒ U::String?

    Returns The result of ‘#[i, j - k]`, where i = RANGE#begin if RANGE#begin ≥ 0, i = #length - abs(RANGE#begin) otherwise, j = RANGE#end if RANGE#end ≥ 0, j = #length - abs(RANGE#end) otherwise, and k = 1 if RANGE#exclude_end?, k = 0 otherwise, or nil if j - k < 0.

    Parameters:

    • range (Range)

    Returns:

    • (U::String, nil)

      The result of ‘#[i, j - k]`, where i = RANGE#begin if RANGE#begin ≥ 0, i = #length - abs(RANGE#begin) otherwise, j = RANGE#end if RANGE#end ≥ 0, j = #length - abs(RANGE#end) otherwise, and k = 1 if RANGE#exclude_end?, k = 0 otherwise, or nil if j - k < 0

  • #[](regexp, reference = 0) ⇒ U::String?

    Returns The submatch REFERENCE from the first match of REGEXP in the receiver, inheriting any taint and untrust from both the receiver and from REGEXP, or nil if there is no match or if the submatch isn’t part of the overall match.

    Parameters:

    • regexp (Regexp)
    • reference (#to_int, #to_str, Symbol) (defaults to: 0)

    Returns:

    • (U::String, nil)

      The submatch REFERENCE from the first match of REGEXP in the receiver, inheriting any taint and untrust from both the receiver and from REGEXP, or nil if there is no match or if the submatch isn’t part of the overall match

    Raises:

    • (IndexError)

      If REFERENCE doesn’t refer to a submatch

  • #[](string) ⇒ U::String?

    Returns The substring STRING, inheriting any taint and untrust from STRING, if STRING is a substring of the receiver.

    Parameters:

    Returns:

    • (U::String, nil)

      The substring STRING, inheriting any taint and untrust from STRING, if STRING is a substring of the receiver

  • #[](object) ⇒ nil

    Returns Nil for any object that doesn’t satisfy the other cases.

    Parameters:

    • object (Object)

    Returns:

    • (nil)

      Nil for any object that doesn’t satisfy the other cases



130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'ext/u/rb_u_string_aref.c', line 130

VALUE
rb_u_string_aref_m(int argc, VALUE *argv, VALUE self)
{
        need_m_to_n_arguments(argc, 1, 2);

        if (argc == 1)
                return rb_u_string_aref(self, argv[0]);

        if (TYPE(argv[0]) == T_REGEXP)
                return rb_u_string_subpat(self, argv[0], argv[1]);

        return rb_u_string_substr(self, NUM2LONG(argv[0]), NUM2LONG(argv[1]));
}

#alnum?Boolean

Returns True if the receiver contains only characters in the general categories Letter and Number.

Returns:

  • (Boolean)

    True if the receiver contains only characters in the general categories Letter and Number



6
7
8
9
10
# File 'ext/u/rb_u_string_alnum.c', line 6

VALUE
rb_u_string_alnum(VALUE self)
{
        return _rb_u_character_test(self, u_char_isalnum);
}

#alpha?Boolean

Returns True if the receiver contains only characters in the general category Alpha.

Returns:

  • (Boolean)

    True if the receiver contains only characters in the general category Alpha



6
7
8
9
10
# File 'ext/u/rb_u_string_alpha.c', line 6

VALUE
rb_u_string_alpha(VALUE self)
{
        return _rb_u_character_test(self, u_char_isalpha);
}

#ascii_only?Boolean

Returns True if the receiver contains only characters in the ASCII region, that is, U+0000 through U+007F.

Returns:

  • (Boolean)

    True if the receiver contains only characters in the ASCII region, that is, U+0000 through U+007F



6
7
8
9
10
11
12
13
# File 'ext/u/rb_u_string_ascii_only.c', line 6

VALUE
rb_u_string_ascii_only(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        return u_is_ascii_only_n(USTRING_STR(string), USTRING_LENGTH(string)) ?
                Qtrue : Qfalse;
}

#assigned?Boolean

Returns True if the receiver contains only code points that have been assigned a code value.

Returns:

  • (Boolean)

    True if the receiver contains only code points that have been assigned a code value



6
7
8
9
10
# File 'ext/u/rb_u_string_assigned.c', line 6

VALUE
rb_u_string_assigned(VALUE self)
{
        return _rb_u_character_test(self, u_char_isassigned);
}

#bString

Returns The String representation of the receiver, inheriting any taint and untrust, encoded as ASCII-8BIT.

Returns:

  • (String)

    The String representation of the receiver, inheriting any taint and untrust, encoded as ASCII-8BIT.



8
9
10
11
12
13
14
15
16
17
18
# File 'ext/u/rb_u_string_b.c', line 8

VALUE
rb_u_string_b(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);
        VALUE result = rb_str_new(USTRING_STR(string), USTRING_LENGTH(string));
#ifdef HAVE_RUBY_ENCODING_H
        rb_enc_associate(result, rb_ascii8bit_encoding());
#endif
        OBJ_INFECT(result, self);
        return result;
}

#bytesArray<Fixnum>

Returns The bytes of the receiver.

Returns:

  • (Array<Fixnum>)

    The bytes of the receiver.



40
41
42
43
44
45
46
# File 'ext/u/rb_u_string_each_byte.c', line 40

VALUE
rb_u_string_bytes(VALUE self)
{
        struct yield_array y = YIELD_ARRAY_INIT;
        each(self, &y.yield);
        return y.array;
}

#bytesizeInteger

Returns The number of bytes required to represent the receiver.

Returns:

  • (Integer)

    The number of bytes required to represent the receiver



4
5
6
7
8
9
10
# File 'ext/u/rb_u_string_bytesize.c', line 4

VALUE
rb_u_string_bytesize(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        return LONG2NUM(USTRING_LENGTH(string));
}

#byteslice(index) ⇒ U::String? #byteslice(index, length) ⇒ U::String? #byteslice(range) ⇒ U::String? #byteslice(object) ⇒ nil

Overloads:

  • #byteslice(index) ⇒ U::String?

    Returns The byte-index-based substring [max(i, 0), min(#bytesize, i + 1)], where i = INDEX if INDEX ≥ 0, i = #bytesize - abs(INDEX) otherwise, inheriting any taint and untrust, or nil if this substring is empty.

    Parameters:

    • index (#to_int)

    Returns:

    • (U::String, nil)

      The byte-index-based substring [max(i, 0), min(#bytesize, i + 1)], where i = INDEX if INDEX ≥ 0, i = #bytesize - abs(INDEX) otherwise, inheriting any taint and untrust, or nil if this substring is empty

  • #byteslice(index, length) ⇒ U::String?

    Returns The byte-index-based substring [max(i, 0), min(#bytesize, i + LENGTH)], where i = INDEX if INDEX ≥ 0, i = #bytesize - abs(INDEX) otherwise, inheriting any taint and untrust, or nil if LENGTH < 0.

    Parameters:

    • index (#to_int)
    • length (#to_int)

    Returns:

    • (U::String, nil)

      The byte-index-based substring [max(i, 0), min(#bytesize, i + LENGTH)], where i = INDEX if INDEX ≥ 0, i = #bytesize - abs(INDEX) otherwise, inheriting any taint and untrust, or nil if LENGTH < 0

  • #byteslice(range) ⇒ U::String?

    Returns The result of ‘#[i, j - k]`, where i = RANGE#begin if RANGE#begin ≥ 0, i = #bytesize - abs(RANGE#begin) otherwise, j = RANGE#end if RANGE#end ≥ 0, j = #bytesize - abs(RANGE#end) otherwise, and k = 1 if RANGE#exclude_end?, k = 0 otherwise, or nil if j - k < 0.

    Parameters:

    • range (Range)

    Returns:

    • (U::String, nil)

      The result of ‘#[i, j - k]`, where i = RANGE#begin if RANGE#begin ≥ 0, i = #bytesize - abs(RANGE#begin) otherwise, j = RANGE#end if RANGE#end ≥ 0, j = #bytesize - abs(RANGE#end) otherwise, and k = 1 if RANGE#exclude_end?, k = 0 otherwise, or nil if j - k < 0

  • #byteslice(object) ⇒ nil

    Returns Nil for any object that doesn’t satisfy the other cases.

    Parameters:

    • object (Object)

    Returns:

    • (nil)

      Nil for any object that doesn’t satisfy the other cases



92
93
94
95
96
97
98
99
100
101
102
103
# File 'ext/u/rb_u_string_byteslice.c', line 92

VALUE
rb_u_string_byteslice_m(int argc, VALUE *argv, VALUE self)
{
        need_m_to_n_arguments(argc, 1, 2);

        if (argc == 1)
                return rb_u_string_byteslice(self, argv[0]);

        return rb_u_string_byte_substr(self,
                                       NUM2LONG(argv[0]),
                                       NUM2LONG(argv[1]));
}

#canonical_combining_classFixnum

Returns the canonical combining class of the characters of the receiver.

The canonical combining class of a character is a number in the range [0, 254]. The canonical combining class is used when generating a canonical ordering of the characters in a string.

The empty string has a canonical combining class of 0.

Returns:

  • (Fixnum)

Raises:

  • (ArgumentError)

    If the receiver contains two characters belonging to different combining classes

  • (ArgumentError)

    If the receiver contains an incomplete UTF-8 sequence

  • (ArgumentError)

    If the receiver contains an invalid UTF-8 sequence



16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'ext/u/rb_u_string_canonical_combining_class.c', line 16

VALUE
rb_u_string_canonical_combining_class(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);
        const char *p = USTRING_STR(string);
        const char *end = USTRING_END(string);
        if (p == end)
                return 0;
        int first = u_char_canonical_combining_class(u_decode(&p, p, end));
        while (p < end) {
                int value = u_char_canonical_combining_class(u_decode(&p, p, end));
                if (value != first)
                        rb_u_raise(rb_eArgError,
                                   "string consists of characters with different canonical combining class values: %d+, %d",
                                   first, value);
        }
        return INT2FIX(first);
}

#case_ignorable?Boolean

Returns True if the receiver contains only “case ignorable” characters, that is, characters in the general categories

  • Other, format (Cf)

  • Letter, modifier (Lm)

  • Mark, enclosing (Me)

  • Mark, nonspacing (Mn)

  • Symbol, modifier (Sk)

and the characters

  • U+0027 APOSTROPHE

  • U+00AD SOFT HYPHEN

  • U+2019 RIGHT SINGLE QUOTATION MARK.

Returns:

  • (Boolean)

    True if the receiver contains only “case ignorable” characters, that is, characters in the general categories

    • Other, format (Cf)

    • Letter, modifier (Lm)

    • Mark, enclosing (Me)

    • Mark, nonspacing (Mn)

    • Symbol, modifier (Sk)

    and the characters

    • U+0027 APOSTROPHE

    • U+00AD SOFT HYPHEN

    • U+2019 RIGHT SINGLE QUOTATION MARK

See Also:



21
22
23
24
25
# File 'ext/u/rb_u_string_case_ignorable.c', line 21

VALUE
rb_u_string_case_ignorable(VALUE self)
{
        return _rb_u_character_test(self, u_char_iscaseignorable);
}

#casecmp(other, locale = ENV['LC_COLLATE']) ⇒ Fixnum

Returns the comparison of #foldcase to other#foldcase using the linguistically correct rules of LOCALE. This is, however, only an approximation of a case-insensitive comparison. The LOCALE must be given as a language, region, and encoding, for example, “en_US.UTF-8”.

This operation is known as “collation” and you can find more information about the collation algorithm employed in the Unicode Technical Standard #10, see unicode.org/reports/tr10/.

Parameters:

Returns:

  • (Fixnum)


31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'ext/u/rb_u_string_casecmp.c', line 31

VALUE
rb_u_string_casecmp(int argc, VALUE *argv, VALUE self)
{
        const char *locale = NULL;

        VALUE rbother, rblocale;
        if (rb_scan_args(argc, argv, "11", &rbother, &rblocale) == 2)
                locale = StringValuePtr(rblocale);

        const struct rb_u_string *string = RVAL2USTRING(self);
        const struct rb_u_string *other = RVAL2USTRING_ANY(rbother);

        char *folded;
        size_t folded_n = foldcase(&folded, string, locale, NULL);

        char *folded_other;
        size_t folded_other_n = foldcase(&folded_other, other, locale, folded);

        errno = 0;
        int r = u_collate(folded, folded_n,
                          folded_other, folded_other_n,
                          locale);

        free(folded_other);
        free(folded);

        if (errno != 0)
                rb_u_raise_errno(errno, "can’t collate strings");

        return INT2FIX(r);
}

#cased?Boolean

Returns True if the receiver only contains characters in the general categories

  • Letter, uppercase (Lu)

  • Letter, lowercase (Ll)

  • Letter, titlecase (Lt)

or has the derived properties Other_Uppercase or Other_Lowercase.

Returns:

  • (Boolean)

    True if the receiver only contains characters in the general categories

    • Letter, uppercase (Lu)

    • Letter, lowercase (Ll)

    • Letter, titlecase (Lt)

    or has the derived properties Other_Uppercase or Other_Lowercase



13
14
15
16
17
# File 'ext/u/rb_u_string_cased.c', line 13

VALUE
rb_u_string_cased(VALUE self)
{
        return _rb_u_character_test(self, u_char_iscased);
}

#center(width, padding = ' ') ⇒ U::String

Returns The receiver padded as evenly as possible on both sides with PADDING to make it max(#length, WIDTH) wide, inheriting any taint and untrust from the receiver and also from PADDING if PADDING is used.

Parameters:

Returns:

  • (U::String)

    The receiver padded as evenly as possible on both sides with PADDING to make it max(#length, WIDTH) wide, inheriting any taint and untrust from the receiver and also from PADDING if PADDING is used

Raises:

  • (ArgumentError)

    If PADDING#width = 0

  • (ArgumentError)

    If characters inside PADDING that should be used for round-off padding are too wide

See Also:



131
132
133
134
135
# File 'ext/u/rb_u_string_justify.c', line 131

VALUE
rb_u_string_center(int argc, VALUE *argv, VALUE self)
{
        return rb_u_string_justify(argc, argv, self, 'c');
}

#charsArray<U::String>

Returns The characters of the receiver, each inheriting any taint and untrust.

Returns:

  • (Array<U::String>)

    The characters of the receiver, each inheriting any taint and untrust.



43
44
45
46
47
48
49
# File 'ext/u/rb_u_string_each_char.c', line 43

VALUE
rb_u_string_chars(VALUE self)
{
        struct yield_array y = YIELD_ARRAY_INIT;
        each(self, &y.yield);
        return y.array;
}

#chomp(separator = $/) ⇒ U::String, ...

Returns the receiver, minus any SEPARATOR suffix, inheriting any taint and untrust, unless #length = 0, in which case nil is returned. If SEPARATOR is nil or invalidly encoded, the receiver is returned.

If SEPARATOR is ‘$/` and `$/` has its default value or if SEPARATOR is U+000A LINE FEED, the longest suffix consisting of any of

  • U+000A LINE FEED

  • U+000D CARRIAGE RETURN

  • U+000D CARRIAGE RETURN, U+000D LINE FEED

will be removed. If no such suffix exists and the last character is a #newline?, it will be removed instead.

If SEPARATOR is #empty?, remove the longest #newline? suffix.

Parameters:

Returns:

See Also:



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# File 'ext/u/rb_u_string_chomp.c', line 65

VALUE
rb_u_string_chomp(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        long length = USTRING_LENGTH(string);
        if (length == 0)
                return Qnil;

        VALUE rs;
        if (argc == 0) {
                rs = rb_rs;
                if (rs == rb_default_rs)
                        return rb_u_string_chomp_default(self);
        } else {
                rb_scan_args(argc, argv, "01", &rs);
        }
        if (NIL_P(rs))
                return self;

        const struct rb_u_string *separator = RVAL2USTRING_ANY(rs);

        long separator_length = USTRING_LENGTH(separator);
        if (separator_length == 0)
                return rb_u_string_chomp_newlines(self);

        if (separator_length > length)
                return self;

        char last_char = USTRING_STR(separator)[separator_length - 1];
        if (separator_length == 1 && last_char == '\n')
                return rb_u_string_chomp_default(self);

        if (!u_valid(USTRING_STR(separator), separator_length, NULL) ||
            USTRING_STR(string)[length - 1] != last_char ||
            (separator_length > 1 &&
             rb_memcmp(USTRING_STR(separator),
                       USTRING_END(string) - separator_length,
                       separator_length) != 0))
                return self;

        return rb_u_string_new_c(self, USTRING_STR(string), length - separator_length);
}

#chopU::String

Returns the receiver, minus its last character, inheriting any taint and untrust, unless the receiver is #empty? or if the last character is invalidly encoded, in which case the receiver is returned.

If the last character is U+000A LINE FEED and the second-to-last character is U+000D CARRIAGE RETURN, both characters are removed.

Returns:

See Also:



15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'ext/u/rb_u_string_chop.c', line 15

VALUE
rb_u_string_chop(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        if (USTRING_LENGTH(string) == 0)
                return self;

        const char *begin = USTRING_STR(string);
        const char *end = USTRING_END(string);

        const char *last;
        uint32_t c = u_decode_r(&last, begin, end);
        if (c == '\n')
                if (*(last - 1) == '\r')
                        last--;

        return rb_u_string_new_c(self, begin, last - begin);
}

#chrU::String

Returns The substring [0, min(#length, 1)], inheriting any taint and untrust.

Returns:

  • (U::String)

    The substring [0, min(#length, 1)], inheriting any taint and untrust



5
6
7
8
9
# File 'ext/u/rb_u_string_chr.c', line 5

VALUE
rb_u_string_chr(VALUE self)
{
        return rb_u_string_substr(self, 0, 1);
}

#cntrl?Boolean

Returns True if the receiver contains only characters in the general category Other, control (Cc).

Returns:

  • (Boolean)

    True if the receiver contains only characters in the general category Other, control (Cc)



6
7
8
9
10
# File 'ext/u/rb_u_string_cntrl.c', line 6

VALUE
rb_u_string_cntrl(VALUE self)
{
        return _rb_u_character_test(self, u_char_iscntrl);
}

#codepointsArray<Integer>

Returns The code points of the receiver.

Returns:

  • (Array<Integer>)

    The code points of the receiver.



39
40
41
42
43
44
45
# File 'ext/u/rb_u_string_each_codepoint.c', line 39

VALUE
rb_u_string_codepoints(VALUE self)
{
        struct yield_array y = YIELD_ARRAY_INIT;
        each(self, &y.yield);
        return y.array;
}

#collation_key(locale = ENV['LC_COLLATE']) ⇒ U::String

Note:

Use the collation key when comparing U::Strings to each other repeatedly, as occurs when, for example, sorting a list of U::Strings.

Note:

The LOCALE must be given as a language, region, and encoding, for example, “en_US.UTF-8”.

Returns The locale-dependent collation key of the receiver in LOCALE, inheriting any taint and untrust.

Returns:

  • (U::String)

    The locale-dependent collation key of the receiver in LOCALE, inheriting any taint and untrust

Raises:

  • (Errno::EILSEQ)

    If a character in the receiver can’t be converted into the encoding of the locale



14
15
16
17
18
# File 'ext/u/rb_u_string_collation_key.c', line 14

VALUE
rb_u_string_collation_key(int argc, VALUE *argv, VALUE self)
{
        return _rb_u_string_convert_locale(argc, argv, self, u_collation_key, "LC_COLLATE");
}

#count(set, *sets) ⇒ Integer

Returns the number of characters in the receiver that are included in the intersection of SET and any additional SETS of characters.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘‘^`’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

Parameters:

Returns:

  • (Integer)


19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# File 'ext/u/rb_u_string_count.c', line 19

VALUE
rb_u_string_count(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        need_at_least_n_arguments(argc, 1);

        if (USTRING_LENGTH(string) == 0)
                return INT2FIX(0);

        struct tr_table table;
        tr_table_initialize_from_strings(&table, argc, argv);

        long count = 0;
        for (const char *p = USTRING_STR(string), *end = USTRING_END(string); p < end; )
                if (tr_table_lookup(&table, u_decode(&p, p, end)))
                        count++;

        return LONG2NUM(count);
}

#defined?Boolean

Returns True if the receiver contains only characters not in the general categories Other, not assigned (Cn) and Other, surrogate (Cs).

Returns:

  • (Boolean)

    True if the receiver contains only characters not in the general categories Other, not assigned (Cn) and Other, surrogate (Cs)



6
7
8
9
10
# File 'ext/u/rb_u_string_defined.c', line 6

VALUE
rb_u_string_defined(VALUE self)
{
        return _rb_u_character_test(self, u_char_isdefined);
}

#delete(set, *sets) ⇒ U::String

Returns the receiver, minus any characters that are included in the intersection of SET and any additional SETS of characters, inheriting any taint and untrust.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘‘^`’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

Parameters:

Returns:



40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'ext/u/rb_u_string_delete.c', line 40

VALUE
rb_u_string_delete(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        need_at_least_n_arguments(argc, 1);

        if (USTRING_LENGTH(string) == 0)
                return self;

        struct tr_table table;
        tr_table_initialize_from_strings(&table, argc, argv);

        long count = rb_u_string_delete_loop(string, &table, NULL);
        if (count == 0)
                return self;

        char *remaining = ALLOC_N(char, count + 1);
        rb_u_string_delete_loop(string, &table, remaining);
        remaining[count] = '\0';

        return rb_u_string_new_c_own(self, remaining, count);
}

#digit?Boolean

Returns True if the receiver contains only characters in the general category Number, decimal digit (Nd).

Returns:

  • (Boolean)

    True if the receiver contains only characters in the general category Number, decimal digit (Nd)



6
7
8
9
10
# File 'ext/u/rb_u_string_digit.c', line 6

VALUE
rb_u_string_digit(VALUE self)
{
        return _rb_u_character_test(self, u_char_isdigit);
}

#downcase(locale = ENV['LC_CTYPE']) ⇒ U::String

Returns The downcasing of the receiver according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust.

Parameters:

  • locale (#to_str) (defaults to: ENV['LC_CTYPE'])

Returns:

  • (U::String)

    The downcasing of the receiver according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust



9
10
11
12
13
# File 'ext/u/rb_u_string_downcase.c', line 9

VALUE
rb_u_string_downcase(int argc, VALUE *argv, VALUE self)
{
        return _rb_u_string_convert_locale(argc, argv, self, u_downcase, NULL);
}

#dumpU::String

Returns the receiver in a reader-friendly format, inheriting any taint and untrust.

The reader-friendly format looks like “‘“…”.u`”. Inside the “…”, any #print? characters in the ASCII range are output as-is, the following special characters are escaped according to the following table:

<table>

<thead><tr><th>Character</th><th>Dumped Sequence</th></tr></thead>
<tbody>
  <tr><td>U+0022 QUOTATION MARK</td><td><code>\"</code></td></tr>
  <tr><td>U+005C REVERSE SOLIDUS</td><td><code>\\</code></td></tr>
  <tr><td>U+000A LINE FEED (LF)</td><td><code>\n</code></td></tr>
  <tr><td>U+000D CARRIAGE RETURN (CR)</td><td><code>\r</code></td></tr>
  <tr><td>U+0009 CHARACTER TABULATION</td><td><code>\t</code></td></tr>
  <tr><td>U+000C FORM FEED (FF)</td><td><code>\f</code></td></tr>
  <tr><td>U+000B LINE TABULATION</td><td><code>\v</code></td></tr>
  <tr><td>U+0008 BACKSPACE</td><td><code>\b</code></td></tr>
  <tr><td>U+0007 BELL</td><td><code>\a</code></td></tr>
  <tr><td>U+001B ESCAPE</td><td><code>\e</code></td></tr>
</tbody>

</table>

the following special sequences are also escaped:

<table>

<thead><tr><th>Character</th><th>Dumped Sequence</th></tr></thead>
<tbody>
  <tr><td><code>#$</code></td><td><code>\#$</code></td></tr>
  <tr><td><code>#@</code></td><td><code>\#@</code></td></tr>
  <tr><td><code>#{</code></td><td><code>\#{</code></td></tr>
</tbody>

</table>

any valid UTF-8 byte sequences are output as “‘u{`n`}`”, where n is the lowercase hexadecimal representation of the code point encoded by the UTF-8 sequence, and any other byte is output as “`x`n”, where n is the two-digit uppercase hexadecimal representation of the byte’s value.

Returns:



125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# File 'ext/u/rb_u_string_dump.c', line 125

VALUE
rb_u_string_dump(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);
        const char *p = USTRING_STR(string);
        const char *end = USTRING_END(string);

        VALUE buffer = rb_u_buffer_new_sized(7);

        rb_u_buffer_append(buffer, "\"", 1);
        while (p < end) {
                unsigned char c = *p;

                if (!rb_u_string_dump_escape(buffer, c) &&
                    !rb_u_string_dump_hash(buffer, c, p, end) &&
                    !rb_u_string_dump_ascii_printable(buffer, c) &&
                    !rb_u_string_dump_codepoint(buffer, &p, end))
                        rb_u_string_dump_hex(buffer, c);

                p++;
        }
        rb_u_buffer_append(buffer, "\".u", 3);

        VALUE result = rb_u_buffer_to_u_bang(buffer);

        OBJ_INFECT(result, self);

        return result;
}

#each_byte {|byte| ... } ⇒ self #each_byteEnumerator

Overloads:

  • #each_byte {|byte| ... } ⇒ self

    Enumerates the bytes in the receiver.

    Yield Parameters:

    • byte (Fixnum)

    Returns:

    • (self)
  • #each_byteEnumerator

    Returns An Enumerator over the bytes in the receiver.

    Returns:

    • (Enumerator)

      An Enumerator over the bytes in the receiver



30
31
32
33
34
35
36
37
# File 'ext/u/rb_u_string_each_byte.c', line 30

VALUE
rb_u_string_each_byte(VALUE self)
{
        RETURN_SIZED_ENUMERATOR(self, 0, NULL, size);
        struct yield y = YIELD_INIT;
        each(self, &y);
        return self;
}

#each_char {|char| ... } ⇒ self #each_charEnumerator

Overloads:

  • #each_char {|char| ... } ⇒ self

    Enumerates the characters in the receiver, each inheriting any taint and untrust.

    Yield Parameters:

    Returns:

    • (self)
  • #each_charEnumerator

    Returns An Enumerator over the characters in the receiver.

    Returns:

    • (Enumerator)

      An Enumerator over the characters in the receiver



32
33
34
35
36
37
38
39
# File 'ext/u/rb_u_string_each_char.c', line 32

VALUE
rb_u_string_each_char(VALUE self)
{
        RETURN_SIZED_ENUMERATOR(self, 0, NULL, size);
        struct yield y = YIELD_INIT;
        each(self, &y);
        return self;
}

#each_codepoint {|codepoint| ... } ⇒ self #each_codepointEnumerator

Overloads:

  • #each_codepoint {|codepoint| ... } ⇒ self

    Enumerates the code points of the receiver.

    Yield Parameters:

    • codepoint (Integer)

    Returns:

    • (self)
  • #each_codepointEnumerator

    Returns An Enumerator over the code points of the receiver.

    Returns:

    • (Enumerator)

      An Enumerator over the code points of the receiver



29
30
31
32
33
34
35
36
# File 'ext/u/rb_u_string_each_codepoint.c', line 29

VALUE
rb_u_string_each_codepoint(VALUE self)
{
        RETURN_SIZED_ENUMERATOR(self, 0, NULL, size);
        struct yield y = YIELD_INIT;
        each(self, &y);
        return self;
}

#each_grapheme_cluster {|cluster| ... } ⇒ self #each_grapheme_clusterEnumerator Also known as: grapheme_clusters

Overloads:



25
26
27
28
29
30
31
32
33
34
35
36
# File 'ext/u/rb_u_string_each_grapheme_cluster.c', line 25

VALUE
rb_u_string_each_grapheme_cluster(VALUE self)
{
        RETURN_ENUMERATOR(self, 0, NULL);

        const struct rb_u_string *string = RVAL2USTRING(self);
        const char *p = USTRING_STR(string);
        const char *end = USTRING_END(string);
        size_t length = end - p;
        u_grapheme_clusters(p, length, (u_substring_fn)each, &self);
        return self;
}

#each_line(separator = $/) {|lp| ... } ⇒ self #each_line(separator = $/) ⇒ Enumerator

Overloads:

  • #each_line(separator = $/) {|lp| ... } ⇒ self

    Enumerates the lines of the receiver, inheriting any taint and untrust.

    If SEPARATOR is nil, yields self. If SEPARATOR is #empty?, separates each line (paragraph) by two or more U+000A LINE FEED characters.

    Parameters:

    Yield Parameters:

    Returns:

    • (self)
  • #each_line(separator = $/) ⇒ Enumerator

    Returns an Enumerator over the lines of the receiver.

    If SEPARATOR is nil, self will be yielded. If SEPARATOR is #empty?, separates each line (paragraph) by two or more U+000A LINE FEED characters.

    Parameters:

    Returns:

    • (Enumerator)


118
119
120
121
122
123
124
125
# File 'ext/u/rb_u_string_each_line.c', line 118

VALUE
rb_u_string_each_line(int argc, VALUE *argv, VALUE self)
{
        RETURN_ENUMERATOR(self, argc, argv);
        struct yield y = YIELD_INIT;
        each(argc, argv, self, &y);
        return self;
}

#each_word {|word| ... } ⇒ self #each_wordEnumerator Also known as: words

Overloads:



24
25
26
27
28
29
30
31
32
33
34
# File 'ext/u/rb_u_string_each_word.c', line 24

VALUE
rb_u_string_each_word(VALUE self)
{
        RETURN_ENUMERATOR(self, 0, NULL);

        const struct rb_u_string *string = RVAL2USTRING(self);
        const char *p = USTRING_STR(string);
        size_t length = USTRING_LENGTH(string);
        u_words(p, length, (u_substring_fn)each, &self);
        return self;
}

#empty?Boolean

Returns True if #bytesize = 0.

Returns:



5
6
7
8
9
10
11
# File 'ext/u/rb_u_string_empty.c', line 5

VALUE
rb_u_string_empty(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        return (USTRING_LENGTH(string) == 0) ? Qtrue : Qfalse;
}

#end_with?(*suffixes) ⇒ Boolean

Returns True if any element of SUFFIXES that responds to #to_str is a byte-level suffix of the receiver.

Parameters:

  • suffixes (Array)

Returns:

  • (Boolean)

    True if any element of SUFFIXES that responds to #to_str is a byte-level suffix of the receiver



7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# File 'ext/u/rb_u_string_end_with.c', line 7

VALUE
rb_u_string_end_with(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);
        const char *end = USTRING_END(string);
        long p_length = USTRING_LENGTH(string);

        for (int i = 0; i < argc; i++) {
                VALUE tmp = rb_u_string_check_type(argv[i]);
                if (NIL_P(tmp))
                        continue;

                const struct rb_u_string *other = RVAL2USTRING_ANY(tmp);
                const char *q = USTRING_STR(other);
                long q_length = USTRING_LENGTH(other);

                if (p_length < q_length)
                        continue;

                if (memcmp(end - q_length, q, q_length) == 0)
                        return Qtrue;
        }

        return Qfalse;
}

#eql?(other) ⇒ Boolean

Returns True if the receiver’s bytes equal those of OTHER.

Parameters:

Returns:

  • (Boolean)

    True if the receiver’s bytes equal those of OTHER

See Also:



8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# File 'ext/u/rb_u_string_eql.c', line 8

VALUE
rb_u_string_eql(VALUE self, VALUE rbother)
{
        if (self == rbother)
                return Qtrue;

        if (!RTEST(rb_obj_is_kind_of(rbother, rb_cUString)))
                return Qfalse;

        const struct rb_u_string *string = RVAL2USTRING(self);
        const struct rb_u_string *other = RVAL2USTRING(rbother);

        const char *p = USTRING_STR(string);
        const char *q = USTRING_STR(other);

        if (p == q)
                return Qtrue;

        long p_length = USTRING_LENGTH(string);
        long q_length = USTRING_LENGTH(other);

        return p_length == q_length && memcmp(p, q, q_length) == 0 ? Qtrue : Qfalse;
}

#foldcase(locale = ENV['LC_CTYPE']) ⇒ U::String

Returns The case-folding of the receiver according to the rules of the language of LOCALE, which may be empty to specifically use the default rules, inheriting any taint and untrust.

Parameters:

  • locale (#to_str) (defaults to: ENV['LC_CTYPE'])

Returns:

  • (U::String)

    The case-folding of the receiver according to the rules of the language of LOCALE, which may be empty to specifically use the default rules, inheriting any taint and untrust



8
9
10
11
12
# File 'ext/u/rb_u_string_foldcase.c', line 8

VALUE
rb_u_string_foldcase(int argc, VALUE *argv, VALUE self)
{
        return _rb_u_string_convert_locale(argc, argv, self, u_foldcase, NULL);
}

#folded?(locale = ENV[LC_CTYPE]) ⇒ Boolean

Returns True if the receiver has been case-folded according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#foldcase(LOCALE), where a = #normalize(‘:nfd`).

Parameters:

  • locale (#to_str) (defaults to: ENV[LC_CTYPE])

Returns:

  • (Boolean)

    True if the receiver has been case-folded according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#foldcase(LOCALE), where a = #normalize(‘:nfd`)



9
10
11
12
13
# File 'ext/u/rb_u_string_folded.c', line 9

VALUE
rb_u_string_folded(int argc, VALUE *argv, VALUE self)
{
        return _rb_u_string_test_locale(argc, argv, self, u_foldcase);
}

#general_categorySymbol

Returns the general category of the characters of the receiver.

The general category identifies what kind of symbol the character is.

<table>

<thead>
  <tr>
    <th>Category Major, minor</th>
    <th>Unicode Value</th>
    <th>Ruby Value</th>
  </tr>
</thead>
<tbody>
  <tr><td>Other, control</td><td>Cc</td><td>:other_control</td></tr>
  <tr><td>Other, format</td><td>Cf</td><td>:other_format</td></tr>
  <tr><td>Other, not assigned</td><td>Cn</td><td>:other_not_assigned</td></tr>
  <tr><td>Other, private use</td><td>Co</td><td>:other_private_use</td></tr>
  <tr><td>Other, surrogate</td><td>Cs</td><td>:other_surrogate</td></tr>
  <tr><td>Letter, lowercase</td><td>Ll</td><td>:letter_lowercase</td></tr>
  <tr><td>Letter, modifier</td><td>Lm</td><td>:letter_modifier</td></tr>
  <tr><td>Letter, other</td><td>Lo</td><td>:letter_other</td></tr>
  <tr><td>Letter, titlecase</td><td>Lt</td><td>:letter_titlecase</td></tr>
  <tr><td>Letter, uppercase</td><td>Lu</td><td>:letter_uppercase</td></tr>
  <tr><td>Mark, spacing combining</td><td>Mc</td><td>:mark_spacing_combining</td></tr>
  <tr><td>Mark, enclosing</td><td>Me</td><td>:mark_enclosing</td></tr>
  <tr><td>Mark, nonspacing</td><td>Mn</td><td>:mark_non_spacing</td></tr>
  <tr><td>Number, decimal digit</td><td>Nd</td><td>:number_decimal</td></tr>
  <tr><td>Number, letter</td><td>Nl</td><td>:number_letter</td></tr>
  <tr><td>Number, other</td><td>No</td><td>:number_other</td></tr>
  <tr><td>Punctuation, connector</td><td>Pc</td><td>:punctuation_connector</td></tr>
  <tr><td>Punctuation, dash</td><td>Pd</td><td>:punctuation_dash</td></tr>
  <tr><td>Punctuation, close</td><td>Pe</td><td>:punctuation_close</td></tr>
  <tr><td>Punctuation, final quote</td><td>Pf</td><td>:punctuation_final_quote</td></tr>
  <tr><td>Punctuation, initial quote</td><td>Pi</td><td>:punctuation_initial_quote</td></tr>
  <tr><td>Punctuation, other</td><td>Po</td><td>:punctuation_other</td></tr>
  <tr><td>Punctuation, open</td><td>Ps</td><td>:punctuation_open</td></tr>
  <tr><td>Symbol, currency</td><td>Sc</td><td>:symbol_currency</td></tr>
  <tr><td>Symbol, modifier</td><td>Sk</td><td>:symbol_modifier</td></tr>
  <tr><td>Symbol, math</td><td>Sm</td><td>:symbol_math</td></tr>
  <tr><td>Symbol, other</td><td>So</td><td>:symbol_other</td></tr>
  <tr><td>Separator, line</td><td>Zl</td><td>:separator_line</td></tr>
  <tr><td>Separator, paragraph</td><td>Zp</td><td>:separator_paragraph</td></tr>
  <tr><td>Separator, space</td><td>Zs</td><td>:separator_space</td></tr>
</tbody>

</table>

Returns:

  • (Symbol)

Raises:

  • (ArgumentError)

    If the receiver contains two characters belonging to different general categories

  • (ArgumentError)

    If the receiver contains an incomplete UTF-8 sequence

  • (ArgumentError)

    If the receiver contains an invalid UTF-8 sequence

See Also:



103
104
105
106
107
108
109
# File 'ext/u/rb_u_string_general_category.c', line 103

VALUE
rb_u_string_general_category(VALUE self)
{
        return _rb_u_string_property(self, "general category", U_GENERAL_CATEGORY_OTHER_NOT_ASSIGNED,
                                     (int (*)(uint32_t))u_char_general_category,
                                     (VALUE (*)(int))category_to_symbol);
}

#getbyte(index) ⇒ Fixnum?

Returns The byte at byte-index i, where i = INDEX if INDEX ≥ 0, i = #bytesize - abs(INDEX) otherwise, or nil if i lays outside of [0, #bytesize].

Parameters:

  • index (#to_int)

Returns:

  • (Fixnum, nil)

    The byte at byte-index i, where i = INDEX if INDEX ≥ 0, i = #bytesize - abs(INDEX) otherwise, or nil if i lays outside of [0, #bytesize]



8
9
10
11
12
13
14
15
16
17
18
19
20
21
# File 'ext/u/rb_u_string_getbyte.c', line 8

VALUE
rb_u_string_getbyte(VALUE self, VALUE rbindex)
{
        const struct rb_u_string *string = RVAL2USTRING(self);
        long index = NUM2LONG(rbindex);

        if (index < 0)
                index += USTRING_LENGTH(string);

        if (index < 0 || USTRING_LENGTH(string) <= index)
                return Qnil;

        return INT2FIX((unsigned char)USTRING_STR(string)[index]);
}

#graph?Boolean

Returns true if the receiver contains only non-space “printable” characters.

Non-space “printable” character are those not in the general categories Other or Space, separator (Zs):

  • Other, control (Cc)

  • Other, format (Cf)

  • Other, not assigned (Cn)

  • Other, surrogate (Cs)

  • Space, separator (Zs)

Returns:

  • (Boolean)


17
18
19
20
21
# File 'ext/u/rb_u_string_graph.c', line 17

VALUE
rb_u_string_graph(VALUE self)
{
        return _rb_u_character_test(self, u_char_isgraph);
}

#grapheme_breakSymbol

Returns the grapheme break property value of the characters of the receiver.

The possible break values are

  • :control

  • :cr

  • :extend

  • :l

  • :lf

  • :lv

  • :lvt

  • :other

  • :prepend

  • :regional_indicator

  • :spacingmark

  • :t

  • :v

Returns:

  • (Symbol)

Raises:

  • (ArgumentError)

    If the string consists of more than one break type

See Also:



55
56
57
58
59
60
61
# File 'ext/u/rb_u_string_grapheme_break.c', line 55

VALUE
rb_u_string_grapheme_break(VALUE self)
{
        return _rb_u_string_property(self, "grapheme break", U_GRAPHEME_BREAK_OTHER,
                                     (int (*)(uint32_t))u_char_grapheme_break,
                                     (VALUE (*)(int))break_to_symbol);
}

#gsub(pattern, replacement) ⇒ U::String #gsub(pattern, replacements) ⇒ U::String #gsub(pattern) {|match| ... } ⇒ U::String #gsub(pattern) ⇒ Enumerator

Overloads:

  • #gsub(pattern, replacement) ⇒ U::String

    Returns the receiver with all matches of PATTERN replaced by REPLACEMENT, inheriting any taint and untrust from the receiver and from REPLACEMENT.

    The REPLACEMENT is used as a specification for what to replace matches with:

    <table>

    <thead>
      <tr><th>Specification</th><th>Replacement</th></tr>
    </thead>
    <tbody>
      <tr>
        <td><code>\1</code>, <code>\2</code>, …, <code>\</code><em>n</em></td>
        <td>Numbered sub-match <em>n</em></td>
      </tr>
      <tr>
        <td><code>\k&lt;</code><em>name</em><code>></code></td>
        <td>Named sub-match <em>name</em></td>
      </tr>
    </tbody>
    

    </table>

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

    Parameters:

    Returns:

  • #gsub(pattern, replacements) ⇒ U::String

    Returns the receiver with all matches of PATTERN replaced by REPLACEMENTS#[match], where match is the matched substring, inheriting any taint and untrust from the receiver and from the REPLACEMENTS#[match]es, as well as any taint on REPLACEMENTS.

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

    Parameters:

    • pattern (Regexp, #to_str)
    • replacements (#to_hash)

    Returns:

    Raises:

    • (RuntimeError)

      If any replacement is the result being constructed

    • (Exception)

      Any error raised by REPLACEMENTS#default, if it gets called

  • #gsub(pattern) {|match| ... } ⇒ U::String

    Returns the receiver with all matches of PATTERN replaced by the results of the given block, inheriting any taint and untrust from the receiver and from the results of the given block.

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

    Parameters:

    Yield Parameters:

    Yield Returns:

    Returns:

  • #gsub(pattern) ⇒ Enumerator

    Returns an Enumerator over the matches of PATTERN in the receiver.

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n will be updated accordingly.

    Parameters:

    Returns:

    • (Enumerator)


75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# File 'ext/u/rb_u_string_gsub.c', line 75

VALUE
rb_u_string_gsub(int argc, VALUE *argv, VALUE self)
{
        VALUE pattern, replacement;
        VALUE replacements = Qnil;
        bool use_block = false;
        bool tainted = false;

        if (argc == 1) {
                RETURN_ENUMERATOR(self, argc, argv);
                use_block = true;
        }

        if (rb_scan_args(argc, argv, "11", &pattern, &replacement) == 2) {
                replacements = rb_check_convert_type(replacement, T_HASH,
                                                     "Hash", "to_hash");
                if (NIL_P(replacements))
                        StringValue(replacement);
                if (OBJ_TAINTED(replacement))
                        tainted = true;
        }

        pattern = rb_u_pattern_argument(pattern, true);

        VALUE str = rb_str_to_str(self);
        long begin = rb_reg_search(pattern, str, 0, 0);
        if (begin < 0)
                return self;

        const char *base = RSTRING_PTR(str);
        const char *p = base;
        const char *end = RSTRING_END(str);
        VALUE substituted = rb_u_str_buf_new(RSTRING_LEN(str) + 30);
        do {
                VALUE match = rb_backref_get();
                struct re_registers *registers = RMATCH_REGS(match);
                VALUE result;

                if (use_block || !NIL_P(replacements)) {
                        if (use_block) {
                                VALUE ustr = rb_u_string_new_rb(rb_reg_nth_match(0, match));
                                result = rb_u_string_object_as_string(rb_yield(ustr));
                        } else {
                                VALUE ustr = rb_u_string_new_c(self,
                                                               base + registers->beg[0],
                                                               registers->end[0] - registers->beg[0]);
                                result = rb_u_string_object_as_string(rb_hash_aref(replacements, ustr));
                        }

                        if (result == substituted)
                                rb_u_raise(rb_eRuntimeError,
                                           "result of block is string being built; please try not to cheat");
                } else
                        result =
#ifdef HAVE_RB_REG_REGSUB4
                        rb_reg_regsub(replacement, str, registers, pattern);
#else
                        rb_reg_regsub(replacement, str, registers);
#endif

                if (OBJ_TAINTED(result))
                        tainted = true;

                const struct rb_u_string *value = RVAL2USTRING_ANY(result);

                rb_str_buf_cat(substituted, p, registers->beg[0] - (p - base));
                rb_str_buf_cat(substituted, USTRING_STR(value), USTRING_LENGTH(value));
                OBJ_INFECT(substituted, result);

                p = base + registers->end[0];
                if (registers->beg[0] == registers->end[0])
                        p = u_next(p);
                if (p >= end)
                        break;

                begin = rb_reg_search(pattern, str, registers->end[0], 0);
        } while (begin >= 0);

        if (p < end)
                rb_str_buf_cat(substituted, p, end - p);

        rb_reg_search(pattern, str, end - p, 0);

        RBASIC(substituted)->klass = rb_obj_class(str);
        OBJ_INFECT(substituted, str);
        if (tainted)
                OBJ_TAINT(substituted);

        return rb_u_string_new_rb(substituted);
}

#hashFixnum

Returns The hash value of the receiver’s content.

Returns:

  • (Fixnum)

    The hash value of the receiver’s content



4
5
6
7
8
9
10
# File 'ext/u/rb_u_string_hash.c', line 4

VALUE
rb_u_string_hash(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        return INT2FIX(rb_memhash(USTRING_STR(string), USTRING_LENGTH(string)));
}

#hexInteger

Returns The result of #to_i(16).

Returns:

  • (Integer)

    The result of #to_i(16)



5
6
7
8
9
# File 'ext/u/rb_u_string_hex.c', line 5

VALUE
rb_u_string_hex(VALUE self)
{
        return rb_u_string_to_inum(self, 16, false);
}

#include?(substring) ⇒ Boolean

Returns True if #index(SUBSTRING) ≠ nil.

Parameters:

Returns:

  • (Boolean)

    True if #index(SUBSTRING) ≠ nil



6
7
8
9
10
# File 'ext/u/rb_u_string_include.c', line 6

VALUE
rb_u_string_include(VALUE self, VALUE substring)
{
        return rb_u_string_index(self, substring, 0) != -1 ? Qtrue : Qfalse;
}

#index(pattern, offset = 0) ⇒ Integer?

Returns the minimal index of the receiver where PATTERN matches, equal to or greater than i, where i = OFFSET if OFFSET ≥ 0, i = #length - abs(OFFSET) otherwise, or nil if there is no match.

If PATTERN is a Regexp, the Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

If PATTERN responds to #to_str, the matching is performed by byte comparison.

Parameters:

  • pattern (Regexp, #to_str)
  • offset (#to_int) (defaults to: 0)

Returns:

  • (Integer, nil)

See Also:



70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'ext/u/rb_u_string_index.c', line 70

VALUE
rb_u_string_index_m(int argc, VALUE *argv, VALUE self)
{
        VALUE sub, rboffset;
        long offset = 0;
        if (rb_scan_args(argc, argv, "11", &sub, &rboffset) == 2)
                offset = NUM2LONG(rboffset);

        const struct rb_u_string *string = RVAL2USTRING(self);

        const char *begin = rb_u_string_begin_from_offset(string, offset);
        if (begin == NULL) {
                if (TYPE(sub) == T_REGEXP)
                        rb_backref_set(Qnil);

                return Qnil;
        }

        switch (TYPE(sub)) {
        case T_REGEXP:
                offset = rb_u_string_index_regexp(self, begin, sub, false);
                break;
        default: {
                VALUE tmp = rb_check_string_type(sub);
                if (NIL_P(tmp))
                        rb_u_raise(rb_eTypeError, "type mismatch: %s given",
                                   rb_obj_classname(sub));

                sub = tmp;
        }
                /* fall through */
        case T_STRING:
                offset = rb_u_string_index(self, sub, offset);
                break;
        }

        if (offset < 0)
                return Qnil;

        return LONG2NUM(offset);
}

#inspectString

Returns the receiver in a reader-friendly inspectable format, inheriting any taint and untrust, encoded using UTF-8.

The reader-friendly inspectable format looks like “‘“…”.u`”. Inside the “…”, any #print? characters are output as-is, the following special characters are escaped according to the following table:

<table>

<thead><tr><th>Character</th><th>Dumped Sequence</th></tr></thead>
<tbody>
  <tr><td>U+0022 QUOTATION MARK</td><td><code>\"</code></td></tr>
  <tr><td>U+005C REVERSE SOLIDUS</td><td><code>\\</code></td></tr>
  <tr><td>U+000A LINE FEED (LF)</td><td><code>\n</code></td></tr>
  <tr><td>U+000D CARRIAGE RETURN (CR)</td><td><code>\r</code></td></tr>
  <tr><td>U+0009 CHARACTER TABULATION</td><td><code>\t</code></td></tr>
  <tr><td>U+000C FORM FEED (FF)</td><td><code>\f</code></td></tr>
  <tr><td>U+000B LINE TABULATION</td><td><code>\v</code></td></tr>
  <tr><td>U+0008 BACKSPACE</td><td><code>\b</code></td></tr>
  <tr><td>U+0007 BELL</td><td><code>\a</code></td></tr>
  <tr><td>U+001B ESCAPE</td><td><code>\e</code></td></tr>
</tbody>

</table>

the following special sequences are also escaped:

<table>

<thead><tr><th>Character</th><th>Dumped Sequence</th></tr></thead>
<tbody>
  <tr><td><code>#$</code></td><td><code>\#$</code></td></tr>
  <tr><td><code>#@</code></td><td><code>\#@</code></td></tr>
  <tr><td><code>#{</code></td><td><code>\#{</code></td></tr>
</tbody>

</table>

Valid UTF-8 byte sequences representing code points < 0x10000 are output as ‘u`n, where n is the four-digit uppercase hexadecimal representation of the code point.

Valid UTF-8 byte sequences representing code points ≥ 0x10000 are output as ‘u{`n`}`, where n is the uppercase hexadecimal representation of the code point.

Any other byte is output as ‘x`n, where n is the two-digit uppercase hexadecimal representation of the byte’s value.

Returns:



127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# File 'ext/u/rb_u_string_inspect.c', line 127

VALUE
rb_u_string_inspect(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        VALUE result = rb_u_str_buf_new(0);
        rb_str_buf_cat2(result, "\"");
        const char *p = USTRING_STR(string);
        const char *end = USTRING_END(string);
        while (p < end) {
                const char *q;
                uint32_t c = u_decode(&q, p, end);
                switch (c) {
                case '"':
                case '\\':
                        rb_u_string_inspect_special_char(c, result);
                        break;
                case '#':
                        p = rb_u_string_inspect_hash_char(q, end, result);
                        continue;
                case '\n':
                        rb_str_buf_cat2(result, "\\n");
                        break;
                case '\r':
                        rb_str_buf_cat2(result, "\\r");
                        break;
                case '\t':
                        rb_str_buf_cat2(result, "\\t");
                        break;
                case '\f':
                        rb_str_buf_cat2(result, "\\f");
                        break;
                case '\013':
                        rb_str_buf_cat2(result, "\\v");
                        break;
                case '\010':
                        rb_str_buf_cat2(result, "\\b");
                        break;
                case '\007':
                        rb_str_buf_cat2(result, "\\a");
                        break;
                case '\033':
                        rb_str_buf_cat2(result, "\\e");
                        break;
                case REPLACEMENT_CHARACTER:
                        if (!u_valid(p, q - p, NULL)) {
                                rb_u_string_inspect_bad_input(p, q, result);
                                break;
                        }
                        /* fall through */
                default:
                        rb_u_string_inspect_default(c, result);
                        break;
                }
                p = q;
        }

        rb_str_buf_cat2(result, "\".u");

        OBJ_INFECT(result, self);

        return result;
}

#lengthInteger Also known as: size

Returns The number of characters in the receiver.

Returns:

  • (Integer)

    The number of characters in the receiver



4
5
6
7
8
9
10
# File 'ext/u/rb_u_string_length.c', line 4

VALUE
rb_u_string_length(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        return UINT2NUM(u_n_chars_n(USTRING_STR(string), USTRING_LENGTH(string)));
}

#line_breakSymbol

Returns the line break property value of the characters of the receiver.

The possible break values are

  • :after

  • :alphabetic

  • :ambiguous

  • :before

  • :before_and_after

  • :carriage_return

  • :close_parenthesis

  • :close_punctuation

  • :combining_mark

  • :complex_context

  • :conditional_japanese_starter

  • :contingent

  • :exclamation

  • :hangul_l_jamo

  • :hangul_lv_syllable

  • :hangul_lvt_syllable

  • :hangul_t_jamo

  • :hangul_v_jamo

  • :hebrew_letter

  • :hyphen

  • :ideographic

  • :infix_separator

  • :inseparable

  • :line_feed

  • :mandatory

  • :next_line

  • :non_breaking_glue

  • :non_starter

  • :numeric

  • :open_punctuation

  • :postfix

  • :prefix

  • :quotation

  • :regional_indicator

  • :space

  • :surrogate

  • :symbol

  • :unknown

  • :word_joiner

  • :zero_width_space

Returns:

  • (Symbol)

Raises:

  • (ArgumentError)

    If the string consists of more than one break type

See Also:



109
110
111
112
113
114
115
# File 'ext/u/rb_u_string_line_break.c', line 109

VALUE
rb_u_string_line_break(VALUE self)
{
        return _rb_u_string_property(self, "line break", U_LINE_BREAK_UNKNOWN,
                                     (int (*)(uint32_t))u_char_line_break,
                                     (VALUE (*)(int))break_to_symbol);
}

#lines(separator = $/) ⇒ Array<U::String>

Returns the lines of the receiver, inheriting any taint and untrust.

If SEPARATOR is nil, yields self. If SEPARATOR is #empty?, separates each line (paragraph) by two or more U+000A LINE FEED characters.

Parameters:

Returns:



136
137
138
139
140
141
142
# File 'ext/u/rb_u_string_each_line.c', line 136

VALUE
rb_u_string_lines(int argc, VALUE *argv, VALUE self)
{
        struct yield_array y = YIELD_ARRAY_INIT;
        each(argc, argv, self, &y.yield);
        return y.array;
}

#ljust(width, padding = ' ') ⇒ U::String

Returns The receiver padded on the right with PADDING to make it max(#length, WIDTH) wide, inheriting any taint and untrust from the receiver and also from PADDING if PADDING is used.

Parameters:

Returns:

  • (U::String)

    The receiver padded on the right with PADDING to make it max(#length, WIDTH) wide, inheriting any taint and untrust from the receiver and also from PADDING if PADDING is used

Raises:

  • (ArgumentError)

    If PADDING#width = 0

  • (ArgumentError)

    If characters inside PADDING that should be used for round-off padding are too wide

See Also:



148
149
150
151
152
# File 'ext/u/rb_u_string_justify.c', line 148

VALUE
rb_u_string_ljust(int argc, VALUE *argv, VALUE self)
{
        return rb_u_string_justify(argc, argv, self, 'l');
}

#lower?(locale = ENV[LC_CTYPE]) ⇒ Boolean

Returns True if the receiver has been downcased according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#downcase(LOCALE), where a = #normalize(‘:nfd`).

Parameters:

  • locale (#to_str) (defaults to: ENV[LC_CTYPE])

Returns:

  • (Boolean)

    True if the receiver has been downcased according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#downcase(LOCALE), where a = #normalize(‘:nfd`)



9
10
11
12
13
# File 'ext/u/rb_u_string_lower.c', line 9

VALUE
rb_u_string_lower(int argc, VALUE *argv, VALUE self)
{
        return _rb_u_string_test_locale(argc, argv, self, u_downcase);
}

#lstripU::String

Returns The receiver with its maximum #space? prefix removed, inheriting any taint and untrust.

Returns:

  • (U::String)

    The receiver with its maximum #space? prefix removed, inheriting any taint and untrust

See Also:



7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# File 'ext/u/rb_u_string_lstrip.c', line 7

VALUE
rb_u_string_lstrip(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        const char *begin = USTRING_STR(string);
        if (begin == NULL)
                return self;

        const char *p = begin, *end = USTRING_END(string);
        for (const char *q; p < end; p = q)
                if (!u_char_isspace(u_decode(&q, p, end)))
                        break;
        if (p == begin)
                return self;

        return rb_u_string_new_c(self, p, end - p);
}

#match(pattern, index = 0) ⇒ MatchData? #match(pattern, index = 0) {|matchdata| ... } ⇒ Object?

Overloads:

  • #match(pattern, index = 0) ⇒ MatchData?

    Returns The result of r#match(self, index), that is, the match data of the first match of r in the receiver, inheriting any taint and untrust from both the receiver and from PATTERN, if one exists, where r = PATTERN, if PATTERN is a Regexp, r = Regexp.new(PATTERN) otherwise.

    Parameters:

    • pattern (Regexp, #to_str)
    • index (#to_int) (defaults to: 0)

    Returns:

    • (MatchData, nil)

      The result of r#match(self, index), that is, the match data of the first match of r in the receiver, inheriting any taint and untrust from both the receiver and from PATTERN, if one exists, where r = PATTERN, if PATTERN is a Regexp, r = Regexp.new(PATTERN) otherwise

  • #match(pattern, index = 0) {|matchdata| ... } ⇒ Object?

    Returns The result of calling the given block with the result of r#match(self, index), that is, the match data of the first match of r in the receiver, inheriting any taint and untrust from both the recevier and from PATTERN, if one exists, where r = PATTERN, if PATTERN is a Regexp, r = Regexp.new(PATTERN) otherwise.

    Parameters:

    • pattern (Regexp, #to_str)
    • index (#to_int) (defaults to: 0)

    Yield Parameters:

    • matchdata (MatchData)

    Returns:

    • (Object, nil)

      The result of calling the given block with the result of r#match(self, index), that is, the match data of the first match of r in the receiver, inheriting any taint and untrust from both the recevier and from PATTERN, if one exists, where r = PATTERN, if PATTERN is a Regexp, r = Regexp.new(PATTERN) otherwise



52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'ext/u/rb_u_string_match.c', line 52

VALUE
rb_u_string_match_m(int argc, VALUE *argv, VALUE self)
{
        VALUE re;
        if (argc < 0)
                need_m_to_n_arguments(argc, 1, 2);
        re = argv[0];
        argv[0] = self;
        VALUE result = rb_funcall2(rb_u_pattern_argument(re, false),
                                   rb_intern("match"), argc, argv);
        if (!NIL_P(result) && rb_block_given_p())
                return rb_yield(result);
        return result;
}

#mirrorU::String

Returns the mirroring of the receiver, inheriting any taint and untrust.

Mirroring is done by replacing characters in the string with their horizontal mirror image, if any, in text that is laid out from right to left. For example, ‘(’ becomes ‘)’ and ‘)’ becomes ‘(’.



12
13
14
15
16
# File 'ext/u/rb_u_string_mirror.c', line 12

VALUE
rb_u_string_mirror(VALUE self)
{
        return _rb_u_string_convert(self, u_mirror);
}

#newline?Boolean

Returns true if the receiver contains only “newline” characters. A character is a “newline” character if it is any of the following characters:

  • U+000A (LINE FEED (LF))

  • U+000C (FORM FEED (FF))

  • U+000D (CARRIAGE RETURN (CR))

  • U+0085 (NEXT LINE)

  • U+2028 (LINE SEPARATOR)

  • U+2029 (PARAGRAPH SEPARATOR)

Returns:

  • (Boolean)


17
18
19
20
21
# File 'ext/u/rb_u_string_newline.c', line 17

VALUE
rb_u_string_newline(VALUE self)
{
        return _rb_u_character_test(self, u_char_isnewline);
}

#normalize(form = :nfd) ⇒ U::String

Returns the receiver normalized into FORM, inheriting any taint and untrust.

Normalization is the process of converting characters and sequences of characters in string into a canonical form. This process includes dealing with whether characters are represented by a composed character or a base character and combining marks, such as accents.

The possible normalization forms are

<table>

<thead>
  <tr><th>Form</th><th>Description</th></tr>
</thead>
<tbody>
  <tr>
    <td><code>:nfd</code></td>
    <td>Normalizes characters to their maximally decomposed form,
    ordering accents and so on according to their combining class</td>
  </tr>
  <tr>
    <td><code>:nfc</code></td>
    <td>Normalizes according to <code>:nfd</code>, then composes any
    decomposed characters</td>
  </tr>
  <tr>
    <td><code>:nfkd</code></td>
    <td>Normalizes according to <code>:nfd</code> and also normalizes
    “compatibility” characters, such as replacing U+00B3 SUPERSCRIPT
    THREE with U+0033 DIGIT THREE</td>
  </tr>
  <tr>
    <td><code>:nfkc</code></td>
    <td>Normalizes according to <code>:nfkd</code>, then composes any
    decomposed characters</td>
  </tr>
</tbody>

</table>

Parameters:

  • form (#to_sym) (defaults to: :nfd)

Returns:

See Also:



48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# File 'ext/u/rb_u_string_normalize.c', line 48

VALUE
rb_u_string_normalize(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        VALUE rbform;
        enum u_normalization_form form = U_NORMALIZATION_FORM_D;
        if (rb_scan_args(argc, argv, "01", &rbform) == 1)
                form = _rb_u_symbol_to_normalization_form(rbform);

        size_t n = u_normalize(NULL, 0,
                               USTRING_STR(string), USTRING_LENGTH(string),
                               form);
        char *normalized = ALLOC_N(char, n + 1);
        n = u_normalize(normalized, n + 1,
                        USTRING_STR(string), USTRING_LENGTH(string),
                        form);
        char *t = REALLOC_N(normalized, char, n + 1);
        if (t != NULL)
                normalized = t;

        return rb_u_string_new_c_own(self, normalized, n);
}

#normalize?(mode = :default) ⇒ Boolean

Returns true if it can be determined that the receiver is normalized according to MODE.

See #normalize for a discussion on normalization and a list of the possible normalization modes.

Parameters:

  • mode (#to_sym) (defaults to: :default)

Returns:

  • (Boolean)

See Also:



15
16
17
18
19
20
21
22
23
24
25
26
27
28
# File 'ext/u/rb_u_string_normalized.c', line 15

VALUE
rb_u_string_normalized(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        VALUE rbform;
        enum u_normalization_form form = U_NORMALIZATION_FORM_D;
        if (rb_scan_args(argc, argv, "01", &rbform) == 1)
                form = _rb_u_symbol_to_normalization_form(rbform);

        return u_normalized(USTRING_STR(string),
                            USTRING_LENGTH(string),
                            form) == U_NORMALIZED_YES ? Qtrue : Qfalse;
}

#octInteger

Returns The result of #to_i(8), but with the added provision that any leading base specification in the receiver will override the suggested octal (8) base, that is, ‘’0b11’.u`#oct = 3, not 9.

Returns:

  • (Integer)

    The result of #to_i(8), but with the added provision that any leading base specification in the receiver will override the suggested octal (8) base, that is, ‘’0b11’.u`#oct = 3, not 9.



7
8
9
10
11
# File 'ext/u/rb_u_string_oct.c', line 7

VALUE
rb_u_string_oct(VALUE self)
{
        return rb_u_string_to_inum(self, -8, false);
}

#ordInteger

Returns The code point of the first character of the receiver.

Returns:

  • (Integer)

    The code point of the first character of the receiver



4
5
6
7
8
9
10
11
12
13
14
# File 'ext/u/rb_u_string_ord.c', line 4

VALUE
rb_u_string_ord(VALUE self)
{
        const struct rb_u_string *s = RVAL2USTRING(self);
        const char *p = USTRING_STR(s);
        const char *end = USTRING_END(s);
        if (p == end)
                rb_u_raise(rb_eArgError, "empty string");
        const char *q;
        return UINT2NUM(u_decode(&q, p, end));
}

#partition(separator) ⇒ Array<U::String>

Returns The receiver split into s₁ = #slice(0, i), s₂ = #slice(i, n), s₃ = #slice(i+n, -1), where i

j if j ≠ nil, i = #length otherwise, j =

#index(SEPARATOR), n = SEPARATOR#length, where s₁ and s₃ inherit any taint and untrust from the receiver and s₂ inherits any taint and untrust from SEPARATOR and also from the receiver if SEPARATOR is a Regexp.

Parameters:

Returns:

  • (Array<U::String>)

    The receiver split into s₁ = #slice(0, i), s₂ = #slice(i, n), s₃ = #slice(i+n, -1), where i

    j if j ≠ nil, i = #length otherwise, j =

    #index(SEPARATOR), n = SEPARATOR#length, where s₁ and s₃ inherit any taint and untrust from the receiver and s₂ inherits any taint and untrust from SEPARATOR and also from the receiver if SEPARATOR is a Regexp

See Also:



73
74
75
76
77
78
79
80
# File 'ext/u/rb_u_string_partition.c', line 73

VALUE
rb_u_string_partition(VALUE self, VALUE separator)
{
        if (TYPE(separator) == T_REGEXP)
                return rb_u_string_partition_regex(self, separator);

        return rb_u_string_partition_string(self, separator);
}

#print?Boolean

Returns True if the receiver contains only characters not in the general category Other.

Returns:

  • (Boolean)

    True if the receiver contains only characters not in the general category Other



6
7
8
9
10
# File 'ext/u/rb_u_string_print.c', line 6

VALUE
rb_u_string_print(VALUE self)
{
        return _rb_u_character_test(self, u_char_isprint);
}

#punct?Boolean

Returns True if the receiver contains only characters in the general categories Punctuation and Symbol.

Returns:

  • (Boolean)

    True if the receiver contains only characters in the general categories Punctuation and Symbol



6
7
8
9
10
# File 'ext/u/rb_u_string_punct.c', line 6

VALUE
rb_u_string_punct(VALUE self)
{
        return _rb_u_character_test(self, u_char_ispunct);
}

#recode(codeset) ⇒ Object



205
206
207
208
209
210
211
212
213
214
215
216
217
# File 'ext/u/rb_u_string.c', line 205

static VALUE
rb_u_string_recode(VALUE self, VALUE codeset)
{
        const struct rb_u_string *string = RVAL2USTRING(self);
        const char *cs = StringValuePtr(codeset);
        errno = 0;
        size_t n = u_recode(NULL, 0, USTRING_STR(string), USTRING_LENGTH(string), cs);
        if (errno != 0)
                rb_u_raise_errno(errno, "can’t recode");
        char *recoded = ALLOC_N(char, n + 1);
        u_recode(recoded, n + 1, USTRING_STR(string), USTRING_LENGTH(string), cs);
        return rb_str_new(recoded, n);
}

#reverseU::String

Note:

This doesn’t take into account proper handling of combining marks, direction indicators, and similarly relevant characters, so this method is mostly useful when you know the contents of the string is simple and the result isn’t intended for display.

Returns The reversal of the receiver, inheriting any taint and untrust from the receiver.

Returns:

  • (U::String)

    The reversal of the receiver, inheriting any taint and untrust from the receiver



9
10
11
12
13
# File 'ext/u/rb_u_string_reverse.c', line 9

VALUE
rb_u_string_reverse(VALUE self)
{
        return _rb_u_string_convert(self, u_reverse);
}

#rindex(pattern, offset = -1) ⇒ Integer?

Returns the maximal index of the receiver where PATTERN matches, equal to or less than i, where i = OFFSET if OFFSET ≥ 0, i = #length - abs(OFFSET) otherwise, or nil if there is no match.

If PATTERN is a Regexp, the Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

If PATTERN responds to ‘#to_str`, the matching is performed by a byte comparison.

Parameters:

  • pattern (Regexp, #to_str)
  • offset (#to_int) (defaults to: -1)

Returns:

  • (Integer, nil)

See Also:



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'ext/u/rb_u_string_rindex.c', line 47

VALUE
rb_u_string_rindex_m(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        VALUE sub, rboffset;
        long offset;
        if (rb_scan_args(argc, argv, "11", &sub, &rboffset) == 2)
                offset = NUM2LONG(rboffset);
        else
                /* TODO: Why not simply use -1?  Benchmark which is faster. */
                offset = u_n_chars_n(USTRING_STR(string), USTRING_LENGTH(string));

        const char *begin = rb_u_string_begin_from_offset(string, offset);
        const char *end = USTRING_END(string);
        if (begin == NULL) {
                if (offset <= 0) {
                        if (TYPE(sub) == T_REGEXP)
                                rb_backref_set(Qnil);

                        return Qnil;
                }

                begin = end;
                /* TODO: this converting back and forward can be optimized away
                 * if rb_u_string_index_regexp() and rb_u_string_rindex() were split up
                 * into two additional functions, adding
                 * rb_u_string_index_regexp_pointer() and rb_u_string_rindex_pointer(),
                 * so that one can pass a pointer to start at immediately
                 * instead of an offset that gets calculated into a pointer. */
                offset = u_n_chars_n(USTRING_STR(string), USTRING_LENGTH(string));
        }

        switch (TYPE(sub)) {
        case T_REGEXP:
                /* TODO: What’s this first test for, exactly? */
                if (RREGEXP(sub)->ptr == NULL || RREGEXP_SRC_LEN(sub) > 0)
                        offset = rb_u_string_index_regexp(self, begin, sub, true);
                break;
        default: {
                VALUE tmp = rb_check_string_type(sub);
                if (NIL_P(tmp))
                        rb_u_raise(rb_eTypeError, "type mismatch: %s given",
                                   rb_obj_classname(sub));

                sub = tmp;
        }
                /* fall through */
        case T_STRING:
                offset = rb_u_string_rindex(self, sub, offset);
                break;
        }

        if (offset < 0)
                return Qnil;

        return LONG2NUM(offset);
}

#rjust(width, padding = ' ') ⇒ U::String

Returns The receiver padded on the left with PADDING to make it max(#length, WIDTH) wide, inheriting any taint and untrust from the receiver and also from PADDING if PADDING is used.

Parameters:

Returns:

  • (U::String)

    The receiver padded on the left with PADDING to make it max(#length, WIDTH) wide, inheriting any taint and untrust from the receiver and also from PADDING if PADDING is used

Raises:

  • (ArgumentError)

    If PADDING#width = 0

  • (ArgumentError)

    If characters inside PADDING that should be used for round-off padding are too wide

See Also:



165
166
167
168
169
# File 'ext/u/rb_u_string_justify.c', line 165

VALUE
rb_u_string_rjust(int argc, VALUE *argv, VALUE self)
{
        return rb_u_string_justify(argc, argv, self, 'r');
}

#rpartition(separator) ⇒ Array<U::String>

Returns The receiver split into s₁ = #slice(0, i), s₂ = #slice(i, n), s₃ = #slice(i + n, -1), where i = j if j ≠ nil, i = 0 otherwise, j = #rindex(SEPARATOR), n = SEPARATOR#length, where s₁ and s₃ inherit any taint and untrust from the receiver and s₂ inherits any taint and untrust from SEPARATOR and also from the receiver if SEPARATOR is a Regexp.

Parameters:

Returns:

  • (Array<U::String>)

    The receiver split into s₁ = #slice(0, i), s₂ = #slice(i, n), s₃ = #slice(i + n, -1), where i = j if j ≠ nil, i = 0 otherwise, j = #rindex(SEPARATOR), n = SEPARATOR#length, where s₁ and s₃ inherit any taint and untrust from the receiver and s₂ inherits any taint and untrust from SEPARATOR and also from the receiver if SEPARATOR is a Regexp

See Also:



74
75
76
77
78
79
80
81
# File 'ext/u/rb_u_string_rpartition.c', line 74

VALUE
rb_u_string_rpartition(VALUE self, VALUE separator)
{
        if (TYPE(separator) == T_REGEXP)
                return rb_u_string_rpartition_regex(self, separator);

        return rb_u_string_rpartition_string(self, separator);
}

#rstripU::String

Returns The receiver with its maximum #space? suffix removed, inheriting any taint and untrust from the receiver.

Returns:

  • (U::String)

    The receiver with its maximum #space? suffix removed, inheriting any taint and untrust from the receiver

See Also:



7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# File 'ext/u/rb_u_string_rstrip.c', line 7

VALUE
rb_u_string_rstrip(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        const char *begin = USTRING_STR(string);
        if (begin == NULL)
                return self;

        const char *end = USTRING_END(string);
        const char *q = end;
        while (begin < q) {
                const char *p;
                uint32_t c = u_decode_r(&p, begin, q);
                if (c != '\0' && !u_char_isspace(c))
                        break;
                q = p;
        }
        if (q == end)
                return self;

        return rb_u_string_new_c(self, begin, q - begin);
}

#scan(pattern) ⇒ Array<U::String>+ #scan(pattern) ⇒ Array<U::String> #scan(pattern) {|submatches| ... } ⇒ self #scan(pattern) {|match| ... } ⇒ self

Overloads:

  • #scan(pattern) ⇒ Array<U::String>+
    Note:

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

    Returns All matches – or sub-matches, if they exist – of matches of PATTERN in the receiver, each inheriting any taint and untrust from both the receiver and from PATTERN.

    Parameters:

    • pattern (Regexp)

    Returns:

    • (Array<U::String>, Array<Array<U::String>>)

      All matches – or sub-matches, if they exist – of matches of PATTERN in the receiver, each inheriting any taint and untrust from both the receiver and from PATTERN

  • #scan(pattern) ⇒ Array<U::String>

    Returns All matches of PATTERN in the receiver, each inheriting any taint and untrust from the receiver.

    Parameters:

    Returns:

    • (Array<U::String>)

      All matches of PATTERN in the receiver, each inheriting any taint and untrust from the receiver

  • #scan(pattern) {|submatches| ... } ⇒ self
    Note:

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

    Enumerates the sub-matches of matches of PATTERN in the receiver, each inheriting any taint and untrust from both the receiver and from PATTERN.

    Parameters:

    • pattern (Regexp)

    Yield Parameters:

    Returns:

    • (self)
  • #scan(pattern) {|match| ... } ⇒ self

    Enumerates the matches of PATTERN in the receiver, each inheriting any taint and untrust from the receiver.

    Parameters:

    Yield Parameters:

    Returns:

    • (self)


98
99
100
101
102
103
104
105
106
107
108
109
# File 'ext/u/rb_u_string_scan.c', line 98

VALUE
rb_u_string_scan(VALUE self, VALUE pattern)
{
        pattern = rb_u_pattern_argument(pattern, true);

        VALUE string = rb_str_to_str(self);

        if (rb_block_given_p())
                return rb_u_string_scan_block(self, string, pattern);

        return rb_u_string_scan_array(string, pattern);
}

#scriptSymbol

Returns the script of the characters of the receiver.

The script of a character identifies the primary writing system that uses the character.

<table>

<thead><tr><th>Script</th><th>Description</th></tr></thead>
<tbody>
  <tr><td>:arabic</td><td>Arabic</td></tr>
  <tr><td>:armenian</td><td>Armenian</td></tr>
  <tr><td>:avestan</td><td>Avestan</td></tr>
  <tr><td>:balinese</td><td>Balinese</td></tr>
  <tr><td>:bamum</td><td>Bamum</td></tr>
  <tr><td>:batak</td><td>Batak</td></tr>
  <tr><td>:bengali</td><td>Bengali</td></tr>
  <tr><td>:bopomofo</td><td>Bopomofo</td></tr>
  <tr><td>:brahmi</td><td>Brahmi</td></tr>
  <tr><td>:braille</td><td>Braille</td></tr>
  <tr><td>:buginese</td><td>Buginese</td></tr>
  <tr><td>:buhid</td><td>Buhid</td></tr>
  <tr><td>:canadian_aboriginal</td><td>Canadian Aboriginal</td></tr>
  <tr><td>:carian</td><td>Carian</td></tr>
  <tr><td>:chakma</td><td>Chakma</td></tr>
  <tr><td>:cham</td><td>Cham</td></tr>
  <tr><td>:cherokee</td><td>Cherokee</td></tr>
  <tr><td>:common</td><td>For other characters that may be used with multiple scripts</td></tr>
  <tr><td>:coptic</td><td>Coptic</td></tr>
  <tr><td>:cuneiform</td><td>Cuneiform</td></tr>
  <tr><td>:cypriot</td><td>Cypriot</td></tr>
  <tr><td>:cyrillic</td><td>Cyrillic</td></tr>
  <tr><td>:deseret</td><td>Deseret</td></tr>
  <tr><td>:devanagari</td><td>Devanagari</td></tr>
  <tr><td>:egyptian_hieroglyphs</td><td>Egyptian Hieroglpyhs</td></tr>
  <tr><td>:ethiopic</td><td>Ethiopic</td></tr>
  <tr><td>:georgian</td><td>Georgian</td></tr>
  <tr><td>:glagolitic</td><td>Glagolitic</td></tr>
  <tr><td>:gothic</td><td>Gothic</td></tr>
  <tr><td>:greek</td><td>Greek</td></tr>
  <tr><td>:gujarati</td><td>Gujarati</td></tr>
  <tr><td>:gurmukhi</td><td>Gurmukhi</td></tr>
  <tr><td>:han</td><td>Han</td></tr>
  <tr><td>:hangul</td><td>Hangul</td></tr>
  <tr><td>:hanunoo</td><td>Hanunoo</td></tr>
  <tr><td>:hebrew</td><td>Hebrew</td></tr>
  <tr><td>:hiragana</td><td>Hiragana</td></tr>
  <tr><td>:imperial_aramaic</td><td>Imperial Aramaic</td></tr>
  <tr><td>:inherited</td><td>For characters that may be used with multiple
  scripts, and that inherit their script from the preceding characters;
  these include nonspacing marks, enclosing marks, and the zero-width
  joiner/non-joiner characters</td></tr>
  <tr><td>:inscriptional_pahlavi</td><td>Inscriptional Pahlavi</td></tr>
  <tr><td>:inscriptional_parthian</td><td>Inscriptional Parthian</td></tr>
  <tr><td>:javanese</td><td>Javanese</td></tr>
  <tr><td>:kaithi</td><td>Kaithi</td></tr>
  <tr><td>:kannada</td><td>Kannada</td></tr>
  <tr><td>:katakana</td><td>Katakana</td></tr>
  <tr><td>:kayah_li</td><td>Kayah Li</td></tr>
  <tr><td>:kharoshthi</td><td>Kharoshthi</td></tr>
  <tr><td>:khmer</td><td>Khmer</td></tr>
  <tr><td>:lao</td><td>Lao</td></tr>
  <tr><td>:latin</td><td>Latin</td></tr>
  <tr><td>:lepcha</td><td>Lepcha</td></tr>
  <tr><td>:limbu</td><td>Limbu</td></tr>
  <tr><td>:linear_b</td><td>Linear B</td></tr>
  <tr><td>:lisu</td><td>Lisu</td></tr>
  <tr><td>:lycian</td><td>Lycian</td></tr>
  <tr><td>:lydian</td><td>Lydian</td></tr>
  <tr><td>:malayalam</td><td>Malayalam</td></tr>
  <tr><td>:mandaic</td><td>Mandaic</td></tr>
  <tr><td>:meetei_mayek</td><td>Meetei Mayek</td></tr>
  <tr><td>:meroitic_hieroglyphs</td><td>Meroitic Hieroglyphs</td></tr>
  <tr><td>:meroitic_cursive</td><td>Meroitic Cursives</td></tr>
  <tr><td>:miao</td><td>Miao</td></tr>
  <tr><td>:mongolian</td><td>Mongolian</td></tr>
  <tr><td>:myanmar</td><td>Myanmar</td></tr>
  <tr><td>:new_tai_lue</td><td>New Tai Lue</td></tr>
  <tr><td>:nko</td><td>N'Ko</td></tr>
  <tr><td>:ogham</td><td>Ogham</td></tr>
  <tr><td>:old_italic</td><td>Old Italic</td></tr>
  <tr><td>:old_persian</td><td>Old Persian</td></tr>
  <tr><td>:old_south_arabian</td><td>Old South Arabian</td></tr>
  <tr><td>:old_turkic</td><td>Old Turkic</td></tr>
  <tr><td>:ol_chiki</td><td>Ol Chiki</td></tr>
  <tr><td>:oriya</td><td>Oriya</td></tr>
  <tr><td>:osmanya</td><td>Osmanya</td></tr>
  <tr><td>:phags_pa</td><td>Phags-pa</td></tr>
  <tr><td>:phoenician</td><td>Phoenician</td></tr>
  <tr><td>:rejang</td><td>Rejang</td></tr>
  <tr><td>:runic</td><td>Runic</td></tr>
  <tr><td>:samaritan</td><td>Samaritan</td></tr>
  <tr><td>:saurashtra</td><td>Saurashtra</td></tr>
  <tr><td>:sharada</td><td>Sharada</td></tr>
  <tr><td>:shavian</td><td>Shavian</td></tr>
  <tr><td>:sinhala</td><td>Sinhala</td></tr>
  <tr><td>:sora_sompeng</td><td>Sora Sompeng</td></tr>
  <tr><td>:sundanese</td><td>Sundanese</td></tr>
  <tr><td>:syloti_nagri</td><td>Syloti Nagri</td></tr>
  <tr><td>:syriac</td><td>Syriac</td></tr>
  <tr><td>:tagalog</td><td>Tagalog</td></tr>
  <tr><td>:tagbanwa</td><td>Tagbanwa</td></tr>
  <tr><td>:tai_le</td><td>Tai Le</td></tr>
  <tr><td>:tai_tham</td><td>Tai Tham</td></tr>
  <tr><td>:tai_viet</td><td>Tai Viet</td></tr>
  <tr><td>:takri</td><td>Takri</td></tr>
  <tr><td>:tamil</td><td>Tamil</td></tr>
  <tr><td>:telugu</td><td>Telugu</td></tr>
  <tr><td>:thaana</td><td>Thaana</td></tr>
  <tr><td>:thai</td><td>Thai</td></tr>
  <tr><td>:tibetan</td><td>Tibetan</td></tr>
  <tr><td>:tifinagh</td><td>Tifinagh</td></tr>
  <tr><td>:ugaritic</td><td>Ugaritic</td></tr>
  <tr><td>:unknown</td><td>For not assigned, private-use, non-character, and surrogate code points</td></tr>
  <tr><td>:vai</td><td>Vai</td></tr>
  <tr><td>:yi</td><td>Yi</td></tr>
</tbody>

</table>

Returns:

  • (Symbol)

Raises:

  • (ArgumentError)

    If the receiver contains two characters belonging to different scripts

  • (ArgumentError)

    If the receiver contains an incomplete UTF-8 sequence

  • (ArgumentError)

    If the receiver contains an invalid UTF-8 sequence

See Also:



247
248
249
250
251
252
253
# File 'ext/u/rb_u_string_script.c', line 247

VALUE
rb_u_string_script(VALUE self)
{
        return _rb_u_string_property(self, "script", U_SCRIPT_UNKNOWN,
                                     (int (*)(uint32_t))u_char_script,
                                     (VALUE (*)(int))script_to_symbol);
}

#soft_dotted?Boolean

Note:

Soft-dotted characters have the soft-dotted property and thus lose their dot if an accent is applied to them, for example, ‘i’ and ‘j’.

Returns True if this U::String only contains soft-dotted characters.

Returns:

  • (Boolean)

    True if this U::String only contains soft-dotted characters

See Also:



9
10
11
12
13
# File 'ext/u/rb_u_string_soft_dotted.c', line 9

VALUE
rb_u_string_soft_dotted(VALUE self)
{
        return _rb_u_character_test(self, u_char_issoftdotted);
}

#space?Boolean

Returns true if the receiver contains only “space” characters. Space characters are those in the general category Separator:

  • Separator, space (Zs)

  • Separator, line (Zl)

  • Separator, paragraph (Zp)

such as ‘ ’, or a control character acting as such, namely

  • U+0009 CHARACTER TABULATION (HT)

  • U+000A LINE FEED (LF)

  • U+000C FORM FEED (FF)

  • U+000D CARRIAGE RETURN (CR)

Returns:

  • (Boolean)


20
21
22
23
24
# File 'ext/u/rb_u_string_space.c', line 20

VALUE
rb_u_string_space(VALUE self)
{
        return _rb_u_character_test(self, u_char_isspace);
}

#split(pattern = $;, limit = 0) ⇒ Array<U::String>

Returns the receiver split into LIMIT substrings separated by PATTERN, each inheriting any taint and untrust.

If PATTERN = ‘$;` = nil or PATTERN = `’ ‘`, splits according to AWK rules, that is, any #space? prefix is skipped, then substrings are separated by non-empty #space? substrings.

If LIMIT < 0, then no limit is imposed and trailing #empty? substrings aren’t removed.

If LIMIT = 0, then no limit is imposed and trailing #empty? substrings are removed.

If LIMIT = 1, then, if #length = 0, the result will be empty, otherwise it will consist of the receiver only.

If LIMIT > 1, then the receiver is split into at most LIMIT substrings.

Parameters:

  • pattern (Regexp, #to_str) (defaults to: $;)
  • limit (#to_int) (defaults to: 0)

Returns:



200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
# File 'ext/u/rb_u_string_split.c', line 200

VALUE
rb_u_string_split_m(int argc, VALUE *argv, VALUE self)
{
        VALUE rbpattern, rblimit;
        int limit = 0;
        bool limit_given;

        if (rb_scan_args(argc, argv, "02", &rbpattern, &rblimit) == 2)
                limit = NUM2INT(rblimit);

        const struct rb_u_string *string = RVAL2USTRING(self);

        if (limit == 1) {
                if (USTRING_LENGTH(string) == 0)
                        return rb_ary_new2(0);

                return rb_ary_new3(1, self);
        }

        limit_given = !NIL_P(rblimit) && limit >= 0;

        if (NIL_P(rbpattern) && NIL_P(rb_fs))
                return rb_u_string_split_awk(self, limit_given, limit);
        else if (NIL_P(rbpattern))
                rbpattern = rb_fs;

        if (TYPE(rbpattern) != T_STRING && !RTEST(rb_obj_is_kind_of(rbpattern, rb_cUString)))
                return rb_u_string_split_pattern(self,
                                                 rb_u_pattern_argument(rbpattern, true),
                                                 limit_given,
                                                 limit);

        const struct rb_u_string *pattern = RVAL2USTRING_ANY(rbpattern);
        const char *p = USTRING_STR(pattern);
        long length = USTRING_LENGTH(pattern);

        if (length == 0)
                return rb_u_string_split_pattern(self,
                                                 rb_reg_regcomp(rb_str_to_str(rbpattern)),
                                                 limit_given,
                                                 limit);
        else if (length == 1 && *p == ' ')
                return rb_u_string_split_awk(self, limit_given, limit);
        else
                return rb_u_string_split_string(self, rbpattern, limit_given, limit);
}

#squeeze(*sets) ⇒ U::String

Returns the receiver, replacing any substrings of #length > 1 consisting of the same character c with c, where c is a member of the intersection of the character sets in SETS, inheriting any taint and untrust.

If SETS is empty, then the set of all Unicode characters is used.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘‘^`’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

Parameters:

Returns:



52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'ext/u/rb_u_string_squeeze.c', line 52

VALUE
rb_u_string_squeeze(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        if (USTRING_LENGTH(string) == 0)
                return Qnil;

        struct tr_table table;
        if (argc > 0)
                tr_table_initialize_from_strings(&table, argc, argv);

        struct tr_table *table_pointer = (argc > 0) ? &table : NULL;

        long count = rb_u_string_squeeze_loop(string, table_pointer, NULL);
        if (count == 0)
                return self;

        char *remaining = ALLOC_N(char, count + 1);
        rb_u_string_squeeze_loop(string, table_pointer, remaining);
        remaining[count] = '\0';

        return rb_u_string_new_c_own(self, remaining, count);
}

#start_with?(*prefixes) ⇒ Boolean

Returns True if any element of PREFIXES that responds to #to_str is a byte-level prefix of the receiver.

Parameters:

  • prefixes (Array)

Returns:

  • (Boolean)

    True if any element of PREFIXES that responds to #to_str is a byte-level prefix of the receiver



7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# File 'ext/u/rb_u_string_start_with.c', line 7

VALUE
rb_u_string_start_with(int argc, VALUE *argv, VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);
        const char *p = USTRING_STR(string);
        long p_length = USTRING_LENGTH(string);

        for (int i = 0; i < argc; i++) {
                VALUE tmp = rb_u_string_check_type(argv[i]);
                if (NIL_P(tmp))
                        continue;

                const struct rb_u_string *other = RVAL2USTRING_ANY(tmp);
                const char *q = USTRING_STR(other);
                long q_length = USTRING_LENGTH(other);

                if (p_length < q_length)
                        continue;

                if (memcmp(p, q, q_length) == 0)
                        return Qtrue;
        }

        return Qfalse;
}

#stripU::String

Returns The receiver with its maximum #space? prefix and suffix removed, inheriting any taint and untrust.

Returns:

  • (U::String)

    The receiver with its maximum #space? prefix and suffix removed, inheriting any taint and untrust

See Also:



7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'ext/u/rb_u_string_strip.c', line 7

VALUE
rb_u_string_strip(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        const char *begin = USTRING_STR(string);
        if (begin == NULL)
                return self;

        const char *end = USTRING_END(string);
        const char *s = begin;
        uint32_t c;
        const char *t;
        while (s < end && u_char_isspace(u_decode(&t, s, end)))
                s = t;

        t = end;
        while (begin < t) {
                const char *p;
                c = u_decode_r(&p, begin, t);
                if (c != '\0' && !u_char_isspace(c))
                        break;
                t = p;
        }

        if (s == begin && t == end)
                return self;

        return rb_u_string_new_c(self, s, t - s);
}

#sub(pattern, replacement) ⇒ U::String? #sub(pattern, replacements) ⇒ U::String? #sub(pattern) {|match| ... } ⇒ U::String?

Overloads:

  • #sub(pattern, replacement) ⇒ U::String?

    Returns the receiver with the first match of PATTERN replaced by REPLACEMENT, inheriting any taint and untrust from the receiver and from REPLACEMENT, or nil if there’s no match.

    The REPLACEMENT is used as a specification for what to replace matches with:

    <table>

    <thead>
      <tr><th>Specification</th><th>Replacement</th></tr>
    </thead>
    <tbody>
      <tr>
        <td><code>\1</code>, <code>\2</code>, …, <code>\</code><em>n</em></td>
        <td>Numbered sub-match <em>n</em></td>
      </tr>
      <tr>
        <td><code>\k&lt;</code><em>name</em><code>></code></td>
        <td>Named sub-match <em>name</em></td>
      </tr>
    </tbody>
    

    </table>

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

    Parameters:

    Returns:

  • #sub(pattern, replacements) ⇒ U::String?

    Returns the receiver with the first match of PATTERN replaced by REPLACEMENTS#[match], where match is the matched substring, inheriting any taint and untrust from the receiver, REPLACEMENTS, and REPLACEMENTS#[match], or nil if there’s no match.

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

    Parameters:

    • pattern (Regexp, #to_str)
    • replacements (#to_hash)

    Returns:

    Raises:

    • (Exception)

      Any error raised by REPLACEMENTS#default, if it gets called

  • #sub(pattern) {|match| ... } ⇒ U::String?

    Returns the receiver with all instances of PATTERN replaced by the results of the given block, inheriting any taint and untrust from the receiver and from the results of the given block, or nil if there’s no match.

    The Regexp special variables ‘$&`, `$’‘, $\`, `$1`, `$2`, …, `$`n are updated accordingly.

    Parameters:

    Yield Parameters:

    Yield Returns:

    Returns:



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
# File 'ext/u/rb_u_string_sub.c', line 65

VALUE
rb_u_string_sub(int argc, VALUE *argv, VALUE self)
{
        VALUE pattern, replacement;
        VALUE replacements = Qnil;
        bool use_block = false;
        bool tainted = false;
        bool untrusted = false;

        if (argc == 1)
                use_block = true;

        if (rb_scan_args(argc, argv, "11", &pattern, &replacement) == 2) {
                replacements = rb_check_convert_type(replacement, T_HASH,
                                                     "Hash", "to_hash");
                if (NIL_P(replacements))
                        StringValue(replacement);
                if (OBJ_TAINTED(replacement))
                        tainted = true;
                if (OBJ_UNTRUSTED(replacement))
                        untrusted = true;
        }

        pattern = rb_u_pattern_argument(pattern, true);

        VALUE str = rb_str_to_str(self);
        long begin = rb_reg_search(pattern, str, 0, 0);
        if (begin < 0)
                return Qnil;

        VALUE match = rb_backref_get();
        struct re_registers *registers = RMATCH_REGS(match);
        VALUE result;
        if (use_block || !NIL_P(replacements)) {
                if (use_block) {
                        VALUE ustr = rb_u_string_new_rb(rb_reg_nth_match(0, match));
                        result = rb_u_string_object_as_string(rb_yield(ustr));
                } else {
                        VALUE ustr = rb_u_string_new_c(self,
                                                       RSTRING_PTR(str) + registers->beg[0],
                                                       registers->end[0] - registers->beg[0]);
                        result = rb_u_string_object_as_string(rb_hash_aref(replacements, ustr));
                }
        } else
                result =
#ifdef HAVE_RB_REG_REGSUB4
                        rb_reg_regsub(replacement, str, registers, pattern);
#else
                        rb_reg_regsub(replacement, str, registers);
#endif

        if (OBJ_TAINTED(result))
                tainted = true;
        if (OBJ_UNTRUSTED(result))
                untrusted = true;

        const struct rb_u_string *value = RVAL2USTRING_ANY(result);

        size_t length = registers->beg[0] +
                USTRING_LENGTH(value) +
                (RSTRING_LEN(str) - registers->end[0]);
        char *base = ALLOC_N(char, length + 1);
        MEMCPY(base,
               RSTRING_PTR(str),
               char,
               registers->beg[0]);
        MEMCPY(base + registers->beg[0],
               USTRING_STR(value),
               char,
               USTRING_LENGTH(value));
        MEMCPY(base + registers->beg[0] + USTRING_LENGTH(value),
               RSTRING_PTR(str) + registers->end[0],
               char,
               RSTRING_LEN(str) - registers->end[0]);
        base[length] = '\0';

        VALUE substituted = rb_u_string_new_c_own(self, base, length);
        if (tainted)
                OBJ_TAINT(substituted);
        if (untrusted)
                OBJ_UNTRUST(substituted);
        return substituted;
}

#title?Boolean

Returns True if the receiver contains only characters in the general category Letter, Titlecase (Lt).

Returns:

  • (Boolean)

    True if the receiver contains only characters in the general category Letter, Titlecase (Lt)



6
7
8
9
10
# File 'ext/u/rb_u_string_title.c', line 6

VALUE
rb_u_string_title(VALUE self)
{
        return _rb_u_character_test(self, u_char_istitle);
}

#titlecase(locale = ENV['LC_CTYPE']) ⇒ U::String

Returns The title-casing of the receiver according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust.

Parameters:

  • locale (#to_str) (defaults to: ENV['LC_CTYPE'])

Returns:

  • (U::String)

    The title-casing of the receiver according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust



9
10
11
12
13
# File 'ext/u/rb_u_string_titlecase.c', line 9

VALUE
rb_u_string_titlecase(int argc, VALUE *argv, VALUE self)
{
        return _rb_u_string_convert_locale(argc, argv, self, u_titlecase, NULL);
}

#to_i(base = 16) ⇒ Integer

Returns the Integer value that results from treating the receiver as a string of digits in BASE.

The conversion algorithm is

  1. Skip any leading #space?s

  2. Check for an optional sign, ‘+’ or ‘-’

  3. If base is 2, skip an optional “0b” or “0B” prefix

  4. If base is 8, skip an optional “0o” or “0o” prefix

  5. If base is 10, skip an optional “0d” or “0D” prefix

  6. If base is 16, skip an optional “0x” or “0X” prefix

  7. Skip any ‘0’s

  8. Read an as long sequence of digits in BASE separated by optional U+005F

LOW LINE characters, using letters in the following ranges of characters
for digits or the characters digit value, if any

 * U+0041 LATIN CAPITAL LETTER A through U+005A LATIN CAPITAL LETTER Z
 * U+0061 LATIN SMALL LETTER A through U+007A LATIN SMALL LETTER Z
 * U+FF21 FULLWIDTH LATIN CAPITAL LETTER A through U+FF3A FULLWIDTH LATIN CAPITAL LETTER Z
 * U+FF41 FULLWIDTH LATIN SMALL LETTER A through U+FF5A FULLWIDTH LATIN SMALL LETTER Z

Note that only one separator is allowed in a row.

Parameters:

  • base (#to_int) (defaults to: 16)

Returns:

  • (Integer)

Raises:

  • (ArgumentError)

    Unless 2 ≤ BASE ≤ 36



32
33
34
35
36
37
38
39
40
41
42
43
44
45
# File 'ext/u/rb_u_string_to_i.c', line 32

VALUE
rb_u_string_to_i(int argc, VALUE *argv, VALUE self)
{
        int base = 10;

        VALUE rbbase;
        if (rb_scan_args(argc, argv, "01", &rbbase) == 1)
                base = NUM2INT(rbbase);

        if (base < 0)
                rb_u_raise(rb_eArgError, "illegal radix %d", base);

        return rb_u_string_to_inum(self, base, false);
}

#to_strObject Also known as: to_s

Returns The String representation of the receiver, inheriting any taint and untrust, encoded as UTF-8.

Returns:

  • The String representation of the receiver, inheriting any taint and untrust, encoded as UTF-8



5
6
7
8
9
10
11
12
13
14
15
16
17
# File 'ext/u/rb_u_string_to_str.c', line 5

VALUE
rb_u_string_to_str(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        VALUE result = NIL_P(string->rb) ?
                rb_u_str_new(USTRING_STR(string), USTRING_LENGTH(string)) :
                string->rb;

        OBJ_INFECT(result, self);

        return result;
}

#to_symSymbol Also known as: intern

Returns The Symbol representation of the receiver.

Returns:

  • (Symbol)

    The Symbol representation of the receiver

Raises:

  • (EncodingError)

    If the receiver contains an invalid UTF-8 sequence

  • (RuntimeError)

    If there’s no more room for a new Symbol in Ruby’s Symbol table



7
8
9
10
11
12
# File 'ext/u/rb_u_string_to_sym.c', line 7

VALUE
rb_u_string_to_sym(VALUE self)
{
        /* NOTE: Lazy, but MRI makes it hard to implement this method. */
        return rb_str_intern(StringValue(self));
}

#tr(from, to) ⇒ U::String

Returns the receiver, translating characters in FROM to their equivalent character, by index, in TO, inheriting any taint and untrust. If TO#length < FROM#length, TO will be used for any index i > TO#length.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘‘^`’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

Parameters:

Returns:



262
263
264
265
266
# File 'ext/u/rb_u_string_tr.c', line 262

VALUE
rb_u_string_tr(VALUE self, VALUE from, VALUE to)
{
        return tr_trans(self, from, to, false);
}

#tr_s(from, to) ⇒ U::String

Returns the receiver, translating characters in FROM to their equivalent character, by index, in TO and then squeezing any substrings of #length > 1 consisting of the same character c with c, inheriting any taint and untrust. If TO#length < FROM#length, TO will be used for any index i > TO#length.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘‘^`’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

Parameters:

Returns:



286
287
288
289
290
# File 'ext/u/rb_u_string_tr.c', line 286

VALUE
rb_u_string_tr_s(VALUE self, VALUE from, VALUE to)
{
        return tr_trans(self, from, to, true);
}

#uself

Returns The receiver; mostly for completeness, but allows you to always call #u on something that’s either a String or a U::String.

Returns:

  • (self)

    The receiver; mostly for completeness, but allows you to always call #u on something that’s either a String or a U::String



6
7
8
# File 'lib/u-1.0/string.rb', line 6

def u
  self
end

#upcase(locale = ENV['LC_CTYPE']) ⇒ U::String

Returns The upcasing of the receiver according to the rules of of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust.

Parameters:

  • locale (#to_str) (defaults to: ENV['LC_CTYPE'])

Returns:

  • (U::String)

    The upcasing of the receiver according to the rules of of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust



8
9
10
11
12
# File 'ext/u/rb_u_string_upcase.c', line 8

VALUE
rb_u_string_upcase(int argc, VALUE *argv, VALUE self)
{
        return _rb_u_string_convert_locale(argc, argv, self, u_upcase, NULL);
}

#upper?(locale = ENV[LC_CTYPE]) ⇒ Boolean

Returns True if the receiver has been upcased according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#upcase(LOCALE), where a = #normalize(‘:nfd`).

Parameters:

  • locale (#to_str) (defaults to: ENV[LC_CTYPE])

Returns:

  • (Boolean)

    True if the receiver has been upcased according to the rules of the language of LOCALE, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#upcase(LOCALE), where a = #normalize(‘:nfd`)



9
10
11
12
13
# File 'ext/u/rb_u_string_upper.c', line 9

VALUE
rb_u_string_upper(int argc, VALUE *argv, VALUE self)
{
        return _rb_u_string_test_locale(argc, argv, self, u_upcase);
}

#valid?Boolean

Returns True if the receiver contains only valid Unicode characters.

Returns:

  • (Boolean)

    True if the receiver contains only valid Unicode characters



6
7
8
9
10
# File 'ext/u/rb_u_string_valid.c', line 6

VALUE
rb_u_string_valid(VALUE self)
{
        return _rb_u_character_test(self, u_char_isvalid);
}

#valid_encoding?Boolean

Returns True if the receiver contains only valid UTF-8 sequences.

Returns:

  • (Boolean)

    True if the receiver contains only valid UTF-8 sequences



6
7
8
9
10
11
12
# File 'ext/u/rb_u_string_valid_encoding.c', line 6

VALUE
rb_u_string_valid_encoding(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        return u_valid(USTRING_STR(string), USTRING_LENGTH(string), NULL) ? Qtrue : Qfalse;
}

#wide?Boolean

Returns true if the receiver contains only “wide” characters. Wide character are those that have their East_Asian_Width property set to Wide or Fullwidth.

This is mostly useful for determining how many “cells” a character will take up on a terminal or similar cell-based display.



17
18
19
20
21
# File 'ext/u/rb_u_string_wide.c', line 17

VALUE
rb_u_string_wide(VALUE self)
{
        return _rb_u_character_test(self, u_char_iswide);
}

#wide_cjk?Boolean

Returns true if the receiver contains only “wide” and “ambiguously wide” characters. Wide and ambiguously wide character are those that have their East_Asian_Width property set to Ambiguous, Wide or Fullwidth.

This is mostly useful for determining how many “cells” a character will take up on a terminal or similar cell-based display.



17
18
19
20
21
# File 'ext/u/rb_u_string_wide_cjk.c', line 17

VALUE
rb_u_string_wide_cjk(VALUE self)
{
        return _rb_u_character_test(self, u_char_iswide_cjk);
}

#widthInteger

Returns the width of the receiver. The width is defined as the sum of the number of “cells” on a terminal or similar cell-based display that the characters in the string will require.

Characters that are #wide? have a width of 2. Characters that are #zero_width? have a width of 0. Other characters have a width of 1.

Returns:

  • (Integer)

See Also:



13
14
15
16
17
18
19
# File 'ext/u/rb_u_string_width.c', line 13

VALUE
rb_u_string_width(VALUE self)
{
        const struct rb_u_string *string = RVAL2USTRING(self);

        return UINT2NUM(u_width_n(USTRING_STR(string), USTRING_LENGTH(string)));
}

#word_breakSymbol

Returns the word break property value of the characters of the receiver.

The possible word break values are

  • :aletter

  • :cr

  • :extend

  • :extendnumlet

  • :format

  • :katakana

  • :lf

  • :midletter

  • :midnum

  • :midnumlet

  • :newline

  • :numeric

  • :other

  • :regional_indicator

Returns:

  • (Symbol)

Raises:

  • (ArgumentError)

    If the string consists of more than one break type

See Also:



57
58
59
60
61
62
63
# File 'ext/u/rb_u_string_word_break.c', line 57

VALUE
rb_u_string_word_break(VALUE self)
{
        return _rb_u_string_property(self, "word break", U_WORD_BREAK_OTHER,
                                     (int (*)(uint32_t))u_char_word_break,
                                     (VALUE (*)(int))break_to_symbol);
}

#xdigit?Boolean

Returns true if the receiver contains only characters in the general category Number, decimal digit (Nd) or is a lower- or uppercase letter between ‘a’ and ‘f’. Specifically, any character that

  • Belongs to the general category Number, decimal digit (Nd)

  • Falls in the range U+0041 (LATIN CAPITAL LETTER A) through U+0046 (LATIN CAPITAL LETTER F)

  • Falls in the range U+0061 (LATIN SMALL LETTER A) through U+0066 (LATIN SMALL LETTER F)

  • Falls in the range U+FF21 (FULLWIDTH LATIN CAPITAL LETTER A) through U+FF26 (FULLWIDTH LATIN CAPITAL LETTER F)

  • Falls in the range U+FF41 (FULLWIDTH LATIN SMALL LETTER A) through U+FF46 (FULLWIDTH LATIN SMALL LETTER F)

will do.

Returns:

  • (Boolean)


18
19
20
21
22
# File 'ext/u/rb_u_string_xdigit.c', line 18

VALUE
rb_u_string_xdigit(VALUE self)
{
        return _rb_u_character_test(self, u_char_isxdigit);
}

#zero_width?Boolean

Returns true if the receiver contains only “zero-width” characters. A zero-width character is defined as a character in the general categories Mark, nonspacing (Mn), Mark, enclosing (Me) or Other, format (Of), excluding the character U+00AD (SOFT HYPHEN), or is a Hangul character between U+1160 and U+1200 or U+200B (ZERO WIDTH SPACE).

Returns:

  • (Boolean)


12
13
14
15
16
# File 'ext/u/rb_u_string_zero_width.c', line 12

VALUE
rb_u_string_zero_width(VALUE self)
{
        return _rb_u_character_test(self, u_char_iszerowidth);
}