Class: U::String
- Inherits:
-
Data
- Object
- Data
- U::String
- Includes:
- Comparable
- Defined in:
- ext/u/rb_u_string.c,
lib/u-1.0/string.rb,
ext/u/rb_u_string.c
Overview
A U::String is a sequence of zero or more Unicode characters encoded as UTF-8. It’s interface is an extension of that of Ruby’s built-in String class that provides better Unicode support, as it handles things such as casing, width, collation, and various other Unicode properties that Ruby’s built-in String class simply doesn’t bother itself with. It also provides “backwards compatibility” with Ruby 1.8.7 so that you can use Unicode without upgrading to Ruby 2.0 (which you probably should do, though).
It differs from Ruby’s built-in String class in one other very important way in that it doesn’t provide any way to change an existing object. That is, a U::String is a value object.
A U::String is most easily created from a String by calling #u. Most U::String methods that return a stringy result will return a U::String, so you only have to do that once. You can get back a String by calling #to_str.
Validation of a U::String’s content isn’t performed until any access to it is made, at which time an ArgumentError will be raised if it isn’t valid.
U::String has a lot of methods defined upon it, so let’s break them up into categories to get a proper overview of what’s possible to do with one. Let’s begin with the interrogators. There are three kinds of interrogators, validity-checking ones, property-checking ones, and content-matching ones.
The validity-checking interrogator is #valid_encoding?, which makes sure that the UTF-8 sequence itself is valid.
The property-checking interrogators are #alnum?, #alpha?, #ascii_only?, #assigned?, #case_ignorable?, #cased?, #cntrl?, #defined?, #digit?, #graph?, #newline?, #print?, #punct?, #soft_dotted?, #space?, #title?, #valid?, #wide?, #wide_cjk?, #xdigit?, and #zero_width?. These interrogators check the corresponding Unicode property of each characters in the U::String and if all characters have this property, they’ll return true.
Very close relatives to the property-checking interrogators are #folded?, #lower?, and #upper?, which check whether a string has been cased in a given way, and #normalized?, which checks whether the receiver has been normalized, optionally to a specific normalization form.
The content-matching interrogators are #==, #===, #=~, #match, #empty?, #end_with?, #eql?, #include?, #index, #rindex, and #start_with?. These interrogators check that a substring of the U::String matches another string or Regexp and either return a Boolean result, and index into the U::String where the match begins or MatchData for full matching information.
Related to the content-matching interrogators are #<=>, #casecmp, and #collation_key, all of which compare a U::String against another for ordering.
Related to the property-checking interrogators are #canonical_combining_class, #general_category, #grapheme_break, #line_break, #script, and #word_break, which return the value of the Unicode property in question, the general category being the one often interrogated.
There are a couple of other “interrogators” in #bytesize, #length, #size, #width that return integer properties of the U::String as a whole, where #length and #width are probably the most useful.
Beyond interrogators there are quite a few methods for iterating over the content of a U::String, each viewing it in its own way: #each_byte, #each_char, #each_codepoint, #each_grapheme_cluster, #each_line, and #each_word. They all have respective methods (#bytes, #chars, #codepoints, #grapheme_clusters, #lines, #words) that return an Array instead of yielding each result.
Quite a few methods are devoted to extracting a substring of a U::String, namely #[], #slice, #byteslice, #chomp, #chop, #chr, #getbyte, #lstrip, #ord, #rstrip, #strip.
There are a few methods for case-shifting: #downcase, #foldcase, #titlecase, and #upcase. Then there’s #mirror, #normalize, and #reverse that alter the string in other ways.
The methods #center, #ljust, and #rjust pad a U::String to make it a certain number of cells wide.
Then there’s a couple of methods that are more related in the arguments they take than in function: #count, #delete, #squeeze, #tr, and #tr_s. These methods all take specifications of character/code point ranges that should be counted, deleted, squeezed, and translated (plus squeezed).
Deconstructing a U::String can be done with #partition and #rpartition, which splits it around a divider, #scan, which extracts matches to a pattern, #split, which splits it on a divider.
Substitution of all matches to a pattern can be made with #gsub and of the first match to a pattern with #sub.
Creating larger U::Strings from smaller ones is done with #+, which concatenates two of them, and #*, which concatenates a U::String to itself a number of times.
A U::String can also be used as a specification as to how to format a number of values via #% (and its alias #format) into a new U::String, much like snprintf(3) in C.
The content of a U::String can be #dumped and #inspected to make it reader-friendly, but also debugger-friendly.
Finally, a U::String has a few methods to turn its content into other values: #hash, which turns it into a hash value to be used for hashing, #hex, #oct, #to_i, which turn it into a Integer, #to_str, #to_s, #b, which turn it into a String, and #to_sym (and its alias #intern), which turns it into a Symbol.
Note that some methods defined on String are missing. #Capitalize doesn’t exist, as capitalization isn’t a Unicode concept. #Sum doesn’t exist, as a U::String generally doesn’t contain content that you need a checksum of. #Crypt doesn’t exist for similar reasons. #Swapcase isn’t useful on a String and it certainly isn’t useful in a Unicode context. As a U::String doesn’t contain arbitrary data, #unpack is left to String. #Next/#succ would perhaps be implementable, but haven’t, as a satisfactory implementation hasn’t been thought of.
Instance Method Summary collapse
-
#%(value) ⇒ U::String
(also: #format)
Returns a formatted string of the values in Array(VALUE) by treating the receiver as a format specification of this formatted string.
- #*(n) ⇒ U::String
- #+(other) ⇒ U::String
-
#<=>(other, locale = ENV['LC_COLLATE']) ⇒ Fixnum
Returns the comparison of the receiver and OTHER using the linguistically correct rules of LOCALE.
- #==(other) ⇒ Boolean (also: #===)
- #=~(other) ⇒ Numeric?
- #[](*args) ⇒ Object (also: #slice)
- #alnum? ⇒ Boolean
- #alpha? ⇒ Boolean
- #ascii_only? ⇒ Boolean
- #assigned? ⇒ Boolean
-
#b ⇒ String
The String representation of the receiver, inheriting any taint and untrust, encoded as ASCII-8BIT.
-
#bytes ⇒ Array<Fixnum>
The bytes of the receiver.
-
#bytesize ⇒ Integer
The number of bytes required to represent the receiver.
- #byteslice(*args) ⇒ Object
-
#canonical_combining_class ⇒ Fixnum
Returns the canonical combining class of the characters of the receiver.
- #case_ignorable? ⇒ Boolean
- #casecmp(other, locale = ENV['LC_COLLATE']) ⇒ Fixnum
- #cased? ⇒ Boolean
- #center(width, padding = ' ') ⇒ U::String
-
#chars ⇒ Array<U::String>
The characters of the receiver, each inheriting any taint and untrust.
-
#chomp(separator = $/) ⇒ U::String, ...
Returns the receiver, minus any SEPARATOR suffix, inheriting any taint and untrust, unless #length = 0, in which case nil is returned.
-
#chop ⇒ U::String
Returns the receiver, minus its last character, inheriting any taint and untrust, unless the receiver is #empty? or if the last character is invalidly encoded, in which case the receiver is returned.
-
#chr ⇒ U::String
The substring [0, min(#length, 1)], inheriting any taint and untrust.
- #cntrl? ⇒ Boolean
-
#codepoints ⇒ Array<Integer>
The code points of the receiver.
- #collation_key(locale = ENV['LC_COLLATE']) ⇒ U::String
-
#count(set, *sets) ⇒ Integer
Returns the number of characters in the receiver that are included in the intersection of SET and any additional SETS of characters.
- #defined? ⇒ Boolean
-
#delete(set, *sets) ⇒ U::String
Returns the receiver, minus any characters that are included in the intersection of SET and any additional SETS of characters, inheriting any taint and untrust.
- #digit? ⇒ Boolean
- #downcase(locale = ENV['LC_CTYPE']) ⇒ U::String
-
#dump ⇒ U::String
Returns the receiver in a reader-friendly format, inheriting any taint and untrust.
- #each_byte ⇒ Object
- #each_char ⇒ Object
- #each_codepoint ⇒ Object
- #each_grapheme_cluster ⇒ Object (also: #grapheme_clusters)
- #each_line(*args) ⇒ Object
- #each_word ⇒ Object (also: #words)
- #empty? ⇒ Boolean
- #end_with?(*suffixes) ⇒ Boolean
- #eql?(other) ⇒ Boolean
- #foldcase(locale = ENV['LC_CTYPE']) ⇒ U::String
- #folded?(locale = ENV[LC_CTYPE]) ⇒ Boolean
-
#general_category ⇒ Symbol
Returns the general category of the characters of the receiver.
- #getbyte(index) ⇒ Fixnum?
-
#graph? ⇒ Boolean
Returns true if the receiver contains only non-space “printable” characters.
-
#grapheme_break ⇒ Symbol
Returns the grapheme break property value of the characters of the receiver.
- #gsub(*args) ⇒ Object
-
#hash ⇒ Fixnum
The hash value of the receiver’s content.
-
#hex ⇒ Integer
The result of #to_i(16).
- #include?(substring) ⇒ Boolean
-
#index(pattern, offset = 0) ⇒ Integer?
Returns the minimal index of the receiver where PATTERN matches, equal to or greater than i, where i = OFFSET if OFFSET ≥ 0, i = #length - abs(OFFSET) otherwise, or nil if there is no match.
-
#new(string = nil) ⇒ Object
constructor
Sets up a U::String wrapping STRING after encoding it as UTF-8 and freezing it.
-
#inspect ⇒ String
Returns the receiver in a reader-friendly inspectable format, inheriting any taint and untrust, encoded using UTF-8.
-
#length ⇒ Integer
(also: #size)
The number of characters in the receiver.
-
#line_break ⇒ Symbol
Returns the line break property value of the characters of the receiver.
-
#lines(separator = $/) ⇒ Array<U::String>
Returns the lines of the receiver, inheriting any taint and untrust.
- #ljust(width, padding = ' ') ⇒ U::String
- #lower?(locale = ENV[LC_CTYPE]) ⇒ Boolean
-
#lstrip ⇒ U::String
The receiver with its maximum #space? prefix removed, inheriting any taint and untrust.
- #match(*args) ⇒ Object
-
#mirror ⇒ U::String
Returns the mirroring of the receiver, inheriting any taint and untrust.
-
#newline? ⇒ Boolean
Returns true if the receiver contains only “newline” characters.
-
#normalize(form = :nfd) ⇒ U::String
Returns the receiver normalized into FORM, inheriting any taint and untrust.
-
#normalize?(mode = :default) ⇒ Boolean
Returns true if it can be determined that the receiver is normalized according to MODE.
- #oct ⇒ Integer
-
#ord ⇒ Integer
The code point of the first character of the receiver.
- #partition(separator) ⇒ Array<U::String>
- #print? ⇒ Boolean
- #punct? ⇒ Boolean
- #recode(codeset) ⇒ Object
-
#reverse ⇒ U::String
The reversal of the receiver, inheriting any taint and untrust from the receiver.
-
#rindex(pattern, offset = -1) ⇒ Integer?
Returns the maximal index of the receiver where PATTERN matches, equal to or less than i, where i = OFFSET if OFFSET ≥ 0, i = #length - abs(OFFSET) otherwise, or nil if there is no match.
- #rjust(width, padding = ' ') ⇒ U::String
- #rpartition(separator) ⇒ Array<U::String>
-
#rstrip ⇒ U::String
The receiver with its maximum #space? suffix removed, inheriting any taint and untrust from the receiver.
- #scan(pattern) ⇒ Object
-
#script ⇒ Symbol
Returns the script of the characters of the receiver.
- #soft_dotted? ⇒ Boolean
-
#space? ⇒ Boolean
Returns true if the receiver contains only “space” characters.
-
#split(pattern = $;, limit = 0) ⇒ Array<U::String>
Returns the receiver split into LIMIT substrings separated by PATTERN, each inheriting any taint and untrust.
-
#squeeze(*sets) ⇒ U::String
Returns the receiver, replacing any substrings of #length > 1 consisting of the same character c with c, where c is a member of the intersection of the character sets in SETS, inheriting any taint and untrust.
- #start_with?(*prefixes) ⇒ Boolean
-
#strip ⇒ U::String
The receiver with its maximum #space? prefix and suffix removed, inheriting any taint and untrust.
- #sub(*args) ⇒ Object
- #title? ⇒ Boolean
- #titlecase(locale = ENV['LC_CTYPE']) ⇒ U::String
-
#to_i(base = 16) ⇒ Integer
Returns the Integer value that results from treating the receiver as a string of digits in BASE.
-
#to_str ⇒ Object
(also: #to_s)
The String representation of the receiver, inheriting any taint and untrust, encoded as UTF-8.
-
#to_sym ⇒ Symbol
(also: #intern)
The Symbol representation of the receiver.
-
#tr(from, to) ⇒ U::String
Returns the receiver, translating characters in FROM to their equivalent character, by index, in TO, inheriting any taint and untrust.
-
#tr_s(from, to) ⇒ U::String
Returns the receiver, translating characters in FROM to their equivalent character, by index, in TO and then squeezing any substrings of #length > 1 consisting of the same character c with c, inheriting any taint and untrust.
-
#u ⇒ self
The receiver; mostly for completeness, but allows you to always call #u on something that’s either a String or a U::String.
- #upcase(locale = ENV['LC_CTYPE']) ⇒ U::String
- #upper?(locale = ENV[LC_CTYPE]) ⇒ Boolean
- #valid? ⇒ Boolean
- #valid_encoding? ⇒ Boolean
-
#wide? ⇒ Boolean
Returns true if the receiver contains only “wide” characters.
-
#wide_cjk? ⇒ Boolean
Returns true if the receiver contains only “wide” and “ambiguously wide” characters.
-
#width ⇒ Integer
Returns the width of the receiver.
-
#word_break ⇒ Symbol
Returns the word break property value of the characters of the receiver.
-
#xdigit? ⇒ Boolean
Returns true if the receiver contains only characters in the general category Number, decimal digit (Nd) or is a lower- or uppercase letter between ‘a’ and ‘f’.
-
#zero_width? ⇒ Boolean
Returns true if the receiver contains only “zero-width” characters.
Constructor Details
#new(string = nil) ⇒ Object
161 162 163 164 165 166 167 168 169 170 171 172 173 |
# File 'ext/u/rb_u_string.c', line 161
static VALUE
rb_u_string_initialize(int argc, VALUE *argv, VALUE self)
{
VALUE rb;
rb_scan_args(argc, argv, "01", &rb);
if (!NIL_P(rb)) {
StringValue(rb);
rb_u_string_set_rb(self, rb);
}
return Qnil;
}
|
Instance Method Details
#%(value) ⇒ U::String Also known as: format
1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 |
# File 'ext/u/rb_u_string_format.c', line 1736
VALUE
rb_u_string_format_m(VALUE self, VALUE argument)
{
volatile VALUE tmp = rb_check_array_type(argument);
if (!NIL_P(tmp))
return rb_u_string_format(RARRAY_LENINT(tmp), RARRAY_PTR(tmp), self);
return rb_u_string_format(1, &argument, self);
}
|
#*(n) ⇒ U::String
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# File 'ext/u/rb_u_string_times.c', line 9
VALUE
rb_u_string_times(VALUE self, VALUE rbtimes)
{
const struct rb_u_string *string = RVAL2USTRING(self);
long times = NUM2LONG(rbtimes);
if (times < 0)
rb_u_raise(rb_eArgError, "negative argument: %ld", times);
/* TODO: Isn’t this off by one, as we add one to length for the
* ALLOC_N() call? */
if (times > 0 && LONG_MAX / times < USTRING_LENGTH(string))
rb_u_raise(rb_eArgError, "argument too big: %ld", times);
long length = times * USTRING_LENGTH(string);
char *product = ALLOC_N(char, length + 1);
long i = USTRING_LENGTH(string);
if (i > 0) {
memcpy(product, USTRING_STR(string), i);
for ( ; i <= times / 2; i *= 2)
memcpy(product + i, product, i);
memcpy(product + i, product, times - i);
}
product[length] = '\0';
return rb_u_string_new_c_own(self, product, length);
}
|
#+(other) ⇒ U::String
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'ext/u/rb_u_string_plus.c', line 8
VALUE
rb_u_string_plus(VALUE self, VALUE rbother)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const struct rb_u_string *other = RVAL2USTRING_ANY(rbother);
long string_length = USTRING_LENGTH(string);
long other_length = USTRING_LENGTH(other);
/* TODO: Isn’t this off by one, as we add one to length for the
* ALLOC_N() call? */
if (string_length > LONG_MAX - other_length)
rb_u_raise(rb_eArgError, "length of resulting string would be too big");
long length = string_length + other_length;
char *sum = ALLOC_N(char, length + 1);
memcpy(sum, USTRING_STR(string), string_length);
memcpy(sum + string_length, USTRING_STR(other), other_length);
sum[length] = '\0';
VALUE result = rb_u_string_new_uninfected_own(sum, length);
if (OBJ_TAINTED(self) || OBJ_TAINTED(rbother))
OBJ_TAINT(result);
return result;
}
|
#<=>(other, locale = ENV['LC_COLLATE']) ⇒ Fixnum
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# File 'ext/u/rb_u_string_collate.c', line 21
VALUE
rb_u_string_collate(int argc, VALUE *argv, VALUE self)
{
const char *locale = NULL;
VALUE rbother, rblocale;
if (rb_scan_args(argc, argv, "11", &rbother, &rblocale) == 2)
locale = StringValuePtr(rblocale);
else {
const char * const env[] = { "LC_ALL", "LC_COLLATE", "LANG", NULL };
for (const char * const *p = env; *p != NULL; p++)
if ((locale = getenv(*p)) != NULL)
break;
}
const struct rb_u_string *string = RVAL2USTRING(self);
const struct rb_u_string *other = RVAL2USTRING_ANY(rbother);
errno = 0;
int r = u_collate(USTRING_STR(string), USTRING_LENGTH(string),
USTRING_STR(other), USTRING_LENGTH(other),
locale);
if (errno != 0)
rb_u_raise_errno(errno, "can’t collate strings");
return INT2FIX(r);
}
|
#==(other) ⇒ Boolean Also known as: ===
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'ext/u/rb_u_string_equal.c', line 8
VALUE
rb_u_string_equal(VALUE self, VALUE rbother)
{
if (self == rbother)
return Qtrue;
if (RTEST(rb_obj_is_kind_of(rbother, rb_cUString)))
return rb_u_string_eql(self, rbother);
if (!rb_respond_to(rbother, rb_intern("to_str")))
return Qfalse;
const struct rb_u_string *string = RVAL2USTRING(self);
const struct rb_u_string *other = RVAL2USTRING_ANY(rbother);
const char *p = USTRING_STR(string);
const char *q = USTRING_STR(other);
if (p == q)
return Qtrue;
long p_length = USTRING_LENGTH(string);
long q_length = USTRING_LENGTH(other);
return p_length == q_length && memcmp(p, q, q_length) == 0 ? Qtrue : Qfalse;
}
|
#=~(other) ⇒ Numeric?
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'ext/u/rb_u_string_match.c', line 10
VALUE
rb_u_string_match(VALUE self, VALUE other)
{
if (RTEST(rb_obj_is_kind_of(other, rb_cUString)))
rb_u_raise(rb_eTypeError, "type mismatch: U::String given");
switch (TYPE(other)) {
case T_STRING:
rb_u_raise(rb_eTypeError, "type mismatch: String given");
break;
case T_REGEXP: {
const struct rb_u_string *string = RVAL2USTRING(self);
long index = rb_reg_search(other, rb_str_to_str(self), 0, 0);
if (index < 0)
return Qnil;
return LONG2NUM(u_pointer_to_offset(USTRING_STR(string),
USTRING_STR(string) + index));
}
default:
return rb_funcall(other, rb_intern("=~"), 1, self);
}
}
|
#[](index) ⇒ U::String? #[](index, length) ⇒ U::String? #[](range) ⇒ U::String? #[](regexp, reference = 0) ⇒ U::String? #[](string) ⇒ U::String? #[](object) ⇒ nil Also known as: slice
130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'ext/u/rb_u_string_aref.c', line 130
VALUE
rb_u_string_aref_m(int argc, VALUE *argv, VALUE self)
{
need_m_to_n_arguments(argc, 1, 2);
if (argc == 1)
return rb_u_string_aref(self, argv[0]);
if (TYPE(argv[0]) == T_REGEXP)
return rb_u_string_subpat(self, argv[0], argv[1]);
return rb_u_string_substr(self, NUM2LONG(argv[0]), NUM2LONG(argv[1]));
}
|
#alnum? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_alnum.c', line 6
VALUE
rb_u_string_alnum(VALUE self)
{
return _rb_u_character_test(self, u_char_isalnum);
}
|
#alpha? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_alpha.c', line 6
VALUE
rb_u_string_alpha(VALUE self)
{
return _rb_u_character_test(self, u_char_isalpha);
}
|
#ascii_only? ⇒ Boolean
6 7 8 9 10 11 12 13 |
# File 'ext/u/rb_u_string_ascii_only.c', line 6
VALUE
rb_u_string_ascii_only(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
return u_is_ascii_only_n(USTRING_STR(string), USTRING_LENGTH(string)) ?
Qtrue : Qfalse;
}
|
#assigned? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_assigned.c', line 6
VALUE
rb_u_string_assigned(VALUE self)
{
return _rb_u_character_test(self, u_char_isassigned);
}
|
#b ⇒ String
Returns The String representation of the receiver, inheriting any taint and untrust, encoded as ASCII-8BIT.
8 9 10 11 12 13 14 15 16 17 18 |
# File 'ext/u/rb_u_string_b.c', line 8
VALUE
rb_u_string_b(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
VALUE result = rb_str_new(USTRING_STR(string), USTRING_LENGTH(string));
#ifdef HAVE_RUBY_ENCODING_H
rb_enc_associate(result, rb_ascii8bit_encoding());
#endif
OBJ_INFECT(result, self);
return result;
}
|
#bytes ⇒ Array<Fixnum>
Returns The bytes of the receiver.
40 41 42 43 44 45 46 |
# File 'ext/u/rb_u_string_each_byte.c', line 40
VALUE
rb_u_string_bytes(VALUE self)
{
struct yield_array y = YIELD_ARRAY_INIT;
each(self, &y.yield);
return y.array;
}
|
#bytesize ⇒ Integer
Returns The number of bytes required to represent the receiver.
4 5 6 7 8 9 10 |
# File 'ext/u/rb_u_string_bytesize.c', line 4
VALUE
rb_u_string_bytesize(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
return LONG2NUM(USTRING_LENGTH(string));
}
|
#byteslice(index) ⇒ U::String? #byteslice(index, length) ⇒ U::String? #byteslice(range) ⇒ U::String? #byteslice(object) ⇒ nil
92 93 94 95 96 97 98 99 100 101 102 103 |
# File 'ext/u/rb_u_string_byteslice.c', line 92
VALUE
rb_u_string_byteslice_m(int argc, VALUE *argv, VALUE self)
{
need_m_to_n_arguments(argc, 1, 2);
if (argc == 1)
return rb_u_string_byteslice(self, argv[0]);
return rb_u_string_byte_substr(self,
NUM2LONG(argv[0]),
NUM2LONG(argv[1]));
}
|
#canonical_combining_class ⇒ Fixnum
Returns the canonical combining class of the characters of the receiver.
The canonical combining class of a character is a number in the range [0, 254]. The canonical combining class is used when generating a canonical ordering of the characters in a string.
The empty string has a canonical combining class of 0.
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'ext/u/rb_u_string_canonical_combining_class.c', line 16
VALUE
rb_u_string_canonical_combining_class(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const char *p = USTRING_STR(string);
const char *end = USTRING_END(string);
if (p == end)
return 0;
int first = u_char_canonical_combining_class(u_decode(&p, p, end));
while (p < end) {
int value = u_char_canonical_combining_class(u_decode(&p, p, end));
if (value != first)
rb_u_raise(rb_eArgError,
"string consists of characters with different canonical combining class values: %d+, %d",
first, value);
}
return INT2FIX(first);
}
|
#case_ignorable? ⇒ Boolean
21 22 23 24 25 |
# File 'ext/u/rb_u_string_case_ignorable.c', line 21
VALUE
rb_u_string_case_ignorable(VALUE self)
{
return _rb_u_character_test(self, u_char_iscaseignorable);
}
|
#casecmp(other, locale = ENV['LC_COLLATE']) ⇒ Fixnum
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
# File 'ext/u/rb_u_string_casecmp.c', line 31
VALUE
rb_u_string_casecmp(int argc, VALUE *argv, VALUE self)
{
const char *locale = NULL;
VALUE rbother, rblocale;
if (rb_scan_args(argc, argv, "11", &rbother, &rblocale) == 2)
locale = StringValuePtr(rblocale);
const struct rb_u_string *string = RVAL2USTRING(self);
const struct rb_u_string *other = RVAL2USTRING_ANY(rbother);
char *folded;
size_t folded_n = foldcase(&folded, string, locale, NULL);
char *folded_other;
size_t folded_other_n = foldcase(&folded_other, other, locale, folded);
errno = 0;
int r = u_collate(folded, folded_n,
folded_other, folded_other_n,
locale);
free(folded_other);
free(folded);
if (errno != 0)
rb_u_raise_errno(errno, "can’t collate strings");
return INT2FIX(r);
}
|
#cased? ⇒ Boolean
13 14 15 16 17 |
# File 'ext/u/rb_u_string_cased.c', line 13
VALUE
rb_u_string_cased(VALUE self)
{
return _rb_u_character_test(self, u_char_iscased);
}
|
#center(width, padding = ' ') ⇒ U::String
131 132 133 134 135 |
# File 'ext/u/rb_u_string_justify.c', line 131
VALUE
rb_u_string_center(int argc, VALUE *argv, VALUE self)
{
return rb_u_string_justify(argc, argv, self, 'c');
}
|
#chars ⇒ Array<U::String>
Returns The characters of the receiver, each inheriting any taint and untrust.
43 44 45 46 47 48 49 |
# File 'ext/u/rb_u_string_each_char.c', line 43
VALUE
rb_u_string_chars(VALUE self)
{
struct yield_array y = YIELD_ARRAY_INIT;
each(self, &y.yield);
return y.array;
}
|
#chomp(separator = $/) ⇒ U::String, ...
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
# File 'ext/u/rb_u_string_chomp.c', line 65
VALUE
rb_u_string_chomp(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
long length = USTRING_LENGTH(string);
if (length == 0)
return Qnil;
VALUE rs;
if (argc == 0) {
rs = rb_rs;
if (rs == rb_default_rs)
return rb_u_string_chomp_default(self);
} else {
rb_scan_args(argc, argv, "01", &rs);
}
if (NIL_P(rs))
return self;
const struct rb_u_string *separator = RVAL2USTRING_ANY(rs);
long separator_length = USTRING_LENGTH(separator);
if (separator_length == 0)
return rb_u_string_chomp_newlines(self);
if (separator_length > length)
return self;
char last_char = USTRING_STR(separator)[separator_length - 1];
if (separator_length == 1 && last_char == '\n')
return rb_u_string_chomp_default(self);
if (!u_valid(USTRING_STR(separator), separator_length, NULL) ||
USTRING_STR(string)[length - 1] != last_char ||
(separator_length > 1 &&
rb_memcmp(USTRING_STR(separator),
USTRING_END(string) - separator_length,
separator_length) != 0))
return self;
return rb_u_string_new_c(self, USTRING_STR(string), length - separator_length);
}
|
#chop ⇒ U::String
Returns the receiver, minus its last character, inheriting any taint and untrust, unless the receiver is #empty? or if the last character is invalidly encoded, in which case the receiver is returned.
If the last character is U+000A LINE FEED and the second-to-last character is U+000D CARRIAGE RETURN, both characters are removed.
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'ext/u/rb_u_string_chop.c', line 15
VALUE
rb_u_string_chop(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
if (USTRING_LENGTH(string) == 0)
return self;
const char *begin = USTRING_STR(string);
const char *end = USTRING_END(string);
const char *last;
uint32_t c = u_decode_r(&last, begin, end);
if (c == '\n')
if (*(last - 1) == '\r')
last--;
return rb_u_string_new_c(self, begin, last - begin);
}
|
#chr ⇒ U::String
Returns The substring [0, min(#length, 1)], inheriting any taint and untrust.
5 6 7 8 9 |
# File 'ext/u/rb_u_string_chr.c', line 5
VALUE
rb_u_string_chr(VALUE self)
{
return rb_u_string_substr(self, 0, 1);
}
|
#cntrl? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_cntrl.c', line 6
VALUE
rb_u_string_cntrl(VALUE self)
{
return _rb_u_character_test(self, u_char_iscntrl);
}
|
#codepoints ⇒ Array<Integer>
Returns The code points of the receiver.
39 40 41 42 43 44 45 |
# File 'ext/u/rb_u_string_each_codepoint.c', line 39
VALUE
rb_u_string_codepoints(VALUE self)
{
struct yield_array y = YIELD_ARRAY_INIT;
each(self, &y.yield);
return y.array;
}
|
#collation_key(locale = ENV['LC_COLLATE']) ⇒ U::String
14 15 16 17 18 |
# File 'ext/u/rb_u_string_collation_key.c', line 14
VALUE
rb_u_string_collation_key(int argc, VALUE *argv, VALUE self)
{
return _rb_u_string_convert_locale(argc, argv, self, u_collation_key, "LC_COLLATE");
}
|
#count(set, *sets) ⇒ Integer
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# File 'ext/u/rb_u_string_count.c', line 19
VALUE
rb_u_string_count(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
need_at_least_n_arguments(argc, 1);
if (USTRING_LENGTH(string) == 0)
return INT2FIX(0);
struct tr_table table;
tr_table_initialize_from_strings(&table, argc, argv);
long count = 0;
for (const char *p = USTRING_STR(string), *end = USTRING_END(string); p < end; )
if (tr_table_lookup(&table, u_decode(&p, p, end)))
count++;
return LONG2NUM(count);
}
|
#defined? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_defined.c', line 6
VALUE
rb_u_string_defined(VALUE self)
{
return _rb_u_character_test(self, u_char_isdefined);
}
|
#delete(set, *sets) ⇒ U::String
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'ext/u/rb_u_string_delete.c', line 40
VALUE
rb_u_string_delete(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
need_at_least_n_arguments(argc, 1);
if (USTRING_LENGTH(string) == 0)
return self;
struct tr_table table;
tr_table_initialize_from_strings(&table, argc, argv);
long count = rb_u_string_delete_loop(string, &table, NULL);
if (count == 0)
return self;
char *remaining = ALLOC_N(char, count + 1);
rb_u_string_delete_loop(string, &table, remaining);
remaining[count] = '\0';
return rb_u_string_new_c_own(self, remaining, count);
}
|
#digit? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_digit.c', line 6
VALUE
rb_u_string_digit(VALUE self)
{
return _rb_u_character_test(self, u_char_isdigit);
}
|
#downcase(locale = ENV['LC_CTYPE']) ⇒ U::String
9 10 11 12 13 |
# File 'ext/u/rb_u_string_downcase.c', line 9
VALUE
rb_u_string_downcase(int argc, VALUE *argv, VALUE self)
{
return _rb_u_string_convert_locale(argc, argv, self, u_downcase, NULL);
}
|
#dump ⇒ U::String
Returns the receiver in a reader-friendly format, inheriting any taint and untrust.
The reader-friendly format looks like “‘“…”.u`”. Inside the “…”, any #print? characters in the ASCII range are output as-is, the following special characters are escaped according to the following table:
<table>
<thead><tr><th>Character</th><th>Dumped Sequence</th></tr></thead>
<tbody>
<tr><td>U+0022 QUOTATION MARK</td><td><code>\"</code></td></tr>
<tr><td>U+005C REVERSE SOLIDUS</td><td><code>\\</code></td></tr>
<tr><td>U+000A LINE FEED (LF)</td><td><code>\n</code></td></tr>
<tr><td>U+000D CARRIAGE RETURN (CR)</td><td><code>\r</code></td></tr>
<tr><td>U+0009 CHARACTER TABULATION</td><td><code>\t</code></td></tr>
<tr><td>U+000C FORM FEED (FF)</td><td><code>\f</code></td></tr>
<tr><td>U+000B LINE TABULATION</td><td><code>\v</code></td></tr>
<tr><td>U+0008 BACKSPACE</td><td><code>\b</code></td></tr>
<tr><td>U+0007 BELL</td><td><code>\a</code></td></tr>
<tr><td>U+001B ESCAPE</td><td><code>\e</code></td></tr>
</tbody>
</table>
the following special sequences are also escaped:
<table>
<thead><tr><th>Character</th><th>Dumped Sequence</th></tr></thead>
<tbody>
<tr><td><code>#$</code></td><td><code>\#$</code></td></tr>
<tr><td><code>#@</code></td><td><code>\#@</code></td></tr>
<tr><td><code>#{</code></td><td><code>\#{</code></td></tr>
</tbody>
</table>
any valid UTF-8 byte sequences are output as “‘u{`n`}`”, where n is the lowercase hexadecimal representation of the code point encoded by the UTF-8 sequence, and any other byte is output as “`x`n”, where n is the two-digit uppercase hexadecimal representation of the byte’s value.
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
# File 'ext/u/rb_u_string_dump.c', line 125
VALUE
rb_u_string_dump(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const char *p = USTRING_STR(string);
const char *end = USTRING_END(string);
VALUE buffer = rb_u_buffer_new_sized(7);
rb_u_buffer_append(buffer, "\"", 1);
while (p < end) {
unsigned char c = *p;
if (!rb_u_string_dump_escape(buffer, c) &&
!rb_u_string_dump_hash(buffer, c, p, end) &&
!rb_u_string_dump_ascii_printable(buffer, c) &&
!rb_u_string_dump_codepoint(buffer, &p, end))
rb_u_string_dump_hex(buffer, c);
p++;
}
rb_u_buffer_append(buffer, "\".u", 3);
VALUE result = rb_u_buffer_to_u_bang(buffer);
OBJ_INFECT(result, self);
return result;
}
|
#each_byte {|byte| ... } ⇒ self #each_byte ⇒ Enumerator
30 31 32 33 34 35 36 37 |
# File 'ext/u/rb_u_string_each_byte.c', line 30
VALUE
rb_u_string_each_byte(VALUE self)
{
RETURN_SIZED_ENUMERATOR(self, 0, NULL, size);
struct yield y = YIELD_INIT;
each(self, &y);
return self;
}
|
#each_char {|char| ... } ⇒ self #each_char ⇒ Enumerator
32 33 34 35 36 37 38 39 |
# File 'ext/u/rb_u_string_each_char.c', line 32
VALUE
rb_u_string_each_char(VALUE self)
{
RETURN_SIZED_ENUMERATOR(self, 0, NULL, size);
struct yield y = YIELD_INIT;
each(self, &y);
return self;
}
|
#each_codepoint {|codepoint| ... } ⇒ self #each_codepoint ⇒ Enumerator
29 30 31 32 33 34 35 36 |
# File 'ext/u/rb_u_string_each_codepoint.c', line 29
VALUE
rb_u_string_each_codepoint(VALUE self)
{
RETURN_SIZED_ENUMERATOR(self, 0, NULL, size);
struct yield y = YIELD_INIT;
each(self, &y);
return self;
}
|
#each_grapheme_cluster {|cluster| ... } ⇒ self #each_grapheme_cluster ⇒ Enumerator Also known as: grapheme_clusters
25 26 27 28 29 30 31 32 33 34 35 36 |
# File 'ext/u/rb_u_string_each_grapheme_cluster.c', line 25
VALUE
rb_u_string_each_grapheme_cluster(VALUE self)
{
RETURN_ENUMERATOR(self, 0, NULL);
const struct rb_u_string *string = RVAL2USTRING(self);
const char *p = USTRING_STR(string);
const char *end = USTRING_END(string);
size_t length = end - p;
u_grapheme_clusters(p, length, (u_substring_fn)each, &self);
return self;
}
|
#each_line(separator = $/) {|lp| ... } ⇒ self #each_line(separator = $/) ⇒ Enumerator
118 119 120 121 122 123 124 125 |
# File 'ext/u/rb_u_string_each_line.c', line 118
VALUE
rb_u_string_each_line(int argc, VALUE *argv, VALUE self)
{
RETURN_ENUMERATOR(self, argc, argv);
struct yield y = YIELD_INIT;
each(argc, argv, self, &y);
return self;
}
|
#each_word {|word| ... } ⇒ self #each_word ⇒ Enumerator Also known as: words
24 25 26 27 28 29 30 31 32 33 34 |
# File 'ext/u/rb_u_string_each_word.c', line 24
VALUE
rb_u_string_each_word(VALUE self)
{
RETURN_ENUMERATOR(self, 0, NULL);
const struct rb_u_string *string = RVAL2USTRING(self);
const char *p = USTRING_STR(string);
size_t length = USTRING_LENGTH(string);
u_words(p, length, (u_substring_fn)each, &self);
return self;
}
|
#empty? ⇒ Boolean
5 6 7 8 9 10 11 |
# File 'ext/u/rb_u_string_empty.c', line 5
VALUE
rb_u_string_empty(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
return (USTRING_LENGTH(string) == 0) ? Qtrue : Qfalse;
}
|
#end_with?(*suffixes) ⇒ Boolean
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# File 'ext/u/rb_u_string_end_with.c', line 7
VALUE
rb_u_string_end_with(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const char *end = USTRING_END(string);
long p_length = USTRING_LENGTH(string);
for (int i = 0; i < argc; i++) {
VALUE tmp = rb_u_string_check_type(argv[i]);
if (NIL_P(tmp))
continue;
const struct rb_u_string *other = RVAL2USTRING_ANY(tmp);
const char *q = USTRING_STR(other);
long q_length = USTRING_LENGTH(other);
if (p_length < q_length)
continue;
if (memcmp(end - q_length, q, q_length) == 0)
return Qtrue;
}
return Qfalse;
}
|
#eql?(other) ⇒ Boolean
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# File 'ext/u/rb_u_string_eql.c', line 8
VALUE
rb_u_string_eql(VALUE self, VALUE rbother)
{
if (self == rbother)
return Qtrue;
if (!RTEST(rb_obj_is_kind_of(rbother, rb_cUString)))
return Qfalse;
const struct rb_u_string *string = RVAL2USTRING(self);
const struct rb_u_string *other = RVAL2USTRING(rbother);
const char *p = USTRING_STR(string);
const char *q = USTRING_STR(other);
if (p == q)
return Qtrue;
long p_length = USTRING_LENGTH(string);
long q_length = USTRING_LENGTH(other);
return p_length == q_length && memcmp(p, q, q_length) == 0 ? Qtrue : Qfalse;
}
|
#foldcase(locale = ENV['LC_CTYPE']) ⇒ U::String
8 9 10 11 12 |
# File 'ext/u/rb_u_string_foldcase.c', line 8
VALUE
rb_u_string_foldcase(int argc, VALUE *argv, VALUE self)
{
return _rb_u_string_convert_locale(argc, argv, self, u_foldcase, NULL);
}
|
#folded?(locale = ENV[LC_CTYPE]) ⇒ Boolean
9 10 11 12 13 |
# File 'ext/u/rb_u_string_folded.c', line 9
VALUE
rb_u_string_folded(int argc, VALUE *argv, VALUE self)
{
return _rb_u_string_test_locale(argc, argv, self, u_foldcase);
}
|
#general_category ⇒ Symbol
Returns the general category of the characters of the receiver.
The general category identifies what kind of symbol the character is.
<table>
<thead>
<tr>
<th>Category Major, minor</th>
<th>Unicode Value</th>
<th>Ruby Value</th>
</tr>
</thead>
<tbody>
<tr><td>Other, control</td><td>Cc</td><td>:other_control</td></tr>
<tr><td>Other, format</td><td>Cf</td><td>:other_format</td></tr>
<tr><td>Other, not assigned</td><td>Cn</td><td>:other_not_assigned</td></tr>
<tr><td>Other, private use</td><td>Co</td><td>:other_private_use</td></tr>
<tr><td>Other, surrogate</td><td>Cs</td><td>:other_surrogate</td></tr>
<tr><td>Letter, lowercase</td><td>Ll</td><td>:letter_lowercase</td></tr>
<tr><td>Letter, modifier</td><td>Lm</td><td>:letter_modifier</td></tr>
<tr><td>Letter, other</td><td>Lo</td><td>:letter_other</td></tr>
<tr><td>Letter, titlecase</td><td>Lt</td><td>:letter_titlecase</td></tr>
<tr><td>Letter, uppercase</td><td>Lu</td><td>:letter_uppercase</td></tr>
<tr><td>Mark, spacing combining</td><td>Mc</td><td>:mark_spacing_combining</td></tr>
<tr><td>Mark, enclosing</td><td>Me</td><td>:mark_enclosing</td></tr>
<tr><td>Mark, nonspacing</td><td>Mn</td><td>:mark_non_spacing</td></tr>
<tr><td>Number, decimal digit</td><td>Nd</td><td>:number_decimal</td></tr>
<tr><td>Number, letter</td><td>Nl</td><td>:number_letter</td></tr>
<tr><td>Number, other</td><td>No</td><td>:number_other</td></tr>
<tr><td>Punctuation, connector</td><td>Pc</td><td>:punctuation_connector</td></tr>
<tr><td>Punctuation, dash</td><td>Pd</td><td>:punctuation_dash</td></tr>
<tr><td>Punctuation, close</td><td>Pe</td><td>:punctuation_close</td></tr>
<tr><td>Punctuation, final quote</td><td>Pf</td><td>:punctuation_final_quote</td></tr>
<tr><td>Punctuation, initial quote</td><td>Pi</td><td>:punctuation_initial_quote</td></tr>
<tr><td>Punctuation, other</td><td>Po</td><td>:punctuation_other</td></tr>
<tr><td>Punctuation, open</td><td>Ps</td><td>:punctuation_open</td></tr>
<tr><td>Symbol, currency</td><td>Sc</td><td>:symbol_currency</td></tr>
<tr><td>Symbol, modifier</td><td>Sk</td><td>:symbol_modifier</td></tr>
<tr><td>Symbol, math</td><td>Sm</td><td>:symbol_math</td></tr>
<tr><td>Symbol, other</td><td>So</td><td>:symbol_other</td></tr>
<tr><td>Separator, line</td><td>Zl</td><td>:separator_line</td></tr>
<tr><td>Separator, paragraph</td><td>Zp</td><td>:separator_paragraph</td></tr>
<tr><td>Separator, space</td><td>Zs</td><td>:separator_space</td></tr>
</tbody>
</table>
103 104 105 106 107 108 109 |
# File 'ext/u/rb_u_string_general_category.c', line 103
VALUE
rb_u_string_general_category(VALUE self)
{
return _rb_u_string_property(self, "general category", U_GENERAL_CATEGORY_OTHER_NOT_ASSIGNED,
(int (*)(uint32_t))u_char_general_category,
(VALUE (*)(int))category_to_symbol);
}
|
#getbyte(index) ⇒ Fixnum?
8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# File 'ext/u/rb_u_string_getbyte.c', line 8
VALUE
rb_u_string_getbyte(VALUE self, VALUE rbindex)
{
const struct rb_u_string *string = RVAL2USTRING(self);
long index = NUM2LONG(rbindex);
if (index < 0)
index += USTRING_LENGTH(string);
if (index < 0 || USTRING_LENGTH(string) <= index)
return Qnil;
return INT2FIX((unsigned char)USTRING_STR(string)[index]);
}
|
#graph? ⇒ Boolean
17 18 19 20 21 |
# File 'ext/u/rb_u_string_graph.c', line 17
VALUE
rb_u_string_graph(VALUE self)
{
return _rb_u_character_test(self, u_char_isgraph);
}
|
#grapheme_break ⇒ Symbol
Returns the grapheme break property value of the characters of the receiver.
The possible break values are
-
:control
-
:cr
-
:extend
-
:l
-
:lf
-
:lv
-
:lvt
-
:other
-
:prepend
-
:regional_indicator
-
:spacingmark
-
:t
-
:v
55 56 57 58 59 60 61 |
# File 'ext/u/rb_u_string_grapheme_break.c', line 55
VALUE
rb_u_string_grapheme_break(VALUE self)
{
return _rb_u_string_property(self, "grapheme break", U_GRAPHEME_BREAK_OTHER,
(int (*)(uint32_t))u_char_grapheme_break,
(VALUE (*)(int))break_to_symbol);
}
|
#gsub(pattern, replacement) ⇒ U::String #gsub(pattern, replacements) ⇒ U::String #gsub(pattern) {|match| ... } ⇒ U::String #gsub(pattern) ⇒ Enumerator
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# File 'ext/u/rb_u_string_gsub.c', line 75
VALUE
rb_u_string_gsub(int argc, VALUE *argv, VALUE self)
{
VALUE pattern, replacement;
VALUE replacements = Qnil;
bool use_block = false;
bool tainted = false;
if (argc == 1) {
RETURN_ENUMERATOR(self, argc, argv);
use_block = true;
}
if (rb_scan_args(argc, argv, "11", &pattern, &replacement) == 2) {
replacements = rb_check_convert_type(replacement, T_HASH,
"Hash", "to_hash");
if (NIL_P(replacements))
StringValue(replacement);
if (OBJ_TAINTED(replacement))
tainted = true;
}
pattern = rb_u_pattern_argument(pattern, true);
VALUE str = rb_str_to_str(self);
long begin = rb_reg_search(pattern, str, 0, 0);
if (begin < 0)
return self;
const char *base = RSTRING_PTR(str);
const char *p = base;
const char *end = RSTRING_END(str);
VALUE substituted = rb_u_str_buf_new(RSTRING_LEN(str) + 30);
do {
VALUE match = rb_backref_get();
struct re_registers *registers = RMATCH_REGS(match);
VALUE result;
if (use_block || !NIL_P(replacements)) {
if (use_block) {
VALUE ustr = rb_u_string_new_rb(rb_reg_nth_match(0, match));
result = rb_u_string_object_as_string(rb_yield(ustr));
} else {
VALUE ustr = rb_u_string_new_c(self,
base + registers->beg[0],
registers->end[0] - registers->beg[0]);
result = rb_u_string_object_as_string(rb_hash_aref(replacements, ustr));
}
if (result == substituted)
rb_u_raise(rb_eRuntimeError,
"result of block is string being built; please try not to cheat");
} else
result =
#ifdef HAVE_RB_REG_REGSUB4
rb_reg_regsub(replacement, str, registers, pattern);
#else
rb_reg_regsub(replacement, str, registers);
#endif
if (OBJ_TAINTED(result))
tainted = true;
const struct rb_u_string *value = RVAL2USTRING_ANY(result);
rb_str_buf_cat(substituted, p, registers->beg[0] - (p - base));
rb_str_buf_cat(substituted, USTRING_STR(value), USTRING_LENGTH(value));
OBJ_INFECT(substituted, result);
p = base + registers->end[0];
if (registers->beg[0] == registers->end[0])
p = u_next(p);
if (p >= end)
break;
begin = rb_reg_search(pattern, str, registers->end[0], 0);
} while (begin >= 0);
if (p < end)
rb_str_buf_cat(substituted, p, end - p);
rb_reg_search(pattern, str, end - p, 0);
RBASIC(substituted)->klass = rb_obj_class(str);
OBJ_INFECT(substituted, str);
if (tainted)
OBJ_TAINT(substituted);
return rb_u_string_new_rb(substituted);
}
|
#hash ⇒ Fixnum
Returns The hash value of the receiver’s content.
4 5 6 7 8 9 10 |
# File 'ext/u/rb_u_string_hash.c', line 4
VALUE
rb_u_string_hash(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
return INT2FIX(rb_memhash(USTRING_STR(string), USTRING_LENGTH(string)));
}
|
#hex ⇒ Integer
Returns The result of #to_i(16).
5 6 7 8 9 |
# File 'ext/u/rb_u_string_hex.c', line 5
VALUE
rb_u_string_hex(VALUE self)
{
return rb_u_string_to_inum(self, 16, false);
}
|
#include?(substring) ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_include.c', line 6
VALUE
rb_u_string_include(VALUE self, VALUE substring)
{
return rb_u_string_index(self, substring, 0) != -1 ? Qtrue : Qfalse;
}
|
#index(pattern, offset = 0) ⇒ Integer?
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'ext/u/rb_u_string_index.c', line 70
VALUE
rb_u_string_index_m(int argc, VALUE *argv, VALUE self)
{
VALUE sub, rboffset;
long offset = 0;
if (rb_scan_args(argc, argv, "11", &sub, &rboffset) == 2)
offset = NUM2LONG(rboffset);
const struct rb_u_string *string = RVAL2USTRING(self);
const char *begin = rb_u_string_begin_from_offset(string, offset);
if (begin == NULL) {
if (TYPE(sub) == T_REGEXP)
rb_backref_set(Qnil);
return Qnil;
}
switch (TYPE(sub)) {
case T_REGEXP:
offset = rb_u_string_index_regexp(self, begin, sub, false);
break;
default: {
VALUE tmp = rb_check_string_type(sub);
if (NIL_P(tmp))
rb_u_raise(rb_eTypeError, "type mismatch: %s given",
rb_obj_classname(sub));
sub = tmp;
}
/* fall through */
case T_STRING:
offset = rb_u_string_index(self, sub, offset);
break;
}
if (offset < 0)
return Qnil;
return LONG2NUM(offset);
}
|
#inspect ⇒ String
Returns the receiver in a reader-friendly inspectable format, inheriting any taint and untrust, encoded using UTF-8.
The reader-friendly inspectable format looks like “‘“…”.u`”. Inside the “…”, any #print? characters are output as-is, the following special characters are escaped according to the following table:
<table>
<thead><tr><th>Character</th><th>Dumped Sequence</th></tr></thead>
<tbody>
<tr><td>U+0022 QUOTATION MARK</td><td><code>\"</code></td></tr>
<tr><td>U+005C REVERSE SOLIDUS</td><td><code>\\</code></td></tr>
<tr><td>U+000A LINE FEED (LF)</td><td><code>\n</code></td></tr>
<tr><td>U+000D CARRIAGE RETURN (CR)</td><td><code>\r</code></td></tr>
<tr><td>U+0009 CHARACTER TABULATION</td><td><code>\t</code></td></tr>
<tr><td>U+000C FORM FEED (FF)</td><td><code>\f</code></td></tr>
<tr><td>U+000B LINE TABULATION</td><td><code>\v</code></td></tr>
<tr><td>U+0008 BACKSPACE</td><td><code>\b</code></td></tr>
<tr><td>U+0007 BELL</td><td><code>\a</code></td></tr>
<tr><td>U+001B ESCAPE</td><td><code>\e</code></td></tr>
</tbody>
</table>
the following special sequences are also escaped:
<table>
<thead><tr><th>Character</th><th>Dumped Sequence</th></tr></thead>
<tbody>
<tr><td><code>#$</code></td><td><code>\#$</code></td></tr>
<tr><td><code>#@</code></td><td><code>\#@</code></td></tr>
<tr><td><code>#{</code></td><td><code>\#{</code></td></tr>
</tbody>
</table>
Valid UTF-8 byte sequences representing code points < 0x10000 are output as ‘u`n, where n is the four-digit uppercase hexadecimal representation of the code point.
Valid UTF-8 byte sequences representing code points ≥ 0x10000 are output as ‘u{`n`}`, where n is the uppercase hexadecimal representation of the code point.
Any other byte is output as ‘x`n, where n is the two-digit uppercase hexadecimal representation of the byte’s value.
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
# File 'ext/u/rb_u_string_inspect.c', line 127
VALUE
rb_u_string_inspect(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
VALUE result = rb_u_str_buf_new(0);
rb_str_buf_cat2(result, "\"");
const char *p = USTRING_STR(string);
const char *end = USTRING_END(string);
while (p < end) {
const char *q;
uint32_t c = u_decode(&q, p, end);
switch (c) {
case '"':
case '\\':
rb_u_string_inspect_special_char(c, result);
break;
case '#':
p = rb_u_string_inspect_hash_char(q, end, result);
continue;
case '\n':
rb_str_buf_cat2(result, "\\n");
break;
case '\r':
rb_str_buf_cat2(result, "\\r");
break;
case '\t':
rb_str_buf_cat2(result, "\\t");
break;
case '\f':
rb_str_buf_cat2(result, "\\f");
break;
case '\013':
rb_str_buf_cat2(result, "\\v");
break;
case '\010':
rb_str_buf_cat2(result, "\\b");
break;
case '\007':
rb_str_buf_cat2(result, "\\a");
break;
case '\033':
rb_str_buf_cat2(result, "\\e");
break;
case REPLACEMENT_CHARACTER:
if (!u_valid(p, q - p, NULL)) {
rb_u_string_inspect_bad_input(p, q, result);
break;
}
/* fall through */
default:
rb_u_string_inspect_default(c, result);
break;
}
p = q;
}
rb_str_buf_cat2(result, "\".u");
OBJ_INFECT(result, self);
return result;
}
|
#length ⇒ Integer Also known as: size
Returns The number of characters in the receiver.
4 5 6 7 8 9 10 |
# File 'ext/u/rb_u_string_length.c', line 4
VALUE
rb_u_string_length(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
return UINT2NUM(u_n_chars_n(USTRING_STR(string), USTRING_LENGTH(string)));
}
|
#line_break ⇒ Symbol
Returns the line break property value of the characters of the receiver.
The possible break values are
-
:after
-
:alphabetic
-
:ambiguous
-
:before
-
:before_and_after
-
:carriage_return
-
:close_parenthesis
-
:close_punctuation
-
:combining_mark
-
:complex_context
-
:conditional_japanese_starter
-
:contingent
-
:exclamation
-
:hangul_l_jamo
-
:hangul_lv_syllable
-
:hangul_lvt_syllable
-
:hangul_t_jamo
-
:hangul_v_jamo
-
:hebrew_letter
-
:hyphen
-
:ideographic
-
:infix_separator
-
:inseparable
-
:line_feed
-
:mandatory
-
:next_line
-
:non_breaking_glue
-
:non_starter
-
:numeric
-
:open_punctuation
-
:postfix
-
:prefix
-
:quotation
-
:regional_indicator
-
:space
-
:surrogate
-
:symbol
-
:unknown
-
:word_joiner
-
:zero_width_space
109 110 111 112 113 114 115 |
# File 'ext/u/rb_u_string_line_break.c', line 109
VALUE
rb_u_string_line_break(VALUE self)
{
return _rb_u_string_property(self, "line break", U_LINE_BREAK_UNKNOWN,
(int (*)(uint32_t))u_char_line_break,
(VALUE (*)(int))break_to_symbol);
}
|
#lines(separator = $/) ⇒ Array<U::String>
136 137 138 139 140 141 142 |
# File 'ext/u/rb_u_string_each_line.c', line 136
VALUE
rb_u_string_lines(int argc, VALUE *argv, VALUE self)
{
struct yield_array y = YIELD_ARRAY_INIT;
each(argc, argv, self, &y.yield);
return y.array;
}
|
#ljust(width, padding = ' ') ⇒ U::String
148 149 150 151 152 |
# File 'ext/u/rb_u_string_justify.c', line 148
VALUE
rb_u_string_ljust(int argc, VALUE *argv, VALUE self)
{
return rb_u_string_justify(argc, argv, self, 'l');
}
|
#lower?(locale = ENV[LC_CTYPE]) ⇒ Boolean
9 10 11 12 13 |
# File 'ext/u/rb_u_string_lower.c', line 9
VALUE
rb_u_string_lower(int argc, VALUE *argv, VALUE self)
{
return _rb_u_string_test_locale(argc, argv, self, u_downcase);
}
|
#lstrip ⇒ U::String
Returns The receiver with its maximum #space? prefix removed, inheriting any taint and untrust.
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# File 'ext/u/rb_u_string_lstrip.c', line 7
VALUE
rb_u_string_lstrip(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const char *begin = USTRING_STR(string);
if (begin == NULL)
return self;
const char *p = begin, *end = USTRING_END(string);
for (const char *q; p < end; p = q)
if (!u_char_isspace(u_decode(&q, p, end)))
break;
if (p == begin)
return self;
return rb_u_string_new_c(self, p, end - p);
}
|
#match(pattern, index = 0) ⇒ MatchData? #match(pattern, index = 0) {|matchdata| ... } ⇒ Object?
52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# File 'ext/u/rb_u_string_match.c', line 52
VALUE
rb_u_string_match_m(int argc, VALUE *argv, VALUE self)
{
VALUE re;
if (argc < 0)
need_m_to_n_arguments(argc, 1, 2);
re = argv[0];
argv[0] = self;
VALUE result = rb_funcall2(rb_u_pattern_argument(re, false),
rb_intern("match"), argc, argv);
if (!NIL_P(result) && rb_block_given_p())
return rb_yield(result);
return result;
}
|
#mirror ⇒ U::String
Returns the mirroring of the receiver, inheriting any taint and untrust.
Mirroring is done by replacing characters in the string with their horizontal mirror image, if any, in text that is laid out from right to left. For example, ‘(’ becomes ‘)’ and ‘)’ becomes ‘(’.
12 13 14 15 16 |
# File 'ext/u/rb_u_string_mirror.c', line 12
VALUE
rb_u_string_mirror(VALUE self)
{
return _rb_u_string_convert(self, u_mirror);
}
|
#newline? ⇒ Boolean
17 18 19 20 21 |
# File 'ext/u/rb_u_string_newline.c', line 17
VALUE
rb_u_string_newline(VALUE self)
{
return _rb_u_character_test(self, u_char_isnewline);
}
|
#normalize(form = :nfd) ⇒ U::String
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'ext/u/rb_u_string_normalize.c', line 48
VALUE
rb_u_string_normalize(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
VALUE rbform;
enum u_normalization_form form = U_NORMALIZATION_FORM_D;
if (rb_scan_args(argc, argv, "01", &rbform) == 1)
form = _rb_u_symbol_to_normalization_form(rbform);
size_t n = u_normalize(NULL, 0,
USTRING_STR(string), USTRING_LENGTH(string),
form);
char *normalized = ALLOC_N(char, n + 1);
n = u_normalize(normalized, n + 1,
USTRING_STR(string), USTRING_LENGTH(string),
form);
char *t = REALLOC_N(normalized, char, n + 1);
if (t != NULL)
normalized = t;
return rb_u_string_new_c_own(self, normalized, n);
}
|
#normalize?(mode = :default) ⇒ Boolean
15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# File 'ext/u/rb_u_string_normalized.c', line 15
VALUE
rb_u_string_normalized(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
VALUE rbform;
enum u_normalization_form form = U_NORMALIZATION_FORM_D;
if (rb_scan_args(argc, argv, "01", &rbform) == 1)
form = _rb_u_symbol_to_normalization_form(rbform);
return u_normalized(USTRING_STR(string),
USTRING_LENGTH(string),
form) == U_NORMALIZED_YES ? Qtrue : Qfalse;
}
|
#oct ⇒ Integer
7 8 9 10 11 |
# File 'ext/u/rb_u_string_oct.c', line 7
VALUE
rb_u_string_oct(VALUE self)
{
return rb_u_string_to_inum(self, -8, false);
}
|
#ord ⇒ Integer
Returns The code point of the first character of the receiver.
4 5 6 7 8 9 10 11 12 13 14 |
# File 'ext/u/rb_u_string_ord.c', line 4
VALUE
rb_u_string_ord(VALUE self)
{
const struct rb_u_string *s = RVAL2USTRING(self);
const char *p = USTRING_STR(s);
const char *end = USTRING_END(s);
if (p == end)
rb_u_raise(rb_eArgError, "empty string");
const char *q;
return UINT2NUM(u_decode(&q, p, end));
}
|
#partition(separator) ⇒ Array<U::String>
73 74 75 76 77 78 79 80 |
# File 'ext/u/rb_u_string_partition.c', line 73
VALUE
rb_u_string_partition(VALUE self, VALUE separator)
{
if (TYPE(separator) == T_REGEXP)
return rb_u_string_partition_regex(self, separator);
return rb_u_string_partition_string(self, separator);
}
|
#print? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_print.c', line 6
VALUE
rb_u_string_print(VALUE self)
{
return _rb_u_character_test(self, u_char_isprint);
}
|
#punct? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_punct.c', line 6
VALUE
rb_u_string_punct(VALUE self)
{
return _rb_u_character_test(self, u_char_ispunct);
}
|
#recode(codeset) ⇒ Object
205 206 207 208 209 210 211 212 213 214 215 216 217 |
# File 'ext/u/rb_u_string.c', line 205
static VALUE
rb_u_string_recode(VALUE self, VALUE codeset)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const char *cs = StringValuePtr(codeset);
errno = 0;
size_t n = u_recode(NULL, 0, USTRING_STR(string), USTRING_LENGTH(string), cs);
if (errno != 0)
rb_u_raise_errno(errno, "can’t recode");
char *recoded = ALLOC_N(char, n + 1);
u_recode(recoded, n + 1, USTRING_STR(string), USTRING_LENGTH(string), cs);
return rb_str_new(recoded, n);
}
|
#reverse ⇒ U::String
This doesn’t take into account proper handling of combining marks, direction indicators, and similarly relevant characters, so this method is mostly useful when you know the contents of the string is simple and the result isn’t intended for display.
Returns The reversal of the receiver, inheriting any taint and untrust from the receiver.
9 10 11 12 13 |
# File 'ext/u/rb_u_string_reverse.c', line 9
VALUE
rb_u_string_reverse(VALUE self)
{
return _rb_u_string_convert(self, u_reverse);
}
|
#rindex(pattern, offset = -1) ⇒ Integer?
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'ext/u/rb_u_string_rindex.c', line 47
VALUE
rb_u_string_rindex_m(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
VALUE sub, rboffset;
long offset;
if (rb_scan_args(argc, argv, "11", &sub, &rboffset) == 2)
offset = NUM2LONG(rboffset);
else
/* TODO: Why not simply use -1? Benchmark which is faster. */
offset = u_n_chars_n(USTRING_STR(string), USTRING_LENGTH(string));
const char *begin = rb_u_string_begin_from_offset(string, offset);
const char *end = USTRING_END(string);
if (begin == NULL) {
if (offset <= 0) {
if (TYPE(sub) == T_REGEXP)
rb_backref_set(Qnil);
return Qnil;
}
begin = end;
/* TODO: this converting back and forward can be optimized away
* if rb_u_string_index_regexp() and rb_u_string_rindex() were split up
* into two additional functions, adding
* rb_u_string_index_regexp_pointer() and rb_u_string_rindex_pointer(),
* so that one can pass a pointer to start at immediately
* instead of an offset that gets calculated into a pointer. */
offset = u_n_chars_n(USTRING_STR(string), USTRING_LENGTH(string));
}
switch (TYPE(sub)) {
case T_REGEXP:
/* TODO: What’s this first test for, exactly? */
if (RREGEXP(sub)->ptr == NULL || RREGEXP_SRC_LEN(sub) > 0)
offset = rb_u_string_index_regexp(self, begin, sub, true);
break;
default: {
VALUE tmp = rb_check_string_type(sub);
if (NIL_P(tmp))
rb_u_raise(rb_eTypeError, "type mismatch: %s given",
rb_obj_classname(sub));
sub = tmp;
}
/* fall through */
case T_STRING:
offset = rb_u_string_rindex(self, sub, offset);
break;
}
if (offset < 0)
return Qnil;
return LONG2NUM(offset);
}
|
#rjust(width, padding = ' ') ⇒ U::String
165 166 167 168 169 |
# File 'ext/u/rb_u_string_justify.c', line 165
VALUE
rb_u_string_rjust(int argc, VALUE *argv, VALUE self)
{
return rb_u_string_justify(argc, argv, self, 'r');
}
|
#rpartition(separator) ⇒ Array<U::String>
74 75 76 77 78 79 80 81 |
# File 'ext/u/rb_u_string_rpartition.c', line 74
VALUE
rb_u_string_rpartition(VALUE self, VALUE separator)
{
if (TYPE(separator) == T_REGEXP)
return rb_u_string_rpartition_regex(self, separator);
return rb_u_string_rpartition_string(self, separator);
}
|
#rstrip ⇒ U::String
Returns The receiver with its maximum #space? suffix removed, inheriting any taint and untrust from the receiver.
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# File 'ext/u/rb_u_string_rstrip.c', line 7
VALUE
rb_u_string_rstrip(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const char *begin = USTRING_STR(string);
if (begin == NULL)
return self;
const char *end = USTRING_END(string);
const char *q = end;
while (begin < q) {
const char *p;
uint32_t c = u_decode_r(&p, begin, q);
if (c != '\0' && !u_char_isspace(c))
break;
q = p;
}
if (q == end)
return self;
return rb_u_string_new_c(self, begin, q - begin);
}
|
#scan(pattern) ⇒ Array<U::String>+ #scan(pattern) ⇒ Array<U::String> #scan(pattern) {|submatches| ... } ⇒ self #scan(pattern) {|match| ... } ⇒ self
98 99 100 101 102 103 104 105 106 107 108 109 |
# File 'ext/u/rb_u_string_scan.c', line 98
VALUE
rb_u_string_scan(VALUE self, VALUE pattern)
{
pattern = rb_u_pattern_argument(pattern, true);
VALUE string = rb_str_to_str(self);
if (rb_block_given_p())
return rb_u_string_scan_block(self, string, pattern);
return rb_u_string_scan_array(string, pattern);
}
|
#script ⇒ Symbol
Returns the script of the characters of the receiver.
The script of a character identifies the primary writing system that uses the character.
<table>
<thead><tr><th>Script</th><th>Description</th></tr></thead>
<tbody>
<tr><td>:arabic</td><td>Arabic</td></tr>
<tr><td>:armenian</td><td>Armenian</td></tr>
<tr><td>:avestan</td><td>Avestan</td></tr>
<tr><td>:balinese</td><td>Balinese</td></tr>
<tr><td>:bamum</td><td>Bamum</td></tr>
<tr><td>:batak</td><td>Batak</td></tr>
<tr><td>:bengali</td><td>Bengali</td></tr>
<tr><td>:bopomofo</td><td>Bopomofo</td></tr>
<tr><td>:brahmi</td><td>Brahmi</td></tr>
<tr><td>:braille</td><td>Braille</td></tr>
<tr><td>:buginese</td><td>Buginese</td></tr>
<tr><td>:buhid</td><td>Buhid</td></tr>
<tr><td>:canadian_aboriginal</td><td>Canadian Aboriginal</td></tr>
<tr><td>:carian</td><td>Carian</td></tr>
<tr><td>:chakma</td><td>Chakma</td></tr>
<tr><td>:cham</td><td>Cham</td></tr>
<tr><td>:cherokee</td><td>Cherokee</td></tr>
<tr><td>:common</td><td>For other characters that may be used with multiple scripts</td></tr>
<tr><td>:coptic</td><td>Coptic</td></tr>
<tr><td>:cuneiform</td><td>Cuneiform</td></tr>
<tr><td>:cypriot</td><td>Cypriot</td></tr>
<tr><td>:cyrillic</td><td>Cyrillic</td></tr>
<tr><td>:deseret</td><td>Deseret</td></tr>
<tr><td>:devanagari</td><td>Devanagari</td></tr>
<tr><td>:egyptian_hieroglyphs</td><td>Egyptian Hieroglpyhs</td></tr>
<tr><td>:ethiopic</td><td>Ethiopic</td></tr>
<tr><td>:georgian</td><td>Georgian</td></tr>
<tr><td>:glagolitic</td><td>Glagolitic</td></tr>
<tr><td>:gothic</td><td>Gothic</td></tr>
<tr><td>:greek</td><td>Greek</td></tr>
<tr><td>:gujarati</td><td>Gujarati</td></tr>
<tr><td>:gurmukhi</td><td>Gurmukhi</td></tr>
<tr><td>:han</td><td>Han</td></tr>
<tr><td>:hangul</td><td>Hangul</td></tr>
<tr><td>:hanunoo</td><td>Hanunoo</td></tr>
<tr><td>:hebrew</td><td>Hebrew</td></tr>
<tr><td>:hiragana</td><td>Hiragana</td></tr>
<tr><td>:imperial_aramaic</td><td>Imperial Aramaic</td></tr>
<tr><td>:inherited</td><td>For characters that may be used with multiple
scripts, and that inherit their script from the preceding characters;
these include nonspacing marks, enclosing marks, and the zero-width
joiner/non-joiner characters</td></tr>
<tr><td>:inscriptional_pahlavi</td><td>Inscriptional Pahlavi</td></tr>
<tr><td>:inscriptional_parthian</td><td>Inscriptional Parthian</td></tr>
<tr><td>:javanese</td><td>Javanese</td></tr>
<tr><td>:kaithi</td><td>Kaithi</td></tr>
<tr><td>:kannada</td><td>Kannada</td></tr>
<tr><td>:katakana</td><td>Katakana</td></tr>
<tr><td>:kayah_li</td><td>Kayah Li</td></tr>
<tr><td>:kharoshthi</td><td>Kharoshthi</td></tr>
<tr><td>:khmer</td><td>Khmer</td></tr>
<tr><td>:lao</td><td>Lao</td></tr>
<tr><td>:latin</td><td>Latin</td></tr>
<tr><td>:lepcha</td><td>Lepcha</td></tr>
<tr><td>:limbu</td><td>Limbu</td></tr>
<tr><td>:linear_b</td><td>Linear B</td></tr>
<tr><td>:lisu</td><td>Lisu</td></tr>
<tr><td>:lycian</td><td>Lycian</td></tr>
<tr><td>:lydian</td><td>Lydian</td></tr>
<tr><td>:malayalam</td><td>Malayalam</td></tr>
<tr><td>:mandaic</td><td>Mandaic</td></tr>
<tr><td>:meetei_mayek</td><td>Meetei Mayek</td></tr>
<tr><td>:meroitic_hieroglyphs</td><td>Meroitic Hieroglyphs</td></tr>
<tr><td>:meroitic_cursive</td><td>Meroitic Cursives</td></tr>
<tr><td>:miao</td><td>Miao</td></tr>
<tr><td>:mongolian</td><td>Mongolian</td></tr>
<tr><td>:myanmar</td><td>Myanmar</td></tr>
<tr><td>:new_tai_lue</td><td>New Tai Lue</td></tr>
<tr><td>:nko</td><td>N'Ko</td></tr>
<tr><td>:ogham</td><td>Ogham</td></tr>
<tr><td>:old_italic</td><td>Old Italic</td></tr>
<tr><td>:old_persian</td><td>Old Persian</td></tr>
<tr><td>:old_south_arabian</td><td>Old South Arabian</td></tr>
<tr><td>:old_turkic</td><td>Old Turkic</td></tr>
<tr><td>:ol_chiki</td><td>Ol Chiki</td></tr>
<tr><td>:oriya</td><td>Oriya</td></tr>
<tr><td>:osmanya</td><td>Osmanya</td></tr>
<tr><td>:phags_pa</td><td>Phags-pa</td></tr>
<tr><td>:phoenician</td><td>Phoenician</td></tr>
<tr><td>:rejang</td><td>Rejang</td></tr>
<tr><td>:runic</td><td>Runic</td></tr>
<tr><td>:samaritan</td><td>Samaritan</td></tr>
<tr><td>:saurashtra</td><td>Saurashtra</td></tr>
<tr><td>:sharada</td><td>Sharada</td></tr>
<tr><td>:shavian</td><td>Shavian</td></tr>
<tr><td>:sinhala</td><td>Sinhala</td></tr>
<tr><td>:sora_sompeng</td><td>Sora Sompeng</td></tr>
<tr><td>:sundanese</td><td>Sundanese</td></tr>
<tr><td>:syloti_nagri</td><td>Syloti Nagri</td></tr>
<tr><td>:syriac</td><td>Syriac</td></tr>
<tr><td>:tagalog</td><td>Tagalog</td></tr>
<tr><td>:tagbanwa</td><td>Tagbanwa</td></tr>
<tr><td>:tai_le</td><td>Tai Le</td></tr>
<tr><td>:tai_tham</td><td>Tai Tham</td></tr>
<tr><td>:tai_viet</td><td>Tai Viet</td></tr>
<tr><td>:takri</td><td>Takri</td></tr>
<tr><td>:tamil</td><td>Tamil</td></tr>
<tr><td>:telugu</td><td>Telugu</td></tr>
<tr><td>:thaana</td><td>Thaana</td></tr>
<tr><td>:thai</td><td>Thai</td></tr>
<tr><td>:tibetan</td><td>Tibetan</td></tr>
<tr><td>:tifinagh</td><td>Tifinagh</td></tr>
<tr><td>:ugaritic</td><td>Ugaritic</td></tr>
<tr><td>:unknown</td><td>For not assigned, private-use, non-character, and surrogate code points</td></tr>
<tr><td>:vai</td><td>Vai</td></tr>
<tr><td>:yi</td><td>Yi</td></tr>
</tbody>
</table>
247 248 249 250 251 252 253 |
# File 'ext/u/rb_u_string_script.c', line 247
VALUE
rb_u_string_script(VALUE self)
{
return _rb_u_string_property(self, "script", U_SCRIPT_UNKNOWN,
(int (*)(uint32_t))u_char_script,
(VALUE (*)(int))script_to_symbol);
}
|
#soft_dotted? ⇒ Boolean
9 10 11 12 13 |
# File 'ext/u/rb_u_string_soft_dotted.c', line 9
VALUE
rb_u_string_soft_dotted(VALUE self)
{
return _rb_u_character_test(self, u_char_issoftdotted);
}
|
#space? ⇒ Boolean
20 21 22 23 24 |
# File 'ext/u/rb_u_string_space.c', line 20
VALUE
rb_u_string_space(VALUE self)
{
return _rb_u_character_test(self, u_char_isspace);
}
|
#split(pattern = $;, limit = 0) ⇒ Array<U::String>
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 |
# File 'ext/u/rb_u_string_split.c', line 200
VALUE
rb_u_string_split_m(int argc, VALUE *argv, VALUE self)
{
VALUE rbpattern, rblimit;
int limit = 0;
bool limit_given;
if (rb_scan_args(argc, argv, "02", &rbpattern, &rblimit) == 2)
limit = NUM2INT(rblimit);
const struct rb_u_string *string = RVAL2USTRING(self);
if (limit == 1) {
if (USTRING_LENGTH(string) == 0)
return rb_ary_new2(0);
return rb_ary_new3(1, self);
}
limit_given = !NIL_P(rblimit) && limit >= 0;
if (NIL_P(rbpattern) && NIL_P(rb_fs))
return rb_u_string_split_awk(self, limit_given, limit);
else if (NIL_P(rbpattern))
rbpattern = rb_fs;
if (TYPE(rbpattern) != T_STRING && !RTEST(rb_obj_is_kind_of(rbpattern, rb_cUString)))
return rb_u_string_split_pattern(self,
rb_u_pattern_argument(rbpattern, true),
limit_given,
limit);
const struct rb_u_string *pattern = RVAL2USTRING_ANY(rbpattern);
const char *p = USTRING_STR(pattern);
long length = USTRING_LENGTH(pattern);
if (length == 0)
return rb_u_string_split_pattern(self,
rb_reg_regcomp(rb_str_to_str(rbpattern)),
limit_given,
limit);
else if (length == 1 && *p == ' ')
return rb_u_string_split_awk(self, limit_given, limit);
else
return rb_u_string_split_string(self, rbpattern, limit_given, limit);
}
|
#squeeze(*sets) ⇒ U::String
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'ext/u/rb_u_string_squeeze.c', line 52
VALUE
rb_u_string_squeeze(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
if (USTRING_LENGTH(string) == 0)
return Qnil;
struct tr_table table;
if (argc > 0)
tr_table_initialize_from_strings(&table, argc, argv);
struct tr_table *table_pointer = (argc > 0) ? &table : NULL;
long count = rb_u_string_squeeze_loop(string, table_pointer, NULL);
if (count == 0)
return self;
char *remaining = ALLOC_N(char, count + 1);
rb_u_string_squeeze_loop(string, table_pointer, remaining);
remaining[count] = '\0';
return rb_u_string_new_c_own(self, remaining, count);
}
|
#start_with?(*prefixes) ⇒ Boolean
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# File 'ext/u/rb_u_string_start_with.c', line 7
VALUE
rb_u_string_start_with(int argc, VALUE *argv, VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const char *p = USTRING_STR(string);
long p_length = USTRING_LENGTH(string);
for (int i = 0; i < argc; i++) {
VALUE tmp = rb_u_string_check_type(argv[i]);
if (NIL_P(tmp))
continue;
const struct rb_u_string *other = RVAL2USTRING_ANY(tmp);
const char *q = USTRING_STR(other);
long q_length = USTRING_LENGTH(other);
if (p_length < q_length)
continue;
if (memcmp(p, q, q_length) == 0)
return Qtrue;
}
return Qfalse;
}
|
#strip ⇒ U::String
Returns The receiver with its maximum #space? prefix and suffix removed, inheriting any taint and untrust.
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# File 'ext/u/rb_u_string_strip.c', line 7
VALUE
rb_u_string_strip(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
const char *begin = USTRING_STR(string);
if (begin == NULL)
return self;
const char *end = USTRING_END(string);
const char *s = begin;
uint32_t c;
const char *t;
while (s < end && u_char_isspace(u_decode(&t, s, end)))
s = t;
t = end;
while (begin < t) {
const char *p;
c = u_decode_r(&p, begin, t);
if (c != '\0' && !u_char_isspace(c))
break;
t = p;
}
if (s == begin && t == end)
return self;
return rb_u_string_new_c(self, s, t - s);
}
|
#sub(pattern, replacement) ⇒ U::String? #sub(pattern, replacements) ⇒ U::String? #sub(pattern) {|match| ... } ⇒ U::String?
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
# File 'ext/u/rb_u_string_sub.c', line 65
VALUE
rb_u_string_sub(int argc, VALUE *argv, VALUE self)
{
VALUE pattern, replacement;
VALUE replacements = Qnil;
bool use_block = false;
bool tainted = false;
bool untrusted = false;
if (argc == 1)
use_block = true;
if (rb_scan_args(argc, argv, "11", &pattern, &replacement) == 2) {
replacements = rb_check_convert_type(replacement, T_HASH,
"Hash", "to_hash");
if (NIL_P(replacements))
StringValue(replacement);
if (OBJ_TAINTED(replacement))
tainted = true;
if (OBJ_UNTRUSTED(replacement))
untrusted = true;
}
pattern = rb_u_pattern_argument(pattern, true);
VALUE str = rb_str_to_str(self);
long begin = rb_reg_search(pattern, str, 0, 0);
if (begin < 0)
return Qnil;
VALUE match = rb_backref_get();
struct re_registers *registers = RMATCH_REGS(match);
VALUE result;
if (use_block || !NIL_P(replacements)) {
if (use_block) {
VALUE ustr = rb_u_string_new_rb(rb_reg_nth_match(0, match));
result = rb_u_string_object_as_string(rb_yield(ustr));
} else {
VALUE ustr = rb_u_string_new_c(self,
RSTRING_PTR(str) + registers->beg[0],
registers->end[0] - registers->beg[0]);
result = rb_u_string_object_as_string(rb_hash_aref(replacements, ustr));
}
} else
result =
#ifdef HAVE_RB_REG_REGSUB4
rb_reg_regsub(replacement, str, registers, pattern);
#else
rb_reg_regsub(replacement, str, registers);
#endif
if (OBJ_TAINTED(result))
tainted = true;
if (OBJ_UNTRUSTED(result))
untrusted = true;
const struct rb_u_string *value = RVAL2USTRING_ANY(result);
size_t length = registers->beg[0] +
USTRING_LENGTH(value) +
(RSTRING_LEN(str) - registers->end[0]);
char *base = ALLOC_N(char, length + 1);
MEMCPY(base,
RSTRING_PTR(str),
char,
registers->beg[0]);
MEMCPY(base + registers->beg[0],
USTRING_STR(value),
char,
USTRING_LENGTH(value));
MEMCPY(base + registers->beg[0] + USTRING_LENGTH(value),
RSTRING_PTR(str) + registers->end[0],
char,
RSTRING_LEN(str) - registers->end[0]);
base[length] = '\0';
VALUE substituted = rb_u_string_new_c_own(self, base, length);
if (tainted)
OBJ_TAINT(substituted);
if (untrusted)
OBJ_UNTRUST(substituted);
return substituted;
}
|
#title? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_title.c', line 6
VALUE
rb_u_string_title(VALUE self)
{
return _rb_u_character_test(self, u_char_istitle);
}
|
#titlecase(locale = ENV['LC_CTYPE']) ⇒ U::String
9 10 11 12 13 |
# File 'ext/u/rb_u_string_titlecase.c', line 9
VALUE
rb_u_string_titlecase(int argc, VALUE *argv, VALUE self)
{
return _rb_u_string_convert_locale(argc, argv, self, u_titlecase, NULL);
}
|
#to_i(base = 16) ⇒ Integer
32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
# File 'ext/u/rb_u_string_to_i.c', line 32
VALUE
rb_u_string_to_i(int argc, VALUE *argv, VALUE self)
{
int base = 10;
VALUE rbbase;
if (rb_scan_args(argc, argv, "01", &rbbase) == 1)
base = NUM2INT(rbbase);
if (base < 0)
rb_u_raise(rb_eArgError, "illegal radix %d", base);
return rb_u_string_to_inum(self, base, false);
}
|
#to_str ⇒ Object Also known as: to_s
Returns The String representation of the receiver, inheriting any taint and untrust, encoded as UTF-8.
5 6 7 8 9 10 11 12 13 14 15 16 17 |
# File 'ext/u/rb_u_string_to_str.c', line 5
VALUE
rb_u_string_to_str(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
VALUE result = NIL_P(string->rb) ?
rb_u_str_new(USTRING_STR(string), USTRING_LENGTH(string)) :
string->rb;
OBJ_INFECT(result, self);
return result;
}
|
#to_sym ⇒ Symbol Also known as: intern
Returns The Symbol representation of the receiver.
7 8 9 10 11 12 |
# File 'ext/u/rb_u_string_to_sym.c', line 7
VALUE
rb_u_string_to_sym(VALUE self)
{
/* NOTE: Lazy, but MRI makes it hard to implement this method. */
return rb_str_intern(StringValue(self));
}
|
#tr(from, to) ⇒ U::String
262 263 264 265 266 |
# File 'ext/u/rb_u_string_tr.c', line 262
VALUE
rb_u_string_tr(VALUE self, VALUE from, VALUE to)
{
return tr_trans(self, from, to, false);
}
|
#tr_s(from, to) ⇒ U::String
286 287 288 289 290 |
# File 'ext/u/rb_u_string_tr.c', line 286
VALUE
rb_u_string_tr_s(VALUE self, VALUE from, VALUE to)
{
return tr_trans(self, from, to, true);
}
|
#u ⇒ self
Returns The receiver; mostly for completeness, but allows you to always call #u on something that’s either a String or a U::String.
6 7 8 |
# File 'lib/u-1.0/string.rb', line 6 def u self end |
#upcase(locale = ENV['LC_CTYPE']) ⇒ U::String
8 9 10 11 12 |
# File 'ext/u/rb_u_string_upcase.c', line 8
VALUE
rb_u_string_upcase(int argc, VALUE *argv, VALUE self)
{
return _rb_u_string_convert_locale(argc, argv, self, u_upcase, NULL);
}
|
#upper?(locale = ENV[LC_CTYPE]) ⇒ Boolean
9 10 11 12 13 |
# File 'ext/u/rb_u_string_upper.c', line 9
VALUE
rb_u_string_upper(int argc, VALUE *argv, VALUE self)
{
return _rb_u_string_test_locale(argc, argv, self, u_upcase);
}
|
#valid? ⇒ Boolean
6 7 8 9 10 |
# File 'ext/u/rb_u_string_valid.c', line 6
VALUE
rb_u_string_valid(VALUE self)
{
return _rb_u_character_test(self, u_char_isvalid);
}
|
#valid_encoding? ⇒ Boolean
6 7 8 9 10 11 12 |
# File 'ext/u/rb_u_string_valid_encoding.c', line 6
VALUE
rb_u_string_valid_encoding(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
return u_valid(USTRING_STR(string), USTRING_LENGTH(string), NULL) ? Qtrue : Qfalse;
}
|
#wide? ⇒ Boolean
17 18 19 20 21 |
# File 'ext/u/rb_u_string_wide.c', line 17
VALUE
rb_u_string_wide(VALUE self)
{
return _rb_u_character_test(self, u_char_iswide);
}
|
#wide_cjk? ⇒ Boolean
17 18 19 20 21 |
# File 'ext/u/rb_u_string_wide_cjk.c', line 17
VALUE
rb_u_string_wide_cjk(VALUE self)
{
return _rb_u_character_test(self, u_char_iswide_cjk);
}
|
#width ⇒ Integer
Returns the width of the receiver. The width is defined as the sum of the number of “cells” on a terminal or similar cell-based display that the characters in the string will require.
Characters that are #wide? have a width of 2. Characters that are #zero_width? have a width of 0. Other characters have a width of 1.
13 14 15 16 17 18 19 |
# File 'ext/u/rb_u_string_width.c', line 13
VALUE
rb_u_string_width(VALUE self)
{
const struct rb_u_string *string = RVAL2USTRING(self);
return UINT2NUM(u_width_n(USTRING_STR(string), USTRING_LENGTH(string)));
}
|
#word_break ⇒ Symbol
Returns the word break property value of the characters of the receiver.
The possible word break values are
-
:aletter
-
:cr
-
:extend
-
:extendnumlet
-
:format
-
:katakana
-
:lf
-
:midletter
-
:midnum
-
:midnumlet
-
:newline
-
:numeric
-
:other
-
:regional_indicator
57 58 59 60 61 62 63 |
# File 'ext/u/rb_u_string_word_break.c', line 57
VALUE
rb_u_string_word_break(VALUE self)
{
return _rb_u_string_property(self, "word break", U_WORD_BREAK_OTHER,
(int (*)(uint32_t))u_char_word_break,
(VALUE (*)(int))break_to_symbol);
}
|
#xdigit? ⇒ Boolean
18 19 20 21 22 |
# File 'ext/u/rb_u_string_xdigit.c', line 18
VALUE
rb_u_string_xdigit(VALUE self)
{
return _rb_u_character_test(self, u_char_isxdigit);
}
|
#zero_width? ⇒ Boolean
12 13 14 15 16 |
# File 'ext/u/rb_u_string_zero_width.c', line 12
VALUE
rb_u_string_zero_width(VALUE self)
{
return _rb_u_character_test(self, u_char_iszerowidth);
}
|