Class: Regex::Character

Inherits:
AtomicExpression show all
Defined in:
lib/regex/character.rb

Overview

A regular expression that matches a specific character in a given character set

Constant Summary collapse

DigramSequences =

Constant with all special 2-characters escape sequences

{
  '\a' => 0x7, # alarm
  '\n' => 0xA, # newline
  '\r' => 0xD, # carriage return
  '\t' => 0x9, # tab
  '\e' => 0x1B, # escape
  '\f' => 0xC, # form feed
  '\v' => 0xB, # vertical feed
  # Single octal digit literals
  '\0' => 0,
  '\1' => 1,
  '\2' => 2,
  '\3' => 3,
  '\4' => 4,
  '\5' => 5,
  '\6' => 6,
  '\7' => 7
}.freeze
MetaChars =
'\^$.|+?*()[]{}'.freeze
MetaCharsInClass =

Characters with special meaning in char. class

'\^[]-'.freeze

Instance Attribute Summary collapse

Attributes inherited from Expression

#begin_anchor, #end_anchor

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from AtomicExpression

#atomic?, #done!, #lazy!

Methods inherited from Expression

#atomic?, #options, #to_str

Constructor Details

#initialize(aValue) ⇒ Character

Constructor. [aValue] Initialize the character with a either a String literal or a codepoint value. Examples: Initializing with codepoint value... RegAn::Character.new(0x3a3) # Represents: Σ (Unicode GREEK CAPITAL LETTER SIGMA) RegAn::Character.new(931) # Also represents: Σ (931 dec == 3a3 hex)

Initializing with a single character string RegAn::Character.new(?\u03a3) # Also represents: Σ RegAn::Character.new('Σ') # Obviously, represents a Σ

Initializing with an escape sequence string Recognized escaped characters are: \a (alarm, 0x07), \n (newline, 0xA), \r (carriage return, 0xD), \t (tab, 0x9), \e (escape, 0x1B), \f (form feed, 0xC) \uXXXX where XXXX is a 4 hex digits integer value, \uX..., \ooo (octal) \xXX (hex) Any other escaped character will be treated as a literal character RegAn::Character.new('\n') # Represents a newline RegAn::Character.new('\u03a3') # Represents a Σ



59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/regex/character.rb', line 59

def initialize(aValue)
  case aValue
    when String
      if aValue.size == 1
        # Literal single character case...
        @codepoint = self.class.char2codepoint(aValue)
      else
        # Should be an escape sequence...
        @codepoint = self.class.esc2codepoint(aValue)
      end
      @lexeme = aValue

    when Integer
      @codepoint = aValue
    else
      raise StandardError, "Cannot initialize a Character with a '#{aValue}'."
  end
end

Instance Attribute Details

#codepointObject (readonly)

The integer value that uniquely identifies the character.



32
33
34
# File 'lib/regex/character.rb', line 32

def codepoint
  @codepoint
end

#lexemeObject (readonly)

The initial text representation of the character (if any).



35
36
37
# File 'lib/regex/character.rb', line 35

def lexeme
  @lexeme
end

Class Method Details

.char2codepoint(aChar) ⇒ Object

Convertion method that returns the codepoint for the given single character. Example: RegAn::Character::char2codepoint('Σ') # Returns: 0x3a3



89
90
91
# File 'lib/regex/character.rb', line 89

def self.char2codepoint(aChar)
  aChar.ord
end

.codepoint2char(aCodepoint) ⇒ Object

Convertion method that returns a character given a codepoint (integer) value. Example: RegAn::Character::codepoint2char(0x3a3) # Returns: Σ ( The Unicode GREEK CAPITAL LETTER SIGMA)



82
83
84
# File 'lib/regex/character.rb', line 82

def self.codepoint2char(aCodepoint)
  [aCodepoint].pack('U') # Remark: chr() fails with codepoints > 256
end

.esc2codepoint(esc_seq) ⇒ Object

Convertion method that returns the codepoint for the given escape sequence (a String). Recognized escaped characters are: \a (alarm, 0x07), \n (newline, 0xA), \r (carriage return, 0xD), \t (tab, 0x9), \e (escape, 0x1B), \f (form feed, 0xC), \v (vertical feed, 0xB) \uXXXX where XXXX is a 4 hex digits integer value, \uX..., \ooo (octal) \xXX (hex) Any other escaped character will be treated as a literal character Example: RegAn::Character::esc2codepoint('\n') # Returns: 0xd

Raises:

  • (StandardError)


103
104
105
106
107
108
109
# File 'lib/regex/character.rb', line 103

def self.esc2codepoint(esc_seq)
  msg = "Escape sequence #{esc_seq} does not begin with a backslash (\)."
  raise StandardError, msg unless esc_seq[0] == '\\'
  result = (esc_seq.length == 2) ? digram2codepoint(esc_seq) : esc_number2codepoint(esc_seq)

  return result
end

Instance Method Details

#==(other) ⇒ Object

Returns true iff this Character and parameter 'another' represent the same character. [another] any Object. The way the equality is tested depends on the another's class Example: newOne = Character.new(?\u03a3) newOne == newOne # true. Identity newOne == Character.new(?\u03a3) # true. Both have same codepoint newOne == ?\u03a3 # true. The single character String match exactly the char attribute. newOne == 0x03a3 # true. The Integer is compared to the codepoint value. Will test equality with any Object that knows the to_s method



125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'lib/regex/character.rb', line 125

def ==(other)
  result = case other
    when Character
      to_str == other.to_str

    when Integer
      codepoint == other

    when String
      other.size > 1 ? false : to_str == other

    else
      # Unknown type: try with a convertion
      self == other.to_s # Recursive call
  end

  return result
end

#charObject

Return the character as a String object



112
113
114
# File 'lib/regex/character.rb', line 112

def char
  self.class.codepoint2char(@codepoint)
end

#explainObject

Return a plain English description of the character



145
146
147
# File 'lib/regex/character.rb', line 145

def explain
  "the character '#{to_str}'"
end