Class: FormatParser::UTF8Reader

Inherits:
Object
  • Object
show all
Defined in:
lib/utf8_reader.rb

Overview

This class Reads individual characters from files using UTF-8 encoding This deals with two main concerns:

- Variable byte length of characters
- Reducing the number of read operations by loading bytes in chunks

Defined Under Namespace

Classes: UTF8CharReaderError

Constant Summary collapse

READ_CHUNK_SIZE =
128

Instance Method Summary collapse

Constructor Details

#initialize(io) ⇒ UTF8Reader

Returns a new instance of UTF8Reader.



13
14
15
16
17
18
# File 'lib/utf8_reader.rb', line 13

def initialize(io)
  @io = io
  @chunk = ""
  @index = 0
  @eof = false
end

Instance Method Details

#read_charObject



20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/utf8_reader.rb', line 20

def read_char
  first_byte = read_byte
  return if first_byte.nil?

  char_length = assess_char_length(first_byte)
  as_bytes = Array.new(char_length) do |i|
    next first_byte if i == 0
    read_byte
  end

  char = as_bytes.pack('c*').force_encoding('UTF-8')
  raise UTF8CharReaderError, "Invalid UTF-8 character" unless char.valid_encoding?

  char
rescue TypeError
  raise UTF8CharReaderError, "Invalid UTF-8 character"
end