Class: FormatParser::UTF8Reader
- Inherits:
-
Object
- Object
- FormatParser::UTF8Reader
- Defined in:
- lib/utf8_reader.rb
Overview
This class Reads individual characters from files using UTF-8 encoding This deals with two main concerns:
- Variable byte length of characters
- Reducing the number of read operations by loading bytes in chunks
Defined Under Namespace
Classes: UTF8CharReaderError
Constant Summary collapse
- READ_CHUNK_SIZE =
128
Instance Method Summary collapse
-
#initialize(io) ⇒ UTF8Reader
constructor
A new instance of UTF8Reader.
- #read_char ⇒ Object
Constructor Details
#initialize(io) ⇒ UTF8Reader
Returns a new instance of UTF8Reader.
13 14 15 16 17 18 |
# File 'lib/utf8_reader.rb', line 13 def initialize(io) @io = io @chunk = "" @index = 0 @eof = false end |
Instance Method Details
#read_char ⇒ Object
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# File 'lib/utf8_reader.rb', line 20 def read_char first_byte = read_byte return if first_byte.nil? char_length = assess_char_length(first_byte) as_bytes = Array.new(char_length) do |i| next first_byte if i == 0 read_byte end char = as_bytes.pack('c*').force_encoding('UTF-8') raise UTF8CharReaderError, "Invalid UTF-8 character" unless char.valid_encoding? char rescue TypeError raise UTF8CharReaderError, "Invalid UTF-8 character" end |