Class: MARC::Marc8::ToUnicode
- Inherits:
-
Object
- Object
- MARC::Marc8::ToUnicode
- Defined in:
- lib/marc/marc8/to_unicode.rb
Overview
Class to convert Marc8 to UTF-8. NOTE: Requires ruby 1.9+ (this could be changed without too much trouble, but we just don’t care to support 1.8.7 anymore.)
www.loc.gov/marc/specifications/speccharmarc8.html
NOT thread-safe, it needs to keep state as it goes through a string, do not re-use between threads.
Uses 4 spaces per indent, rather than usual ruby 2 space, just to change the python less.
Returns UTF-8 encoded string! Encode to something else if you want something else.
III proprietary code points?
Constant Summary collapse
- BASIC_LATIN =
0x42
- ANSEL =
0x45
- G0_SET =
["(", ",", "$"]
- G1_SET =
[")", "-", "$"]
- CODESETS =
MARC::Marc8::MapToUnicode::CODESETS
Instance Attribute Summary collapse
-
#g0 ⇒ Object
These are state flags, MARC8 requires you to keep track of ‘current char sets’ or something like that, which are changed with escape codes, or something like that.
-
#g1 ⇒ Object
These are state flags, MARC8 requires you to keep track of ‘current char sets’ or something like that, which are changed with escape codes, or something like that.
Instance Method Summary collapse
-
#initialize ⇒ ToUnicode
constructor
A new instance of ToUnicode.
-
#is_multibyte(charset) ⇒ Object
from the original python, yeah, apparently only one charset is considered multibyte.
-
#transcode(marc8_string, options = {}) ⇒ Object
Returns UTF-8 encoded string equivalent of marc8_string passed in.
-
#unichr(code_point) ⇒ Object
input single unicode codepoint as integer; output encoded as a UTF-8 string python has unichr built-in, we just define it for convenience no problem.
Constructor Details
#initialize ⇒ ToUnicode
Returns a new instance of ToUnicode.
34 35 36 37 |
# File 'lib/marc/marc8/to_unicode.rb', line 34 def initialize self.g0 = BASIC_LATIN self.g1 = ANSEL end |
Instance Attribute Details
#g0 ⇒ Object
These are state flags, MARC8 requires you to keep track of ‘current char sets’ or something like that, which are changed with escape codes, or something like that.
32 33 34 |
# File 'lib/marc/marc8/to_unicode.rb', line 32 def g0 @g0 end |
#g1 ⇒ Object
These are state flags, MARC8 requires you to keep track of ‘current char sets’ or something like that, which are changed with escape codes, or something like that.
32 33 34 |
# File 'lib/marc/marc8/to_unicode.rb', line 32 def g1 @g1 end |
Instance Method Details
#is_multibyte(charset) ⇒ Object
from the original python, yeah, apparently only one charset is considered multibyte
180 181 182 |
# File 'lib/marc/marc8/to_unicode.rb', line 180 def is_multibyte(charset) charset == 0x31 end |
#transcode(marc8_string, options = {}) ⇒ Object
Returns UTF-8 encoded string equivalent of marc8_string passed in.
Bad Marc8 bytes? By default will raise an Encoding::InvalidByteSequenceError (will not have full metadata filled out, but will have a decent error message)
Set option :invalid => :replace to instead silently replace bad bytes with a replacement char – by default Unicode Replacement Char, but can set option :replace to something else, including empty string.
converter.transcode(bad_marc8, :invalid => :replace, :replace => “”)
By default returns NFC normalized, but set :normalization option to:
:nfd, :nfkd, :nfkc, :nfc, or nil. Set to nil for higher performance,
we won't do any normalization just take it as it comes out of the
transcode algorithm. This will generally NOT be composed.
By default, escaped unicode ‘named character references’ in Marc8 will be translated to actual UTF8. Eg. “‏” But pass :expand_ncr => false to disable. www.loc.gov/marc/specifications/speccharconversion.html#lossless
String arg passed in WILL have it’s encoding tagged ‘binary’ if it’s not already, if it’s Marc8 there’s no good reason for it not to be already.
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
# File 'lib/marc/marc8/to_unicode.rb', line 62 def transcode(marc8_string, = {}) invalid_replacement = .fetch(:replace, "\uFFFD") = .fetch(:expand_ncr, true) normalization = .fetch(:normalization, :nfc) # don't choke on empty marc8_string return "" if marc8_string.nil? || marc8_string.empty? # Make sure to call it 'binary', so we can slice it # byte by byte, and so ruby doesn't complain about bad # bytes for some other encoding. Yeah, we're changing # encoding on input! If it's Marc8, it ought to be tagged # binary already. marc8_string.force_encoding("binary") uni_list = [] combinings = [] pos = 0 while pos < marc8_string.length if marc8_string[pos] == "\x1b" next_byte = marc8_string[pos + 1] if G0_SET.include? next_byte if marc8_string.length >= pos + 3 if (marc8_string[pos + 2] == ",") && (next_byte == "$") pos += 1 end self.g0 = marc8_string[pos + 2].ord pos += 3 else # if there aren't enough remaining characters, readd # the escape character so it doesn't get lost; may # help users diagnose problem records uni_list.push marc8_string[pos] pos += 1 end next elsif G1_SET.include? next_byte if (marc8_string[pos + 2] == "-") && (next_byte == "$") pos += 1 end self.g1 = marc8_string[pos + 2].ord pos += 3 next else charset = next_byte.ord if CODESETS.has_key? charset self.g0 = charset pos += 2 elsif charset == 0x73 self.g0 = BASIC_LATIN pos += 2 if pos == marc8_string.length break end end end end mb_flag = is_multibyte(g0) if mb_flag code_point = (marc8_string[pos].ord * 65536 + marc8_string[pos + 1].ord * 256 + marc8_string[pos + 2].ord) pos += 3 else code_point = marc8_string[pos].ord pos += 1 end if (code_point < 0x20) || ((code_point > 0x80) && (code_point < 0xa0)) uni = unichr(code_point) next end begin code_set = (code_point > 0x80) && !mb_flag ? g1 : g0 (uni, cflag) = CODESETS.fetch(code_set).fetch(code_point) if cflag combinings.push unichr(uni) else uni_list.push unichr(uni) if combinings.length > 0 uni_list.concat combinings combinings = [] end end rescue KeyError if [:invalid] == :replace # Let's coallesece multiple replacements uni_list.push invalid_replacement unless uni_list.last == invalid_replacement pos += 1 else raise Encoding::InvalidByteSequenceError.new("MARC8, input byte offset #{pos}, code set: 0x#{code_set.to_s(16)}, code point: 0x#{code_point.to_s(16)}, value: #{transcode(marc8_string, invalid: :replace, replace: "�")}") end end end # what to do if combining chars left over? uni_str = uni_list.join("") if uni_str.gsub!(/&#x([0-9A-F]{4,6});/) do [$1.hex].pack("U") end end if normalization uni_str = uni_str.unicode_normalize(normalization) end uni_str end |
#unichr(code_point) ⇒ Object
input single unicode codepoint as integer; output encoded as a UTF-8 string python has unichr built-in, we just define it for convenience no problem.
186 187 188 |
# File 'lib/marc/marc8/to_unicode.rb', line 186 def unichr(code_point) [code_point].pack("U") end |