Class: MARC::Marc8::ToUnicode

Inherits:
Object
  • Object
show all
Defined in:
lib/marc/marc8/to_unicode.rb

Overview

Class to convert Marc8 to UTF-8. NOTE: Requires ruby 1.9+ (this could be changed without too much trouble, but we just don’t care to support 1.8.7 anymore.)

www.loc.gov/marc/specifications/speccharmarc8.html

NOT thread-safe, it needs to keep state as it goes through a string, do not re-use between threads.

Uses 4 spaces per indent, rather than usual ruby 2 space, just to change the python less.

Returns UTF-8 encoded string! Encode to something else if you want something else.

III proprietary code points?

Constant Summary collapse

BASIC_LATIN =
0x42
ANSEL =
0x45
G0_SET =
["(", ",", "$"]
G1_SET =
[")", "-", "$"]
CODESETS =
MARC::Marc8::MapToUnicode::CODESETS

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeToUnicode

Returns a new instance of ToUnicode.



34
35
36
37
# File 'lib/marc/marc8/to_unicode.rb', line 34

def initialize
  self.g0 = BASIC_LATIN
  self.g1 = ANSEL
end

Instance Attribute Details

#g0Object

These are state flags, MARC8 requires you to keep track of ‘current char sets’ or something like that, which are changed with escape codes, or something like that.



32
33
34
# File 'lib/marc/marc8/to_unicode.rb', line 32

def g0
  @g0
end

#g1Object

These are state flags, MARC8 requires you to keep track of ‘current char sets’ or something like that, which are changed with escape codes, or something like that.



32
33
34
# File 'lib/marc/marc8/to_unicode.rb', line 32

def g1
  @g1
end

Instance Method Details

#is_multibyte(charset) ⇒ Object

from the original python, yeah, apparently only one charset is considered multibyte



180
181
182
# File 'lib/marc/marc8/to_unicode.rb', line 180

def is_multibyte(charset)
  charset == 0x31
end

#transcode(marc8_string, options = {}) ⇒ Object

Returns UTF-8 encoded string equivalent of marc8_string passed in.

Bad Marc8 bytes? By default will raise an Encoding::InvalidByteSequenceError (will not have full metadata filled out, but will have a decent error message)

Set option :invalid => :replace to instead silently replace bad bytes with a replacement char – by default Unicode Replacement Char, but can set option :replace to something else, including empty string.

converter.transcode(bad_marc8, :invalid => :replace, :replace => “”)

By default returns NFC normalized, but set :normalization option to:

:nfd, :nfkd, :nfkc, :nfc, or nil. Set to nil for higher performance,
we won't do any normalization just take it as it comes out of the
transcode algorithm. This will generally NOT be composed.

By default, escaped unicode ‘named character references’ in Marc8 will be translated to actual UTF8. Eg. “‏” But pass :expand_ncr => false to disable. www.loc.gov/marc/specifications/speccharconversion.html#lossless

String arg passed in WILL have it’s encoding tagged ‘binary’ if it’s not already, if it’s Marc8 there’s no good reason for it not to be already.



62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# File 'lib/marc/marc8/to_unicode.rb', line 62

def transcode(marc8_string, options = {})
  invalid_replacement = options.fetch(:replace, "\uFFFD")
  expand_ncr = options.fetch(:expand_ncr, true)
  normalization = options.fetch(:normalization, :nfc)

  # don't choke on empty marc8_string
  return "" if marc8_string.nil? || marc8_string.empty?

  # Make sure to call it 'binary', so we can slice it
  # byte by byte, and so ruby doesn't complain about bad
  # bytes for some other encoding. Yeah, we're changing
  # encoding on input! If it's Marc8, it ought to be tagged
  # binary already.
  marc8_string.force_encoding("binary")

  uni_list = []
  combinings = []
  pos = 0
  while pos < marc8_string.length
    if marc8_string[pos] == "\x1b"
      next_byte = marc8_string[pos + 1]
      if G0_SET.include? next_byte
        if marc8_string.length >= pos + 3
          if (marc8_string[pos + 2] == ",") && (next_byte == "$")
            pos += 1
          end
          self.g0 = marc8_string[pos + 2].ord
          pos += 3
        else
          # if there aren't enough remaining characters, readd
          # the escape character so it doesn't get lost; may
          # help users diagnose problem records
          uni_list.push marc8_string[pos]
          pos += 1
        end
        next
      elsif G1_SET.include? next_byte
        if (marc8_string[pos + 2] == "-") && (next_byte == "$")
          pos += 1
        end
        self.g1 = marc8_string[pos + 2].ord
        pos += 3
        next
      else
        charset = next_byte.ord
        if CODESETS.has_key? charset
          self.g0 = charset
          pos += 2
        elsif charset == 0x73
          self.g0 = BASIC_LATIN
          pos += 2
          if pos == marc8_string.length
            break
          end
        end
      end
    end

    mb_flag = is_multibyte(g0)

    if mb_flag
      code_point = (marc8_string[pos].ord * 65536 +
        marc8_string[pos + 1].ord * 256 +
        marc8_string[pos + 2].ord)
      pos += 3
    else
      code_point = marc8_string[pos].ord
      pos += 1
    end

    if (code_point < 0x20) ||
        ((code_point > 0x80) && (code_point < 0xa0))
      uni = unichr(code_point)
      next
    end

    begin
      code_set = (code_point > 0x80) && !mb_flag ? g1 : g0
      (uni, cflag) = CODESETS.fetch(code_set).fetch(code_point)

      if cflag
        combinings.push unichr(uni)
      else
        uni_list.push unichr(uni)
        if combinings.length > 0
          uni_list.concat combinings
          combinings = []
        end
      end
    rescue KeyError
      if options[:invalid] == :replace
        # Let's coallesece multiple replacements
        uni_list.push invalid_replacement unless uni_list.last == invalid_replacement
        pos += 1
      else
        raise Encoding::InvalidByteSequenceError.new("MARC8, input byte offset #{pos}, code set: 0x#{code_set.to_s(16)}, code point: 0x#{code_point.to_s(16)}, value: #{transcode(marc8_string, invalid: :replace, replace: "")}")
      end
    end
  end

  # what to do if combining chars left over?
  uni_str = uni_list.join("")

  if expand_ncr
    uni_str.gsub!(/&#x([0-9A-F]{4,6});/) do
      [$1.hex].pack("U")
    end
  end

  if normalization
    uni_str = uni_str.unicode_normalize(normalization)
  end

  uni_str
end

#unichr(code_point) ⇒ Object

input single unicode codepoint as integer; output encoded as a UTF-8 string python has unichr built-in, we just define it for convenience no problem.



186
187
188
# File 'lib/marc/marc8/to_unicode.rb', line 186

def unichr(code_point)
  [code_point].pack("U")
end