Module: HexaPDF::Layout::TextLayouter::SimpleTextSegmentation

Defined in:
lib/hexapdf/layout/text_layouter.rb

Overview

Implementation of a simple text segmentation algorithm.

The algorithm breaks TextFragment objects into objects wrapped by Box, Glue or Penalty items, and inserts additional Penalty items when needed:

  • Any valid Unicode newline separator inserts a Penalty object describing a mandatory break.

    See www.unicode.org/reports/tr18/#Line_Boundaries

  • Spaces and tabulators are wrapped by Glue objects, allowing breaks.

  • Non-breaking spaces are wrapped into Penalty objects that prohibit line breaking.

  • Hyphens are attached to the preceeding text fragment (or are a standalone text fragment) and followed by a Penalty object to allow a break.

  • If a soft-hyphens is encountered, a hyphen wrapped by a Penalty object is inserted to allow a break.

  • If a zero-width-space is encountered, a Penalty object is inserted to allow a break.

Constant Summary collapse

BREAK_CHARS =

Breaks are detected at: space, tab, zero-width-space, non-breaking space, hyphen, soft-hypen and any valid Unicode newline separator

{}

Class Method Summary collapse

Class Method Details

.call(items) ⇒ Object

Breaks the items (an array of InlineBox and TextFragment objects) into atomic pieces wrapped by Box, Glue or Penalty items, and returns those as an array.



228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
# File 'lib/hexapdf/layout/text_layouter.rb', line 228

def self.call(items)
  result = []
  glues = {}
  penalties = {}
  items.each do |item|
    if item.kind_of?(InlineBox)
      result << Box.new(item)
    else
      i = 0
      while i < item.items.size
        # Collect characters and kerning values until break character is encountered
        box_items = []
        while (glyph = item.items[i]) &&
            (glyph.kind_of?(Numeric) || !BREAK_CHARS.key?(glyph.str))
          box_items << glyph
          i += 1
        end

        # A hyphen belongs to the text fragment
        box_items << glyph if glyph && !glyph.kind_of?(Numeric) && glyph.str == '-'

        unless box_items.empty?
          result << Box.new(item.dup_attributes(box_items.freeze))
        end

        if glyph
          case glyph.str
          when ' '
            result << (glues[item.attributes_hash] ||=
                       Glue.new(item.dup_attributes([glyph].freeze)))
          when "\n", "\v", "\f", "\u{85}", "\u{2029}"
            result << (penalties[item.attributes_hash] ||=
                       Penalty.new(Penalty::PARAGRAPH_BREAK, 0))
          when "\u{2028}"
            result << Penalty.new(Penalty::LINE_BREAK, 0)
          when "\r"
            if !item.items[i + 1] || item.items[i + 1].kind_of?(Numeric) ||
                item.items[i + 1].str != "\n"
              result << (penalties[item.attributes_hash] ||=
                         Penalty.new(Penalty::PARAGRAPH_BREAK, 0))
            end
          when '-'
            result << Penalty::Standard
          when "\t"
            spaces = [item.style.font.decode_utf8(" ").first] * 8
            result << Glue.new(item.dup_attributes(spaces.freeze))
          when "\u{00AD}"
            frag = item.dup_attributes([item.style.font.decode_utf8("-").first].freeze)
            result << Penalty.new(Penalty::Standard.penalty, frag.width, item: frag)
          when "\u{00A0}"
            frag = item.dup_attributes([item.style.font.decode_utf8(" ").first].freeze)
            result << Penalty.new(Penalty::ProhibitedBreak.penalty, frag.width, item: frag)
          when "\u{200B}"
            result << Penalty.new(0)
          end
        end
        i += 1
      end
    end
  end
  result
end