Module: HexaPDF::Content::SmartTextExtractor

Defined in:
lib/hexapdf/content/smart_text_extractor.rb

Overview

This module converts the glyphs on a page to a single text string while preserving the layout.

The general algorithm is:

  1. Collect all individual glyphs with their user space coordinates in TextRunCollector::TextRun objects.

  2. Sort text runs top to bottom and then left to right.

  3. Group those text runs into lines based on a "baseline" while also combining neighboring text runs into larger runs.

  4. Render each line into a string by taking into account the page size and the median glyph width for a text run to column mapping.

  5. Add blank lines between text lines based on the page's normal line spacing.

Defined Under Namespace

Modules: TextRunCollector Classes: Line, TextRunProcessor

Class Method Summary collapse

Class Method Details

.layout_text_runs(text_runs, page_width, page_height, line_tolerance_factor: 0.4, paragraph_distance_threshold: 1.35, large_distance_threshold: 3.0) ⇒ Object

Converts an array of TextRun objects into a single string representation, preserving the visual layout.

The page_width and page_height arguments specify the width and height of the page from which the text runs were extracted.

The remaining keyword arguments can be used to fine-tune the algorithm for one's needs:

line_tolerance_factor

The tolerance factor is applied to the median text run height to determine the range within which two text runs are considered to be on the same line. This ensures that small differences in the baseline due to, for example, subscript or superscript parts don't result in multiple lines.

The factor should not be too large to avoid forcing separate visual lines into one line
but also not too small to avoid subscript/superscript begin on separate lines. The
default seems to work quite well.
paragraph_distance_threshold

If the number of normal line spacings between two adjacent baselines is at least this large (but smaller than large_distance_threshold), the gap is interpreted as a paragraph break and a single blank line is inserted.

large_distance_threshold

Works like paragraph_distance_threshold and indicates if a number of normal line spacings is too large for being a paragraph break. A proportional number of blank lines is inserted in this case.

This is used to represent large parts with non-text content like images.


163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# File 'lib/hexapdf/content/smart_text_extractor.rb', line 163

def self.layout_text_runs(text_runs, page_width, page_height,
                          line_tolerance_factor: 0.4, paragraph_distance_threshold: 1.35,
                          large_distance_threshold: 3.0)
  return '' if text_runs.empty?

  # Use the median height of all text runs as an approximation of the main font size used on
  # the page. The line tolerance uses a hard floor for small fonts.
  median_height = median(text_runs.map(&:height).sort)
  line_tolerance = [median_height * line_tolerance_factor, 2].max

  # Group the text runs into lines which are sorted top to bottom. Text runs are pre-sorted by
  # baseline from top to bottom and left to right (the latter is done so that consecutive text
  # runs can be combined).
  sorted = text_runs.sort_by {|run| [-run.baseline, run.left] }
  lines = group_into_lines(sorted, line_tolerance)

  # Calculate the normal line spacing, excluding anything too small/big.
  line_distances = lines.map {|l| l.baseline }.each_cons(2).map {|a, b| a - b }.
    select {|d| d >= median_height * 0.5 && d <= median_height * 2 }.sort
  normal_line_spacing = line_distances.empty? ? median_height * 1.2 : median(line_distances)

  # Convert the lines into actual text strings. Blank lines are inserted between the lines
  # based on the normal line spacing.
  output_lines = []
  left_margin = lines.map {|line| line.text_runs[0].left }.min
  glyph_widths = lines.flat_map do |line|
    line.text_runs.flat_map {|run| [run.width.to_f / run.string.length] * run.string.length }
  end.sort
  median_glyph_width = median(glyph_widths)

  lines.each_with_index do |line, index|
    output_lines << text_runs_to_string(line.text_runs, median_glyph_width, left_margin)
    next if index == lines.length - 1

    # Add blank lines as needed.
    ratio = (line.baseline - lines[index + 1].baseline) / normal_line_spacing
    if ratio >= large_distance_threshold
      # Subtract 1 because the newline after the output line already counts as one
      # newline. Also cap at a maximum of 40 to avoid huge gaps.
      [ratio.round - 1, 40].min.times { output_lines << '' }
    elsif ratio >= paragraph_distance_threshold
      output_lines << ''
    end
  end

  output_lines.join("\n")
end