Module: HexaPDF::Content::SmartTextExtractor
- Defined in:
- lib/hexapdf/content/smart_text_extractor.rb
Overview
This module converts the glyphs on a page to a single text string while preserving the layout.
The general algorithm is:
-
Collect all individual glyphs with their user space coordinates in TextRunCollector::TextRun objects.
-
Sort text runs top to bottom and then left to right.
-
Group those text runs into lines based on a "baseline" while also combining neighboring text runs into larger runs.
-
Render each line into a string by taking into account the page size and the median glyph width for a text run to column mapping.
-
Add blank lines between text lines based on the page's normal line spacing.
Defined Under Namespace
Modules: TextRunCollector Classes: Line, TextRunProcessor
Class Method Summary collapse
-
.layout_text_runs(text_runs, page_width, page_height, line_tolerance_factor: 0.4, paragraph_distance_threshold: 1.35, large_distance_threshold: 3.0) ⇒ Object
Converts an array of TextRun objects into a single string representation, preserving the visual layout.
Class Method Details
.layout_text_runs(text_runs, page_width, page_height, line_tolerance_factor: 0.4, paragraph_distance_threshold: 1.35, large_distance_threshold: 3.0) ⇒ Object
Converts an array of TextRun objects into a single string representation, preserving the visual layout.
The page_width and page_height arguments specify the width and height of the page from
which the text runs were extracted.
The remaining keyword arguments can be used to fine-tune the algorithm for one's needs:
line_tolerance_factorThe tolerance factor is applied to the median text run height to determine the range within which two text runs are considered to be on the same line. This ensures that small differences in the baseline due to, for example, subscript or superscript parts don't result in multiple lines.
The factor should not be too large to avoid forcing separate visual lines into one line
but also not too small to avoid subscript/superscript begin on separate lines. The
default seems to work quite well.
paragraph_distance_thresholdIf the number of normal line spacings between two adjacent baselines is at least this large (but smaller than
large_distance_threshold), the gap is interpreted as a paragraph break and a single blank line is inserted.large_distance_thresholdWorks like
paragraph_distance_thresholdand indicates if a number of normal line spacings is too large for being a paragraph break. A proportional number of blank lines is inserted in this case.
This is used to represent large parts with non-text content like images.
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
# File 'lib/hexapdf/content/smart_text_extractor.rb', line 163 def self.layout_text_runs(text_runs, page_width, page_height, line_tolerance_factor: 0.4, paragraph_distance_threshold: 1.35, large_distance_threshold: 3.0) return '' if text_runs.empty? # Use the median height of all text runs as an approximation of the main font size used on # the page. The line tolerance uses a hard floor for small fonts. median_height = median(text_runs.map(&:height).sort) line_tolerance = [median_height * line_tolerance_factor, 2].max # Group the text runs into lines which are sorted top to bottom. Text runs are pre-sorted by # baseline from top to bottom and left to right (the latter is done so that consecutive text # runs can be combined). sorted = text_runs.sort_by {|run| [-run.baseline, run.left] } lines = group_into_lines(sorted, line_tolerance) # Calculate the normal line spacing, excluding anything too small/big. line_distances = lines.map {|l| l.baseline }.each_cons(2).map {|a, b| a - b }. select {|d| d >= median_height * 0.5 && d <= median_height * 2 }.sort normal_line_spacing = line_distances.empty? ? median_height * 1.2 : median(line_distances) # Convert the lines into actual text strings. Blank lines are inserted between the lines # based on the normal line spacing. output_lines = [] left_margin = lines.map {|line| line.text_runs[0].left }.min glyph_widths = lines.flat_map do |line| line.text_runs.flat_map {|run| [run.width.to_f / run.string.length] * run.string.length } end.sort median_glyph_width = median(glyph_widths) lines.each_with_index do |line, index| output_lines << text_runs_to_string(line.text_runs, median_glyph_width, left_margin) next if index == lines.length - 1 # Add blank lines as needed. ratio = (line.baseline - lines[index + 1].baseline) / normal_line_spacing if ratio >= large_distance_threshold # Subtract 1 because the newline after the output line already counts as one # newline. Also cap at a maximum of 40 to avoid huge gaps. [ratio.round - 1, 40].min.times { output_lines << '' } elsif ratio >= paragraph_distance_threshold output_lines << '' end end output_lines.join("\n") end |