Class: Sqed::Parser::OcrParser
- Inherits:
-
Sqed::Parser
- Object
- Sqed::Parser
- Sqed::Parser::OcrParser
- Defined in:
- lib/sqed/parser/ocr_parser.rb
Overview
encoding: UTF-8
Given a single image return all text in that image.
For reference
http://misteroleg.wordpress.com/2012/12/19/ocr-using-tesseract-and-imagemagick-as-pre-processing-task/
https://code.google.com/p/tesseract-ocr/wiki/FAQ
http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
“There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be ”noise removed“.
Constant Summary collapse
- TYPE =
:text
- SECTION_PARAMS =
Tesseract parameters default/specific to section type, default is merged into the type
{ default: { psm: 3 }, annotated_specimen: { # was 45, significantly improves annotated_specimen for odontates edges_children_count_limit: 3000 }, identifier: { psm: 1, # tessedit_char_whitelist: '0123456789' # edges_children_count_limit: 4000 }, curator_metadata: { psm: 3 }, labels: { psm: 3, # may need to be 6 }, determination_labels: { psm: 3 }, other_labels: { psm: 3 }, collecting_event_labels: { psm: 3 } }.freeze
Instance Attribute Summary
Attributes inherited from Sqed::Parser
Instance Method Summary collapse
-
#get_text(section_type: :default) ⇒ String
TODO: very kludge.
Methods inherited from Sqed::Parser
Constructor Details
This class inherits a constructor from Sqed::Parser
Instance Method Details
#get_text(section_type: :default) ⇒ String
TODO: very kludge
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/sqed/parser/ocr_parser.rb', line 112 def get_text(section_type: :default) img = image # resample if an image 4"x4" is less than 300dpi if img.columns * img.rows < 144000 img = img.resample(300) end params = SECTION_PARAMS[:default].dup params.merge!(SECTION_PARAMS[section_type]) # May be able to overcome this hacky kludge messe with providing `processor:` to new file = Tempfile.new('foo1', encoding: 'utf-8') begin file.write(image.to_blob.force_encoding('utf-8')) file.rewind @extracted_text = RTesseract.new(file.path, params).to_s&.strip file.close ensure file.close file.unlink # deletes the temp file end if @extracted_text == '' file = Tempfile.new('foo2', encoding: 'utf-8') begin file.write(img.dup.white_threshold(245).to_blob.force_encoding('utf-8')) file.rewind @extracted_text = RTesseract.new(file.path, params).to_s&.strip file.close ensure file.close file.unlink end end if @extracted_text == '' file = Tempfile.new('foo3', encoding: 'utf-8') begin file.write(img.dup.quantize(256, Magick::GRAYColorspace).to_blob.force_encoding('utf-8')) file.rewind @extracted_text = RTesseract.new(file.path, params).to_s&.strip file.close ensure file.close file.unlink end end @extracted_text end |