Class: Tabula::Extraction::SpreadsheetExtractor
- Inherits:
-
ObjectExtractor
- Object
- ObjectExtractor
- Tabula::Extraction::SpreadsheetExtractor
- Defined in:
- lib/tabula/spreadsheet_extractor.rb
Constant Summary
Constants inherited from ObjectExtractor
ObjectExtractor::DEFAULT_OPTIONS, ObjectExtractor::PRINTABLE_RE
Instance Attribute Summary
Attributes inherited from ObjectExtractor
#characters, #clipping_paths, #debug_clipping_paths, #debug_text, #options
Instance Method Summary collapse
-
#extract(options = {}) ⇒ Object
TODO lots of repeated code with parent class REFACTOR.
Methods inherited from ObjectExtractor
#clear!, #currentClippingPath, #drawImage, #drawPage, #ensurePageSize!, #fillPath, #getStroke, #initialize, #pageTransform, #page_count, #processTextPosition, #rulings, #setStroke, #strokePath, #transformPath
Constructor Details
This class inherits a constructor from Tabula::Extraction::ObjectExtractor
Instance Method Details
#extract(options = {}) ⇒ Object
TODO lots of repeated code with parent class REFACTOR
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
# File 'lib/tabula/spreadsheet_extractor.rb', line 15 def extract(={}) Enumerator.new do |y| begin @pages.each do |i| pdfbox_page = @all_pages.get(i-1) #TODO: this can error out ungracefully if you try to extract a page that doesn't exist (e.g. page 5 of a 4 page doc). we should catch and handle. contents = pdfbox_page.getContents next if contents.nil? self.clear! self.drawPage pdfbox_page page = Tabula::Page.new( @pdf_filename, pdfbox_page.findCropBox.width, pdfbox_page.findCropBox.height, pdfbox_page.getRotation.to_i, i, #one-indexed, just like `i` is. self.characters, self.rulings) page.spreadsheets().each do |spreadsheet| spreadsheet.cells.each do |cell| cell.text_elements = page.get_cell_text(cell) end y.yield page, spreadsheet end end ensure @pdf_file.close end # begin end end |