act_as_page_extractor

Library for extracting plain text from documents(files) for further processing (indexing and searching).

Installation

Install appropriate tools before using:

sudo apt-get install zlib1g zlib1g-dev zip rar p7zip-full

Add this line to your application's Gemfile:

gem 'act_as_page_extractor'
bundle

Usage

For example, for model Document in the Rails framework we need run:

rails g act_as_page_extractor:migration Document category_id user_id

As a result we get two migration files:

class AddPageExtractorFields < ActiveRecord::Migration
  def change
    add_column :documents, :page_extraction_state, :string, default: ''
    add_column :documents, :page_extraction_pages, :integer, default: 0
    add_column :documents, :page_extraction_doctype, :string, default: ''
    add_column :documents, :page_extraction_filesize, :string, default: ''
  end
end

class CreateExtractedPages < ActiveRecord::Migration
  def change
    create_table :extracted_pages do |t|
      t.text :page
      t.integer :document_id
      t.integer :category_id
      t.integer :user_id
      t.integer :page_number

      t.timestamps null: false
    end

    add_index :extracted_pages, :document_id
    add_index :extracted_pages, :category_id
    add_index :extracted_pages, [:document_id, :category_id]
    add_index :extracted_pages, [:document_id, :page_number]
  end
end

Model Document must have field which contains path to file(supports different archive types that contains txt, pdf, doc/x, txt, html, rtf, ...)

Add to model next parameters for initializing:

  class Document < ActiveRecord::Base
    include ActAsPageExtractor

    act_as_page_extractor options: {
      document_class:    'Document',
      save_as_pdf:       true,
      filename:          :filename,
      document_id:       :document_id,
      additional_fields: [:category_id, :user_id],
      #file_storage:      "/full/path/to/documents/storage",
      #pdf_storage:       "/full/path/to/extracted/pdf/storage"
    }

    has_many :extracted_pages, dependent: :destroy
end

Now our instance has few new methods:

document = Document.first
document.page_extract!
document.extracted_pages
document.pdf_path # if option 'save_as_pdf' is 'true'

# Access to pages
ExtractedPage.count

# Importing whole directory of documents
ActAsPageExtractor.import_files('/path/to/foler/with/documents')

# We can use cron for run the processing of all the new documents
ActAsPageExtractor.start_extraction

# Getting statistics information of all documents
ActAsPageExtractor.statistics

Parameters of initializing act_as_page_extractor:

  • document_class - name of model (e.g. Document)
  • save_as_pdf - boolean [true, false] when we want save temporary pdf
  • filename - name of field which contains access to the file and it should be an object with 'url' method that returns path to file (e.g. CarrierWave object with 'filename.url')
  • document_id - name for saving id
  • additional_fields - additional fields that added to extracted page (e.g. for indexing, etc.)
  • file_storage - path for saving tmp files (by default it is "public")
  • pdf_storage - path for saving pdf (by default it is "public/uploads/extracted/pdf")

Run tests

rspec

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

Contacts

https://github.com/phlowerteam [email protected]

License

Copyright (c) 2024 PhlowerTeam MIT License