rTika

A JRuby wrapper around the excellent Apache Tika content extraction library. Feed rTika your files and get extracted text and metadata in return.

Usage

Make sure you’re on JRuby first.

require 'rubygems'
require 'rtika'

result = RTika::FileParser.parse("mywordfile.doc")
puts result.content # prints out the document's contents
puts result.title   # fetches title from the doc's metadata
puts result.author  # fetches author from the doc's metadata

result = RTika::StringParser.parse("<html>

<head><title>MYTITLE</title></head> <body>this is my very … long … string</body></html>“)

puts result.content # returns <body> contents
puts result.title   # returns <title> contents

Options

:remove_boilerplate => true
  # uses the Boilerpipe library that ships with Tika to remove headers & footers

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)

  • Send me a pull request. Bonus points for topic branches.

Copyright © 2010 Pradeep Elankumaran. See LICENSE for details.