PostgreSQLCursor for handling large Result Sets

PostgreSQLCursor extends ActiveRecord to allow for efficient processing of queries returning a large number of rows.

Why use this?

ActiveRecord is designed and optimized for web performance. In a web transaction, only a “page” of around 20 rows is returned to the user. When you do this

Model.find(:all, :conditions=>["id>0"]

The database returns all matching result set rows to ActiveRecord, which instantiates each row with the data returned. This function returns an array of all these rows to the caller.

Asyncronous, Background, or Offline processing may require processing a large amount of data. When there is a very large number of rows, this requires a lot more memory to hold the data. Ruby does not return that memory after processing the array, and the causes your process to “bloat”. If you don’t have enough memory, it will cause an exception.

Enter find_each

To solve this problem, ActiveRecord gives us two alternative methods that work in “chunks” of your data:

Model.where("id>0").find_each { |model| model.process! }

Model.where("id>0").find_in_batches do |batch|
  batch.each { |model| model.process! }
end

Optionally, you can specify a :batch_size option as the size of the “chunk”, and defaults to 1000.

There are drawbacks with these methods:

You cannot specify the order, it will be ordered by the primary key (usually id)
The primary key must be numeric
The query is rerun for each chunk (1000 rows), starting at the next id sequence.
You cannot use overly complex queries as that will be rerun and incur more overhead.

PostgreSQLCursor FTW!

PostgreSQLCursor was developed to take advantage of PostgreSQL’s cursors. Cursors allow the program to declare a cursor to run a given query returning “chunks” of rows to the application program while retaining the position of the full result set in the database. This overcomes all the disadvantages of using find_each and find_in_batches.

Also, with PostgreSQL, you have on option to have raw hashes of the row returned instead of the instantiated models. An informal benchmark showed that returning instances is a factor of 4 times slower than returning hashes. If you are can work with the data in this form, you will find better performance.

With PostgreSQL, you can work with cursors as follows:

Model.where("id>0").each_row { |hash| Model.process(hash) }

Model.where("id>0").each_instance { |model| model.process! }
Model.where("id>0").each_instance(buffer_size:100000) { |model| model.process! }

Model.each_row_by_sql("select * from models") { |hash| Model.process(hash) }

Model.each_instance_by_sql("select * from models") { |model| model.process }

All these methods take an options hash to control things more:

buffer_size:n     The number of rows to fetch from the database each time (default 1000)
while:value       Continue looping as long as the block returns this value
until:value       Continue looping until the block returns this value
connection:conn   Use this connection instead of the current model connection
fraction:float    A value to set for the cursor_tuple_fraction variable.
                  PostgreSQL uses 0.1 (optimize for 10% of result set)
                  This library uses 1.0 (Optimize for 100% of the result set)
                  Do not override this value unless you understand it.

Authors

Allen Fair, [email protected], github.com/afair

Thank you to:

Iulian Dogariu, github.com/iulianu (Fixes)
Julian Mehnle, [email protected] (Suggestions)

Note on Patches/Pull Requests