EachBatch
Improved batch processing in Rails.
This gem provides a new method called each_batch
to ActiveRecord relations, similar to the built-in in_batches
.
There are two main issues this gem attempts to tackle:
- No custom ordering. Rails' default and only behaviour is to order the results by the primary key.
- No proper use of indexes, because of the inability to set custom ordering.
- No efficient pluck solution in batches.
Example
Suppose we have a huge products table:
ActiveRecord::Schema.define(version: 1) do
create_table :products do |t|
t.datetime :enabled_at, index: true
t.integer :sales
t.
end
end
And suppose we want to process a subset of the data based on the enabled_at value: In Rails, one can do this:
Product.
where(enabled_at: a_date_range).
find_each { |product| product.do_something }
This would generate SQL similar to this:
SELECT `products`.*
FROM `products`
WHERE `products`.`enabled_at` BETWEEN '2023-05-28 00:00:00' AND '2023-06-04 23:59:59'
ORDER BY `products`.`id` ASC LIMIT 1000
And for subsequent batches something like this:
SELECT `products`.*
FROM `products`
WHERE `products`.`enabled_at` BETWEEN '2023-05-28 00:00:00' AND '2023-06-04 23:59:59'
AND `products`.`id` > 123456
ORDER BY `products`.`id` ASC LIMIT 1000
The order clause here is what can kill performance! It doesn't utilize the index properly because of that.
With this gem, one can write this instead:
Product.
where(enabled_at: a_date_range).
each_batch(keys: [:enabled_at, :id]).
each_record { |product| product.do_something }
Which would generate something like the following:
SELECT `products`.*
FROM `products`
WHERE `products`.`enabled_at` BETWEEN '2023-05-28 00:00:00' AND '2023-06-04 23:59:59'
ORDER BY `products`.`enabled_at` ASC, `products`.`id` ASC LIMIT 1000
This order matches the index one and the index will be utilized properly.
For subsequent batches:
SELECT `products`.*
FROM `products`
WHERE `products`.`enabled_at` BETWEEN '2023-05-28 00:00:00' AND '2023-06-04 23:59:59'
AND (`products`.`enabled_at`, `products`.`id`) > ('2023-05-29 00:00:00', 123456)
ORDER BY `products`.`enabled_at` ASC, `products`.`id` ASC LIMIT 1000
which again, utilizes the index properly.
Note: the generated query is not exactly like the above, see where_row for more info.
Installation
Add this line to your application's Gemfile:
gem 'each_batch'
And then execute:
$ bundle
Or install it yourself as:
$ gem install each_batch
Usage
Iterating in batches
To simply iterate a relation in batches:
Product.each_batch do |batch|
# do something useful
end
By default the batch size is 1000. To override, use the of
option:
Product.each_batch(of: 500) do |batch|
# do something useful
end
Naturally, any relation can be batched, like so:
Product.where(enabled_at: a_date_range).each_batch(of: 500) do |batch|
# do something useful
end
Assuming that products
is a huge table with an index on enabled_at
, it would make more sense to order the results by this date. And in order to have a deterministic order (many products could be updated on the same time), ordering by enabled_at
and id
could prove optimal. To do that, use the keys
option:
Product.where(enabled_at: a_date_range).each_batch(of: 500, keys: [:enabled_at, :id]) do |batch|
# do something useful
end
To change the order, use the order
option (by default is accending):
Product.where(enabled_at: a_date_range).each_batch(of: 500, order: :desc, keys: [:enabled_at, :id]) do |batch|
# do something useful
end
To access each record instead of the whole relation (this preloads the relation), use each_record
:
Product.where(enabled_at: a_date_range).each_batch(of: 500, keys: [:enabled_at, :id]).each_record do |record|
# do something useful
end
Plucking in batches
To simply iterate over pluck results in batches, use the pluck
method:
Product.where(enabled_at: a_date_range).each_batch(of: 500, keys: [:enabled_at, :id]).pluck(:id, :enabled_at) do |pluck_batch|
# do something useful
end
To iterate over each row instead, use the each_row
method:
Product.where(enabled_at: a_date_range).each_batch(of: 500, keys: [:enabled_at, :id]).pluck(:id, :enabled_at).each_row do |(id, enabled_at)|
# do something useful
end
Enumerator usage
Since these methods return an enumerator, they can be chained with regular enumerator methods:
Product.
where(enabled_at: a_date_range).
each_batch(of: 500, keys: [:enabled_at, :id]).
pluck(:id, :enabled_at, :sales).
each_row.
sum(&:first)
Caveats
Empty results
By default, each_batch
does not preload any records, it just build the necessary queries and yields the relation. This means that it can not know in advance
whether there are any records for the specified conditions. In practice this means that it will always yield a relation, even if it's empty.
This also applies if the result set is a multiple of the batch size. There is no way to deduce that no more results are present, so it will return an empty relation.
each_record
, pluck
and each_row
do not suffer from this since they preload the records/values necessary to deduce that.
Missing keys for select or pluck
For the algorithm to work, we need the last values for each of the keys specified. This means that there must a exist a select clause with those columns:
Product.each_batch(of: 500, keys: [:enabled_at, :id]) # ok
Product.select(:id, :sales).each_batch(of: 500, keys: [:enabled_at, :id]) # ArgumentError
Product.select(:id, :enabled_at, :sales).each_batch(of: 500, keys: [:enabled_at, :id]) # Ok
Product.each_batch(of: 500, keys: [:enabled_at, :id]).pluck # Ok, plucks everything
Product.each_batch(of: 500, keys: [:enabled_at, :id]).pluck(:id, :sales) # ArgumentError
Product.each_batch(of: 500, keys: [:enabled_at, :id]).pluck(:id, :enabled_at) # Ok
Product.each_batch(of: 500, keys: [:enabled_at, :id]).pluck(:enabled_at, :id) # Ok
Can't omit primary key
To make this method safer, we can not specify an ordering that doesn't have the primary key as the last order key. This is in order to always guarantee deterministic ordering. This could be relaxed, possible with an extra option unsafe
, to make it explicit.
Race conditions
This is inherent to batch processing in general.
Alternatives
I am probably missing a lot here..
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/odydoum/each_batch.
License
The gem is available as open source under the terms of the MIT License.