parquet-ruby

This project is a Ruby library wrapping the parquet-rs rust crate.

At the moment, it only supports iterating rows as either a hash or an array.

Usage

This library provides high-level bindings to parquet-rs with two primary APIs for reading Parquet files: row-wise and column-wise iteration. The column-wise API generally offers better performance, especially when working with subset of columns.

Row-wise Iteration

The each_row method provides sequential access to individual rows:

require "parquet"

# Basic usage with default hash output
Parquet.each_row("data.parquet") do |row|
  puts row.inspect  # {"id"=>1, "name"=>"name_1"}
end

# Array output for more efficient memory usage
Parquet.each_row("data.parquet", result_type: :array) do |row|
  puts row.inspect  # [1, "name_1"]
end

# Select specific columns to reduce I/O
Parquet.each_row("data.parquet", columns: ["id", "name"]) do |row|
  puts row.inspect
end

# Reading from IO objects
File.open("data.parquet", "rb") do |file|
  Parquet.each_row(file) do |row|
    puts row.inspect
  end
end

Column-wise Iteration

The each_column method reads data in column-oriented batches, which is typically more efficient for analytical queries:

require "parquet"

# Process columns in batches of 1024 rows
Parquet.each_column("data.parquet", batch_size: 1024) do |batch|
  # With result_type: :hash (default)
  puts batch.inspect
  # {
  #   "id" => [1, 2, ..., 1024],
  #   "name" => ["name_1", "name_2", ..., "name_1024"]
  # }
end

# Array output with specific columns
Parquet.each_column("data.parquet",
                    columns: ["id", "name"],
                    result_type: :array,
                    batch_size: 1024) do |batch|
  puts batch.inspect
  # [
  #   [1, 2, ..., 1024],           # id column
  #   ["name_1", "name_2", ...]    # name column
  # ]
end

Arguments

Both methods accept these common arguments:

input: Path string or IO-like object containing Parquet data
result_type: Output format (:hash or :array, defaults to :hash)
columns: Optional array of column names to read (improves performance)

Additional arguments for each_column:

batch_size: Number of rows per batch (defaults to implementation-defined value)

When no block is given, both methods return an Enumerator.