Rover
Simple, powerful data frames for Ruby
:mountain: Designed for data exploration and machine learning, and powered by Numo
:evergreen_tree: Uses Vega for visualization
Installation
Add this line to your application’s Gemfile:
gem "rover-df"
Intro
A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.
Try it out for forecasting by clicking the button below (it can take a few minutes to start):
Use the Run
button (or SHIFT
+ ENTER
) to run each line.
Creating Data Frames
From an array
Rover::DataFrame.new([
{a: 1, b: "one"},
{a: 2, b: "two"},
{a: 3, b: "three"}
])
From a hash
Rover::DataFrame.new({
a: [1, 2, 3],
b: ["one", "two", "three"]
})
From Active Record
Rover::DataFrame.new(User.all)
From a CSV
Rover.read_csv("file.csv")
# or
Rover.parse_csv("CSV,data,string")
From Parquet (requires the red-parquet gem)
Rover.read_parquet("file.parquet")
# or
Rover.parse_parquet("PAR1...")
Attributes
Get number of rows
df.count
Get column names
df.keys
Check if a column exists
df.include?(name)
Selecting Data
Select a column
df[:a]
Note that strings and symbols are different keys, just like hashes. Creating a data frame from Active Record, a CSV, or Parquet uses strings.
Select multiple columns
df[[:a, :b]]
Select first rows
df.head
# or
df.first(5)
Select last rows
df.tail
# or
df.last(5)
Select rows by index
df[1]
# or
df[1..3]
# or
df[[1, 4, 5]]
Iterate over rows
df.each_row { |row| ... }
Iterate over a column
df[:a].each { |item| ... }
# or
df[:a].each_with_index { |item, index| ... }
Filtering
Filter on a condition
df[df[:a] == 100]
df[df[:a] != 100]
df[df[:a] > 100]
df[df[:a] >= 100]
df[df[:a] < 100]
df[df[:a] <= 100]
In
df[df[:a].in?([1, 2, 3])]
df[df[:a].in?(1..3)]
df[df[:a].in?(["a", "b", "c"])]
Not in
df[!df[:a].in?([1, 2, 3])]
And, or, and exclusive or
df[(df[:a] > 100) & (df[:b] == "one")] # and
df[(df[:a] > 100) | (df[:b] == "one")] # or
df[(df[:a] > 100) ^ (df[:b] == "one")] # xor
Operations
Basic operations
df[:a] + 5
df[:a] - 5
df[:a] * 5
df[:a] / 5
df[:a] % 5
df[:a] ** 2
df[:a].sqrt
df[:a].cbrt
df[:a].abs
Rounding
df[:a].round
df[:a].ceil
df[:a].floor
Logarithm
df[:a].ln # or log
df[:a].log(5)
df[:a].log10
df[:a].log2
Exponentiation
df[:a].exp
df[:a].exp2
Trigonometric functions
df[:a].sin
df[:a].cos
df[:a].tan
df[:a].asin
df[:a].acos
df[:a].atan
Hyperbolic functions
df[:a].sinh
df[:a].cosh
df[:a].tanh
df[:a].asinh
df[:a].acosh
df[:a].atanh
Error function
df[:a].erf
df[:a].erfc
Summary statistics
df[:a].count
df[:a].sum
df[:a].mean
df[:a].median
df[:a].percentile(90)
df[:a].min
df[:a].max
df[:a].std
df[:a].var
Count occurrences
df[:a].tally
Cross tabulation
df[:a].crosstab(df[:b])
Grouping
Group
df.group(:a).count
Works with all summary statistics
df.group(:a).max(:b)
Multiple groups
df.group(:a, :b).count
Visualization
Add Vega to your application’s Gemfile:
gem "vega"
And use:
df.plot(:a, :b)
Specify the chart type (line
, pie
, column
, bar
, area
, or scatter
)
df.plot(:a, :b, type: "pie")
Group data
df.plot(:a, :b, group: :c)
Stacked columns or bars
df.plot(:a, :b, group: :c, stacked: true)
Updating Data
Add a new column
df[:a] = 1
# or
df[:a] = [1, 2, 3]
Update a single element
df[:a][0] = 100
Update multiple elements
df[:a][0..2] = 1
# or
df[:a][0..2] = [1, 2, 3]
Update all elements
df[:a] = df[:a].map { |v| v.gsub("a", "b") }
# or
df[:a].map! { |v| v.gsub("a", "b") }
Update elements matching a condition
df[:a][df[:a] > 100] = 0
Clamp
df[:a].clamp!(0, 100)
Delete columns
df.delete(:a)
# or
df.except!(:a, :b)
Rename columns
df.rename(a: :new_a, b: :new_b)
# or
df[:new_a] = df.delete(:a)
Sort rows
df.sort_by! { |r| r[:a] }
Clear all data
df.clear
Combining Data Frames
Add rows
df.concat(other_df)
Add columns
df.merge!(other_df)
Inner join
df.inner_join(other_df)
# or
df.inner_join(other_df, on: :a)
# or
df.inner_join(other_df, on: [:a, :b])
# or
df.inner_join(other_df, on: {df_col: :other_df_col})
Left join
df.left_join(other_df)
Encoding
One-hot encoding
df.one_hot
Drop a variable in each category to avoid the dummy variable trap
df.one_hot(drop: true)
Conversion
Array of hashes
df.to_a
Hash of arrays
df.to_h
Numo array
df.to_numo
CSV
df.to_csv
Parquet (requires the red-parquet gem)
df.to_parquet
Types
You can specify column types when creating a data frame
Rover::DataFrame.new(data, types: {"a" => :int64, "b" => :float64})
Or
Rover.read_csv("data.csv", types: {"a" => :int64, "b" => :float64})
Supported types are:
- boolean -
:bool
- float -
:float64
,:float32
- integer -
:int64
,:int32
,:int16
,:int8
- unsigned integer -
:uint64
,:uint32
,:uint16
,:uint8
- object -
:object
Get column types
df.types
For a specific column
df[:a].type
Change the type of a column
df[:a].to!(:int32)
History
View the changelog
Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/rover.git
cd rover
bundle install
bundle exec rake test