RedAmber

A simple dataframe library for Ruby (experimental)

Requirements

gem 'red-arrow',   '>= 7.0.0'
gem 'red-parquet', '>= 7.0.0' # if you use IO from/to parquet
gem 'rover-df',    '~> 0.3.0' # if you use IO from/to Rover::DataFrame

Installation

Add this line to your Gemfile:

gem 'red_amber'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install red_amber

RedAmber::DataFrame

Constructors and saving

  • [x] new from a columnar Hash

    • RedAmber::DataFrame.new(x: [1, 2, 3])
  • [x] new from a schema (by Hash) and rows (by Array)

    • RedAmber::DataFrame.new({:x=>:uint8}, [[1], [2], [3]])
  • [x] new from an Arrow::Table

    • RedAmber::DataFrame.new(Arrow::Table.new(x: [1, 2, 3]))
  • [x] new from a Rover::DataFrame

    • RedAmber::DataFrame.new(Rover::DataFrame.new(x: [1, 2, 3]))
  • [ ] load (class method)

    • [x] from a [.arrow, .arrows, .csv, .csv.gz, .tsv] file
      • RedAmber::DataFrame.load("test/entity/with_header.csv")
    • [x] from a string buffer
    • [x] from a URI
      • RedAmber::DataFrame.load(URI("https://github.com/heronshoes/red_amber/blob/master/test/entity/with_header.csv"))
    • [ ] from a parquet file
  • [ ] save (instance method)

    • [x] to a [.arrow, .arrows, .csv, .csv.gz, .tsv] file
    • [x] to a string buffer
    • [x] to a URI
    • [ ] to a parquet file

Properties

  • [x] table

Reader of Arrow::Table object inside.

  • [x] n_rows, nrow, size, length

Returns num of rows (data size).

  • [x] n_columns, ncol, width

Returns num of columns (num of vectors).

  • [x] shape

Returns shape in an Array[n_rows, n_cols].

  • [x] column_names, keys

Returns num of column names by an Array.

  • [x] types

Returns types of columns by an Array of Symbols.

  • [x] data_types

Returns types of columns by an Array of Arrow::DataType.

  • [x] vectors

Returns an Array of Vectors.

  • [x] to_h

Returns column-oriented data in a Hash.

  • [x] to_a, raw_records

Returns an array of row-oriented data without header. If you need a column-oriented full array, use .to_h.to_a

  • [x] schema

Returns column name and data type in a Hash.

  • [x] ==

  • [x] empty?

Output

  • [x] to_s

  • [ ] summary, describe

  • [x] to_rover

Returns a Rover::DataFrame.

  • [x] inspect(tally_level: 5, max_element: 5)

Shows some information about self.

hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]}
RedAmber::DataFrame.new(hash)
# =>
RedAmber::DataFrame : 3 observations(rows) of 3 variables(columns)
Variables : 2 numeric, 1 string
# key type   level data_preview
1 :a  uint8      3 [1, 2, 3]
2 :b  string     3 [A, B, C]
3 :c  double     3 [1.0, 2.0, 3.0]
  • tally_level: max level to use tally mode
  • max_element: max num of element to show values in each row

Selecting

  • [x] Select columns by [] as [key], [keys], [keys[index]]

    • Key in a Symbol: df[:symbol]
    • Key in a String: df["string"]
    • Keys in an Array: df[:symbol1, "string", :symbol2
    • Keys in indeces: df[df.keys[0], df[df.keys[1,2]], df[df.keys[1..]]
    • Keys in a Range: A end-less Range can be used to represent keys. ruby hash = {a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3]} df = RedAmber::DataFrame.new(hash) df[:b..:c, "a"] # => RedAmber::DataFrame : 3 observations(rows) of 3 variables(columns) Variables : 2 numeric, 1 string # key type level data_preview 1 :b string 3 [A, B, C] 2 :c double 3 [1.0, 2.0, 3.0] 3 :a uint8 3 [1, 2, 3]
  • [x] Select rows by [] as [index], [range], [array]

    • Select a row by index: df[0]
    • Select rows by indeces in a Range: df[1..2]
    • Select rows by indeces in an Array: df[1, 2]
    • Mixed case: df[2, 0..]
  • [x] Select rows from top or bottom

head(n=5), tail(n=5), first(n=1), last(n=1)

  • [ ] slice

Updating

  • [ ] Add a new column

  • [ ] Update a single element

  • [ ] Update multiple elements

  • [ ] Update all elements

  • [ ] Update elements matching a condition

  • [ ] Clamp

  • [ ] Delete columns

  • [ ] Rename a column

  • [ ] Sort rows

  • [ ] Clear data

Treat na data

  • [ ] Drop na (NaN, nil)

  • [ ] Replace na with value

  • [ ] Interpolate na with convolution array

Combining DataFrames

  • [ ] Add rows

  • [ ] Add columns

  • [ ] Inner join

  • [ ] Left join

Encoding

  • [ ] One-hot encoding

Iteration (not impremented)

Filtering (not impremented)

RedAmber::Vector

Constructor

  • [x] Create from a column in a DataFrame

  • [x] New from an Array

Properties

  • [x] to_s

  • [x] values, to_a, entries

  • [x] size, length, n_rows, nrow

  • [x] type

  • [x] data_type

  • [ ] each

  • [ ] chunked?

  • [ ] n_chunks

  • [ ] each_chunk

  • [x] tally

  • [ ] n_nulls

Functions

Unary aggregations: vector.func => Scalar

Method Boolean Numeric String Remarks
[x] all [x]
[x] any [x]
[x] approximate_median [x]
[x] count [x] [x] [x]
[x] count_distinct [x] [x] [x]
[x] count_uniq [x] [x] [x] an alias of count_distinct
[ ] index
[x] max [x] [x] [x]
[x] mean [x] [x]
[x] min [x] [x] [x]
[ ] min_max
[ ] mode
[x] product [x] [x]
[ ] quantile
[x] stddev [x]
[x] sum [x] [x]
[ ] tdigest
[x] variance [x]

Unary element-wise: vector.func => Vector

Method Boolean Numeric String Remarks
[x] [email protected] [x] as -vector
[x] negate [x] [email protected]
[x] abs [x]
[ ] acos [ ]
[ ] asin [ ]
[x] atan [x]
[ ] ceil [x]
[x] cos [x]
[ ] floor [x]
[ ] ln [ ]
[ ] log10 [ ]
[ ] log1p [ ]
[ ] log2 [ ]
[x] sign [x]
[x] sin [x]
[x] tan [x]
[ ] trunc [x]

Binary element-wise: vector.func(vector) => Vector

Method Boolean Numeric String Remarks
[x] add [x] +
[x] atan2 [x]
[x] and [x]
[x] and_kleene [x]
[x] and_not [x]
[x] and_not_kleene [x]
[x] bit_wise_and ([x]) &, integer only
[ ] bit_wise_not ([x]) !, integer only
[x] bit_wise_or ([x]) `
[x] bit_wise_xor ([x]) ^, integer only
[x] divide [x] /
[x] equal [x] [x] [x] ==, alias eq
[x] greater [x] [x] [x] >, alias gt
[x] greater_equal [x] [x] [x] >=, alias ge
[x] less [x] [x] [x] <, alias lt
[x] less_equal [x] [x] [x] <=, alias le
[ ] logb [ ]
[ ] mod [ ]
[x] multiply [x] *
[x] not_equal [x] [x] [x] !=, alias ne
[x] or [x]
[x] or_kleene [x]
[x] power [x] **
[x] subtract [x] -
[x] shift_left ([x]) <<, integer only
[x] shift_right ([x]) >>, integer only
[x] xor [x]
(Not impremented)
  • [ ] invert, round, round_to_multiple
  • [ ] sort, sort_index
  • [ ] minmax, var, median, quantile
  • [ ] argmin, argmax

Coerce (not impremented)

Updating (not impremented)

DSL in a block for faster calculation ?

Development

git clone https://github.com/heronshoes/red_amber.git
cd red_amber
bundle install
bundle exec rake test

License

The gem is available as open source under the terms of the MIT License.