daru
Data Analysis in RUby
Introduction
daru (Data Analysis in RUby) is a library for storage, analysis and manipulation of data.
Development of daru was started to address the fragmentation of Dataframe-like classes which were created in many ruby gems as per their own needs. daru offers a uniform interface for all sorts of data analysis and manipulation operations and aims to be compatible with all ruby gems involved in any way with data.
daru is inspired by Statsample::Dataset
and pandas, a very mature solution in Python.
daru works with CRuby (1.9.3+) and JRuby.
Features
- Data structures:
- Vector - A basic 1-D vector.
- DataFrame - A 2-D matrix-like structure which is internally composed of named
Vector
classes.
- Compatible with IRuby notebook.
- Indexed and named data structures.
- Flexible and intuitive API for manipulation and analysis of data.
Notebooks
Blog Posts
Documentation
Docs can be found here.
Basic Usage
daru has been created with keeping extreme ease of use in mind.
The gem consists of two data structures, Vector and DataFrame. Any data in a serial format is a Vector and a table is a DataFrame.
Initialization of DataFrame
A data frame can be initialized from the following sources:
- Hash of indexed order:
{ b: Daru::Vector.new(:b, [11,12,13,14,15], [:two, :one, :four, :five, :three]), a: Daru::Vector.new(:a, [1,2,3,4,5], [:two,:one,:three, :four, :five])}
. - Array of hashes:
[{a: 1, b: 11}, {a: 2, b: 12}, {a: 3, b: 13},{a: 4, b: 14}, {a: 5, b: 15}]
. - Hash of names and Arrays:
{b: [11,12,13,14,15], a: [1,2,3,4,5]}
The DataFrame constructor takes 4 arguments: source, vectors, indexes and name in that order. The last 3 are optional while the first is mandatory.
A basic DataFrame can be initialized like this:
df = Daru::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5]}, order: [:a, :b], index: [:one, :two, :three, :four, :five])
df
# =>
# # <Daru::DataFrame:87274040 @name = 7308c587-4073-4e7d-b3ca-3679d1dcc946 # @size = 5>
# a b
# one 1 11
# two 2 12
# three 3 13
# four 4 14
# five 5 15
Daru will automatically align the vectors correctly according to the specified index and then create the DataFrame. Thus, elements having the same index will show up in the same row. The indexes will be arranged alphabetically if vectors with unaligned indexes are supplied.
The vectors of the DataFrame will be arranged according to the array specified in the (optional) second argument. Otherwise the vectors are ordered alphabetically.
df = Daru::DataFrame.new({
b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
},
order: [:a, :b]
)
df
# =>
# #<Daru::DataFrame:87363700 @name = 75ba0a14-8291-48ac-ac30-35017e4d6c5f # @size = 5>
# a b
# five 5 14
# four 4 13
# one 2 12
# three 3 15
# two 1 11
If an index for the DataFrame is supplied (third argument), then the indexes of the individual vectors will be matched to the DataFrame index. If any of the indexes do not match, nils will be inserted instead:
df = Daru::DataFrame.new({
b: [11] .dv(nil, [:one]),
a: [1,2,3] .dv(nil, [:one, :two, :three]),
c: [11,22,33,44,55] .dv(nil, [:one, :two, :three, :four, :five]),
d: [49,69,89,99,108,44].dv(nil, [:one, :two, :three, :four, :five, :six])
}, order: [:a, :b, :c, :d], index: [:one, :two, :three, :four, :five, :six])
df
# =>
# #<Daru::DataFrame:87523270 @name = bda4eb68-afdd-4404-9981-708edab14201 #@size = 6>
# a b c d
# one 1 11 11 49
# two 2 nil 22 69
# three 3 nil 33 89
# four nil nil 44 99
# five nil nil 55 108
# six nil nil nil 44
If some of the supplied vectors do not contain certain indexes that are contained in other vectors, they are added to those vectors and the correspoding elements are set to nil
.
df = Daru::DataFrame.new({
b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
a: [1,2,3] .dv(:a, [:two,:one,:three])
},
order: [:a, :b]
)
df
# =>
# #<Daru::DataFrame:87612510 @name = 1e904c15-e095-4dce-bfdf-c07ee4d6e4a4 # @size = 5>
# a b
# five nil 14
# four nil 13
# one 2 12
# three 3 15
# two 1 11
Initialization of Vector
The Vector
data structure is also named and indexed. It accepts arguments name, source, index (in that order).
In the simplest case it can be constructed like this:
dv = Daru::Vector.new [1,2,3,4,5], name: ravan, index: [:ek, :don, :teen, :char, :pach]
dv
# =>
# #<Daru::Vector:87630270 @name = ravan @size = 5 >
# ravan
# ek 1
# don 2
# teen 3
# char 4
# pach 5
Initializing a vector with indexes will insert nils in places where elements dont exist:
dv = Daru::Vector.new [1,2,3], name: yoga, index: [0,1,2,3,4]
dv
# =>
# #<Daru::Vector:87890840 @name = yoga @size = 5 >
# y
# 0 1
# 1 2
# 2 3
# 3 nil
# 4 nil
Basic Selection Operations
Initialize a dataframe:
df = Daru::DataFrame.new({
b: [11,12,13,14,15].dv(:b, [:two, :one, :four, :five, :three]),
a: [1,2,3,4,5].dv(:a, [:two,:one,:three, :four, :five])
},
order: [:a, :b]
)
# =>
# #<Daru::DataFrame:87455010 @name = b3d14e23-98c2-4741-a563-92e8f1fd0f13 # @size = 5>
# a b
# five 5 14
# four 4 13
# one 2 12
# three 3 15
# two 1 11
Select a row from a DataFrame:
df.row[:one]
# =>
# #<Daru::Vector:87432070 @name = one @size = 2 >
# one
# a 2
# b 12
A row or a vector is returned as a Daru::Vector
object, so any manipulations supported by Daru::Vector
can be performed on the chosen row as well.
Select multiple rows with a Range and get a DataFrame in return:
df.row[1..3] # OR df.row[:four..:three]
# =>
#<Daru::DataFrame:85361520 @name = d6582f66-5a55-473e-ba57-cb2ba974da6a @size #= 3>
# a b
# four 4 13
# one 2 12
# three 3 15
Select a single vector:
df.vector[:a] # or simply df.a
# =>
# #<Daru::Vector:87454270 @name = a @size = 5 >
# a
# five 5
# four 4
# one 2
# three 3
# two 1
Select multiple vectors and return a DataFrame in the specified order:
df.vector[:b, :a]
# =>
# #<Daru::DataFrame:87835960 @name = e80902cc-cff9-4b23-9eca-5da36ebc88a8 # @size = 5>
# b a
# five 14 5
# four 13 4
# one 12 2
# three 15 3
# two 11 1
Keep/remove row according to a specified condition:
df = df.filter_rows do |row|
row[:a] == 5
end
df
# =>
# #<Daru::DataFrame:87455010 @name = b3d14e23-98c2-4741-a563-92e8f1fd0f13 # @size = 1>
# a b
# five 5 14
The same can be applied to vectors using filter_vectors
.
To iterate over a DataFrame and perform operations on rows or vectors, use #each_row
or #each_vector
.
To change the values of a row/vector while iterating through the DataFrame, use map_rows
or map_vectors
:
df.map_rows do |row|
row = row * row
end
df
# =>
# #<Daru::DataFrame:86826830 @name = b092ca5b-7b83-4dbe-a469-124f7f25a568 # @size = 5>
# a b
# five 25 196
# four 16 169
# one 4 144
# three 9 225
# two 1 121
Rows/vectors can be deleted using delete_row
or delete_vector
.
Basic Maths Operations
Performing a binary arithmetic operation on two Daru::Vector
objects will return a Vector
object in which the operation will be performed on elements of the same index.
dv1 = Daru::Vector.new [1,2,3,4], name: :boozy, index: [:a, :b, :c, :d]
dv2 = Daru::Vector.new [1,2,3,4], name: :mayer, index: [:e, :f, :b, :d]
dv1 * dv2
# #<Daru::Vector:80924700 @name = boozy @size = 2 >
# boozy
# b 6
# d 16
Arithmetic operators applied on a single Numeric will perform the operation with that number against the entire vector.
Statistics Operations
Daru::Vector has a whole lot of statistics operations to maintain compatibility with Statsample::Vector. Check the docs for details.
Plotting
daru uses Nyaplot for plotting and an example of this can be found in the notebook or blog post.
Head over to the tutorials and notebooks listed above for more examples.
Roadmap
- Automate testing for both MRI and JRuby.
- Enable creation of DataFrame by only specifying an NMatrix/MDArray in initialize. Vector naming happens automatically (alphabetic) or is specified in an Array.
- Destructive map iterators for DataFrame.
- Completely test all functionality for NMatrix and MDArray.
- Basic Data manipulation and analysis operations:
- Different kinds of join operations
- Dataframe/vector merge
- Creation of correlation, covariance matrices
- Verification of data in a vector
- Transpose a dataframe.
- Option to express a DataFrame as an NMatrix or MDArray so as to use more efficient storage techniques.
- Assignment of a column to a single number should set the entire column to that number.
- == between daru_vector and string/number.
- Multiple column assignment with []=
- Creation of DataFrame from Array of Arrays.
- Multiple value assignment for vectors with []=.
- Load DataFrame from multiple sources (excel, SQL, etc.).
- Deletion of elements from Vector should only modify the index and leave the vector as it is so that compacting is not needed and things are faster.
- Add a #sync method which will sync the modified index with the unmodified vector.
- Ability to reorder the index of a dataframe.
- head/tail for DV.
- #find_max function which will evaluate a block and return the row for the value of the block is max.
- Function to check if a value of a row/vector is within a specified range.
- Create a new vector in map_rows if any of the already present rows dont match the one assigned in the block.
- Direct functions to answer something like 'number of something per thousand of something else'.
- Tests for checking NMatrix resizing
- Sort while preserving index.
Contributing
Pick a feature from the Roadmap above or think of your own and send me a Pull Request!
Acknowledgements
- Thank you last.fm for making user data accessible to the public.
Copyright (c) 2014, Sameer Deshmukh All rights reserved