Class: Polars::DataFrame

Inherits:

Object

Object
Polars::DataFrame

show all

Includes:: Plot

Defined in:: lib/polars/data_frame.rb

Overview

Two-dimensional data structure representing data as a table with rows and columns.

Instance Method Summary collapse

#!=(other) ⇒ DataFrame
Not equal.
#%(other) ⇒ DataFrame
Returns the modulo.
#*(other) ⇒ DataFrame
Performs multiplication.
#+(other) ⇒ DataFrame
Performs addition.
#-(other) ⇒ DataFrame
Performs subtraction.
#/(other) ⇒ DataFrame
Performs division.
#<(other) ⇒ DataFrame
Less than.
#<=(other) ⇒ DataFrame
Less than or equal.
#==(other) ⇒ DataFrame
Equal.
#>(other) ⇒ DataFrame
Greater than.
#>=(other) ⇒ DataFrame
Greater than or equal.
#[](*args) ⇒ Object
Returns subset of the DataFrame.
#[]=(*key, value) ⇒ Object
Set item.
#cast(dtypes, strict: true) ⇒ DataFrame
Cast DataFrame column(s) to the specified dtype(s).
#clear(n = 0) ⇒ DataFrame (also: #cleared)
Create an empty copy of the current DataFrame.
#collect_schema ⇒ Schema
Get an ordered mapping of column names to their data type.
#columns ⇒ Array
Get column names.
#columns=(columns) ⇒ Object
Change the column names of the DataFrame.
#delete(name) ⇒ Series
Drop in place if exists.
#describe ⇒ DataFrame
Summary statistics for a DataFrame.
#drop(*columns) ⇒ DataFrame
Remove column from DataFrame and return as new.
#drop_in_place(name) ⇒ Series
Drop in place.
#drop_nulls(subset: nil) ⇒ DataFrame
Return a new DataFrame where the null values are dropped.
#dtypes ⇒ Array
Get dtypes of columns in DataFrame.
#each(&block) ⇒ Object
Returns an enumerator.
#each_row(named: true, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
#equals(other, null_equal: true) ⇒ Boolean (also: #frame_equal)
Check if DataFrame is equal to other.
#estimated_size(unit = "b") ⇒ Numeric
Return an estimation of the total (heap) allocated size of the DataFrame.
#explode(columns) ⇒ DataFrame
Explode DataFrame to long format by exploding a column with Lists.
#extend(other) ⇒ DataFrame
Extend the memory backed by this DataFrame with the values from other.
#fill_nan(fill_value) ⇒ DataFrame
Fill floating point NaN values by an Expression evaluation.
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame
Fill null values using the specified value or strategy.
#filter(predicate) ⇒ DataFrame
Filter the rows in the DataFrame based on a predicate expression.
#flags ⇒ Hash
Get flags that are set on the columns of this DataFrame.
#fold ⇒ Series
Apply a horizontal reduction on a DataFrame.
#gather_every(n, offset = 0) ⇒ DataFrame (also: #take_every)
Take every nth row in the DataFrame and return as a new DataFrame.
#get_column(name) ⇒ Series
Get a single column as Series by name.
#get_column_index(name) ⇒ Series (also: #find_idx_by_name)
Find the index of a column by name.
#get_columns ⇒ Array
Get the DataFrame as a Array of Series.
#group_by(by, maintain_order: false) ⇒ GroupBy (also: #groupby, #group)
Start a group by operation.
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window") ⇒ DataFrame (also: #groupby_dynamic)
Group based on a time value (or index value of type :i32, :i64).
#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series
Hash and combine the rows in this DataFrame.
#head(n = 5) ⇒ DataFrame
Get the first n rows.
#height ⇒ Integer (also: #count, #length, #size)
Get the height of the DataFrame.
#hstack(columns, in_place: false) ⇒ DataFrame
Return a new DataFrame grown horizontally by stacking multiple Series to it.
#include?(name) ⇒ Boolean
Check if DataFrame includes column.
#initialize(data = nil, schema: nil, columns: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ DataFrame constructor
Create a new DataFrame.
#insert_column(index, series) ⇒ DataFrame (also: #insert_at_idx)
Insert a Series at a certain column index.
#interpolate ⇒ DataFrame
Interpolate intermediate values.
#is_duplicated ⇒ Series
Get a mask of all duplicated rows in this DataFrame.
#is_empty ⇒ Boolean (also: #empty?)
Check if the dataframe is empty.
#is_unique ⇒ Series
Get a mask of all unique rows in this DataFrame.
#item ⇒ Object
Return the dataframe as a scalar.
#iter_columns ⇒ Object
Returns an iterator over the columns of this DataFrame.
#iter_rows(named: false, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
#iter_slices(n_rows: 10_000) ⇒ Object
Returns a non-copying iterator of slices over the underlying DataFrame.
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil) ⇒ DataFrame
Join in SQL-like fashion.
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ DataFrame
Perform an asof join.
#lazy ⇒ LazyFrame
Start a lazy query from this point.
#limit(n = 5) ⇒ DataFrame
Get the first n rows.
#map_rows(return_dtype: nil, inference_size: 256, &f) ⇒ Object (also: #apply)
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
#max ⇒ DataFrame
Aggregate the columns of this DataFrame to their maximum value.
#max_horizontal ⇒ Series
Get the maximum value horizontally across columns.
#mean ⇒ DataFrame
Aggregate the columns of this DataFrame to their mean value.
#mean_horizontal(ignore_nulls: true) ⇒ Series
Take the mean of all values horizontally across columns.
#median ⇒ DataFrame
Aggregate the columns of this DataFrame to their median value.
#merge_sorted(other, key) ⇒ DataFrame
Take two sorted DataFrames and merge them by the sorted key.
#min ⇒ DataFrame
Aggregate the columns of this DataFrame to their minimum value.
#min_horizontal ⇒ Series
Get the minimum value horizontally across columns.
#n_chunks(strategy: "first") ⇒ Object
Get number of chunks used by the ChunkedArrays of this DataFrame.
#n_unique(subset: nil) ⇒ DataFrame
Return the number of unique rows, or the number of unique row-subsets.
#null_count ⇒ DataFrame
Create a new DataFrame that shows the null counts per column.
#partition_by(groups, maintain_order: true, include_key: true, as_dict: false) ⇒ Object
Split into multiple DataFrames partitioned by groups.
#pipe(func, *args, **kwargs, &block) ⇒ Object
Offers a structured way to apply a sequence of user-defined functions (UDFs).
#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Vega::LiteChart included from Plot
Plot data.
#product ⇒ DataFrame
Aggregate the columns of this DataFrame to their product values.
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Aggregate the columns of this DataFrame to their quantile value.
#rechunk ⇒ DataFrame
This will make sure all subsequent operations have optimal and predictable performance.
#rename(mapping, strict: true) ⇒ DataFrame
Rename column names.
#replace(column, new_col) ⇒ DataFrame
Replace a column by a new Series.
#replace_column(index, series) ⇒ DataFrame (also: #replace_at_idx)
Replace a column at an index location.
#reverse ⇒ DataFrame
Reverse the DataFrame.
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ RollingGroupBy (also: #groupby_rolling, #group_by_rolling)
Create rolling groups based on a time column.
#row(index = nil, by_predicate: nil, named: false) ⇒ Object
Get a row as tuple, either by index or by predicate.
#rows(named: false) ⇒ Array
Convert columnar data to rows as Ruby arrays.
#sample(n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame
Sample from this DataFrame.
#schema ⇒ Hash
Get the schema.
#select(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
#set_sorted(column, descending: false) ⇒ DataFrame
Flag a column as sorted.
#shape ⇒ Array
Get the shape of the DataFrame.
#shift(n, fill_value: nil) ⇒ DataFrame
Shift values by the given period.
#shift_and_fill(periods, fill_value) ⇒ DataFrame
Shift the values by a given period and fill the resulting null values.
#shrink_to_fit(in_place: false) ⇒ DataFrame
Shrink DataFrame memory usage.
#slice(offset, length = nil) ⇒ DataFrame
Get a slice of this DataFrame.
#sort(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column.
#sort!(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column in-place.
#std(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their standard deviation value.
#sum ⇒ DataFrame
Aggregate the columns of this DataFrame to their sum value.
#sum_horizontal(ignore_nulls: true) ⇒ Series
Sum all values horizontally across columns.
#tail(n = 5) ⇒ DataFrame
Get the last n rows.
#to_a ⇒ Array
Returns an array representing the DataFrame.
#to_csv(**options) ⇒ String
Write to comma-separated values (CSV) string.
#to_dummies(columns: nil, separator: "_", drop_first: false) ⇒ DataFrame
Get one hot encoded dummy variables.
#to_h(as_series: true) ⇒ Hash
Convert DataFrame to a hash mapping column name to values.
#to_hashes ⇒ Array
Convert every row to a dictionary.
#to_numo ⇒ Numo::NArray
Convert DataFrame to a 2D Numo array.
#to_s ⇒ String (also: #inspect)
Returns a string representing the DataFrame.
#to_series(index = 0) ⇒ Series
Select column as Series at index location.
#to_struct(name) ⇒ Series
Convert a DataFrame to a Series of type Struct.
#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame
Transpose a DataFrame over the diagonal.
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ DataFrame
Drop duplicate rows from this DataFrame.
#unnest(names) ⇒ DataFrame
Decompose a struct into its fields.
#unpivot(on, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame (also: #melt)
Unpivot a DataFrame from wide to long format.
#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame
Unstack a long table to a wide form without doing an aggregation.
#upsample(time_column:, every:, by: nil, maintain_order: false) ⇒ DataFrame
Upsample a DataFrame at a regular frequency.
#var(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their variance value.
#vstack(df, in_place: false) ⇒ DataFrame
Grow this DataFrame vertically by stacking a DataFrame to it.
#width ⇒ Integer
Get the width of the DataFrame.
#with_column(column) ⇒ DataFrame
Return a new DataFrame with the column added or replaced.
#with_columns(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
#with_row_index(name: "index", offset: 0) ⇒ DataFrame (also: #with_row_count)
Add a column at index 0 that counts the rows.
#write_avro(file, compression = "uncompressed", name: "") ⇒ nil
Write to Apache Avro file.
#write_csv(file = nil, has_header: true, include_header: nil, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil) ⇒ String^?
Write to comma-separated values (CSV) file.
#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ Integer
Write the data in a Polars DataFrame to a database.
#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil
Write DataFrame as delta table.
#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2) ⇒ nil
Write to Arrow IPC binary stream or Feather file.
#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object
Write to Arrow IPC record batch stream.
#write_json(file = nil) ⇒ nil
Serialize to JSON representation.
#write_ndjson(file = nil) ⇒ nil
Serialize to newline delimited JSON representation.
#write_parquet(file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil) ⇒ nil
Write to Apache Parquet file.

Constructor Details

#initialize(data = nil, schema: nil, columns: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ `DataFrame`

Create a new DataFrame.

# File 'lib/polars/data_frame.rb', line 50

def initialize(data = nil, schema: nil, columns: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false)
  if schema && columns
    warn "columns is ignored when schema is passed"
  end
  schema ||= columns

  if defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || data.is_a?(ActiveRecord::Result))
    raise ArgumentError, "Use read_database instead"
  end

  if data.nil?
    self._df = self.class.hash_to_rbdf({}, schema: schema, schema_overrides: schema_overrides)
  elsif data.is_a?(Hash)
    data = data.transform_keys { |v| v.is_a?(Symbol) ? v.to_s : v }
    self._df = self.class.hash_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, nan_to_null: nan_to_null)
  elsif data.is_a?(::Array)
    self._df = self.class.sequence_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, orient: orient, infer_schema_length: infer_schema_length)
  elsif data.is_a?(Series)
    self._df = self.class.series_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict)
  elsif data.respond_to?(:arrow_c_stream)
    # This uses the fact that RbSeries.from_arrow_c_stream will create a
    # struct-typed Series. Then we unpack that to a DataFrame.
    tmp_col_name = ""
    s = Utils.wrap_s(RbSeries.from_arrow_c_stream(data))
    self._df = s.to_frame(tmp_col_name).unnest(tmp_col_name)._df
  else
    raise ArgumentError, "DataFrame constructor called with unsupported type; got #{data.class.name}"
  end
end

Instance Method Details

#!=(other) ⇒ `DataFrame`

Not equal.



230
231
232

# File 'lib/polars/data_frame.rb', line 230

def !=(other)
  _comp(other, "neq")
end

#%(other) ⇒ `DataFrame`

Returns the modulo.

# File 'lib/polars/data_frame.rb', line 313

def %(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.rem_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.rem(other._s))
end

#*(other) ⇒ `DataFrame`

Performs multiplication.

# File 'lib/polars/data_frame.rb', line 265

def *(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.mul_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.mul(other._s))
end

#+(other) ⇒ `DataFrame`

Performs addition.

# File 'lib/polars/data_frame.rb', line 289

def +(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.add_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.add(other._s))
end

#-(other) ⇒ `DataFrame`

Performs subtraction.

# File 'lib/polars/data_frame.rb', line 301

def -(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.sub_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.sub(other._s))
end

#/(other) ⇒ `DataFrame`

Performs division.

# File 'lib/polars/data_frame.rb', line 277

def /(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.div_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.div(other._s))
end

#<(other) ⇒ `DataFrame`

Less than.



244
245
246

# File 'lib/polars/data_frame.rb', line 244

def <(other)
  _comp(other, "lt")
end

#<=(other) ⇒ `DataFrame`

Less than or equal.



258
259
260

# File 'lib/polars/data_frame.rb', line 258

def <=(other)
  _comp(other, "lt_eq")
end

#==(other) ⇒ `DataFrame`

Equal.



223
224
225

# File 'lib/polars/data_frame.rb', line 223

def ==(other)
  _comp(other, "eq")
end

#>(other) ⇒ `DataFrame`

Greater than.



237
238
239

# File 'lib/polars/data_frame.rb', line 237

def >(other)
  _comp(other, "gt")
end

#>=(other) ⇒ `DataFrame`

Greater than or equal.



251
252
253

# File 'lib/polars/data_frame.rb', line 251

def >=(other)
  _comp(other, "gt_eq")
end

#[](*args) ⇒ `Object`

Returns subset of the DataFrame.

Raises:

(ArgumentError)

# File 'lib/polars/data_frame.rb', line 354

def [](*args)
  if args.size == 2
    row_selection, col_selection = args

    # df[.., unknown]
    if row_selection.is_a?(Range)

      # multiple slices
      # df[.., ..]
      if col_selection.is_a?(Range)
        raise Todo
      end
    end

    # df[2, ..] (select row as df)
    if row_selection.is_a?(Integer)
      if col_selection.is_a?(::Array)
        df = self[0.., col_selection]
        return df.slice(row_selection, 1)
      end
      # df[2, "a"]
      if col_selection.is_a?(::String) || col_selection.is_a?(Symbol)
        return self[col_selection][row_selection]
      end
    end

    # column selection can be "a" and ["a", "b"]
    if col_selection.is_a?(::String) || col_selection.is_a?(Symbol)
      col_selection = [col_selection]
    end

    # df[.., 1]
    if col_selection.is_a?(Integer)
      series = to_series(col_selection)
      return series[row_selection]
    end

    if col_selection.is_a?(::Array)
      # df[.., [1, 2]]
      if Utils.is_int_sequence(col_selection)
        series_list = col_selection.map { |i| to_series(i) }
        df = self.class.new(series_list)
        return df[row_selection]
      end
    end

    df = self[col_selection]
    return df[row_selection]
  elsif args.size == 1
    item = args[0]

    # select single column
    # df["foo"]
    if item.is_a?(::String) || item.is_a?(Symbol)
      return Utils.wrap_s(_df.get_column(item.to_s))
    end

    # df[idx]
    if item.is_a?(Integer)
      return slice(_pos_idx(item, 0), 1)
    end

    # df[..]
    if item.is_a?(Range)
      return Slice.new(self).apply(item)
    end

    if item.is_a?(::Array) && item.all? { |v| Utils.strlike?(v) }
      # select multiple columns
      # df[["foo", "bar"]]
      return _from_rbdf(_df.select(item.map(&:to_s)))
    end

    if Utils.is_int_sequence(item)
      item = Series.new("", item)
    end

    if item.is_a?(Series)
      dtype = item.dtype
      if dtype == String
        return _from_rbdf(_df.select(item))
      elsif dtype == UInt32
        return _from_rbdf(_df.take_with_series(item._s))
      elsif [UInt8, UInt16, UInt64, Int8, Int16, Int32, Int64].include?(dtype)
        return _from_rbdf(
          _df.take_with_series(_pos_idxs(item, 0)._s)
        )
      end
    end
  end

  # Ruby-specific
  if item.is_a?(Expr) || item.is_a?(Series)
    return filter(item)
  end

  raise ArgumentError, "Cannot get item of type: #{item.class.name}"
end

#[]=(*key, value) ⇒ `Object`

Set item.

# File 'lib/polars/data_frame.rb', line 456

def []=(*key, value)
  if key.length == 1
    key = key.first
  elsif key.length != 2
    raise ArgumentError, "wrong number of arguments (given #{key.length + 1}, expected 2..3)"
  end

  if Utils.strlike?(key)
    if value.is_a?(::Array) || (defined?(Numo::NArray) && value.is_a?(Numo::NArray))
      value = Series.new(value)
    elsif !value.is_a?(Series)
      value = Polars.lit(value)
    end
    self._df = with_column(value.alias(key.to_s))._df
  elsif key.is_a?(::Array)
    row_selection, col_selection = key

    if Utils.strlike?(col_selection)
      s = self[col_selection]
    elsif col_selection.is_a?(Integer)
      raise Todo
    else
      raise ArgumentError, "column selection not understood: #{col_selection}"
    end

    s[row_selection] = value

    if col_selection.is_a?(Integer)
      replace_column(col_selection, s)
    elsif Utils.strlike?(col_selection)
      replace(col_selection, s)
    end
  else
    raise Todo
  end
end

#cast(dtypes, strict: true) ⇒ `DataFrame`

Cast DataFrame column(s) to the specified dtype(s).

Examples:

Cast specific frame columns to the specified dtypes:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => [Date.new(2020, 1, 2), Date.new(2021, 3, 4), Date.new(2022, 5, 6)]
  }
)
df.cast({"foo" => Polars::Float32, "bar" => Polars::UInt8})
# =>
# shape: (3, 3)
# ┌─────┬─────┬────────────┐
# │ foo ┆ bar ┆ ham        │
# │ --- ┆ --- ┆ ---        │
# │ f32 ┆ u8  ┆ date       │
# ╞═════╪═════╪════════════╡
# │ 1.0 ┆ 6   ┆ 2020-01-02 │
# │ 2.0 ┆ 7   ┆ 2021-03-04 │
# │ 3.0 ┆ 8   ┆ 2022-05-06 │
# └─────┴─────┴────────────┘

Cast all frame columns matching one dtype (or dtype group) to another dtype:

df.cast({Polars::Date => Polars::Datetime})
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────────────────────┐
# │ foo ┆ bar ┆ ham                 │
# │ --- ┆ --- ┆ ---                 │
# │ i64 ┆ f64 ┆ datetime[μs]        │
# ╞═════╪═════╪═════════════════════╡
# │ 1   ┆ 6.0 ┆ 2020-01-02 00:00:00 │
# │ 2   ┆ 7.0 ┆ 2021-03-04 00:00:00 │
# │ 3   ┆ 8.0 ┆ 2022-05-06 00:00:00 │
# └─────┴─────┴─────────────────────┘

Cast all frame columns to the specified dtype:

df.cast(Polars::String).to_h(as_series: false)
# => {"foo"=>["1", "2", "3"], "bar"=>["6.0", "7.0", "8.0"], "ham"=>["2020-01-02", "2021-03-04", "2022-05-06"]}



3144
3145
3146

# File 'lib/polars/data_frame.rb', line 3144

def cast(dtypes, strict: true)
  lazy.cast(dtypes, strict: strict).collect(_eager: true)
end

#clear(n = 0) ⇒ `DataFrame` Also known as: cleared

Create an empty copy of the current DataFrame.

Returns a DataFrame with identical schema but no data.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil]
  }
)
df.clear
# =>
# shape: (0, 3)
# ┌─────┬─────┬──────┐
# │ a   ┆ b   ┆ c    │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ f64 ┆ bool │
# ╞═════╪═════╪══════╡
# └─────┴─────┴──────┘

df.clear(2)
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ a    ┆ b    ┆ c    │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ f64  ┆ bool │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

# File 'lib/polars/data_frame.rb', line 3184

def clear(n = 0)
  if n == 0
    _from_rbdf(_df.clear)
  elsif n > 0 || len > 0
    self.class.new(
      schema.to_h { |nm, tp| [nm, Series.new(nm, [], dtype: tp).extend_constant(nil, n)] }
    )
  else
    clone
  end
end

#collect_schema ⇒ `Schema`

Note:

This method is included to facilitate writing code that is generic for both DataFrame and LazyFrame.

Get an ordered mapping of column names to their data type.

Examples:

Determine the schema.

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.collect_schema
# => Polars::Schema({"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String})

Access various properties of the schema using the `Schema` object.

schema = df.collect_schema
schema["bar"]
# => Polars::Float64

schema.names
# => ["foo", "bar", "ham"]

schema.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]

schema.length
# => 3



533
534
535

# File 'lib/polars/data_frame.rb', line 533

def collect_schema
  Schema.new(columns.zip(dtypes), check_dtypes: false)
end

#columns ⇒ `Array`

Get column names.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.columns
# => ["foo", "bar", "ham"]



140
141
142

# File 'lib/polars/data_frame.rb', line 140

def columns
  _df.columns
end

#columns=(columns) ⇒ `Object`

Change the column names of the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.columns = ["apple", "banana", "orange"]
df
# =>
# shape: (3, 3)
# ┌───────┬────────┬────────┐
# │ apple ┆ banana ┆ orange │
# │ ---   ┆ ---    ┆ ---    │
# │ i64   ┆ i64    ┆ str    │
# ╞═══════╪════════╪════════╡
# │ 1     ┆ 6      ┆ a      │
# │ 2     ┆ 7      ┆ b      │
# │ 3     ┆ 8      ┆ c      │
# └───────┴────────┴────────┘



173
174
175

# File 'lib/polars/data_frame.rb', line 173

def columns=(columns)
  _df.set_column_names(columns)
end

#delete(name) ⇒ `Series`

Drop in place if exists.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.delete("ham")
# =>
# shape: (3,)
# Series: 'ham' [str]
# [
#         "a"
#         "b"
#         "c"
# ]

df.delete("missing")
# => nil



3091
3092
3093

# File 'lib/polars/data_frame.rb', line 3091

def delete(name)
  drop_in_place(name) if include?(name)
end

#describe ⇒ `DataFrame`

Summary statistics for a DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.0, 2.8, 3.0],
    "b" => [4, 5, nil],
    "c" => [true, false, true],
    "d" => [nil, "b", "c"],
    "e" => ["usd", "eur", nil]
  }
)
df.describe
# =>
# shape: (7, 6)
# ┌────────────┬──────────┬──────────┬──────────┬──────┬──────┐
# │ describe   ┆ a        ┆ b        ┆ c        ┆ d    ┆ e    │
# │ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---  │
# │ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str  │
# ╞════════════╪══════════╪══════════╪══════════╪══════╪══════╡
# │ count      ┆ 3.0      ┆ 3.0      ┆ 3.0      ┆ 3    ┆ 3    │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 1    ┆ 1    │
# │ mean       ┆ 2.266667 ┆ 4.5      ┆ 0.666667 ┆ null ┆ null │
# │ std        ┆ 1.101514 ┆ 0.707107 ┆ 0.57735  ┆ null ┆ null │
# │ min        ┆ 1.0      ┆ 4.0      ┆ 0.0      ┆ b    ┆ eur  │
# │ max        ┆ 3.0      ┆ 5.0      ┆ 1.0      ┆ c    ┆ usd  │
# │ median     ┆ 2.8      ┆ 4.5      ┆ 1.0      ┆ null ┆ null │
# └────────────┴──────────┴──────────┴──────────┴──────┴──────┘

# File 'lib/polars/data_frame.rb', line 1481

def describe
  describe_cast = lambda do |stat|
    columns = []
    self.columns.each_with_index do |s, i|
      if self[s].is_numeric || self[s].is_boolean
        columns << stat[0.., i].cast(:f64)
      else
        # for dates, strings, etc, we cast to string so that all
        # statistics can be shown
        columns << stat[0.., i].cast(:str)
      end
    end
    self.class.new(columns)
  end

  summary = _from_rbdf(
    Polars.concat(
      [
        describe_cast.(
          self.class.new(columns.to_h { |c| [c, [height]] })
        ),
        describe_cast.(null_count),
        describe_cast.(mean),
        describe_cast.(std),
        describe_cast.(min),
        describe_cast.(max),
        describe_cast.(median)
      ]
    )._df
  )
  summary.insert_column(
    0,
    Polars::Series.new(
      "describe",
      ["count", "null_count", "mean", "std", "min", "max", "median"],
    )
  )
  summary
end

#drop(*columns) ⇒ `DataFrame`

Remove column from DataFrame and return as new.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.drop("ham")
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ f64 │
# ╞═════╪═════╡
# │ 1   ┆ 6.0 │
# │ 2   ┆ 7.0 │
# │ 3   ┆ 8.0 │
# └─────┴─────┘

Drop multiple columns by passing a list of column names.

df.drop(["bar", "ham"])
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘

Use positional arguments to drop multiple columns.

df.drop("foo", "ham")
# =>
# shape: (3, 1)
# ┌─────┐
# │ bar │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 6.0 │
# │ 7.0 │
# │ 8.0 │
# └─────┘



3031
3032
3033

# File 'lib/polars/data_frame.rb', line 3031

def drop(*columns)
  lazy.drop(*columns).collect(_eager: true)
end

#drop_in_place(name) ⇒ `Series`

Drop in place.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.drop_in_place("ham")
# =>
# shape: (3,)
# Series: 'ham' [str]
# [
#         "a"
#         "b"
#         "c"
# ]



3059
3060
3061

# File 'lib/polars/data_frame.rb', line 3059

def drop_in_place(name)
  Utils.wrap_s(_df.drop_in_place(name))
end

#drop_nulls(subset: nil) ⇒ `DataFrame`

Return a new DataFrame where the null values are dropped.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, nil, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.drop_nulls
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘



1862
1863
1864

# File 'lib/polars/data_frame.rb', line 1862

def drop_nulls(subset: nil)
  lazy.drop_nulls(subset: subset).collect(_eager: true)
end

#dtypes ⇒ `Array`

Get dtypes of columns in DataFrame. Dtypes can also be found in column headers when printing the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]



191
192
193

# File 'lib/polars/data_frame.rb', line 191

def dtypes
  _df.dtypes
end

#each(&block) ⇒ `Object`

Returns an enumerator.



347
348
349

# File 'lib/polars/data_frame.rb', line 347

def each(&block)
  get_columns.each(&block)
end

#each_row(named: true, buffer_size: 500, &block) ⇒ `Object`

Returns an iterator over the DataFrame of rows of Ruby-native values.



5057
5058
5059

# File 'lib/polars/data_frame.rb', line 5057

def each_row(named: true, buffer_size: 500, &block)
  iter_rows(named: named, buffer_size: buffer_size, &block)
end

#equals(other, null_equal: true) ⇒ `Boolean` Also known as: frame_equal

Check if DataFrame is equal to other.

Examples:

df1 = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df2 = Polars::DataFrame.new(
  {
    "foo" => [3, 2, 1],
    "bar" => [8.0, 7.0, 6.0],
    "ham" => ["c", "b", "a"]
  }
)
df1.equals(df1)
# => true
df1.equals(df2)
# => false



1674
1675
1676

# File 'lib/polars/data_frame.rb', line 1674

def equals(other, null_equal: true)
  _df.equals(other._df, null_equal)
end

#estimated_size(unit = "b") ⇒ `Numeric`

Return an estimation of the total (heap) allocated size of the DataFrame.

Estimated size is given in the specified unit (bytes by default).

This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, StructArray's size is an upper bound.

When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.

FFI buffers are included in this estimation.

Examples:

df = Polars::DataFrame.new(
  {
    "x" => 1_000_000.times.to_a.reverse,
    "y" => 1_000_000.times.map { |v| v / 1000.0 },
    "z" => 1_000_000.times.map(&:to_s)
  },
  columns: {"x" => :u32, "y" => :f64, "z" => :str}
)
df.estimated_size
# => 25888898
df.estimated_size("mb")
# => 17.0601749420166

# File 'lib/polars/data_frame.rb', line 1224

def estimated_size(unit = "b")
  sz = _df.estimated_size
  Utils.scale_bytes(sz, to: unit)
end

#explode(columns) ⇒ `DataFrame`

Explode DataFrame to long format by exploding a column with Lists.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["a", "a", "b", "c"],
    "numbers" => [[1], [2, 3], [4, 5], [6, 7, 8]]
  }
)
df.explode("numbers")
# =>
# shape: (8, 2)
# ┌─────────┬─────────┐
# │ letters ┆ numbers │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ a       ┆ 1       │
# │ a       ┆ 2       │
# │ a       ┆ 3       │
# │ b       ┆ 4       │
# │ b       ┆ 5       │
# │ c       ┆ 6       │
# │ c       ┆ 7       │
# │ c       ┆ 8       │
# └─────────┴─────────┘



3433
3434
3435

# File 'lib/polars/data_frame.rb', line 3433

def explode(columns)
  lazy.explode(columns).collect(no_optimization: true)
end

#extend(other) ⇒ `DataFrame`

Extend the memory backed by this DataFrame with the values from other.

Different from vstack which adds the chunks from other to the chunks of this DataFrame extend appends the data from other to the underlying memory locations and thus may cause a reallocation.

If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.

Prefer extend over vstack when you want to do a query after a single append. For instance during online operations where you add n rows and rerun a query.

Prefer vstack over extend when you want to append many times before doing a query. For instance when you read in multiple files and when to store them in a single DataFrame. In the latter case, finish the sequence of vstack operations with a rechunk.

Examples:

df1 = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df2 = Polars::DataFrame.new({"foo" => [10, 20, 30], "bar" => [40, 50, 60]})
df1.extend(df2)
# =>
# shape: (6, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 4   │
# │ 2   ┆ 5   │
# │ 3   ┆ 6   │
# │ 10  ┆ 40  │
# │ 20  ┆ 50  │
# │ 30  ┆ 60  │
# └─────┴─────┘

# File 'lib/polars/data_frame.rb', line 2971

def extend(other)
  _df.extend(other._df)
  self
end

#fill_nan(fill_value) ⇒ `DataFrame`

Note:

Note that floating point NaNs (Not a Number) are not missing values! To replace missing values, use fill_null.

Fill floating point NaN values by an Expression evaluation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13]
  }
)
df.fill_nan(99)
# =>
# shape: (4, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ f64  ┆ f64  │
# ╞══════╪══════╡
# │ 1.5  ┆ 0.5  │
# │ 2.0  ┆ 4.0  │
# │ 99.0 ┆ 99.0 │
# │ 4.0  ┆ 13.0 │
# └──────┴──────┘



3398
3399
3400

# File 'lib/polars/data_frame.rb', line 3398

def fill_nan(fill_value)
  lazy.fill_nan(fill_value).collect(no_optimization: true)
end

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ `DataFrame`

Fill null values using the specified value or strategy.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, nil, 4],
    "b" => [0.5, 4, nil, 13]
  }
)
df.fill_null(99)
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 99  ┆ 99.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

df.fill_null(strategy: "forward")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

df.fill_null(strategy: "max")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

df.fill_null(strategy: "zero")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 0   ┆ 0.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

# File 'lib/polars/data_frame.rb', line 3358

def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true)
  _from_rbdf(
    lazy
      .fill_null(value, strategy: strategy, limit: limit, matches_supertype: matches_supertype)
      .collect(no_optimization: true)
      ._df
  )
end

#filter(predicate) ⇒ `DataFrame`

Filter the rows in the DataFrame based on a predicate expression.

Examples:

Filter on one condition:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.filter(Polars.col("foo") < 3)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

Filter on multiple conditions:

df.filter((Polars.col("foo") < 3) & (Polars.col("ham") == "a"))
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘



1447
1448
1449

# File 'lib/polars/data_frame.rb', line 1447

def filter(predicate)
  lazy.filter(predicate).collect
end

#flags ⇒ `Hash`

Get flags that are set on the columns of this DataFrame.



198
199
200

# File 'lib/polars/data_frame.rb', line 198

def flags
  columns.to_h { |name| [name, self[name].flags] }
end

#fold ⇒ `Series`

Apply a horizontal reduction on a DataFrame.

This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).

An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:

i8 + str = str f32 + i64 = f32 f32 + f64 = f64

Examples:

A horizontal sum operation:

df = Polars::DataFrame.new(
  {
    "a" => [2, 1, 3],
    "b" => [1, 2, 3],
    "c" => [1.0, 2.0, 3.0]
  }
)
df.fold { |s1, s2| s1 + s2 }
# =>
# shape: (3,)
# Series: 'a' [f64]
# [
#         4.0
#         5.0
#         9.0
# ]

A horizontal minimum operation:

df = Polars::DataFrame.new({"a" => [2, 1, 3], "b" => [1, 2, 3], "c" => [1.0, 2.0, 3.0]})
df.fold { |s1, s2| s1.zip_with(s1 < s2, s2) }
# =>
# shape: (3,)
# Series: 'a' [f64]
# [
#         1.0
#         1.0
#         3.0
# ]

A horizontal string concatenation:

df = Polars::DataFrame.new(
  {
    "a" => ["foo", "bar", nil],
    "b" => [1, 2, 3],
    "c" => [1.0, 2.0, 3.0]
  }
)
df.fold { |s1, s2| s1 + s2 }
# =>
# shape: (3,)
# Series: 'a' [str]
# [
#         "foo11.0"
#         "bar22.0"
#         null
# ]

A horizontal boolean or, similar to a row-wise .any:

df = Polars::DataFrame.new(
  {
    "a" => [false, false, true],
    "b" => [false, true, false]
  }
)
df.fold { |s1, s2| s1 | s2 }
# =>
# shape: (3,)
# Series: 'a' [bool]
# [
#         false
#         true
#         true
# ]

# File 'lib/polars/data_frame.rb', line 4866

def fold
  acc = to_series(0)

  1.upto(width - 1) do |i|
    acc = yield(acc, to_series(i))
  end
  acc
end

#gather_every(n, offset = 0) ⇒ `DataFrame` Also known as: take_every

Take every nth row in the DataFrame and return as a new DataFrame.

Examples:

s = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [5, 6, 7, 8]})
s.gather_every(2)
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 5   │
# │ 3   ┆ 7   │
# └─────┴─────┘



5178
5179
5180

# File 'lib/polars/data_frame.rb', line 5178

def gather_every(n, offset = 0)
  select(F.col("*").gather_every(n, offset))
end

#get_column(name) ⇒ `Series`

Get a single column as Series by name.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.get_column("foo")
# =>
# shape: (3,)
# Series: 'foo' [i64]
# [
#         1
#         2
#         3
# ]



3275
3276
3277

# File 'lib/polars/data_frame.rb', line 3275

def get_column(name)
  self[name]
end

#get_column_index(name) ⇒ `Series` Also known as: find_idx_by_name

Find the index of a column by name.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3], "bar" => [6, 7, 8], "ham" => ["a", "b", "c"]}
)
df.get_column_index("ham")
# => 2



1534
1535
1536

# File 'lib/polars/data_frame.rb', line 1534

def get_column_index(name)
  _df.get_column_index(name)
end

#get_columns ⇒ `Array`

Get the DataFrame as a Array of Series.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.get_columns
# =>
# [shape: (3,)
# Series: 'foo' [i64]
# [
#         1
#         2
#         3
# ], shape: (3,)
# Series: 'bar' [i64]
# [
#         4
#         5
#         6
# ]]

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.get_columns
# =>
# [shape: (4,)
# Series: 'a' [i64]
# [
#         1
#         2
#         3
#         4
# ], shape: (4,)
# Series: 'b' [f64]
# [
#         0.5
#         4.0
#         10.0
#         13.0
# ], shape: (4,)
# Series: 'c' [bool]
# [
#         true
#         true
#         false
#         true
# ]]



3253
3254
3255

# File 'lib/polars/data_frame.rb', line 3253

def get_columns
  _df.get_columns.map { |s| Utils.wrap_s(s) }
end

#group_by(by, maintain_order: false) ⇒ `GroupBy` Also known as: groupby, group

Start a group by operation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
df.group_by("a").agg(Polars.col("b").sum).sort("a")
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 4   │
# │ b   ┆ 11  │
# │ c   ┆ 6   │
# └─────┴─────┘

# File 'lib/polars/data_frame.rb', line 1970

def group_by(by, maintain_order: false)
  if !Utils.bool?(maintain_order)
    raise TypeError, "invalid input for group_by arg `maintain_order`: #{maintain_order}."
  end
  GroupBy.new(
    self,
    by,
    maintain_order: maintain_order
  )
end

#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window") ⇒ `DataFrame` Also known as: groupby_dynamic

Group based on a time value (or index value of type :i32, :i64).

Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

every: interval of the window
period: length of the window
offset: offset of the window

The every, period and offset arguments are created with the following string language:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1mo (1 calendar month)
1y (1 calendar year)
1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_dynamic on an integer column, the windows are defined by:

"1i" # length 1
"10i" # length 10

Examples:

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "n" => 0..6
  }
)
# =>
# shape: (7, 2)
# ┌─────────────────────┬─────┐
# │ time                ┆ n   │
# │ ---                 ┆ --- │
# │ datetime[μs]        ┆ i64 │
# ╞═════════════════════╪═════╡
# │ 2021-12-16 00:00:00 ┆ 0   │
# │ 2021-12-16 00:30:00 ┆ 1   │
# │ 2021-12-16 01:00:00 ┆ 2   │
# │ 2021-12-16 01:30:00 ┆ 3   │
# │ 2021-12-16 02:00:00 ┆ 4   │
# │ 2021-12-16 02:30:00 ┆ 5   │
# │ 2021-12-16 03:00:00 ┆ 6   │
# └─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

df.group_by_dynamic("time", every: "1h", closed: "right").agg(
  [
    Polars.col("time").min.alias("time_min"),
    Polars.col("time").max.alias("time_max")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬─────────────────────┬─────────────────────┐
# │ time                ┆ time_min            ┆ time_max            │
# │ ---                 ┆ ---                 ┆ ---                 │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
# └─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result.

df.group_by_dynamic(
  "time", every: "1h", include_boundaries: true, closed: "right"
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (4, 4)
# ┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# └─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed="left", should not include right end of interval.

df.group_by_dynamic("time", every: "1h", closed: "left").agg(
  [
    Polars.col("time").count.alias("time_count"),
    Polars.col("time").alias("time_agg_list")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬─────────────────────────────────┐
# │ time                ┆ time_count ┆ time_agg_list                   │
# │ ---                 ┆ ---        ┆ ---                             │
# │ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]              │
# ╞═════════════════════╪════════════╪═════════════════════════════════╡
# │ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-… │
# │ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-… │
# │ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-… │
# │ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]           │
# └─────────────────────┴────────────┴─────────────────────────────────┘

When closed="both" the time values at the window boundaries belong to 2 groups.

df.group_by_dynamic("time", every: "1h", closed: "both").agg(
  [Polars.col("time").count.alias("time_count")]
)
# =>
# shape: (5, 2)
# ┌─────────────────────┬────────────┐
# │ time                ┆ time_count │
# │ ---                 ┆ ---        │
# │ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 3          │
# │ 2021-12-16 01:00:00 ┆ 3          │
# │ 2021-12-16 02:00:00 ┆ 3          │
# │ 2021-12-16 03:00:00 ┆ 1          │
# └─────────────────────┴────────────┘

Dynamic group bys can also be combined with grouping on normal keys.

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "groups" => ["a", "a", "a", "b", "b", "a", "a"]
  }
)
df.group_by_dynamic(
  "time",
  every: "1h",
  closed: "both",
  by: "groups",
  include_boundaries: true
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (7, 5)
# ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
# │ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# │ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
# │ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
# └────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic group by on an index column.

df = Polars::DataFrame.new(
  {
    "idx" => Polars.arange(0, 6, eager: true),
    "A" => ["A", "A", "B", "B", "B", "C"]
  }
)
df.group_by_dynamic(
  "idx",
  every: "2i",
  period: "3i",
  include_boundaries: true,
  closed: "right"
).agg(Polars.col("A").alias("A_agg_list"))
# =>
# shape: (4, 4)
# ┌─────────────────┬─────────────────┬─────┬─────────────────┐
# │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
# │ ---             ┆ ---             ┆ --- ┆ ---             │
# │ i64             ┆ i64             ┆ i64 ┆ list[str]       │
# ╞═════════════════╪═════════════════╪═════╪═════════════════╡
# │ -2              ┆ 1               ┆ -2  ┆ ["A", "A"]      │
# │ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
# │ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
# │ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
# └─────────────────┴─────────────────┴─────┴─────────────────┘

# File 'lib/polars/data_frame.rb', line 2310

def group_by_dynamic(
  index_column,
  every:,
  period: nil,
  offset: nil,
  truncate: true,
  include_boundaries: false,
  closed: "left",
  by: nil,
  start_by: "window"
)
  DynamicGroupBy.new(
    self,
    index_column,
    every,
    period,
    offset,
    truncate,
    include_boundaries,
    closed,
    by,
    start_by
  )
end

#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ `Series`

Hash and combine the rows in this DataFrame.

The hash value is of type :u64.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 3, 4],
    "ham" => ["a", "b", nil, "d"]
  }
)
df.hash_rows(seed: 42)
# =>
# shape: (4,)
# Series: '' [u64]
# [
#         4238614331852490969
#         17976148875586754089
#         4702262519505526977
#         18144177983981041107
# ]

# File 'lib/polars/data_frame.rb', line 5215

def hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil)
  k0 = seed
  k1 = seed_1.nil? ? seed : seed_1
  k2 = seed_2.nil? ? seed : seed_2
  k3 = seed_3.nil? ? seed : seed_3
  Utils.wrap_s(_df.hash_rows(k0, k1, k2, k3))
end

#head(n = 5) ⇒ `DataFrame`

Get the first n rows.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.head(3)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘



1801
1802
1803

# File 'lib/polars/data_frame.rb', line 1801

def head(n = 5)
  _from_rbdf(_df.head(n))
end

#height ⇒ `Integer` Also known as: count, length, size

Get the height of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.height
# => 5



107
108
109

# File 'lib/polars/data_frame.rb', line 107

def height
  _df.height
end

#hstack(columns, in_place: false) ⇒ `DataFrame`

Return a new DataFrame grown horizontally by stacking multiple Series to it.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
x = Polars::Series.new("apple", [10, 20, 30])
df.hstack([x])
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ i64 ┆ str ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6   ┆ a   ┆ 10    │
# │ 2   ┆ 7   ┆ b   ┆ 20    │
# │ 3   ┆ 8   ┆ c   ┆ 30    │
# └─────┴─────┴─────┴───────┘

# File 'lib/polars/data_frame.rb', line 2873

def hstack(columns, in_place: false)
  if !columns.is_a?(::Array)
    columns = columns.get_columns
  end
  if in_place
    _df.hstack_mut(columns.map(&:_s))
    self
  else
    _from_rbdf(_df.hstack(columns.map(&:_s)))
  end
end

#include?(name) ⇒ `Boolean`

Check if DataFrame includes column.



340
341
342

# File 'lib/polars/data_frame.rb', line 340

def include?(name)
  columns.include?(name)
end

#insert_column(index, series) ⇒ `DataFrame` Also known as: insert_at_idx

Insert a Series at a certain column index. This operation is in place.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
s = Polars::Series.new("baz", [97, 98, 99])
df.insert_column(1, s)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ baz ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 97  ┆ 4   │
# │ 2   ┆ 98  ┆ 5   │
# │ 3   ┆ 99  ┆ 6   │
# └─────┴─────┴─────┘

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
s = Polars::Series.new("d", [-2.5, 15, 20.5, 0])
df.insert_column(3, s)
# =>
# shape: (4, 4)
# ┌─────┬──────┬───────┬──────┐
# │ a   ┆ b    ┆ c     ┆ d    │
# │ --- ┆ ---  ┆ ---   ┆ ---  │
# │ i64 ┆ f64  ┆ bool  ┆ f64  │
# ╞═════╪══════╪═══════╪══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ -2.5 │
# │ 2   ┆ 4.0  ┆ true  ┆ 15.0 │
# │ 3   ┆ 10.0 ┆ false ┆ 20.5 │
# │ 4   ┆ 13.0 ┆ true  ┆ 0.0  │
# └─────┴──────┴───────┴──────┘

# File 'lib/polars/data_frame.rb', line 1400

def insert_column(index, series)
  if index < 0
    index = columns.length + index
  end
  _df.insert_column(index, series._s)
  self
end

#interpolate ⇒ `DataFrame`

Interpolate intermediate values. The interpolation method is linear.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 9, 10],
    "bar" => [6, 7, 9, nil],
    "baz" => [1, nil, nil, 9]
  }
)
df.interpolate
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────────┐
# │ foo  ┆ bar  ┆ baz      │
# │ ---  ┆ ---  ┆ ---      │
# │ f64  ┆ f64  ┆ f64      │
# ╞══════╪══════╪══════════╡
# │ 1.0  ┆ 6.0  ┆ 1.0      │
# │ 5.0  ┆ 7.0  ┆ 3.666667 │
# │ 9.0  ┆ 9.0  ┆ 6.333333 │
# │ 10.0 ┆ null ┆ 9.0      │
# └──────┴──────┴──────────┘



5248
5249
5250

# File 'lib/polars/data_frame.rb', line 5248

def interpolate
  select(F.col("*").interpolate)
end

#is_duplicated ⇒ `Series`

Get a mask of all duplicated rows in this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 1],
    "b" => ["x", "y", "z", "x"],
  }
)
df.is_duplicated
# =>
# shape: (4,)
# Series: '' [bool]
# [
#         true
#         false
#         false
#         true
# ]



3910
3911
3912

# File 'lib/polars/data_frame.rb', line 3910

def is_duplicated
  Utils.wrap_s(_df.is_duplicated)
end

#is_empty ⇒ `Boolean` Also known as: empty?

Check if the dataframe is empty.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.is_empty
# => false
df.filter(Polars.col("foo") > 99).is_empty
# => true



5262
5263
5264

# File 'lib/polars/data_frame.rb', line 5262

def is_empty
  height == 0
end

#is_unique ⇒ `Series`

Get a mask of all unique rows in this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 1],
    "b" => ["x", "y", "z", "x"]
  }
)
df.is_unique
# =>
# shape: (4,)
# Series: '' [bool]
# [
#         false
#         true
#         true
#         false
# ]



3935
3936
3937

# File 'lib/polars/data_frame.rb', line 3935

def is_unique
  Utils.wrap_s(_df.is_unique)
end

#item ⇒ `Object`

Return the dataframe as a scalar.

Equivalent to df[0,0], with a check that the shape is (1,1).

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => [4, 5, 6]})
result = df.select((Polars.col("a") * Polars.col("b")).sum)
result.item
# => 32

# File 'lib/polars/data_frame.rb', line 548

def item
  if shape != [1, 1]
    raise ArgumentError, "Can only call .item if the dataframe is of shape (1,1), dataframe is of shape #{shape}"
  end
  self[0, 0]
end

#iter_columns ⇒ `Object`

Note:

Consider whether you can use all instead. If you can, it will be more efficient.

Returns an iterator over the columns of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.iter_columns.map { |s| s.name }
# => ["a", "b"]

If you're using this to modify a dataframe's columns, e.g.

# Do NOT do this
Polars::DataFrame.new(df.iter_columns.map { |column| column * 2 })
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 4   │
# │ 6   ┆ 8   │
# │ 10  ┆ 12  │
# └─────┴─────┘

then consider whether you can use `all` instead:

df.select(Polars.all * 2)
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 4   │
# │ 6   ┆ 8   │
# │ 10  ┆ 12  │
# └─────┴─────┘

# File 'lib/polars/data_frame.rb', line 5107

def iter_columns
  return to_enum(:iter_columns) unless block_given?

  _df.get_columns.each do |s|
    yield Utils.wrap_s(s)
  end
end

#iter_rows(named: false, buffer_size: 500, &block) ⇒ `Object`

Returns an iterator over the DataFrame of rows of Ruby-native values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.iter_rows.map { |row| row[0] }
# => [1, 3, 5]

df.iter_rows(named: true).map { |row| row["b"] }
# => [2, 4, 6]

# File 'lib/polars/data_frame.rb', line 5010

def iter_rows(named: false, buffer_size: 500, &block)
  return to_enum(:iter_rows, named: named, buffer_size: buffer_size) unless block_given?

  # load into the local namespace for a modest performance boost in the hot loops
  columns = self.columns

  # note: buffering rows results in a 2-4x speedup over individual calls
  # to ".row(i)", so it should only be disabled in extremely specific cases.
  if buffer_size
    offset = 0
    while offset < height
      zerocopy_slice = slice(offset, buffer_size)
      rows_chunk = zerocopy_slice.rows(named: false)
      if named
        rows_chunk.each do |row|
          yield columns.zip(row).to_h
        end
      else
        rows_chunk.each(&block)
      end
      offset += buffer_size
    end
  elsif named
    height.times do |i|
      yield columns.zip(row(i)).to_h
    end
  else
    height.times do |i|
      yield row(i)
    end
  end
end

#iter_slices(n_rows: 10_000) ⇒ `Object`

Returns a non-copying iterator of slices over the underlying DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => 0...17_500,
    "b" => Date.new(2023, 1, 1),
    "c" => "klmnoopqrstuvwxyz"
  },
  schema_overrides: {"a" => Polars::Int32}
)
df.iter_slices.map.with_index do |frame, idx|
  "#{frame.class.name}:[#{idx}]:#{frame.length}"
end
# => ["Polars::DataFrame:[0]:10000", "Polars::DataFrame:[1]:7500"]

# File 'lib/polars/data_frame.rb', line 5135

def iter_slices(n_rows: 10_000)
  return to_enum(:iter_slices, n_rows: n_rows) unless block_given?

  offset = 0
  while offset < height
    yield slice(offset, n_rows)
    offset += n_rows
  end
end

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil) ⇒ `DataFrame`

Join in SQL-like fashion.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
other_df = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
)
df.join(other_df, on: "ham")
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# └─────┴─────┴─────┴───────┘

df.join(other_df, on: "ham", how: "full")
# =>
# shape: (4, 5)
# ┌──────┬──────┬──────┬───────┬───────────┐
# │ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
# │ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
# │ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
# ╞══════╪══════╪══════╪═══════╪═══════════╡
# │ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
# │ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
# │ null ┆ null ┆ null ┆ z     ┆ d         │
# │ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
# └──────┴──────┴──────┴───────┴───────────┘

df.join(other_df, on: "ham", how: "left")
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# │ 3   ┆ 8.0 ┆ c   ┆ null  │
# └─────┴─────┴─────┴───────┘

df.join(other_df, on: "ham", how: "semi")
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6.0 ┆ a   │
# │ 2   ┆ 7.0 ┆ b   │
# └─────┴─────┴─────┘

df.join(other_df, on: "ham", how: "anti")
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

# File 'lib/polars/data_frame.rb', line 2699

def join(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  how: "inner",
  suffix: "_right",
  validate: "m:m",
  join_nulls: false,
  coalesce: nil,
  maintain_order: nil
)
  lazy
    .join(
      other.lazy,
      left_on: left_on,
      right_on: right_on,
      on: on,
      how: how,
      suffix: suffix,
      validate: validate,
      join_nulls: join_nulls,
      coalesce: coalesce,
      maintain_order: maintain_order
    )
    .collect(no_optimization: true)
end

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ `DataFrame`

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the asof_join key.

For each row in the left DataFrame:

A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.

The default is "backward".

Examples:

gdp = Polars::DataFrame.new(
  {
    "date" => [
      DateTime.new(2016, 1, 1),
      DateTime.new(2017, 1, 1),
      DateTime.new(2018, 1, 1),
      DateTime.new(2019, 1, 1),
    ],  # note record date: Jan 1st (sorted!)
    "gdp" => [4164, 4411, 4566, 4696]
  }
).set_sorted("date")
population = Polars::DataFrame.new(
  {
    "date" => [
      DateTime.new(2016, 5, 12),
      DateTime.new(2017, 5, 12),
      DateTime.new(2018, 5, 12),
      DateTime.new(2019, 5, 12),
    ],  # note record date: May 12th (sorted!)
    "population" => [82.19, 82.66, 83.12, 83.52]
  }
).set_sorted("date")
population.join_asof(
  gdp, left_on: "date", right_on: "date", strategy: "backward"
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬──────┐
# │ date                ┆ population ┆ gdp  │
# │ ---                 ┆ ---        ┆ ---  │
# │ datetime[ns]        ┆ f64        ┆ i64  │
# ╞═════════════════════╪════════════╪══════╡
# │ 2016-05-12 00:00:00 ┆ 82.19      ┆ 4164 │
# │ 2017-05-12 00:00:00 ┆ 82.66      ┆ 4411 │
# │ 2018-05-12 00:00:00 ┆ 83.12      ┆ 4566 │
# │ 2019-05-12 00:00:00 ┆ 83.52      ┆ 4696 │
# └─────────────────────┴────────────┴──────┘

# File 'lib/polars/data_frame.rb', line 2533

def join_asof(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  by_left: nil,
  by_right: nil,
  by: nil,
  strategy: "backward",
  suffix: "_right",
  tolerance: nil,
  allow_parallel: true,
  force_parallel: false,
  coalesce: true,
  allow_exact_matches: true,
  check_sortedness: true
)
  lazy
    .join_asof(
      other.lazy,
      left_on: left_on,
      right_on: right_on,
      on: on,
      by_left: by_left,
      by_right: by_right,
      by: by,
      strategy: strategy,
      suffix: suffix,
      tolerance: tolerance,
      allow_parallel: allow_parallel,
      force_parallel: force_parallel,
      coalesce: coalesce,
      allow_exact_matches: allow_exact_matches,
      check_sortedness: check_sortedness
    )
    .collect(no_optimization: true)
end

#lazy ⇒ `LazyFrame`

Start a lazy query from this point.



3942
3943
3944

# File 'lib/polars/data_frame.rb', line 3942

def lazy
  wrap_ldf(_df.lazy)
end

#limit(n = 5) ⇒ `DataFrame`

Get the first n rows.

Alias for #head.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3, 4, 5, 6], "bar" => ["a", "b", "c", "d", "e", "f"]}
)
df.limit(4)
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# │ 4   ┆ d   │
# └─────┴─────┘



1770
1771
1772

# File 'lib/polars/data_frame.rb', line 1770

def limit(n = 5)
  head(n)
end

#map_rows(return_dtype: nil, inference_size: 256, &f) ⇒ `Object` Also known as: apply

Note:

The frame-level apply cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-level apply syntax instead.

Apply a custom/user-defined function (UDF) over the rows of the DataFrame.

The UDF will receive each row as a tuple of values: udf(row).

Implementing logic using a Ruby function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:

The native expression engine runs in Rust; UDFs run in Ruby.
Use of Ruby UDFs forces the DataFrame to be materialized in memory.
Polars-native expressions can be parallelised (UDFs cannot).
Polars-native expressions can be logically optimised (UDFs cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [-1, 5, 8]})

Return a DataFrame by mapping each row to a tuple:

df.map_rows { |t| [t[0] * 2, t[1] * 3] }
# =>
# shape: (3, 2)
# ┌──────────┬──────────┐
# │ column_0 ┆ column_1 │
# │ ---      ┆ ---      │
# │ i64      ┆ i64      │
# ╞══════════╪══════════╡
# │ 2        ┆ -3       │
# │ 4        ┆ 15       │
# │ 6        ┆ 24       │
# └──────────┴──────────┘

Return a Series by mapping each row to a scalar:

df.map_rows { |t| t[0] * 2 + t[1] }
# =>
# shape: (3, 1)
# ┌─────┐
# │ map │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 9   │
# │ 14  │
# └─────┘

# File 'lib/polars/data_frame.rb', line 2787

def map_rows(return_dtype: nil, inference_size: 256, &f)
  out, is_df = _df.map_rows(f, return_dtype, inference_size)
  if is_df
    _from_rbdf(out)
  else
    _from_rbdf(Utils.wrap_s(out).to_frame._df)
  end
end

#max ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their maximum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.max
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘



4202
4203
4204

# File 'lib/polars/data_frame.rb', line 4202

def max
  lazy.max.collect(_eager: true)
end

#max_horizontal ⇒ `Series`

Get the maximum value horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.max_horizontal
# =>
# shape: (3,)
# Series: 'max' [f64]
# [
#         4.0
#         5.0
#         6.0
# ]



4226
4227
4228

# File 'lib/polars/data_frame.rb', line 4226

def max_horizontal
  select(max: F.max_horizontal(F.all)).to_series
end

#mean ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their mean value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.mean
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘



4358
4359
4360

# File 'lib/polars/data_frame.rb', line 4358

def mean
  lazy.mean.collect(_eager: true)
end

#mean_horizontal(ignore_nulls: true) ⇒ `Series`

Take the mean of all values horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.mean_horizontal
# =>
# shape: (3,)
# Series: 'mean' [f64]
# [
#         2.5
#         3.5
#         4.5
# ]

# File 'lib/polars/data_frame.rb', line 4386

def mean_horizontal(ignore_nulls: true)
  select(
    mean: F.mean_horizontal(F.all, ignore_nulls: ignore_nulls)
  ).to_series
end

#median ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their median value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.median
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘



4496
4497
4498

# File 'lib/polars/data_frame.rb', line 4496

def median
  lazy.median.collect(_eager: true)
end

#merge_sorted(other, key) ⇒ `DataFrame`

Take two sorted DataFrames and merge them by the sorted key.

The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.

The schemas of both DataFrames must be equal.

Examples:

df0 = Polars::DataFrame.new(
  {"name" => ["steve", "elise", "bob"], "age" => [42, 44, 18]}
).sort("age")
df1 = Polars::DataFrame.new(
  {"name" => ["anna", "megan", "steve", "thomas"], "age" => [21, 33, 42, 20]}
).sort("age")
df0.merge_sorted(df1, "age")
# =>
# shape: (7, 2)
# ┌────────┬─────┐
# │ name   ┆ age │
# │ ---    ┆ --- │
# │ str    ┆ i64 │
# ╞════════╪═════╡
# │ bob    ┆ 18  │
# │ thomas ┆ 20  │
# │ anna   ┆ 21  │
# │ megan  ┆ 33  │
# │ steve  ┆ 42  │
# │ steve  ┆ 42  │
# │ elise  ┆ 44  │
# └────────┴─────┘



5377
5378
5379

# File 'lib/polars/data_frame.rb', line 5377

def merge_sorted(other, key)
  lazy.merge_sorted(other.lazy, key).collect(_eager: true)
end

#min ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their minimum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.min
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘



4252
4253
4254

# File 'lib/polars/data_frame.rb', line 4252

def min
  lazy.min.collect(_eager: true)
end

#min_horizontal ⇒ `Series`

Get the minimum value horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.min_horizontal
# =>
# shape: (3,)
# Series: 'min' [f64]
# [
#         1.0
#         2.0
#         3.0
# ]



4276
4277
4278

# File 'lib/polars/data_frame.rb', line 4276

def min_horizontal
  select(min: F.min_horizontal(F.all)).to_series
end

#n_chunks(strategy: "first") ⇒ `Object`

Get number of chunks used by the ChunkedArrays of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.n_chunks
# => 1
df.n_chunks(strategy: "all")
# => [1, 1, 1]

# File 'lib/polars/data_frame.rb', line 4170

def n_chunks(strategy: "first")
  if strategy == "first"
    _df.n_chunks
  elsif strategy == "all"
    get_columns.map(&:n_chunks)
  else
    raise ArgumentError, "Strategy: '{strategy}' not understood. Choose one of {{'first',  'all'}}"
  end
end

#n_unique(subset: nil) ⇒ `DataFrame`

Return the number of unique rows, or the number of unique row-subsets.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 1, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 1.0, 2.0, 3.0, 3.0],
    "c" => [true, true, true, false, true, true]
  }
)
df.n_unique
# => 5

Simple columns subset

df.n_unique(subset: ["b", "c"])
# => 4

Expression subset

df.n_unique(
  subset: [
    (Polars.col("a").floordiv(2)),
    (Polars.col("c") | (Polars.col("b") >= 2))
  ]
)
# => 3

# File 'lib/polars/data_frame.rb', line 4669

def n_unique(subset: nil)
  if subset.is_a?(StringIO)
    subset = [Polars.col(subset)]
  elsif subset.is_a?(Expr)
    subset = [subset]
  end

  if subset.is_a?(::Array) && subset.length == 1
    expr = Utils.wrap_expr(Utils.parse_into_expression(subset[0], str_as_lit: false))
  else
    struct_fields = subset.nil? ? Polars.all : subset
    expr = Polars.struct(struct_fields)
  end

  df = lazy.select(expr.n_unique).collect
  df.is_empty ? 0 : df.row(0)[0]
end

#null_count ⇒ `DataFrame`

Create a new DataFrame that shows the null counts per column.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 3],
    "bar" => [6, 7, nil],
    "ham" => ["a", "b", "c"]
  }
)
df.null_count
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 1   ┆ 0   │
# └─────┴─────┴─────┘



4719
4720
4721

# File 'lib/polars/data_frame.rb', line 4719

def null_count
  _from_rbdf(_df.null_count)
end

#partition_by(groups, maintain_order: true, include_key: true, as_dict: false) ⇒ `Object`

Split into multiple DataFrames partitioned by groups.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => ["A", "A", "B", "B", "C"],
    "N" => [1, 2, 2, 4, 2],
    "bar" => ["k", "l", "m", "m", "l"]
  }
)
df.partition_by("foo", maintain_order: true)
# =>
# [shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ A   ┆ 1   ┆ k   │
# │ A   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘, shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ B   ┆ 2   ┆ m   │
# │ B   ┆ 4   ┆ m   │
# └─────┴─────┴─────┘, shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ C   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘]

df.partition_by("foo", maintain_order: true, as_dict: true)
# =>
# {"A"=>shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ A   ┆ 1   ┆ k   │
# │ A   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘, "B"=>shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ B   ┆ 2   ┆ m   │
# │ B   ┆ 4   ┆ m   │
# └─────┴─────┴─────┘, "C"=>shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ C   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘}

# File 'lib/polars/data_frame.rb', line 3783

def partition_by(groups, maintain_order: true, include_key: true, as_dict: false)
  if groups.is_a?(::String)
    groups = [groups]
  elsif !groups.is_a?(::Array)
    groups = Array(groups)
  end

  if as_dict
    out = {}
    if groups.length == 1
      _df.partition_by(groups, maintain_order, include_key).each do |df|
        df = _from_rbdf(df)
        out[df[groups][0, 0]] = df
      end
    else
      _df.partition_by(groups, maintain_order, include_key).each do |df|
        df = _from_rbdf(df)
        out[df[groups].row(0)] = df
      end
    end
    out
  else
    _df.partition_by(groups, maintain_order, include_key).map { |df| _from_rbdf(df) }
  end
end

#pipe(func, *args, **kwargs, &block) ⇒ `Object`

Note:

It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See #lazy.

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Examples:

cast_str_to_int = lambda do |data, col_name:|
  data.with_column(Polars.col(col_name).cast(:i64))
end

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => ["10", "20", "30", "40"]})
df.pipe(cast_str_to_int, col_name: "b")
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 10  │
# │ 2   ┆ 20  │
# │ 3   ┆ 30  │
# │ 4   ┆ 40  │
# └─────┴─────┘



1902
1903
1904

# File 'lib/polars/data_frame.rb', line 1902

def pipe(func, *args, **kwargs, &block)
  func.call(self, *args, **kwargs, &block)
end

#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ `DataFrame`

Create a spreadsheet-style pivot table as a DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => ["one", "one", "two", "two", "one", "two"],
    "bar" => ["y", "y", "y", "x", "x", "x"],
    "baz" => [1, 2, 3, 4, 5, 6]
  }
)
df.pivot("bar", index: "foo", values: "baz", aggregate_function: "sum")
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ y   ┆ x   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ one ┆ 3   ┆ 5   │
# │ two ┆ 3   ┆ 10  │
# └─────┴─────┴─────┘

# File 'lib/polars/data_frame.rb', line 3474

def pivot(
  on,
  index: nil,
  values: nil,
  aggregate_function: nil,
  maintain_order: true,
  sort_columns: false,
  separator: "_"
)
  index = Utils._expand_selectors(self, index)
  on = Utils._expand_selectors(self, on)
  if !values.nil?
    values = Utils._expand_selectors(self, values)
  end

  if aggregate_function.is_a?(::String)
    case aggregate_function
    when "first"
      aggregate_expr = F.element.first._rbexpr
    when "sum"
      aggregate_expr = F.element.sum._rbexpr
    when "max"
      aggregate_expr = F.element.max._rbexpr
    when "min"
      aggregate_expr = F.element.min._rbexpr
    when "mean"
      aggregate_expr = F.element.mean._rbexpr
    when "median"
      aggregate_expr = F.element.median._rbexpr
    when "last"
      aggregate_expr = F.element.last._rbexpr
    when "len"
      aggregate_expr = F.len._rbexpr
    when "count"
      warn "`aggregate_function: \"count\"` input for `pivot` is deprecated. Use `aggregate_function: \"len\"` instead."
      aggregate_expr = F.len._rbexpr
    else
      raise ArgumentError, "Argument aggregate fn: '#{aggregate_fn}' was not expected."
    end
  elsif aggregate_function.nil?
    aggregate_expr = nil
  else
    aggregate_expr = aggregate_function._rbexpr
  end

  _from_rbdf(
    _df.pivot_expr(
      on,
      index,
      values,
      maintain_order,
      sort_columns,
      aggregate_expr,
      separator
    )
  )
end

#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ `Vega::LiteChart` Originally defined in module Plot

Plot data.

Raises:

(ArgumentError)

#product ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their product values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3],
    "b" => [0.5, 4, 10],
    "c" => [true, true, false]
  }
)
df.product
# =>
# shape: (1, 3)
# ┌─────┬──────┬─────┐
# │ a   ┆ b    ┆ c   │
# │ --- ┆ ---  ┆ --- │
# │ i64 ┆ f64  ┆ i64 │
# ╞═════╪══════╪═════╡
# │ 6   ┆ 20.0 ┆ 0   │
# └─────┴──────┴─────┘



4522
4523
4524

# File 'lib/polars/data_frame.rb', line 4522

def product
  select(Polars.all.product)
end

#quantile(quantile, interpolation: "nearest") ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their quantile value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.quantile(0.5, interpolation: "nearest")
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘



4553
4554
4555

# File 'lib/polars/data_frame.rb', line 4553

def quantile(quantile, interpolation: "nearest")
  lazy.quantile(quantile, interpolation: interpolation).collect(_eager: true)
end

#rechunk ⇒ `DataFrame`

This will make sure all subsequent operations have optimal and predictable performance.



4693
4694
4695

# File 'lib/polars/data_frame.rb', line 4693

def rechunk
  _from_rbdf(_df.rechunk)
end

#rename(mapping, strict: true) ⇒ `DataFrame`

Rename column names.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.rename({"foo" => "apple"})
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 1     ┆ 6   ┆ a   │
# │ 2     ┆ 7   ┆ b   │
# │ 3     ┆ 8   ┆ c   │
# └───────┴─────┴─────┘



1349
1350
1351

# File 'lib/polars/data_frame.rb', line 1349

def rename(mapping, strict: true)
  lazy.rename(mapping, strict: strict).collect(no_optimization: true)
end

#replace(column, new_col) ⇒ `DataFrame`

Replace a column by a new Series.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
s = Polars::Series.new([10, 20, 30])
df.replace("foo", s)
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 4   │
# │ 20  ┆ 5   │
# │ 30  ┆ 6   │
# └─────┴─────┘

# File 'lib/polars/data_frame.rb', line 1703

def replace(column, new_col)
  _df.replace(column.to_s, new_col._s)
  self
end

#replace_column(index, series) ⇒ `DataFrame` Also known as: replace_at_idx

Replace a column at an index location.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
s = Polars::Series.new("apple", [10, 20, 30])
df.replace_column(0, s)
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 10    ┆ 6   ┆ a   │
# │ 20    ┆ 7   ┆ b   │
# │ 30    ┆ 8   ┆ c   │
# └───────┴─────┴─────┘

# File 'lib/polars/data_frame.rb', line 1569

def replace_column(index, series)
  if index < 0
    index = columns.length + index
  end
  _df.replace_column(index, series._s)
  self
end

#reverse ⇒ `DataFrame`

Reverse the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "key" => ["a", "b", "c"],
    "val" => [1, 2, 3]
  }
)
df.reverse
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ key ┆ val │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 1   │
# └─────┴─────┘



1314
1315
1316

# File 'lib/polars/data_frame.rb', line 1314

def reverse
  select(Polars.col("*").reverse)
end

#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ `RollingGroupBy` Also known as: groupby_rolling, group_by_rolling

Create rolling groups based on a time column.

Also works for index values of type :i32 or :i64.

Different from a dynamic_group_by the windows are now determined by the individual values and are not of constant intervals. For constant intervals use group_by_dynamic

The period and offset arguments are created either from a timedelta, or by using the following string language:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1mo (1 calendar month)
1y (1 calendar year)
1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_rolling on an integer column, the windows are defined by:

"1i" # length 1
"10i" # length 10

Examples:

dates = [
  "2020-01-01 13:45:48",
  "2020-01-01 16:42:13",
  "2020-01-01 16:45:09",
  "2020-01-02 18:12:48",
  "2020-01-03 19:45:32",
  "2020-01-08 23:16:43"
]
df = Polars::DataFrame.new({"dt" => dates, "a" => [3, 7, 5, 9, 2, 1]}).with_column(
  Polars.col("dt").str.strptime(Polars::Datetime).set_sorted
)
df.rolling(index_column: "dt", period: "2d").agg(
  [
    Polars.sum("a").alias("sum_a"),
    Polars.min("a").alias("min_a"),
    Polars.max("a").alias("max_a")
  ]
)
# =>
# shape: (6, 4)
# ┌─────────────────────┬───────┬───────┬───────┐
# │ dt                  ┆ sum_a ┆ min_a ┆ max_a │
# │ ---                 ┆ ---   ┆ ---   ┆ ---   │
# │ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
# ╞═════════════════════╪═══════╪═══════╪═══════╡
# │ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
# │ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
# │ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
# │ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
# │ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
# │ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
# └─────────────────────┴───────┴───────┴───────┘

# File 'lib/polars/data_frame.rb', line 2067

def rolling(
  index_column:,
  period:,
  offset: nil,
  closed: "right",
  by: nil
)
  RollingGroupBy.new(self, index_column, period, offset, closed, by)
end

#row(index = nil, by_predicate: nil, named: false) ⇒ `Object`

Note:

The index and by_predicate params are mutually exclusive. Additionally, to ensure clarity, the by_predicate parameter must be supplied by keyword.

When using by_predicate it is an error condition if anything other than one row is returned; more than one row raises TooManyRowsReturned, and zero rows will raise NoRowsReturned (both inherit from RowsException).

Get a row as tuple, either by index or by predicate.

Examples:

Return the row at the given index

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.row(2)
# => [3, 8, "c"]

Get a hash instead with a mapping of column names to row values

df.row(2, named: true)
# => {"foo"=>3, "bar"=>8, "ham"=>"c"}

Return the row that matches the given predicate

df.row(by_predicate: Polars.col("ham") == "b")
# => [2, 7, "b"]

# File 'lib/polars/data_frame.rb', line 4914

def row(index = nil, by_predicate: nil, named: false)
  if !index.nil? && !by_predicate.nil?
    raise ArgumentError, "Cannot set both 'index' and 'by_predicate'; mutually exclusive"
  elsif index.is_a?(Expr)
    raise TypeError, "Expressions should be passed to the 'by_predicate' param"
  end

  if !index.nil?
    row = _df.row_tuple(index)
    if named
      columns.zip(row).to_h
    else
      row
    end
  elsif !by_predicate.nil?
    if !by_predicate.is_a?(Expr)
      raise TypeError, "Expected by_predicate to be an expression; found #{by_predicate.class.name}"
    end
    rows = filter(by_predicate).rows
    n_rows = rows.length
    if n_rows > 1
      raise TooManyRowsReturned, "Predicate #{by_predicate} returned #{n_rows} rows"
    elsif n_rows == 0
      raise NoRowsReturned, "Predicate #{by_predicate} returned no rows"
    end
    row = rows[0]
    if named
      columns.zip(row).to_h
    else
      row
    end
  else
    raise ArgumentError, "One of 'index' or 'by_predicate' must be set"
  end
end

#rows(named: false) ⇒ `Array`

Convert columnar data to rows as Ruby arrays.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.rows
# => [[1, 2], [3, 4], [5, 6]]

df.rows(named: true)
# => [{"a"=>1, "b"=>2}, {"a"=>3, "b"=>4}, {"a"=>5, "b"=>6}]

# File 'lib/polars/data_frame.rb', line 4971

def rows(named: false)
  if named
    columns = self.columns
    _df.row_tuples.map do |v|
      columns.zip(v).to_h
    end
  else
    _df.row_tuples
  end
end

#sample(n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ `DataFrame`

Sample from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.sample(n: 2, seed: 0)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8   ┆ c   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

# File 'lib/polars/data_frame.rb', line 4759

def sample(
  n: nil,
  frac: nil,
  with_replacement: false,
  shuffle: false,
  seed: nil
)
  if !n.nil? && !frac.nil?
    raise ArgumentError, "cannot specify both `n` and `frac`"
  end

  if n.nil? && !frac.nil?
    frac = Series.new("frac", [frac]) unless frac.is_a?(Series)

    return _from_rbdf(
      _df.sample_frac(frac._s, with_replacement, shuffle, seed)
    )
  end

  if n.nil?
    n = 1
  end

  n = Series.new("", [n]) unless n.is_a?(Series)

  _from_rbdf(_df.sample_n(n._s, with_replacement, shuffle, seed))
end

#schema ⇒ `Hash`

Get the schema.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.schema
# => {"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String}



216
217
218

# File 'lib/polars/data_frame.rb', line 216

def schema
  columns.zip(dtypes).to_h
end

#select(*exprs, **named_exprs) ⇒ `DataFrame`

Select columns from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.select("foo")
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘

df.select(["foo", "bar"])
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 6   │
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# └─────┴─────┘

df.select(Polars.col("foo") + 1)
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 2   │
# │ 3   │
# │ 4   │
# └─────┘

df.select([Polars.col("foo") + 1, Polars.col("bar") + 1])
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# │ 4   ┆ 9   │
# └─────┴─────┘

df.select(Polars.when(Polars.col("foo") > 2).then(10).otherwise(0))
# =>
# shape: (3, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ i32     │
# ╞═════════╡
# │ 0       │
# │ 0       │
# │ 10      │
# └─────────┘



4034
4035
4036

# File 'lib/polars/data_frame.rb', line 4034

def select(*exprs, **named_exprs)
  lazy.select(*exprs, **named_exprs).collect(_eager: true)
end

#set_sorted(column, descending: false) ⇒ `DataFrame`

Note:

This can lead to incorrect results if the data is NOT sorted! Use with care!

Flag a column as sorted.

This can speed up future operations.

# File 'lib/polars/data_frame.rb', line 5394

def set_sorted(
  column,
  descending: false
)
  lazy
    .set_sorted(column, descending: descending)
    .collect(no_optimization: true)
end

#shape ⇒ `Array`

Get the shape of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.shape
# => [5, 1]



95
96
97

# File 'lib/polars/data_frame.rb', line 95

def shape
  _df.shape
end

#shift(n, fill_value: nil) ⇒ `DataFrame`

Shift values by the given period.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.shift(1)
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 1    ┆ 6    ┆ a    │
# │ 2    ┆ 7    ┆ b    │
# └──────┴──────┴──────┘

df.shift(-1)
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 2    ┆ 7    ┆ b    │
# │ 3    ┆ 8    ┆ c    │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘



3852
3853
3854

# File 'lib/polars/data_frame.rb', line 3852

def shift(n, fill_value: nil)
  lazy.shift(n, fill_value: fill_value).collect(_eager: true)
end

#shift_and_fill(periods, fill_value) ⇒ `DataFrame`

Shift the values by a given period and fill the resulting null values.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.shift_and_fill(1, 0)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 0   ┆ 0   ┆ 0   │
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘



3885
3886
3887

# File 'lib/polars/data_frame.rb', line 3885

def shift_and_fill(periods, fill_value)
  shift(periods, fill_value: fill_value)
end

#shrink_to_fit(in_place: false) ⇒ `DataFrame`

Shrink DataFrame memory usage.

Shrinks to fit the exact capacity needed to hold the data.

# File 'lib/polars/data_frame.rb', line 5150

def shrink_to_fit(in_place: false)
  if in_place
    _df.shrink_to_fit
    self
  else
    df = clone
    df._df.shrink_to_fit
    df
  end
end

#slice(offset, length = nil) ⇒ `DataFrame`

Get a slice of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.slice(1, 2)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 2   ┆ 7.0 ┆ b   │
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

# File 'lib/polars/data_frame.rb', line 1737

def slice(offset, length = nil)
  if !length.nil? && length < 0
    length = height - offset + length
  end
  _from_rbdf(_df.slice(offset, length))
end

#sort(by, reverse: false, nulls_last: false) ⇒ `DataFrame`

Sort the DataFrame by column.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.sort("foo", reverse: true)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Sort by multiple columns.

df.sort(
  [Polars.col("foo"), Polars.col("bar")**2],
  reverse: [true, false]
)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

# File 'lib/polars/data_frame.rb', line 1626

def sort(by, reverse: false, nulls_last: false)
  lazy
    .sort(by, reverse: reverse, nulls_last: nulls_last)
    .collect(no_optimization: true)
end

#sort!(by, reverse: false, nulls_last: false) ⇒ `DataFrame`

Sort the DataFrame by column in-place.



1642
1643
1644

# File 'lib/polars/data_frame.rb', line 1642

def sort!(by, reverse: false, nulls_last: false)
  self._df = sort(by, reverse: reverse, nulls_last: nulls_last)._df
end

#std(ddof: 1) ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their standard deviation value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.std
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1.0 ┆ 1.0 ┆ null │
# └─────┴─────┴──────┘

df.std(ddof: 0)
# =>
# shape: (1, 3)
# ┌──────────┬──────────┬──────┐
# │ foo      ┆ bar      ┆ ham  │
# │ ---      ┆ ---      ┆ ---  │
# │ f64      ┆ f64      ┆ str  │
# ╞══════════╪══════════╪══════╡
# │ 0.816497 ┆ 0.816497 ┆ null │
# └──────────┴──────────┴──────┘



4429
4430
4431

# File 'lib/polars/data_frame.rb', line 4429

def std(ddof: 1)
  lazy.std(ddof: ddof).collect(_eager: true)
end

#sum ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their sum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"],
  }
)
df.sum
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 6   ┆ 21  ┆ null │
# └─────┴─────┴──────┘



4302
4303
4304

# File 'lib/polars/data_frame.rb', line 4302

def sum
  lazy.sum.collect(_eager: true)
end

#sum_horizontal(ignore_nulls: true) ⇒ `Series`

Sum all values horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.sum_horizontal
# =>
# shape: (3,)
# Series: 'sum' [f64]
# [
#         5.0
#         7.0
#         9.0
# ]

# File 'lib/polars/data_frame.rb', line 4330

def sum_horizontal(ignore_nulls: true)
  select(
    sum: F.sum_horizontal(F.all, ignore_nulls: ignore_nulls)
  ).to_series
end

#tail(n = 5) ⇒ `DataFrame`

Get the last n rows.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.tail(3)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8   ┆ c   │
# │ 4   ┆ 9   ┆ d   │
# │ 5   ┆ 10  ┆ e   │
# └─────┴─────┴─────┘



1832
1833
1834

# File 'lib/polars/data_frame.rb', line 1832

def tail(n = 5)
  _from_rbdf(_df.tail(n))
end

#to_a ⇒ `Array`

Returns an array representing the DataFrame



333
334
335

# File 'lib/polars/data_frame.rb', line 333

def to_a
  rows(named: true)
end

#to_csv(**options) ⇒ `String`

Write to comma-separated values (CSV) string.



827
828
829

# File 'lib/polars/data_frame.rb', line 827

def to_csv(**options)
  write_csv(**options)
end

#to_dummies(columns: nil, separator: "_", drop_first: false) ⇒ `DataFrame`

Get one hot encoded dummy variables.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2],
    "bar" => [3, 4],
    "ham" => ["a", "b"]
  }
)
df.to_dummies
# =>
# shape: (2, 6)
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │ foo_1 ┆ foo_2 ┆ bar_3 ┆ bar_4 ┆ ham_a ┆ ham_b │
# │ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
# │ u8    ┆ u8    ┆ u8    ┆ u8    ┆ u8    ┆ u8    │
# ╞═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
# │ 1     ┆ 0     ┆ 1     ┆ 0     ┆ 1     ┆ 0     │
# │ 0     ┆ 1     ┆ 0     ┆ 1     ┆ 0     ┆ 1     │
# └───────┴───────┴───────┴───────┴───────┴───────┘

# File 'lib/polars/data_frame.rb', line 4584

def to_dummies(columns: nil, separator: "_", drop_first: false)
  if columns.is_a?(::String)
    columns = [columns]
  end
  _from_rbdf(_df.to_dummies(columns, separator, drop_first))
end

#to_h(as_series: true) ⇒ `Hash`

Convert DataFrame to a hash mapping column name to values.

# File 'lib/polars/data_frame.rb', line 560

def to_h(as_series: true)
  if as_series
    get_columns.to_h { |s| [s.name, s] }
  else
    get_columns.to_h { |s| [s.name, s.to_a] }
  end
end

#to_hashes ⇒ `Array`

Convert every row to a dictionary.

Note that this is slow.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.to_hashes
# =>
# [{"foo"=>1, "bar"=>4}, {"foo"=>2, "bar"=>5}, {"foo"=>3, "bar"=>6}]

# File 'lib/polars/data_frame.rb', line 579

def to_hashes
  rbdf = _df
  names = columns

  height.times.map do |i|
    names.zip(rbdf.row_tuple(i)).to_h
  end
end

#to_numo ⇒ `Numo::NArray`

Convert DataFrame to a 2D Numo array.

This operation clones data.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3], "bar" => [6, 7, 8], "ham" => ["a", "b", "c"]}
)
df.to_numo.class
# => Numo::RObject

# File 'lib/polars/data_frame.rb', line 600

def to_numo
  out = _df.to_numo
  if out.nil?
    Numo::NArray.vstack(width.times.map { |i| to_series(i).to_numo }).transpose
  else
    out
  end
end

#to_s ⇒ `String` Also known as: inspect

Returns a string representing the DataFrame.



325
326
327

# File 'lib/polars/data_frame.rb', line 325

def to_s
  _df.to_s
end

#to_series(index = 0) ⇒ `Series`

Select column as Series at index location.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.to_series(1)
# =>
# shape: (3,)
# Series: 'bar' [i64]
# [
#         6
#         7
#         8
# ]

# File 'lib/polars/data_frame.rb', line 635

def to_series(index = 0)
  if index < 0
    index = columns.length + index
  end
  Utils.wrap_s(_df.select_at_idx(index))
end

#to_struct(name) ⇒ `Series`

Convert a DataFrame to a Series of type Struct.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4, 5],
    "b" => ["one", "two", "three", "four", "five"]
  }
)
df.to_struct("nums")
# =>
# shape: (5,)
# Series: 'nums' [struct[2]]
# [
#         {1,"one"}
#         {2,"two"}
#         {3,"three"}
#         {4,"four"}
#         {5,"five"}
# ]



5292
5293
5294

# File 'lib/polars/data_frame.rb', line 5292

def to_struct(name)
  Utils.wrap_s(_df.to_struct(name))
end

#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ `DataFrame`

Note:

This is a very expensive operation. Perhaps you can do it differently.

Transpose a DataFrame over the diagonal.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => [1, 2, 3]})
df.transpose(include_header: true)
# =>
# shape: (2, 4)
# ┌────────┬──────────┬──────────┬──────────┐
# │ column ┆ column_0 ┆ column_1 ┆ column_2 │
# │ ---    ┆ ---      ┆ ---      ┆ ---      │
# │ str    ┆ i64      ┆ i64      ┆ i64      │
# ╞════════╪══════════╪══════════╪══════════╡
# │ a      ┆ 1        ┆ 2        ┆ 3        │
# │ b      ┆ 1        ┆ 2        ┆ 3        │
# └────────┴──────────┴──────────┴──────────┘

Replace the auto-generated column names with a list

df.transpose(include_header: false, column_names: ["a", "b", "c"])
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 2   ┆ 3   │
# │ 1   ┆ 2   ┆ 3   │
# └─────┴─────┴─────┘

Include the header as a separate column

df.transpose(
  include_header: true, header_name: "foo", column_names: ["a", "b", "c"]
)
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬─────┐
# │ foo ┆ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╡
# │ a   ┆ 1   ┆ 2   ┆ 3   │
# │ b   ┆ 1   ┆ 2   ┆ 3   │
# └─────┴─────┴─────┴─────┘

# File 'lib/polars/data_frame.rb', line 1286

def transpose(include_header: false, header_name: "column", column_names: nil)
  keep_names_as = include_header ? header_name : nil
  _from_rbdf(_df.transpose(keep_names_as, column_names))
end

#unique(maintain_order: true, subset: nil, keep: "first") ⇒ `DataFrame`

Note:

Note that this fails if there is a column of type List in the DataFrame or subset.

Drop duplicate rows from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 1, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 1.0, 2.0, 3.0, 3.0],
    "c" => [true, true, true, false, true, true]
  }
)
df.unique
# =>
# shape: (5, 3)
# ┌─────┬─────┬───────┐
# │ a   ┆ b   ┆ c     │
# │ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ bool  │
# ╞═════╪═════╪═══════╡
# │ 1   ┆ 0.5 ┆ true  │
# │ 2   ┆ 1.0 ┆ true  │
# │ 3   ┆ 2.0 ┆ false │
# │ 4   ┆ 3.0 ┆ true  │
# │ 5   ┆ 3.0 ┆ true  │
# └─────┴─────┴───────┘

# File 'lib/polars/data_frame.rb', line 4629

def unique(maintain_order: true, subset: nil, keep: "first")
  self._from_rbdf(
    lazy
      .unique(maintain_order: maintain_order, subset: subset, keep: keep)
      .collect(no_optimization: true)
      ._df
  )
end

#unnest(names) ⇒ `DataFrame`

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Examples:

df = Polars::DataFrame.new(
  {
    "before" => ["foo", "bar"],
    "t_a" => [1, 2],
    "t_b" => ["a", "b"],
    "t_c" => [true, nil],
    "t_d" => [[1, 2], [3]],
    "after" => ["baz", "womp"]
  }
).select(["before", Polars.struct(Polars.col("^t_.$")).alias("t_struct"), "after"])
df.unnest("t_struct")
# =>
# shape: (2, 6)
# ┌────────┬─────┬─────┬──────┬───────────┬───────┐
# │ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
# │ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
# │ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
# ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
# │ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
# │ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
# └────────┴─────┴─────┴──────┴───────────┴───────┘

# File 'lib/polars/data_frame.rb', line 5328

def unnest(names)
  if names.is_a?(::String)
    names = [names]
  end
  _from_rbdf(_df.unnest(names))
end

#unpivot(on, index: nil, variable_name: nil, value_name: nil) ⇒ `DataFrame` Also known as: melt

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
)
df.unpivot(Polars.cs.numeric, index: "a")
# =>
# shape: (6, 3)
# ┌─────┬──────────┬───────┐
# │ a   ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ x   ┆ b        ┆ 1     │
# │ y   ┆ b        ┆ 3     │
# │ z   ┆ b        ┆ 5     │
# │ x   ┆ c        ┆ 2     │
# │ y   ┆ c        ┆ 4     │
# │ z   ┆ c        ┆ 6     │
# └─────┴──────────┴───────┘

# File 'lib/polars/data_frame.rb', line 3576

def unpivot(on, index: nil, variable_name: nil, value_name: nil)
  on = on.nil? ? [] : Utils._expand_selectors(self, on)
  index = index.nil? ? [] : Utils._expand_selectors(self, index)

  _from_rbdf(_df.unpivot(on, index, value_name, variable_name))
end

#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ `DataFrame`

Note:

This functionality is experimental and may be subject to changes without it being considered a breaking change.

Unstack a long table to a wide form without doing an aggregation.

This can be much faster than a pivot, because it can skip the grouping phase.

Examples:

df = Polars::DataFrame.new(
  {
    "col1" => "A".."I",
    "col2" => Polars.arange(0, 9, eager: true)
  }
)
# =>
# shape: (9, 2)
# ┌──────┬──────┐
# │ col1 ┆ col2 │
# │ ---  ┆ ---  │
# │ str  ┆ i64  │
# ╞══════╪══════╡
# │ A    ┆ 0    │
# │ B    ┆ 1    │
# │ C    ┆ 2    │
# │ D    ┆ 3    │
# │ E    ┆ 4    │
# │ F    ┆ 5    │
# │ G    ┆ 6    │
# │ H    ┆ 7    │
# │ I    ┆ 8    │
# └──────┴──────┘

df.unstack(step: 3, how: "vertical")
# =>
# shape: (3, 6)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │
# │ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
# │ str    ┆ str    ┆ str    ┆ i64    ┆ i64    ┆ i64    │
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ A      ┆ D      ┆ G      ┆ 0      ┆ 3      ┆ 6      │
# │ B      ┆ E      ┆ H      ┆ 1      ┆ 4      ┆ 7      │
# │ C      ┆ F      ┆ I      ┆ 2      ┆ 5      ┆ 8      │
# └────────┴────────┴────────┴────────┴────────┴────────┘

df.unstack(step: 3, how: "horizontal")
# =>
# shape: (3, 6)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │
# │ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
# │ str    ┆ str    ┆ str    ┆ i64    ┆ i64    ┆ i64    │
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ A      ┆ B      ┆ C      ┆ 0      ┆ 1      ┆ 2      │
# │ D      ┆ E      ┆ F      ┆ 3      ┆ 4      ┆ 5      │
# │ G      ┆ H      ┆ I      ┆ 6      ┆ 7      ┆ 8      │
# └────────┴────────┴────────┴────────┴────────┴────────┘

# File 'lib/polars/data_frame.rb', line 3655

def unstack(step:, how: "vertical", columns: nil, fill_values: nil)
  if !columns.nil?
    df = select(columns)
  else
    df = self
  end

  height = df.height
  if how == "vertical"
    n_rows = step
    n_cols = (height / n_rows.to_f).ceil
  else
    n_cols = step
    n_rows = (height / n_cols.to_f).ceil
  end

  n_fill = n_cols * n_rows - height

  if n_fill > 0
    if !fill_values.is_a?(::Array)
      fill_values = [fill_values] * df.width
    end

    df = df.select(
      df.get_columns.zip(fill_values).map do |s, next_fill|
        s.extend_constant(next_fill, n_fill)
      end
    )
  end

  if how == "horizontal"
    df = (
      df.with_column(
        (Polars.arange(0, n_cols * n_rows, eager: true) % n_cols).alias(
          "__sort_order"
        )
      )
      .sort("__sort_order")
      .drop("__sort_order")
    )
  end

  zfill_val = Math.log10(n_cols).floor + 1
  slices =
    df.get_columns.flat_map do |s|
      n_cols.times.map do |slice_nbr|
        s.slice(slice_nbr * n_rows, n_rows).alias("%s_%0#{zfill_val}d" % [s.name, slice_nbr])
      end
    end

  _from_rbdf(DataFrame.new(slices)._df)
end

#upsample(time_column:, every:, by: nil, maintain_order: false) ⇒ `DataFrame`

Upsample a DataFrame at a regular frequency.

The every and offset arguments are created with the following string language:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1mo (1 calendar month)
1y (1 calendar year)
1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

Examples:

Upsample a DataFrame by a certain interval.

df = Polars::DataFrame.new(
  {
    "time" => [
      DateTime.new(2021, 2, 1),
      DateTime.new(2021, 4, 1),
      DateTime.new(2021, 5, 1),
      DateTime.new(2021, 6, 1)
    ],
    "groups" => ["A", "B", "A", "B"],
    "values" => [0, 1, 2, 3]
  }
).set_sorted("time")
df.upsample(
  time_column: "time", every: "1mo", by: "groups", maintain_order: true
).select(Polars.all.forward_fill)
# =>
# shape: (7, 3)
# ┌─────────────────────┬────────┬────────┐
# │ time                ┆ groups ┆ values │
# │ ---                 ┆ ---    ┆ ---    │
# │ datetime[ns]        ┆ str    ┆ i64    │
# ╞═════════════════════╪════════╪════════╡
# │ 2021-02-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-03-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-04-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-05-01 00:00:00 ┆ A      ┆ 2      │
# │ 2021-04-01 00:00:00 ┆ B      ┆ 1      │
# │ 2021-05-01 00:00:00 ┆ B      ┆ 1      │
# │ 2021-06-01 00:00:00 ┆ B      ┆ 3      │
# └─────────────────────┴────────┴────────┘

# File 'lib/polars/data_frame.rb', line 2399

def upsample(
  time_column:,
  every:,
  by: nil,
  maintain_order: false
)
  if by.nil?
    by = []
  end
  if by.is_a?(::String)
    by = [by]
  end

  every = Utils.parse_as_duration_string(every)

  _from_rbdf(
    _df.upsample(by, time_column, every, maintain_order)
  )
end

#var(ddof: 1) ⇒ `DataFrame`

Aggregate the columns of this DataFrame to their variance value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.var
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1.0 ┆ 1.0 ┆ null │
# └─────┴─────┴──────┘

df.var(ddof: 0)
# =>
# shape: (1, 3)
# ┌──────────┬──────────┬──────┐
# │ foo      ┆ bar      ┆ ham  │
# │ ---      ┆ ---      ┆ ---  │
# │ f64      ┆ f64      ┆ str  │
# ╞══════════╪══════════╪══════╡
# │ 0.666667 ┆ 0.666667 ┆ null │
# └──────────┴──────────┴──────┘



4470
4471
4472

# File 'lib/polars/data_frame.rb', line 4470

def var(ddof: 1)
  lazy.var(ddof: ddof).collect(_eager: true)
end

#vstack(df, in_place: false) ⇒ `DataFrame`

Grow this DataFrame vertically by stacking a DataFrame to it.

Examples:

df1 = Polars::DataFrame.new(
  {
    "foo" => [1, 2],
    "bar" => [6, 7],
    "ham" => ["a", "b"]
  }
)
df2 = Polars::DataFrame.new(
  {
    "foo" => [3, 4],
    "bar" => [8, 9],
    "ham" => ["c", "d"]
  }
)
df1.vstack(df2)
# =>
# shape: (4, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# │ 4   ┆ 9   ┆ d   │
# └─────┴─────┴─────┘

# File 'lib/polars/data_frame.rb', line 2922

def vstack(df, in_place: false)
  if in_place
    _df.vstack_mut(df._df)
    self
  else
    _from_rbdf(_df.vstack(df._df))
  end
end

#width ⇒ `Integer`

Get the width of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.width
# => 1



122
123
124

# File 'lib/polars/data_frame.rb', line 122

def width
  _df.width
end

#with_column(column) ⇒ `DataFrame`

Return a new DataFrame with the column added or replaced.

Examples:

Added

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.with_column((Polars.col("b") ** 2).alias("b_squared"))
# =>
# shape: (3, 3)
# ┌─────┬─────┬───────────┐
# │ a   ┆ b   ┆ b_squared │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ i64 ┆ i64       │
# ╞═════╪═════╪═══════════╡
# │ 1   ┆ 2   ┆ 4         │
# │ 3   ┆ 4   ┆ 16        │
# │ 5   ┆ 6   ┆ 36        │
# └─────┴─────┴───────────┘

Replaced

df.with_column(Polars.col("a") ** 2)
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# │ 9   ┆ 4   │
# │ 25  ┆ 6   │
# └─────┴─────┘

# File 'lib/polars/data_frame.rb', line 2837

def with_column(column)
  lazy
    .with_column(column)
    .collect(no_optimization: true, string_cache: false)
end

#with_columns(*exprs, **named_exprs) ⇒ `DataFrame`

Add columns to this DataFrame.

Added columns will replace existing columns with the same name.

Examples:

Pass an expression to add it as a new column.

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.with_columns((Polars.col("a") ** 2).alias("a^2"))
# =>
# shape: (4, 4)
# ┌─────┬──────┬───────┬─────┐
# │ a   ┆ b    ┆ c     ┆ a^2 │
# │ --- ┆ ---  ┆ ---   ┆ --- │
# │ i64 ┆ f64  ┆ bool  ┆ i64 │
# ╞═════╪══════╪═══════╪═════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   │
# │ 3   ┆ 10.0 ┆ false ┆ 9   │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  │
# └─────┴──────┴───────┴─────┘

Added columns will replace existing columns with the same name.

df.with_columns(Polars.col("a").cast(Polars::Float64))
# =>
# shape: (4, 3)
# ┌─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     │
# │ --- ┆ ---  ┆ ---   │
# │ f64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╡
# │ 1.0 ┆ 0.5  ┆ true  │
# │ 2.0 ┆ 4.0  ┆ true  │
# │ 3.0 ┆ 10.0 ┆ false │
# │ 4.0 ┆ 13.0 ┆ true  │
# └─────┴──────┴───────┘

Multiple columns can be added by passing a list of expressions.

df.with_columns(
  [
    (Polars.col("a") ** 2).alias("a^2"),
    (Polars.col("b") / 2).alias("b/2"),
    (Polars.col("c").not_).alias("not c"),
  ]
)
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Multiple columns also can be added using positional arguments instead of a list.

df.with_columns(
  (Polars.col("a") ** 2).alias("a^2"),
  (Polars.col("b") / 2).alias("b/2"),
  (Polars.col("c").not_).alias("not c"),
)
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Use keyword arguments to easily name your expression inputs.

df.with_columns(
  ab: Polars.col("a") * Polars.col("b"),
  not_c: Polars.col("c").not_
)
# =>
# shape: (4, 5)
# ┌─────┬──────┬───────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ ab   ┆ not_c │
# │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false │
# └─────┴──────┴───────┴──────┴───────┘



4146
4147
4148

# File 'lib/polars/data_frame.rb', line 4146

def with_columns(*exprs, **named_exprs)
  lazy.with_columns(*exprs, **named_exprs).collect(_eager: true)
end

#with_row_index(name: "index", offset: 0) ⇒ `DataFrame` Also known as: with_row_count

Add a column at index 0 that counts the rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.with_row_index
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ index ┆ a   ┆ b   │
# │ ---   ┆ --- ┆ --- │
# │ u32   ┆ i64 ┆ i64 │
# ╞═══════╪═════╪═════╡
# │ 0     ┆ 1   ┆ 2   │
# │ 1     ┆ 3   ┆ 4   │
# │ 2     ┆ 5   ┆ 6   │
# └───────┴─────┴─────┘



1934
1935
1936

# File 'lib/polars/data_frame.rb', line 1934

def with_row_index(name: "index", offset: 0)
  _from_rbdf(_df.with_row_index(name, offset))
end

#write_avro(file, compression = "uncompressed", name: "") ⇒ `nil`

Write to Apache Avro file.

# File 'lib/polars/data_frame.rb', line 839

def write_avro(file, compression = "uncompressed", name: "")
  if compression.nil?
    compression = "uncompressed"
  end
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  if name.nil?
    name = ""
  end

  _df.write_avro(file, compression, name)
end

#write_csv(file = nil, has_header: true, include_header: nil, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil) ⇒ `String`^?

Write to comma-separated values (CSV) file.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.write_csv("file.csv")

# File 'lib/polars/data_frame.rb', line 764

def write_csv(
  file = nil,
  has_header: true,
  include_header: nil,
  sep: ",",
  quote: '"',
  batch_size: 1024,
  datetime_format: nil,
  date_format: nil,
  time_format: nil,
  float_precision: nil,
  null_value: nil
)
  include_header = has_header if include_header.nil?

  if sep.length > 1
    raise ArgumentError, "only single byte separator is allowed"
  elsif quote.length > 1
    raise ArgumentError, "only single byte quote char is allowed"
  elsif null_value == ""
    null_value = nil
  end

  if file.nil?
    buffer = StringIO.new
    buffer.set_encoding(Encoding::BINARY)
    _df.write_csv(
      buffer,
      include_header,
      sep.ord,
      quote.ord,
      batch_size,
      datetime_format,
      date_format,
      time_format,
      float_precision,
      null_value
    )
    return buffer.string.force_encoding(Encoding::UTF_8)
  end

  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  _df.write_csv(
    file,
    include_header,
    sep.ord,
    quote.ord,
    batch_size,
    datetime_format,
    date_format,
    time_format,
    float_precision,
    null_value,
  )
  nil
end

#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ `Integer`

Note:

This functionality is experimental. It may be changed at any point without it being considered a breaking change.

Write the data in a Polars DataFrame to a database.

# File 'lib/polars/data_frame.rb', line 1021

def write_database(table_name, connection = nil, if_table_exists: "fail")
  if !defined?(ActiveRecord)
    raise Error, "Active Record not available"
  elsif ActiveRecord::VERSION::MAJOR < 7
    raise Error, "Requires Active Record 7+"
  end

  valid_write_modes = ["append", "replace", "fail"]
  if !valid_write_modes.include?(if_table_exists)
    msg = "write_database `if_table_exists` must be one of #{valid_write_modes.inspect}, got #{if_table_exists.inspect}"
    raise ArgumentError, msg
  end

  with_connection(connection) do |connection|
    table_exists = connection.table_exists?(table_name)
    if table_exists && if_table_exists == "fail"
      raise ArgumentError, "Table already exists"
    end

    create_table = !table_exists || if_table_exists == "replace"
    maybe_transaction(connection, create_table) do
      if create_table
        mysql = connection.adapter_name.match?(/mysql|trilogy/i)
        force = if_table_exists == "replace"
        connection.create_table(table_name, id: false, force: force) do |t|
          schema.each do |c, dtype|
            options = {}
            column_type =
              case dtype
              when Binary
                :binary
              when Boolean
                :boolean
              when Date
                :date
              when Datetime
                :datetime
              when Decimal
                if mysql
                  options[:precision] = dtype.precision || 65
                  options[:scale] = dtype.scale || 30
                end
                :decimal
              when Float32
                options[:limit] = 24
                :float
              when Float64
                options[:limit] = 53
                :float
              when Int8
                options[:limit] = 1
                :integer
              when Int16
                options[:limit] = 2
                :integer
              when Int32
                options[:limit] = 4
                :integer
              when Int64
                options[:limit] = 8
                :integer
              when UInt8
                if mysql
                  options[:limit] = 1
                  options[:unsigned] = true
                else
                  options[:limit] = 2
                end
                :integer
              when UInt16
                if mysql
                  options[:limit] = 2
                  options[:unsigned] = true
                else
                  options[:limit] = 4
                end
                :integer
              when UInt32
                if mysql
                  options[:limit] = 4
                  options[:unsigned] = true
                else
                  options[:limit] = 8
                end
                :integer
              when UInt64
                if mysql
                  options[:limit] = 8
                  options[:unsigned] = true
                  :integer
                else
                  options[:precision] = 20
                  options[:scale] = 0
                  :decimal
                end
              when String
                :text
              when Time
                :time
              else
                raise ArgumentError, "column type not supported yet: #{dtype}"
              end
            t.column c, column_type, **options
          end
        end
      end

      quoted_table = connection.quote_table_name(table_name)
      quoted_columns = columns.map { |c| connection.quote_column_name(c) }
      rows = cast({Polars::UInt64 => Polars::String}).rows(named: false).map { |row| "(#{row.map { |v| connection.quote(v) }.join(", ")})" }
      connection.exec_update("INSERT INTO #{quoted_table} (#{quoted_columns.join(", ")}) VALUES #{rows.join(", ")}")
    end
  end
end

#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ `nil`

Write DataFrame as delta table.

# File 'lib/polars/data_frame.rb', line 1150

def write_delta(
  target,
  mode: "error",
  storage_options: nil,
  delta_write_options: nil,
  delta_merge_options: nil
)
  Polars.send(:_check_if_delta_available)

  if Utils.pathlike?(target)
    target = Polars.send(:_resolve_delta_lake_uri, target.to_s, strict: false)
  end

  data = self

  if mode == "merge"
    if delta_merge_options.nil?
      msg = "You need to pass delta_merge_options with at least a given predicate for `MERGE` to work."
      raise ArgumentError, msg
    end
    if target.is_a?(::String)
      dt = DeltaLake::Table.new(target, storage_options: storage_options)
    else
      dt = target
    end

    predicate = delta_merge_options.delete(:predicate)
    dt.merge(data, predicate, **delta_merge_options)
  else
    delta_write_options ||= {}

    DeltaLake.write(
      target,
      data,
      mode: mode,
      storage_options: storage_options,
      **delta_write_options
    )
  end
end

#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2) ⇒ `nil`

Write to Arrow IPC binary stream or Feather file.

# File 'lib/polars/data_frame.rb', line 861

def write_ipc(
  file,
  compression: "uncompressed",
  compat_level: nil,
  storage_options: nil,
  retries: 2
)
  return_bytes = file.nil?
  if return_bytes
    file = StringIO.new
    file.set_encoding(Encoding::BINARY)
  end
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  if compat_level.nil?
    compat_level = true
  end

  if compression.nil?
    compression = "uncompressed"
  end

  if storage_options&.any?
    storage_options = storage_options.to_a
  else
    storage_options = nil
  end

  _df.write_ipc(file, compression, compat_level, storage_options, retries)
  return_bytes ? file.string : nil
end

#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ `Object`

Write to Arrow IPC record batch stream.

See "Streaming format" in https://arrow.apache.org/docs/python/ipc.html.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.write_ipc_stream("new_file.arrow")

# File 'lib/polars/data_frame.rb', line 916

def write_ipc_stream(
  file,
  compression: "uncompressed",
  compat_level: nil
)
  return_bytes = file.nil?
  if return_bytes
    file = StringIO.new
    file.set_encoding(Encoding::BINARY)
  elsif Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  if compat_level.nil?
    compat_level = true
  end

  if compression.nil?
    compression = "uncompressed"
  end

  _df.write_ipc_stream(file, compression, compat_level)
  return_bytes ? file.string : nil
end

#write_json(file = nil) ⇒ `nil`

Serialize to JSON representation.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8]
  }
)
df.write_json
# => "[{\"foo\":1,\"bar\":6},{\"foo\":2,\"bar\":7},{\"foo\":3,\"bar\":8}]"

# File 'lib/polars/data_frame.rb', line 658

def write_json(file = nil)
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  to_string_io = !file.nil? && file.is_a?(StringIO)
  if file.nil? || to_string_io
    buf = StringIO.new
    buf.set_encoding(Encoding::BINARY)
    _df.write_json(buf)
    json_bytes = buf.string

    json_str = json_bytes.force_encoding(Encoding::UTF_8)
    if to_string_io
      file.write(json_str)
    else
      return json_str
    end
  else
    _df.write_json(file)
  end
  nil
end

#write_ndjson(file = nil) ⇒ `nil`

Serialize to newline delimited JSON representation.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8]
  }
)
df.write_ndjson
# => "{\"foo\":1,\"bar\":6}\n{\"foo\":2,\"bar\":7}\n{\"foo\":3,\"bar\":8}\n"

# File 'lib/polars/data_frame.rb', line 697

def write_ndjson(file = nil)
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  to_string_io = !file.nil? && file.is_a?(StringIO)
  if file.nil? || to_string_io
    buf = StringIO.new
    buf.set_encoding(Encoding::BINARY)
    _df.write_ndjson(buf)
    json_bytes = buf.string

    json_str = json_bytes.force_encoding(Encoding::UTF_8)
    if to_string_io
      file.write(json_str)
    else
      return json_str
    end
  else
    _df.write_ndjson(file)
  end
  nil
end

#write_parquet(file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil) ⇒ `nil`

Write to Apache Parquet file.

# File 'lib/polars/data_frame.rb', line 965

def write_parquet(
  file,
  compression: "zstd",
  compression_level: nil,
  statistics: false,
  row_group_size: nil,
  data_page_size: nil
)
  if compression.nil?
    compression = "uncompressed"
  end
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  if statistics == true
    statistics = {
      min: true,
      max: true,
      distinct_count: false,
      null_count: true
    }
  elsif statistics == false
    statistics = {}
  elsif statistics == "full"
    statistics = {
      min: true,
      max: true,
      distinct_count: true,
      null_count: true
    }
  end

  _df.write_parquet(
    file, compression, compression_level, statistics, row_group_size, data_page_size
  )
end

Class: Polars::DataFrame

Overview

Instance Method Summary collapse

Constructor Details

#initialize(data = nil, schema: nil, columns: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ DataFrame

Instance Method Details

#!=(other) ⇒ DataFrame

#%(other) ⇒ DataFrame

#*(other) ⇒ DataFrame

#+(other) ⇒ DataFrame

#-(other) ⇒ DataFrame

#/(other) ⇒ DataFrame

#<(other) ⇒ DataFrame

#<=(other) ⇒ DataFrame

#==(other) ⇒ DataFrame

#>(other) ⇒ DataFrame

#>=(other) ⇒ DataFrame

#[](*args) ⇒ Object

#[]=(*key, value) ⇒ Object

#cast(dtypes, strict: true) ⇒ DataFrame

Examples:

Cast specific frame columns to the specified dtypes:

Cast all frame columns matching one dtype (or dtype group) to another dtype:

Cast all frame columns to the specified dtype:

#clear(n = 0) ⇒ DataFrame Also known as: cleared

Examples:

#collect_schema ⇒ Schema

Examples:

Determine the schema.

Access various properties of the schema using the Schema object.

#columns ⇒ Array

Examples:

#columns=(columns) ⇒ Object

Examples:

#delete(name) ⇒ Series

Examples:

#describe ⇒ DataFrame

Examples:

#drop(*columns) ⇒ DataFrame

Examples:

Drop multiple columns by passing a list of column names.

Use positional arguments to drop multiple columns.

#drop_in_place(name) ⇒ Series

Examples:

#drop_nulls(subset: nil) ⇒ DataFrame

Examples:

#dtypes ⇒ Array

Examples:

#each(&block) ⇒ Object

#each_row(named: true, buffer_size: 500, &block) ⇒ Object

#equals(other, null_equal: true) ⇒ Boolean Also known as: frame_equal

Examples:

#estimated_size(unit = "b") ⇒ Numeric

Examples:

#explode(columns) ⇒ DataFrame

Examples:

#extend(other) ⇒ DataFrame

Examples:

#fill_nan(fill_value) ⇒ DataFrame

Examples:

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame

Examples:

#filter(predicate) ⇒ DataFrame

Examples:

Filter on one condition:

Filter on multiple conditions:

#flags ⇒ Hash

#fold ⇒ Series

Examples:

A horizontal sum operation:

A horizontal minimum operation:

A horizontal string concatenation:

A horizontal boolean or, similar to a row-wise .any:

#gather_every(n, offset = 0) ⇒ DataFrame Also known as: take_every

Examples:

#get_column(name) ⇒ Series

Examples:

#get_column_index(name) ⇒ Series Also known as: find_idx_by_name

Examples:

#get_columns ⇒ Array

#initialize(data = nil, schema: nil, columns: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ `DataFrame`

#!=(other) ⇒ `DataFrame`

#%(other) ⇒ `DataFrame`

#*(other) ⇒ `DataFrame`

#+(other) ⇒ `DataFrame`

#-(other) ⇒ `DataFrame`

#/(other) ⇒ `DataFrame`

#<(other) ⇒ `DataFrame`

#<=(other) ⇒ `DataFrame`

#==(other) ⇒ `DataFrame`

#>(other) ⇒ `DataFrame`

#>=(other) ⇒ `DataFrame`

#[](*args) ⇒ `Object`

#[]=(*key, value) ⇒ `Object`

#cast(dtypes, strict: true) ⇒ `DataFrame`

#clear(n = 0) ⇒ `DataFrame` Also known as: cleared

#collect_schema ⇒ `Schema`

Access various properties of the schema using the `Schema` object.

#columns ⇒ `Array`

#columns=(columns) ⇒ `Object`

#delete(name) ⇒ `Series`

#describe ⇒ `DataFrame`

#drop(*columns) ⇒ `DataFrame`

#drop_in_place(name) ⇒ `Series`

#drop_nulls(subset: nil) ⇒ `DataFrame`

#dtypes ⇒ `Array`

#each(&block) ⇒ `Object`

#each_row(named: true, buffer_size: 500, &block) ⇒ `Object`

#equals(other, null_equal: true) ⇒ `Boolean` Also known as: frame_equal

#estimated_size(unit = "b") ⇒ `Numeric`

#explode(columns) ⇒ `DataFrame`

#extend(other) ⇒ `DataFrame`

#fill_nan(fill_value) ⇒ `DataFrame`

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ `DataFrame`

#filter(predicate) ⇒ `DataFrame`

#flags ⇒ `Hash`

#fold ⇒ `Series`

#gather_every(n, offset = 0) ⇒ `DataFrame` Also known as: take_every

#get_column(name) ⇒ `Series`

#get_column_index(name) ⇒ `Series` Also known as: find_idx_by_name

#get_columns ⇒ `Array`

#group_by(by, maintain_order: false) ⇒ `GroupBy` Also known as: groupby, group

#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window") ⇒ `DataFrame` Also known as: groupby_dynamic

#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ `Series`

#head(n = 5) ⇒ `DataFrame`

#height ⇒ `Integer` Also known as: count, length, size

#hstack(columns, in_place: false) ⇒ `DataFrame`

#include?(name) ⇒ `Boolean`

#insert_column(index, series) ⇒ `DataFrame` Also known as: insert_at_idx

#interpolate ⇒ `DataFrame`

#is_duplicated ⇒ `Series`

#is_empty ⇒ `Boolean` Also known as: empty?

#is_unique ⇒ `Series`

#item ⇒ `Object`

#iter_columns ⇒ `Object`

then consider whether you can use `all` instead:

#iter_rows(named: false, buffer_size: 500, &block) ⇒ `Object`

#iter_slices(n_rows: 10_000) ⇒ `Object`

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil) ⇒ `DataFrame`

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ `DataFrame`

#lazy ⇒ `LazyFrame`

#limit(n = 5) ⇒ `DataFrame`

#map_rows(return_dtype: nil, inference_size: 256, &f) ⇒ `Object` Also known as: apply

#max ⇒ `DataFrame`

#max_horizontal ⇒ `Series`

#mean ⇒ `DataFrame`

#mean_horizontal(ignore_nulls: true) ⇒ `Series`

#median ⇒ `DataFrame`

#merge_sorted(other, key) ⇒ `DataFrame`

#min ⇒ `DataFrame`

#min_horizontal ⇒ `Series`

#n_chunks(strategy: "first") ⇒ `Object`

#n_unique(subset: nil) ⇒ `DataFrame`