Class: Polars::DataFrame

Inherits:
Object
  • Object
show all
Includes:
Plot
Defined in:
lib/polars/data_frame.rb

Overview

Two-dimensional data structure representing data as a table with rows and columns.

Instance Method Summary collapse

Constructor Details

#initialize(data = nil, schema: nil, columns: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ DataFrame

Create a new DataFrame.

Parameters:

  • data (Object) (defaults to: nil)

    Two-dimensional data in various forms; hash input must contain arrays or a range. Arrays may contain Series or other arrays.

  • schema (Object) (defaults to: nil)

    The schema of the resulting DataFrame. The schema may be declared in several ways:

    • As a hash of name:type pairs; if type is nil, it will be auto-inferred.
    • As an array of column names; in this case types are automatically inferred.
    • As an array of (name,type) pairs; this is equivalent to the dictionary form.

    If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

    If set to nil (default), the schema is inferred from the data.

  • schema_overrides (Hash) (defaults to: nil)

    Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden.

    The number of entries in the schema should match the underlying data dimensions, unless an array of hashes is being passed, in which case a partial schema can be declared to prevent specific fields from being loaded.

  • strict (Boolean) (defaults to: true)

    Throw an error if any data value does not exactly match the given or inferred data type for that column. If set to false, values that do not match the data type are cast to that data type or, if casting is not possible, set to null instead.

  • orient ("col", "row") (defaults to: nil)

    Whether to interpret two-dimensional data as columns or as rows. If nil, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.

  • infer_schema_length (Integer) (defaults to: 100)

    The maximum number of rows to scan for schema inference. If set to nil, the full data may be scanned (this can be slow). This parameter only applies if the input data is a sequence or generator of rows; other input is read as-is.

  • nan_to_null (Boolean) (defaults to: false)

    If the data comes from one or more Numo arrays, can optionally convert input data NaN values to null instead. This is a no-op for all other input data.



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/polars/data_frame.rb', line 50

def initialize(data = nil, schema: nil, columns: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false)
  if schema && columns
    warn "columns is ignored when schema is passed"
  end
  schema ||= columns

  if defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || data.is_a?(ActiveRecord::Result))
    raise ArgumentError, "Use read_database instead"
  end

  if data.nil?
    self._df = self.class.hash_to_rbdf({}, schema: schema, schema_overrides: schema_overrides)
  elsif data.is_a?(Hash)
    data = data.transform_keys { |v| v.is_a?(Symbol) ? v.to_s : v }
    self._df = self.class.hash_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, nan_to_null: nan_to_null)
  elsif data.is_a?(::Array)
    self._df = self.class.sequence_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, orient: orient, infer_schema_length: infer_schema_length)
  elsif data.is_a?(Series)
    self._df = self.class.series_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict)
  elsif data.respond_to?(:arrow_c_stream)
    # This uses the fact that RbSeries.from_arrow_c_stream will create a
    # struct-typed Series. Then we unpack that to a DataFrame.
    tmp_col_name = ""
    s = Utils.wrap_s(RbSeries.from_arrow_c_stream(data))
    self._df = s.to_frame(tmp_col_name).unnest(tmp_col_name)._df
  else
    raise ArgumentError, "DataFrame constructor called with unsupported type; got #{data.class.name}"
  end
end

Instance Method Details

#!=(other) ⇒ DataFrame

Not equal.

Returns:



230
231
232
# File 'lib/polars/data_frame.rb', line 230

def !=(other)
  _comp(other, "neq")
end

#%(other) ⇒ DataFrame

Returns the modulo.

Returns:



313
314
315
316
317
318
319
320
# File 'lib/polars/data_frame.rb', line 313

def %(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.rem_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.rem(other._s))
end

#*(other) ⇒ DataFrame

Performs multiplication.

Returns:



265
266
267
268
269
270
271
272
# File 'lib/polars/data_frame.rb', line 265

def *(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.mul_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.mul(other._s))
end

#+(other) ⇒ DataFrame

Performs addition.

Returns:



289
290
291
292
293
294
295
296
# File 'lib/polars/data_frame.rb', line 289

def +(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.add_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.add(other._s))
end

#-(other) ⇒ DataFrame

Performs subtraction.

Returns:



301
302
303
304
305
306
307
308
# File 'lib/polars/data_frame.rb', line 301

def -(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.sub_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.sub(other._s))
end

#/(other) ⇒ DataFrame

Performs division.

Returns:



277
278
279
280
281
282
283
284
# File 'lib/polars/data_frame.rb', line 277

def /(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.div_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.div(other._s))
end

#<(other) ⇒ DataFrame

Less than.

Returns:



244
245
246
# File 'lib/polars/data_frame.rb', line 244

def <(other)
  _comp(other, "lt")
end

#<=(other) ⇒ DataFrame

Less than or equal.

Returns:



258
259
260
# File 'lib/polars/data_frame.rb', line 258

def <=(other)
  _comp(other, "lt_eq")
end

#==(other) ⇒ DataFrame

Equal.

Returns:



223
224
225
# File 'lib/polars/data_frame.rb', line 223

def ==(other)
  _comp(other, "eq")
end

#>(other) ⇒ DataFrame

Greater than.

Returns:



237
238
239
# File 'lib/polars/data_frame.rb', line 237

def >(other)
  _comp(other, "gt")
end

#>=(other) ⇒ DataFrame

Greater than or equal.

Returns:



251
252
253
# File 'lib/polars/data_frame.rb', line 251

def >=(other)
  _comp(other, "gt_eq")
end

#[](*args) ⇒ Object

Returns subset of the DataFrame.

Returns:

Raises:

  • (ArgumentError)


354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
# File 'lib/polars/data_frame.rb', line 354

def [](*args)
  if args.size == 2
    row_selection, col_selection = args

    # df[.., unknown]
    if row_selection.is_a?(Range)

      # multiple slices
      # df[.., ..]
      if col_selection.is_a?(Range)
        raise Todo
      end
    end

    # df[2, ..] (select row as df)
    if row_selection.is_a?(Integer)
      if col_selection.is_a?(::Array)
        df = self[0.., col_selection]
        return df.slice(row_selection, 1)
      end
      # df[2, "a"]
      if col_selection.is_a?(::String) || col_selection.is_a?(Symbol)
        return self[col_selection][row_selection]
      end
    end

    # column selection can be "a" and ["a", "b"]
    if col_selection.is_a?(::String) || col_selection.is_a?(Symbol)
      col_selection = [col_selection]
    end

    # df[.., 1]
    if col_selection.is_a?(Integer)
      series = to_series(col_selection)
      return series[row_selection]
    end

    if col_selection.is_a?(::Array)
      # df[.., [1, 2]]
      if Utils.is_int_sequence(col_selection)
        series_list = col_selection.map { |i| to_series(i) }
        df = self.class.new(series_list)
        return df[row_selection]
      end
    end

    df = self[col_selection]
    return df[row_selection]
  elsif args.size == 1
    item = args[0]

    # select single column
    # df["foo"]
    if item.is_a?(::String) || item.is_a?(Symbol)
      return Utils.wrap_s(_df.get_column(item.to_s))
    end

    # df[idx]
    if item.is_a?(Integer)
      return slice(_pos_idx(item, 0), 1)
    end

    # df[..]
    if item.is_a?(Range)
      return Slice.new(self).apply(item)
    end

    if item.is_a?(::Array) && item.all? { |v| Utils.strlike?(v) }
      # select multiple columns
      # df[["foo", "bar"]]
      return _from_rbdf(_df.select(item.map(&:to_s)))
    end

    if Utils.is_int_sequence(item)
      item = Series.new("", item)
    end

    if item.is_a?(Series)
      dtype = item.dtype
      if dtype == String
        return _from_rbdf(_df.select(item))
      elsif dtype == UInt32
        return _from_rbdf(_df.take_with_series(item._s))
      elsif [UInt8, UInt16, UInt64, Int8, Int16, Int32, Int64].include?(dtype)
        return _from_rbdf(
          _df.take_with_series(_pos_idxs(item, 0)._s)
        )
      end
    end
  end

  # Ruby-specific
  if item.is_a?(Expr) || item.is_a?(Series)
    return filter(item)
  end

  raise ArgumentError, "Cannot get item of type: #{item.class.name}"
end

#[]=(*key, value) ⇒ Object

Set item.

Returns:



456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
# File 'lib/polars/data_frame.rb', line 456

def []=(*key, value)
  if key.length == 1
    key = key.first
  elsif key.length != 2
    raise ArgumentError, "wrong number of arguments (given #{key.length + 1}, expected 2..3)"
  end

  if Utils.strlike?(key)
    if value.is_a?(::Array) || (defined?(Numo::NArray) && value.is_a?(Numo::NArray))
      value = Series.new(value)
    elsif !value.is_a?(Series)
      value = Polars.lit(value)
    end
    self._df = with_column(value.alias(key.to_s))._df
  elsif key.is_a?(::Array)
    row_selection, col_selection = key

    if Utils.strlike?(col_selection)
      s = self[col_selection]
    elsif col_selection.is_a?(Integer)
      raise Todo
    else
      raise ArgumentError, "column selection not understood: #{col_selection}"
    end

    s[row_selection] = value

    if col_selection.is_a?(Integer)
      replace_column(col_selection, s)
    elsif Utils.strlike?(col_selection)
      replace(col_selection, s)
    end
  else
    raise Todo
  end
end

#cast(dtypes, strict: true) ⇒ DataFrame

Cast DataFrame column(s) to the specified dtype(s).

Examples:

Cast specific frame columns to the specified dtypes:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => [Date.new(2020, 1, 2), Date.new(2021, 3, 4), Date.new(2022, 5, 6)]
  }
)
df.cast({"foo" => Polars::Float32, "bar" => Polars::UInt8})
# =>
# shape: (3, 3)
# ┌─────┬─────┬────────────┐
# │ foo ┆ bar ┆ ham        │
# │ --- ┆ --- ┆ ---        │
# │ f32 ┆ u8  ┆ date       │
# ╞═════╪═════╪════════════╡
# │ 1.0 ┆ 6   ┆ 2020-01-02 │
# │ 2.0 ┆ 7   ┆ 2021-03-04 │
# │ 3.0 ┆ 8   ┆ 2022-05-06 │
# └─────┴─────┴────────────┘

Cast all frame columns matching one dtype (or dtype group) to another dtype:

df.cast({Polars::Date => Polars::Datetime})
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────────────────────┐
# │ foo ┆ bar ┆ ham                 │
# │ --- ┆ --- ┆ ---                 │
# │ i64 ┆ f64 ┆ datetime[μs]        │
# ╞═════╪═════╪═════════════════════╡
# │ 1   ┆ 6.0 ┆ 2020-01-02 00:00:00 │
# │ 2   ┆ 7.0 ┆ 2021-03-04 00:00:00 │
# │ 3   ┆ 8.0 ┆ 2022-05-06 00:00:00 │
# └─────┴─────┴─────────────────────┘

Cast all frame columns to the specified dtype:

df.cast(Polars::String).to_h(as_series: false)
# => {"foo"=>["1", "2", "3"], "bar"=>["6.0", "7.0", "8.0"], "ham"=>["2020-01-02", "2021-03-04", "2022-05-06"]}

Parameters:

  • dtypes (Object)

    Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.

  • strict (Boolean) (defaults to: true)

    Throw an error if a cast could not be done (for instance, due to an overflow).

Returns:



2951
2952
2953
# File 'lib/polars/data_frame.rb', line 2951

def cast(dtypes, strict: true)
  lazy.cast(dtypes, strict: strict).collect(_eager: true)
end

#clear(n = 0) ⇒ DataFrame Also known as: cleared

Create an empty copy of the current DataFrame.

Returns a DataFrame with identical schema but no data.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil]
  }
)
df.clear
# =>
# shape: (0, 3)
# ┌─────┬─────┬──────┐
# │ a   ┆ b   ┆ c    │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ f64 ┆ bool │
# ╞═════╪═════╪══════╡
# └─────┴─────┴──────┘
df.clear(2)
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ a    ┆ b    ┆ c    │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ f64  ┆ bool │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Returns:



2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
# File 'lib/polars/data_frame.rb', line 2991

def clear(n = 0)
  if n == 0
    _from_rbdf(_df.clear)
  elsif n > 0 || len > 0
    self.class.new(
      schema.to_h { |nm, tp| [nm, Series.new(nm, [], dtype: tp).extend_constant(nil, n)] }
    )
  else
    clone
  end
end

#columnsArray

Get column names.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.columns
# => ["foo", "bar", "ham"]

Returns:



140
141
142
# File 'lib/polars/data_frame.rb', line 140

def columns
  _df.columns
end

#columns=(columns) ⇒ Object

Change the column names of the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.columns = ["apple", "banana", "orange"]
df
# =>
# shape: (3, 3)
# ┌───────┬────────┬────────┐
# │ apple ┆ banana ┆ orange │
# │ ---   ┆ ---    ┆ ---    │
# │ i64   ┆ i64    ┆ str    │
# ╞═══════╪════════╪════════╡
# │ 1     ┆ 6      ┆ a      │
# │ 2     ┆ 7      ┆ b      │
# │ 3     ┆ 8      ┆ c      │
# └───────┴────────┴────────┘

Parameters:

  • columns (Array)

    A list with new names for the DataFrame. The length of the list should be equal to the width of the DataFrame.

Returns:



173
174
175
# File 'lib/polars/data_frame.rb', line 173

def columns=(columns)
  _df.set_column_names(columns)
end

#delete(name) ⇒ Series

Drop in place if exists.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.delete("ham")
# =>
# shape: (3,)
# Series: 'ham' [str]
# [
#         "a"
#         "b"
#         "c"
# ]
df.delete("missing")
# => nil

Parameters:

  • name (Object)

    Column to drop.

Returns:



2898
2899
2900
# File 'lib/polars/data_frame.rb', line 2898

def delete(name)
  drop_in_place(name) if include?(name)
end

#describeDataFrame

Summary statistics for a DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.0, 2.8, 3.0],
    "b" => [4, 5, nil],
    "c" => [true, false, true],
    "d" => [nil, "b", "c"],
    "e" => ["usd", "eur", nil]
  }
)
df.describe
# =>
# shape: (7, 6)
# ┌────────────┬──────────┬──────────┬──────────┬──────┬──────┐
# │ describe   ┆ a        ┆ b        ┆ c        ┆ d    ┆ e    │
# │ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---  │
# │ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str  │
# ╞════════════╪══════════╪══════════╪══════════╪══════╪══════╡
# │ count      ┆ 3.0      ┆ 3.0      ┆ 3.0      ┆ 3    ┆ 3    │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 1    ┆ 1    │
# │ mean       ┆ 2.266667 ┆ 4.5      ┆ 0.666667 ┆ null ┆ null │
# │ std        ┆ 1.101514 ┆ 0.707107 ┆ 0.57735  ┆ null ┆ null │
# │ min        ┆ 1.0      ┆ 4.0      ┆ 0.0      ┆ b    ┆ eur  │
# │ max        ┆ 3.0      ┆ 5.0      ┆ 1.0      ┆ c    ┆ usd  │
# │ median     ┆ 2.8      ┆ 4.5      ┆ 1.0      ┆ null ┆ null │
# └────────────┴──────────┴──────────┴──────────┴──────┴──────┘

Returns:



1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
# File 'lib/polars/data_frame.rb', line 1321

def describe
  describe_cast = lambda do |stat|
    columns = []
    self.columns.each_with_index do |s, i|
      if self[s].is_numeric || self[s].is_boolean
        columns << stat[0.., i].cast(:f64)
      else
        # for dates, strings, etc, we cast to string so that all
        # statistics can be shown
        columns << stat[0.., i].cast(:str)
      end
    end
    self.class.new(columns)
  end

  summary = _from_rbdf(
    Polars.concat(
      [
        describe_cast.(
          self.class.new(columns.to_h { |c| [c, [height]] })
        ),
        describe_cast.(null_count),
        describe_cast.(mean),
        describe_cast.(std),
        describe_cast.(min),
        describe_cast.(max),
        describe_cast.(median)
      ]
    )._df
  )
  summary.insert_column(
    0,
    Polars::Series.new(
      "describe",
      ["count", "null_count", "mean", "std", "min", "max", "median"],
    )
  )
  summary
end

#drop(*columns) ⇒ DataFrame

Remove column from DataFrame and return as new.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.drop("ham")
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ f64 │
# ╞═════╪═════╡
# │ 1   ┆ 6.0 │
# │ 2   ┆ 7.0 │
# │ 3   ┆ 8.0 │
# └─────┴─────┘

Drop multiple columns by passing a list of column names.

df.drop(["bar", "ham"])
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘

Use positional arguments to drop multiple columns.

df.drop("foo", "ham")
# =>
# shape: (3, 1)
# ┌─────┐
# │ bar │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 6.0 │
# │ 7.0 │
# │ 8.0 │
# └─────┘

Parameters:

  • columns (Object)

    Column(s) to drop.

Returns:



2838
2839
2840
# File 'lib/polars/data_frame.rb', line 2838

def drop(*columns)
  lazy.drop(*columns).collect(_eager: true)
end

#drop_in_place(name) ⇒ Series

Drop in place.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.drop_in_place("ham")
# =>
# shape: (3,)
# Series: 'ham' [str]
# [
#         "a"
#         "b"
#         "c"
# ]

Parameters:

  • name (Object)

    Column to drop.

Returns:



2866
2867
2868
# File 'lib/polars/data_frame.rb', line 2866

def drop_in_place(name)
  Utils.wrap_s(_df.drop_in_place(name))
end

#drop_nulls(subset: nil) ⇒ DataFrame

Return a new DataFrame where the null values are dropped.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, nil, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.drop_nulls
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • subset (Object) (defaults to: nil)

    Subset of column(s) on which drop_nulls will be applied.

Returns:



1702
1703
1704
# File 'lib/polars/data_frame.rb', line 1702

def drop_nulls(subset: nil)
  lazy.drop_nulls(subset: subset).collect(_eager: true)
end

#dtypesArray

Get dtypes of columns in DataFrame. Dtypes can also be found in column headers when printing the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]

Returns:



191
192
193
# File 'lib/polars/data_frame.rb', line 191

def dtypes
  _df.dtypes
end

#each(&block) ⇒ Object

Returns an enumerator.

Returns:



347
348
349
# File 'lib/polars/data_frame.rb', line 347

def each(&block)
  get_columns.each(&block)
end

#each_row(named: true, buffer_size: 500, &block) ⇒ Object

Returns an iterator over the DataFrame of rows of Ruby-native values.

Parameters:

  • named (Boolean) (defaults to: true)

    Return hashes instead of arrays. The hashes are a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

  • buffer_size (Integer) (defaults to: 500)

    Determines the number of rows that are buffered internally while iterating over the data; you should only modify this in very specific cases where the default value is determined not to be a good fit to your access pattern, as the speedup from using the buffer is significant (~2-4x). Setting this value to zero disables row buffering.

Returns:



4864
4865
4866
# File 'lib/polars/data_frame.rb', line 4864

def each_row(named: true, buffer_size: 500, &block)
  iter_rows(named: named, buffer_size: buffer_size, &block)
end

#equals(other, null_equal: true) ⇒ Boolean Also known as: frame_equal

Check if DataFrame is equal to other.

Examples:

df1 = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df2 = Polars::DataFrame.new(
  {
    "foo" => [3, 2, 1],
    "bar" => [8.0, 7.0, 6.0],
    "ham" => ["c", "b", "a"]
  }
)
df1.equals(df1)
# => true
df1.equals(df2)
# => false

Parameters:

  • other (DataFrame)

    DataFrame to compare with.

  • null_equal (Boolean) (defaults to: true)

    Consider null values as equal.

Returns:



1514
1515
1516
# File 'lib/polars/data_frame.rb', line 1514

def equals(other, null_equal: true)
  _df.equals(other._df, null_equal)
end

#estimated_size(unit = "b") ⇒ Numeric

Return an estimation of the total (heap) allocated size of the DataFrame.

Estimated size is given in the specified unit (bytes by default).

This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, StructArray's size is an upper bound.

When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.

FFI buffers are included in this estimation.

Examples:

df = Polars::DataFrame.new(
  {
    "x" => 1_000_000.times.to_a.reverse,
    "y" => 1_000_000.times.map { |v| v / 1000.0 },
    "z" => 1_000_000.times.map(&:to_s)
  },
  columns: {"x" => :u32, "y" => :f64, "z" => :str}
)
df.estimated_size
# => 25888898
df.estimated_size("mb")
# => 17.0601749420166

Parameters:

  • unit ("b", "kb", "mb", "gb", "tb") (defaults to: "b")

    Scale the returned size to the given unit.

Returns:

  • (Numeric)


1064
1065
1066
1067
# File 'lib/polars/data_frame.rb', line 1064

def estimated_size(unit = "b")
  sz = _df.estimated_size
  Utils.scale_bytes(sz, to: unit)
end

#explode(columns) ⇒ DataFrame

Explode DataFrame to long format by exploding a column with Lists.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["a", "a", "b", "c"],
    "numbers" => [[1], [2, 3], [4, 5], [6, 7, 8]]
  }
)
df.explode("numbers")
# =>
# shape: (8, 2)
# ┌─────────┬─────────┐
# │ letters ┆ numbers │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ a       ┆ 1       │
# │ a       ┆ 2       │
# │ a       ┆ 3       │
# │ b       ┆ 4       │
# │ b       ┆ 5       │
# │ c       ┆ 6       │
# │ c       ┆ 7       │
# │ c       ┆ 8       │
# └─────────┴─────────┘

Parameters:

  • columns (Object)

    Column of LargeList type.

Returns:



3240
3241
3242
# File 'lib/polars/data_frame.rb', line 3240

def explode(columns)
  lazy.explode(columns).collect(no_optimization: true)
end

#extend(other) ⇒ DataFrame

Extend the memory backed by this DataFrame with the values from other.

Different from vstack which adds the chunks from other to the chunks of this DataFrame extend appends the data from other to the underlying memory locations and thus may cause a reallocation.

If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.

Prefer extend over vstack when you want to do a query after a single append. For instance during online operations where you add n rows and rerun a query.

Prefer vstack over extend when you want to append many times before doing a query. For instance when you read in multiple files and when to store them in a single DataFrame. In the latter case, finish the sequence of vstack operations with a rechunk.

Examples:

df1 = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df2 = Polars::DataFrame.new({"foo" => [10, 20, 30], "bar" => [40, 50, 60]})
df1.extend(df2)
# =>
# shape: (6, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 4   │
# │ 2   ┆ 5   │
# │ 3   ┆ 6   │
# │ 10  ┆ 40  │
# │ 20  ┆ 50  │
# │ 30  ┆ 60  │
# └─────┴─────┘

Parameters:

  • other (DataFrame)

    DataFrame to vertically add.

Returns:



2778
2779
2780
2781
# File 'lib/polars/data_frame.rb', line 2778

def extend(other)
  _df.extend(other._df)
  self
end

#fill_nan(fill_value) ⇒ DataFrame

Note:

Note that floating point NaNs (Not a Number) are not missing values! To replace missing values, use fill_null.

Fill floating point NaN values by an Expression evaluation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13]
  }
)
df.fill_nan(99)
# =>
# shape: (4, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ f64  ┆ f64  │
# ╞══════╪══════╡
# │ 1.5  ┆ 0.5  │
# │ 2.0  ┆ 4.0  │
# │ 99.0 ┆ 99.0 │
# │ 4.0  ┆ 13.0 │
# └──────┴──────┘

Parameters:

  • fill_value (Object)

    Value to fill NaN with.

Returns:



3205
3206
3207
# File 'lib/polars/data_frame.rb', line 3205

def fill_nan(fill_value)
  lazy.fill_nan(fill_value).collect(no_optimization: true)
end

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame

Fill null values using the specified value or strategy.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, nil, 4],
    "b" => [0.5, 4, nil, 13]
  }
)
df.fill_null(99)
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 99  ┆ 99.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
df.fill_null(strategy: "forward")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
df.fill_null(strategy: "max")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
df.fill_null(strategy: "zero")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 0   ┆ 0.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

Parameters:

  • value (Numeric) (defaults to: nil)

    Value used to fill null values.

  • strategy (nil, "forward", "backward", "min", "max", "mean", "zero", "one") (defaults to: nil)

    Strategy used to fill null values.

  • limit (Integer) (defaults to: nil)

    Number of consecutive null values to fill when using the 'forward' or 'backward' strategy.

  • matches_supertype (Boolean) (defaults to: true)

    Fill all matching supertype of the fill value.

Returns:



3165
3166
3167
3168
3169
3170
3171
3172
# File 'lib/polars/data_frame.rb', line 3165

def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true)
  _from_rbdf(
    lazy
      .fill_null(value, strategy: strategy, limit: limit, matches_supertype: matches_supertype)
      .collect(no_optimization: true)
      ._df
  )
end

#filter(predicate) ⇒ DataFrame

Filter the rows in the DataFrame based on a predicate expression.

Examples:

Filter on one condition:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.filter(Polars.col("foo") < 3)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

Filter on multiple conditions:

df.filter((Polars.col("foo") < 3) & (Polars.col("ham") == "a"))
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Parameters:

  • predicate (Expr)

    Expression that evaluates to a boolean Series.

Returns:



1287
1288
1289
# File 'lib/polars/data_frame.rb', line 1287

def filter(predicate)
  lazy.filter(predicate).collect
end

#flagsHash

Get flags that are set on the columns of this DataFrame.

Returns:

  • (Hash)


198
199
200
# File 'lib/polars/data_frame.rb', line 198

def flags
  columns.to_h { |name| [name, self[name].flags] }
end

#foldSeries

Apply a horizontal reduction on a DataFrame.

This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).

An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:

i8 + str = str f32 + i64 = f32 f32 + f64 = f64

Examples:

A horizontal sum operation:

df = Polars::DataFrame.new(
  {
    "a" => [2, 1, 3],
    "b" => [1, 2, 3],
    "c" => [1.0, 2.0, 3.0]
  }
)
df.fold { |s1, s2| s1 + s2 }
# =>
# shape: (3,)
# Series: 'a' [f64]
# [
#         4.0
#         5.0
#         9.0
# ]

A horizontal minimum operation:

df = Polars::DataFrame.new({"a" => [2, 1, 3], "b" => [1, 2, 3], "c" => [1.0, 2.0, 3.0]})
df.fold { |s1, s2| s1.zip_with(s1 < s2, s2) }
# =>
# shape: (3,)
# Series: 'a' [f64]
# [
#         1.0
#         1.0
#         3.0
# ]

A horizontal string concatenation:

df = Polars::DataFrame.new(
  {
    "a" => ["foo", "bar", nil],
    "b" => [1, 2, 3],
    "c" => [1.0, 2.0, 3.0]
  }
)
df.fold { |s1, s2| s1 + s2 }
# =>
# shape: (3,)
# Series: 'a' [str]
# [
#         "foo11.0"
#         "bar22.0"
#         null
# ]

A horizontal boolean or, similar to a row-wise .any:

df = Polars::DataFrame.new(
  {
    "a" => [false, false, true],
    "b" => [false, true, false]
  }
)
df.fold { |s1, s2| s1 | s2 }
# =>
# shape: (3,)
# Series: 'a' [bool]
# [
#         false
#         true
#         true
# ]

Returns:



4673
4674
4675
4676
4677
4678
4679
4680
# File 'lib/polars/data_frame.rb', line 4673

def fold
  acc = to_series(0)

  1.upto(width - 1) do |i|
    acc = yield(acc, to_series(i))
  end
  acc
end

#gather_every(n, offset = 0) ⇒ DataFrame Also known as: take_every

Take every nth row in the DataFrame and return as a new DataFrame.

Examples:

s = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [5, 6, 7, 8]})
s.gather_every(2)
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 5   │
# │ 3   ┆ 7   │
# └─────┴─────┘

Returns:



4901
4902
4903
# File 'lib/polars/data_frame.rb', line 4901

def gather_every(n, offset = 0)
  select(F.col("*").gather_every(n, offset))
end

#get_column(name) ⇒ Series

Get a single column as Series by name.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.get_column("foo")
# =>
# shape: (3,)
# Series: 'foo' [i64]
# [
#         1
#         2
#         3
# ]

Parameters:

  • name (String)

    Name of the column to retrieve.

Returns:



3082
3083
3084
# File 'lib/polars/data_frame.rb', line 3082

def get_column(name)
  self[name]
end

#get_column_index(name) ⇒ Series Also known as: find_idx_by_name

Find the index of a column by name.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3], "bar" => [6, 7, 8], "ham" => ["a", "b", "c"]}
)
df.get_column_index("ham")
# => 2

Parameters:

  • name (String)

    Name of the column to find.

Returns:



1374
1375
1376
# File 'lib/polars/data_frame.rb', line 1374

def get_column_index(name)
  _df.get_column_index(name)
end

#get_columnsArray

Get the DataFrame as a Array of Series.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.get_columns
# =>
# [shape: (3,)
# Series: 'foo' [i64]
# [
#         1
#         2
#         3
# ], shape: (3,)
# Series: 'bar' [i64]
# [
#         4
#         5
#         6
# ]]
df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.get_columns
# =>
# [shape: (4,)
# Series: 'a' [i64]
# [
#         1
#         2
#         3
#         4
# ], shape: (4,)
# Series: 'b' [f64]
# [
#         0.5
#         4.0
#         10.0
#         13.0
# ], shape: (4,)
# Series: 'c' [bool]
# [
#         true
#         true
#         false
#         true
# ]]

Returns:



3060
3061
3062
# File 'lib/polars/data_frame.rb', line 3060

def get_columns
  _df.get_columns.map { |s| Utils.wrap_s(s) }
end

#group_by(by, maintain_order: false) ⇒ GroupBy Also known as: groupby, group

Start a group by operation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
df.group_by("a").agg(Polars.col("b").sum).sort("a")
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 4   │
# │ b   ┆ 11  │
# │ c   ┆ 6   │
# └─────┴─────┘

Parameters:

  • by (Object)

    Column(s) to group by.

  • maintain_order (Boolean) (defaults to: false)

    Make sure that the order of the groups remain consistent. This is more expensive than a default group by. Note that this only works in expression aggregations.

Returns:



1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
# File 'lib/polars/data_frame.rb', line 1810

def group_by(by, maintain_order: false)
  if !Utils.bool?(maintain_order)
    raise TypeError, "invalid input for group_by arg `maintain_order`: #{maintain_order}."
  end
  GroupBy.new(
    self,
    by,
    maintain_order: maintain_order
  )
end

#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window") ⇒ DataFrame Also known as: groupby_dynamic

Group based on a time value (or index value of type :i32, :i64).

Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

  • every: interval of the window
  • period: length of the window
  • offset: offset of the window

The every, period and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_dynamic on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "n" => 0..6
  }
)
# =>
# shape: (7, 2)
# ┌─────────────────────┬─────┐
# │ time                ┆ n   │
# │ ---                 ┆ --- │
# │ datetime[μs]        ┆ i64 │
# ╞═════════════════════╪═════╡
# │ 2021-12-16 00:00:00 ┆ 0   │
# │ 2021-12-16 00:30:00 ┆ 1   │
# │ 2021-12-16 01:00:00 ┆ 2   │
# │ 2021-12-16 01:30:00 ┆ 3   │
# │ 2021-12-16 02:00:00 ┆ 4   │
# │ 2021-12-16 02:30:00 ┆ 5   │
# │ 2021-12-16 03:00:00 ┆ 6   │
# └─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

df.group_by_dynamic("time", every: "1h", closed: "right").agg(
  [
    Polars.col("time").min.alias("time_min"),
    Polars.col("time").max.alias("time_max")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬─────────────────────┬─────────────────────┐
# │ time                ┆ time_min            ┆ time_max            │
# │ ---                 ┆ ---                 ┆ ---                 │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
# └─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result.

df.group_by_dynamic(
  "time", every: "1h", include_boundaries: true, closed: "right"
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (4, 4)
# ┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# └─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed="left", should not include right end of interval.

df.group_by_dynamic("time", every: "1h", closed: "left").agg(
  [
    Polars.col("time").count.alias("time_count"),
    Polars.col("time").alias("time_agg_list")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬─────────────────────────────────┐
# │ time                ┆ time_count ┆ time_agg_list                   │
# │ ---                 ┆ ---        ┆ ---                             │
# │ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]              │
# ╞═════════════════════╪════════════╪═════════════════════════════════╡
# │ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-… │
# │ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-… │
# │ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-… │
# │ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]           │
# └─────────────────────┴────────────┴─────────────────────────────────┘

When closed="both" the time values at the window boundaries belong to 2 groups.

df.group_by_dynamic("time", every: "1h", closed: "both").agg(
  [Polars.col("time").count.alias("time_count")]
)
# =>
# shape: (5, 2)
# ┌─────────────────────┬────────────┐
# │ time                ┆ time_count │
# │ ---                 ┆ ---        │
# │ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 3          │
# │ 2021-12-16 01:00:00 ┆ 3          │
# │ 2021-12-16 02:00:00 ┆ 3          │
# │ 2021-12-16 03:00:00 ┆ 1          │
# └─────────────────────┴────────────┘

Dynamic group bys can also be combined with grouping on normal keys.

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "groups" => ["a", "a", "a", "b", "b", "a", "a"]
  }
)
df.group_by_dynamic(
  "time",
  every: "1h",
  closed: "both",
  by: "groups",
  include_boundaries: true
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (7, 5)
# ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
# │ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# │ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
# │ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
# └────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic group by on an index column.

df = Polars::DataFrame.new(
  {
    "idx" => Polars.arange(0, 6, eager: true),
    "A" => ["A", "A", "B", "B", "B", "C"]
  }
)
df.group_by_dynamic(
  "idx",
  every: "2i",
  period: "3i",
  include_boundaries: true,
  closed: "right"
).agg(Polars.col("A").alias("A_agg_list"))
# =>
# shape: (4, 4)
# ┌─────────────────┬─────────────────┬─────┬─────────────────┐
# │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
# │ ---             ┆ ---             ┆ --- ┆ ---             │
# │ i64             ┆ i64             ┆ i64 ┆ list[str]       │
# ╞═════════════════╪═════════════════╪═════╪═════════════════╡
# │ -2              ┆ 1               ┆ -2  ┆ ["A", "A"]      │
# │ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
# │ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
# │ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
# └─────────────────┴─────────────────┴─────┴─────────────────┘

Parameters:

  • index_column

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a dynamic group by on indices, dtype needs to be one of :i32, :i64. Note that :i32 gets temporarily cast to :i64, so if performance matters use an :i64 column.

  • every

    Interval of the window.

  • period (defaults to: nil)

    Length of the window, if None it is equal to 'every'.

  • offset (defaults to: nil)

    Offset of the window if None and period is None it will be equal to negative every.

  • truncate (defaults to: true)

    Truncate the time value to the window lower bound.

  • include_boundaries (defaults to: false)

    Add the lower and upper bound of the window to the "_lower_bound" and "_upper_bound" columns. This will impact performance because it's harder to parallelize

  • closed ("right", "left", "both", "none") (defaults to: "left")

    Define whether the temporal window interval is closed or not.

  • by (defaults to: nil)

    Also group by this column/these columns

Returns:



2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
# File 'lib/polars/data_frame.rb', line 2150

def group_by_dynamic(
  index_column,
  every:,
  period: nil,
  offset: nil,
  truncate: true,
  include_boundaries: false,
  closed: "left",
  by: nil,
  start_by: "window"
)
  DynamicGroupBy.new(
    self,
    index_column,
    every,
    period,
    offset,
    truncate,
    include_boundaries,
    closed,
    by,
    start_by
  )
end

#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series

Hash and combine the rows in this DataFrame.

The hash value is of type :u64.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 3, 4],
    "ham" => ["a", "b", nil, "d"]
  }
)
df.hash_rows(seed: 42)
# =>
# shape: (4,)
# Series: '' [u64]
# [
#         4238614331852490969
#         17976148875586754089
#         4702262519505526977
#         18144177983981041107
# ]

Parameters:

  • seed (Integer) (defaults to: 0)

    Random seed parameter. Defaults to 0.

  • seed_1 (Integer) (defaults to: nil)

    Random seed parameter. Defaults to seed if not set.

  • seed_2 (Integer) (defaults to: nil)

    Random seed parameter. Defaults to seed if not set.

  • seed_3 (Integer) (defaults to: nil)

    Random seed parameter. Defaults to seed if not set.

Returns:



4938
4939
4940
4941
4942
4943
4944
# File 'lib/polars/data_frame.rb', line 4938

def hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil)
  k0 = seed
  k1 = seed_1.nil? ? seed : seed_1
  k2 = seed_2.nil? ? seed : seed_2
  k3 = seed_3.nil? ? seed : seed_3
  Utils.wrap_s(_df.hash_rows(k0, k1, k2, k3))
end

#head(n = 5) ⇒ DataFrame

Get the first n rows.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.head(3)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



1641
1642
1643
# File 'lib/polars/data_frame.rb', line 1641

def head(n = 5)
  _from_rbdf(_df.head(n))
end

#heightInteger Also known as: count, length, size

Get the height of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.height
# => 5

Returns:

  • (Integer)


107
108
109
# File 'lib/polars/data_frame.rb', line 107

def height
  _df.height
end

#hstack(columns, in_place: false) ⇒ DataFrame

Return a new DataFrame grown horizontally by stacking multiple Series to it.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
x = Polars::Series.new("apple", [10, 20, 30])
df.hstack([x])
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ i64 ┆ str ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6   ┆ a   ┆ 10    │
# │ 2   ┆ 7   ┆ b   ┆ 20    │
# │ 3   ┆ 8   ┆ c   ┆ 30    │
# └─────┴─────┴─────┴───────┘

Parameters:

  • columns (Object)

    Series to stack.

  • in_place (Boolean) (defaults to: false)

    Modify in place.

Returns:



2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
# File 'lib/polars/data_frame.rb', line 2680

def hstack(columns, in_place: false)
  if !columns.is_a?(::Array)
    columns = columns.get_columns
  end
  if in_place
    _df.hstack_mut(columns.map(&:_s))
    self
  else
    _from_rbdf(_df.hstack(columns.map(&:_s)))
  end
end

#include?(name) ⇒ Boolean

Check if DataFrame includes column.

Returns:



340
341
342
# File 'lib/polars/data_frame.rb', line 340

def include?(name)
  columns.include?(name)
end

#insert_column(index, series) ⇒ DataFrame Also known as: insert_at_idx

Insert a Series at a certain column index. This operation is in place.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
s = Polars::Series.new("baz", [97, 98, 99])
df.insert_column(1, s)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ baz ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 97  ┆ 4   │
# │ 2   ┆ 98  ┆ 5   │
# │ 3   ┆ 99  ┆ 6   │
# └─────┴─────┴─────┘
df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
s = Polars::Series.new("d", [-2.5, 15, 20.5, 0])
df.insert_column(3, s)
# =>
# shape: (4, 4)
# ┌─────┬──────┬───────┬──────┐
# │ a   ┆ b    ┆ c     ┆ d    │
# │ --- ┆ ---  ┆ ---   ┆ ---  │
# │ i64 ┆ f64  ┆ bool  ┆ f64  │
# ╞═════╪══════╪═══════╪══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ -2.5 │
# │ 2   ┆ 4.0  ┆ true  ┆ 15.0 │
# │ 3   ┆ 10.0 ┆ false ┆ 20.5 │
# │ 4   ┆ 13.0 ┆ true  ┆ 0.0  │
# └─────┴──────┴───────┴──────┘

Parameters:

  • index (Integer)

    Column to insert the new Series column.

  • series (Series)

    Series to insert.

Returns:



1240
1241
1242
1243
1244
1245
1246
# File 'lib/polars/data_frame.rb', line 1240

def insert_column(index, series)
  if index < 0
    index = columns.length + index
  end
  _df.insert_column(index, series._s)
  self
end

#interpolateDataFrame

Interpolate intermediate values. The interpolation method is linear.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 9, 10],
    "bar" => [6, 7, 9, nil],
    "baz" => [1, nil, nil, 9]
  }
)
df.interpolate
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────────┐
# │ foo  ┆ bar  ┆ baz      │
# │ ---  ┆ ---  ┆ ---      │
# │ f64  ┆ f64  ┆ f64      │
# ╞══════╪══════╪══════════╡
# │ 1.0  ┆ 6.0  ┆ 1.0      │
# │ 5.0  ┆ 7.0  ┆ 3.666667 │
# │ 9.0  ┆ 9.0  ┆ 6.333333 │
# │ 10.0 ┆ null ┆ 9.0      │
# └──────┴──────┴──────────┘

Returns:



4971
4972
4973
# File 'lib/polars/data_frame.rb', line 4971

def interpolate
  select(F.col("*").interpolate)
end

#is_duplicatedSeries

Get a mask of all duplicated rows in this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 1],
    "b" => ["x", "y", "z", "x"],
  }
)
df.is_duplicated
# =>
# shape: (4,)
# Series: '' [bool]
# [
#         true
#         false
#         false
#         true
# ]

Returns:



3717
3718
3719
# File 'lib/polars/data_frame.rb', line 3717

def is_duplicated
  Utils.wrap_s(_df.is_duplicated)
end

#is_emptyBoolean Also known as: empty?

Check if the dataframe is empty.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.is_empty
# => false
df.filter(Polars.col("foo") > 99).is_empty
# => true

Returns:



4985
4986
4987
# File 'lib/polars/data_frame.rb', line 4985

def is_empty
  height == 0
end

#is_uniqueSeries

Get a mask of all unique rows in this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 1],
    "b" => ["x", "y", "z", "x"]
  }
)
df.is_unique
# =>
# shape: (4,)
# Series: '' [bool]
# [
#         false
#         true
#         true
#         false
# ]

Returns:



3742
3743
3744
# File 'lib/polars/data_frame.rb', line 3742

def is_unique
  Utils.wrap_s(_df.is_unique)
end

#itemObject

Return the dataframe as a scalar.

Equivalent to df[0,0], with a check that the shape is (1,1).

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => [4, 5, 6]})
result = df.select((Polars.col("a") * Polars.col("b")).sum)
result.item
# => 32

Returns:



509
510
511
512
513
514
# File 'lib/polars/data_frame.rb', line 509

def item
  if shape != [1, 1]
    raise ArgumentError, "Can only call .item if the dataframe is of shape (1,1), dataframe is of shape #{shape}"
  end
  self[0, 0]
end

#iter_rows(named: false, buffer_size: 500, &block) ⇒ Object

Returns an iterator over the DataFrame of rows of Ruby-native values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.iter_rows.map { |row| row[0] }
# => [1, 3, 5]
df.iter_rows(named: true).map { |row| row["b"] }
# => [2, 4, 6]

Parameters:

  • named (Boolean) (defaults to: false)

    Return hashes instead of arrays. The hashes are a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

  • buffer_size (Integer) (defaults to: 500)

    Determines the number of rows that are buffered internally while iterating over the data; you should only modify this in very specific cases where the default value is determined not to be a good fit to your access pattern, as the speedup from using the buffer is significant (~2-4x). Setting this value to zero disables row buffering.

Returns:



4817
4818
4819
4820
4821
4822
4823
4824
4825
4826
4827
4828
4829
4830
4831
4832
4833
4834
4835
4836
4837
4838
4839
4840
4841
4842
4843
4844
4845
4846
4847
4848
# File 'lib/polars/data_frame.rb', line 4817

def iter_rows(named: false, buffer_size: 500, &block)
  return to_enum(:iter_rows, named: named, buffer_size: buffer_size) unless block_given?

  # load into the local namespace for a modest performance boost in the hot loops
  columns = self.columns

  # note: buffering rows results in a 2-4x speedup over individual calls
  # to ".row(i)", so it should only be disabled in extremely specific cases.
  if buffer_size
    offset = 0
    while offset < height
      zerocopy_slice = slice(offset, buffer_size)
      rows_chunk = zerocopy_slice.rows(named: false)
      if named
        rows_chunk.each do |row|
          yield columns.zip(row).to_h
        end
      else
        rows_chunk.each(&block)
      end
      offset += buffer_size
    end
  elsif named
    height.times do |i|
      yield columns.zip(row(i)).to_h
    end
  else
    height.times do |i|
      yield row(i)
    end
  end
end

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil) ⇒ DataFrame

Join in SQL-like fashion.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
other_df = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
)
df.join(other_df, on: "ham")
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "full")
# =>
# shape: (4, 5)
# ┌──────┬──────┬──────┬───────┬───────────┐
# │ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
# │ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
# │ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
# ╞══════╪══════╪══════╪═══════╪═══════════╡
# │ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
# │ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
# │ null ┆ null ┆ null ┆ z     ┆ d         │
# │ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
# └──────┴──────┴──────┴───────┴───────────┘
df.join(other_df, on: "ham", how: "left")
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# │ 3   ┆ 8.0 ┆ c   ┆ null  │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "semi")
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6.0 ┆ a   │
# │ 2   ┆ 7.0 ┆ b   │
# └─────┴─────┴─────┘
df.join(other_df, on: "ham", how: "anti")
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • other (DataFrame)

    DataFrame to join with.

  • left_on (Object) (defaults to: nil)

    Name(s) of the left join column(s).

  • right_on (Object) (defaults to: nil)

    Name(s) of the right join column(s).

  • on (Object) (defaults to: nil)

    Name(s) of the join columns in both DataFrames.

  • how ("inner", "left", "full", "semi", "anti", "cross") (defaults to: "inner")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • validate ('m:m', 'm:1', '1:m', '1:1') (defaults to: "m:m")

    Checks if join is of specified type.

    • many_to_many - “m:m”: default, does not result in checks
    • one_to_one - “1:1”: check if join keys are unique in both left and right datasets
    • one_to_many - “1:m”: check if join keys are unique in left dataset
    • many_to_one - “m:1”: check if join keys are unique in right dataset
  • join_nulls (Boolean) (defaults to: false)

    Join on null values. By default null values will never produce matches.

  • coalesce (Boolean) (defaults to: nil)

    Coalescing behavior (merging of join columns).

    • nil: -> join specific.
    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.

Returns:



2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
# File 'lib/polars/data_frame.rb', line 2509

def join(other,
  left_on: nil,
  right_on: nil,
  on: nil,
  how: "inner",
  suffix: "_right",
  validate: "m:m",
  join_nulls: false,
  coalesce: nil
)
  lazy
    .join(
      other.lazy,
      left_on: left_on,
      right_on: right_on,
      on: on,
      how: how,
      suffix: suffix,
      validate: validate,
      join_nulls: join_nulls,
      coalesce: coalesce
    )
    .collect(no_optimization: true)
end

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true) ⇒ DataFrame

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the asof_join key.

For each row in the left DataFrame:

  • A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
  • A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.

The default is "backward".

Examples:

gdp = Polars::DataFrame.new(
  {
    "date" => [
      DateTime.new(2016, 1, 1),
      DateTime.new(2017, 1, 1),
      DateTime.new(2018, 1, 1),
      DateTime.new(2019, 1, 1),
    ],  # note record date: Jan 1st (sorted!)
    "gdp" => [4164, 4411, 4566, 4696]
  }
).set_sorted("date")
population = Polars::DataFrame.new(
  {
    "date" => [
      DateTime.new(2016, 5, 12),
      DateTime.new(2017, 5, 12),
      DateTime.new(2018, 5, 12),
      DateTime.new(2019, 5, 12),
    ],  # note record date: May 12th (sorted!)
    "population" => [82.19, 82.66, 83.12, 83.52]
  }
).set_sorted("date")
population.join_asof(
  gdp, left_on: "date", right_on: "date", strategy: "backward"
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬──────┐
# │ date                ┆ population ┆ gdp  │
# │ ---                 ┆ ---        ┆ ---  │
# │ datetime[ns]        ┆ f64        ┆ i64  │
# ╞═════════════════════╪════════════╪══════╡
# │ 2016-05-12 00:00:00 ┆ 82.19      ┆ 4164 │
# │ 2017-05-12 00:00:00 ┆ 82.66      ┆ 4411 │
# │ 2018-05-12 00:00:00 ┆ 83.12      ┆ 4566 │
# │ 2019-05-12 00:00:00 ┆ 83.52      ┆ 4696 │
# └─────────────────────┴────────────┴──────┘

Parameters:

  • other (DataFrame)

    DataFrame to join with.

  • left_on (String) (defaults to: nil)

    Join column of the left DataFrame.

  • right_on (String) (defaults to: nil)

    Join column of the right DataFrame.

  • on (String) (defaults to: nil)

    Join column of both DataFrames. If set, left_on and right_on should be None.

  • by (Object) (defaults to: nil)

    join on these columns before doing asof join

  • by_left (Object) (defaults to: nil)

    join on these columns before doing asof join

  • by_right (Object) (defaults to: nil)

    join on these columns before doing asof join

  • strategy ("backward", "forward") (defaults to: "backward")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • tolerance (Object) (defaults to: nil)

    Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time" you use the following string language:

    • 1ns (1 nanosecond)
    • 1us (1 microsecond)
    • 1ms (1 millisecond)
    • 1s (1 second)
    • 1m (1 minute)
    • 1h (1 hour)
    • 1d (1 day)
    • 1w (1 week)
    • 1mo (1 calendar month)
    • 1y (1 calendar year)
    • 1i (1 index count)

    Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

  • allow_parallel (Boolean) (defaults to: true)

    Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

  • force_parallel (Boolean) (defaults to: false)

    Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

  • coalesce (Boolean) (defaults to: true)

    Coalescing behavior (merging of join columns).

    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.

Returns:



2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
# File 'lib/polars/data_frame.rb', line 2365

def join_asof(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  by_left: nil,
  by_right: nil,
  by: nil,
  strategy: "backward",
  suffix: "_right",
  tolerance: nil,
  allow_parallel: true,
  force_parallel: false,
  coalesce: true
)
  lazy
    .join_asof(
      other.lazy,
      left_on: left_on,
      right_on: right_on,
      on: on,
      by_left: by_left,
      by_right: by_right,
      by: by,
      strategy: strategy,
      suffix: suffix,
      tolerance: tolerance,
      allow_parallel: allow_parallel,
      force_parallel: force_parallel,
      coalesce: coalesce
    )
    .collect(no_optimization: true)
end

#lazyLazyFrame

Start a lazy query from this point.

Returns:



3749
3750
3751
# File 'lib/polars/data_frame.rb', line 3749

def lazy
  wrap_ldf(_df.lazy)
end

#limit(n = 5) ⇒ DataFrame

Get the first n rows.

Alias for #head.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3, 4, 5, 6], "bar" => ["a", "b", "c", "d", "e", "f"]}
)
df.limit(4)
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# │ 4   ┆ d   │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



1610
1611
1612
# File 'lib/polars/data_frame.rb', line 1610

def limit(n = 5)
  head(n)
end

#map_rows(return_dtype: nil, inference_size: 256, &f) ⇒ Object Also known as: apply

Note:

The frame-level apply cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-level apply syntax instead.

Apply a custom/user-defined function (UDF) over the rows of the DataFrame.

The UDF will receive each row as a tuple of values: udf(row).

Implementing logic using a Ruby function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:

  • The native expression engine runs in Rust; UDFs run in Ruby.
  • Use of Ruby UDFs forces the DataFrame to be materialized in memory.
  • Polars-native expressions can be parallelised (UDFs cannot).
  • Polars-native expressions can be logically optimised (UDFs cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [-1, 5, 8]})

Return a DataFrame by mapping each row to a tuple:

df.map_rows { |t| [t[0] * 2, t[1] * 3] }
# =>
# shape: (3, 2)
# ┌──────────┬──────────┐
# │ column_0 ┆ column_1 │
# │ ---      ┆ ---      │
# │ i64      ┆ i64      │
# ╞══════════╪══════════╡
# │ 2        ┆ -3       │
# │ 4        ┆ 15       │
# │ 6        ┆ 24       │
# └──────────┴──────────┘

Return a Series by mapping each row to a scalar:

df.map_rows { |t| t[0] * 2 + t[1] }
# =>
# shape: (3, 1)
# ┌─────┐
# │ map │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 9   │
# │ 14  │
# └─────┘

Parameters:

  • return_dtype (Symbol) (defaults to: nil)

    Output type of the operation. If none given, Polars tries to infer the type.

  • inference_size (Integer) (defaults to: 256)

    Only used in the case when the custom function returns rows. This uses the first n rows to determine the output schema

Returns:



2594
2595
2596
2597
2598
2599
2600
2601
# File 'lib/polars/data_frame.rb', line 2594

def map_rows(return_dtype: nil, inference_size: 256, &f)
  out, is_df = _df.map_rows(f, return_dtype, inference_size)
  if is_df
    _from_rbdf(out)
  else
    _from_rbdf(Utils.wrap_s(out).to_frame._df)
  end
end

#maxDataFrame

Aggregate the columns of this DataFrame to their maximum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.max
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Returns:



4009
4010
4011
# File 'lib/polars/data_frame.rb', line 4009

def max
  lazy.max.collect(_eager: true)
end

#max_horizontalSeries

Get the maximum value horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.max_horizontal
# =>
# shape: (3,)
# Series: 'max' [f64]
# [
#         4.0
#         5.0
#         6.0
# ]

Returns:



4033
4034
4035
# File 'lib/polars/data_frame.rb', line 4033

def max_horizontal
  select(max: F.max_horizontal(F.all)).to_series
end

#meanDataFrame

Aggregate the columns of this DataFrame to their mean value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.mean
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘

Returns:



4165
4166
4167
# File 'lib/polars/data_frame.rb', line 4165

def mean
  lazy.mean.collect(_eager: true)
end

#mean_horizontal(ignore_nulls: true) ⇒ Series

Take the mean of all values horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.mean_horizontal
# =>
# shape: (3,)
# Series: 'mean' [f64]
# [
#         2.5
#         3.5
#         4.5
# ]

Parameters:

  • ignore_nulls (Boolean) (defaults to: true)

    Ignore null values (default). If set to false, any null value in the input will lead to a null output.

Returns:



4193
4194
4195
4196
4197
# File 'lib/polars/data_frame.rb', line 4193

def mean_horizontal(ignore_nulls: true)
  select(
    mean: F.mean_horizontal(F.all, ignore_nulls: ignore_nulls)
  ).to_series
end

#medianDataFrame

Aggregate the columns of this DataFrame to their median value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.median
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘

Returns:



4303
4304
4305
# File 'lib/polars/data_frame.rb', line 4303

def median
  lazy.median.collect(_eager: true)
end

#merge_sorted(other, key) ⇒ DataFrame

Take two sorted DataFrames and merge them by the sorted key.

The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.

The schemas of both DataFrames must be equal.

Examples:

df0 = Polars::DataFrame.new(
  {"name" => ["steve", "elise", "bob"], "age" => [42, 44, 18]}
).sort("age")
df1 = Polars::DataFrame.new(
  {"name" => ["anna", "megan", "steve", "thomas"], "age" => [21, 33, 42, 20]}
).sort("age")
df0.merge_sorted(df1, "age")
# =>
# shape: (7, 2)
# ┌────────┬─────┐
# │ name   ┆ age │
# │ ---    ┆ --- │
# │ str    ┆ i64 │
# ╞════════╪═════╡
# │ bob    ┆ 18  │
# │ thomas ┆ 20  │
# │ anna   ┆ 21  │
# │ megan  ┆ 33  │
# │ steve  ┆ 42  │
# │ steve  ┆ 42  │
# │ elise  ┆ 44  │
# └────────┴─────┘

Parameters:

  • other (DataFrame)

    Other DataFrame that must be merged

  • key (String)

    Key that is sorted.

Returns:



5100
5101
5102
# File 'lib/polars/data_frame.rb', line 5100

def merge_sorted(other, key)
  lazy.merge_sorted(other.lazy, key).collect(_eager: true)
end

#minDataFrame

Aggregate the columns of this DataFrame to their minimum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.min
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Returns:



4059
4060
4061
# File 'lib/polars/data_frame.rb', line 4059

def min
  lazy.min.collect(_eager: true)
end

#min_horizontalSeries

Get the minimum value horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.min_horizontal
# =>
# shape: (3,)
# Series: 'min' [f64]
# [
#         1.0
#         2.0
#         3.0
# ]

Returns:



4083
4084
4085
# File 'lib/polars/data_frame.rb', line 4083

def min_horizontal
  select(min: F.min_horizontal(F.all)).to_series
end

#n_chunks(strategy: "first") ⇒ Object

Get number of chunks used by the ChunkedArrays of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.n_chunks
# => 1
df.n_chunks(strategy: "all")
# => [1, 1, 1]

Parameters:

  • strategy ("first", "all") (defaults to: "first")

    Return the number of chunks of the 'first' column, or 'all' columns in this DataFrame.

Returns:



3977
3978
3979
3980
3981
3982
3983
3984
3985
# File 'lib/polars/data_frame.rb', line 3977

def n_chunks(strategy: "first")
  if strategy == "first"
    _df.n_chunks
  elsif strategy == "all"
    get_columns.map(&:n_chunks)
  else
    raise ArgumentError, "Strategy: '{strategy}' not understood. Choose one of {{'first',  'all'}}"
  end
end

#n_unique(subset: nil) ⇒ DataFrame

Return the number of unique rows, or the number of unique row-subsets.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 1, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 1.0, 2.0, 3.0, 3.0],
    "c" => [true, true, true, false, true, true]
  }
)
df.n_unique
# => 5

Simple columns subset

df.n_unique(subset: ["b", "c"])
# => 4

Expression subset

df.n_unique(
  subset: [
    (Polars.col("a").floordiv(2)),
    (Polars.col("c") | (Polars.col("b") >= 2))
  ]
)
# => 3

Parameters:

  • subset (Object) (defaults to: nil)

    One or more columns/expressions that define what to count; omit to return the count of unique rows.

Returns:



4476
4477
4478
4479
4480
4481
4482
4483
4484
4485
4486
4487
4488
4489
4490
4491
4492
# File 'lib/polars/data_frame.rb', line 4476

def n_unique(subset: nil)
  if subset.is_a?(StringIO)
    subset = [Polars.col(subset)]
  elsif subset.is_a?(Expr)
    subset = [subset]
  end

  if subset.is_a?(::Array) && subset.length == 1
    expr = Utils.wrap_expr(Utils.parse_into_expression(subset[0], str_as_lit: false))
  else
    struct_fields = subset.nil? ? Polars.all : subset
    expr = Polars.struct(struct_fields)
  end

  df = lazy.select(expr.n_unique).collect
  df.is_empty ? 0 : df.row(0)[0]
end

#null_countDataFrame

Create a new DataFrame that shows the null counts per column.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 3],
    "bar" => [6, 7, nil],
    "ham" => ["a", "b", "c"]
  }
)
df.null_count
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 1   ┆ 0   │
# └─────┴─────┴─────┘

Returns:



4526
4527
4528
# File 'lib/polars/data_frame.rb', line 4526

def null_count
  _from_rbdf(_df.null_count)
end

#partition_by(groups, maintain_order: true, include_key: true, as_dict: false) ⇒ Object

Split into multiple DataFrames partitioned by groups.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => ["A", "A", "B", "B", "C"],
    "N" => [1, 2, 2, 4, 2],
    "bar" => ["k", "l", "m", "m", "l"]
  }
)
df.partition_by("foo", maintain_order: true)
# =>
# [shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ A   ┆ 1   ┆ k   │
# │ A   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘, shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ B   ┆ 2   ┆ m   │
# │ B   ┆ 4   ┆ m   │
# └─────┴─────┴─────┘, shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ C   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘]
df.partition_by("foo", maintain_order: true, as_dict: true)
# =>
# {"A"=>shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ A   ┆ 1   ┆ k   │
# │ A   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘, "B"=>shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ B   ┆ 2   ┆ m   │
# │ B   ┆ 4   ┆ m   │
# └─────┴─────┴─────┘, "C"=>shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ C   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘}

Parameters:

  • groups (Object)

    Groups to partition by.

  • maintain_order (Boolean) (defaults to: true)

    Keep predictable output order. This is slower as it requires an extra sort operation.

  • as_dict (Boolean) (defaults to: false)

    If true, return the partitions in a dictionary keyed by the distinct group values instead of a list.

Returns:



3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
# File 'lib/polars/data_frame.rb', line 3590

def partition_by(groups, maintain_order: true, include_key: true, as_dict: false)
  if groups.is_a?(::String)
    groups = [groups]
  elsif !groups.is_a?(::Array)
    groups = Array(groups)
  end

  if as_dict
    out = {}
    if groups.length == 1
      _df.partition_by(groups, maintain_order, include_key).each do |df|
        df = _from_rbdf(df)
        out[df[groups][0, 0]] = df
      end
    else
      _df.partition_by(groups, maintain_order, include_key).each do |df|
        df = _from_rbdf(df)
        out[df[groups].row(0)] = df
      end
    end
    out
  else
    _df.partition_by(groups, maintain_order, include_key).map { |df| _from_rbdf(df) }
  end
end

#pipe(func, *args, **kwargs, &block) ⇒ Object

Note:

It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See #lazy.

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Examples:

cast_str_to_int = lambda do |data, col_name:|
  data.with_column(Polars.col(col_name).cast(:i64))
end

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => ["10", "20", "30", "40"]})
df.pipe(cast_str_to_int, col_name: "b")
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 10  │
# │ 2   ┆ 20  │
# │ 3   ┆ 30  │
# │ 4   ┆ 40  │
# └─────┴─────┘

Parameters:

  • func (Object)

    Callable; will receive the frame as the first parameter, followed by any given args/kwargs.

  • args (Object)

    Arguments to pass to the UDF.

  • kwargs (Object)

    Keyword arguments to pass to the UDF.

Returns:



1742
1743
1744
# File 'lib/polars/data_frame.rb', line 1742

def pipe(func, *args, **kwargs, &block)
  func.call(self, *args, **kwargs, &block)
end

#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame

Create a spreadsheet-style pivot table as a DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => ["one", "one", "two", "two", "one", "two"],
    "bar" => ["y", "y", "y", "x", "x", "x"],
    "baz" => [1, 2, 3, 4, 5, 6]
  }
)
df.pivot("bar", index: "foo", values: "baz", aggregate_function: "sum")
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ y   ┆ x   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ one ┆ 3   ┆ 5   │
# │ two ┆ 3   ┆ 10  │
# └─────┴─────┴─────┘

Parameters:

  • values (Object) (defaults to: nil)

    Column values to aggregate. Can be multiple columns if the columns arguments contains multiple columns as well

  • index (Object) (defaults to: nil)

    One or multiple keys to group by

  • on (Object)

    Columns whose values will be used as the header of the output DataFrame

  • aggregate_function ("first", "sum", "max", "min", "mean", "median", "last", "count") (defaults to: nil)

    A predefined aggregate function str or an expression.

  • maintain_order (Object) (defaults to: true)

    Sort the grouped keys so that the output order is predictable.

  • sort_columns (Object) (defaults to: false)

    Sort the transposed columns by name. Default is by order of discovery.

Returns:



3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
# File 'lib/polars/data_frame.rb', line 3281

def pivot(
  on,
  index: nil,
  values: nil,
  aggregate_function: nil,
  maintain_order: true,
  sort_columns: false,
  separator: "_"
)
  index = Utils._expand_selectors(self, index)
  on = Utils._expand_selectors(self, on)
  if !values.nil?
    values = Utils._expand_selectors(self, values)
  end

  if aggregate_function.is_a?(::String)
    case aggregate_function
    when "first"
      aggregate_expr = F.element.first._rbexpr
    when "sum"
      aggregate_expr = F.element.sum._rbexpr
    when "max"
      aggregate_expr = F.element.max._rbexpr
    when "min"
      aggregate_expr = F.element.min._rbexpr
    when "mean"
      aggregate_expr = F.element.mean._rbexpr
    when "median"
      aggregate_expr = F.element.median._rbexpr
    when "last"
      aggregate_expr = F.element.last._rbexpr
    when "len"
      aggregate_expr = F.len._rbexpr
    when "count"
      warn "`aggregate_function: \"count\"` input for `pivot` is deprecated. Use `aggregate_function: \"len\"` instead."
      aggregate_expr = F.len._rbexpr
    else
      raise ArgumentError, "Argument aggregate fn: '#{aggregate_fn}' was not expected."
    end
  elsif aggregate_function.nil?
    aggregate_expr = nil
  else
    aggregate_expr = aggregate_function._rbexpr
  end

  _from_rbdf(
    _df.pivot_expr(
      on,
      index,
      values,
      maintain_order,
      sort_columns,
      aggregate_expr,
      separator
    )
  )
end

#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Vega::LiteChart Originally defined in module Plot

Plot data.

Returns:

  • (Vega::LiteChart)

Raises:

  • (ArgumentError)

#productDataFrame

Aggregate the columns of this DataFrame to their product values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3],
    "b" => [0.5, 4, 10],
    "c" => [true, true, false]
  }
)
df.product
# =>
# shape: (1, 3)
# ┌─────┬──────┬─────┐
# │ a   ┆ b    ┆ c   │
# │ --- ┆ ---  ┆ --- │
# │ i64 ┆ f64  ┆ i64 │
# ╞═════╪══════╪═════╡
# │ 6   ┆ 20.0 ┆ 0   │
# └─────┴──────┴─────┘

Returns:



4329
4330
4331
# File 'lib/polars/data_frame.rb', line 4329

def product
  select(Polars.all.product)
end

#quantile(quantile, interpolation: "nearest") ⇒ DataFrame

Aggregate the columns of this DataFrame to their quantile value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.quantile(0.5, interpolation: "nearest")
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘

Parameters:

  • quantile (Float)

    Quantile between 0.0 and 1.0.

  • interpolation ("nearest", "higher", "lower", "midpoint", "linear") (defaults to: "nearest")

    Interpolation method.

Returns:



4360
4361
4362
# File 'lib/polars/data_frame.rb', line 4360

def quantile(quantile, interpolation: "nearest")
  lazy.quantile(quantile, interpolation: interpolation).collect(_eager: true)
end

#rechunkDataFrame

This will make sure all subsequent operations have optimal and predictable performance.

Returns:



4500
4501
4502
# File 'lib/polars/data_frame.rb', line 4500

def rechunk
  _from_rbdf(_df.rechunk)
end

#rename(mapping, strict: true) ⇒ DataFrame

Rename column names.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.rename({"foo" => "apple"})
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 1     ┆ 6   ┆ a   │
# │ 2     ┆ 7   ┆ b   │
# │ 3     ┆ 8   ┆ c   │
# └───────┴─────┴─────┘

Parameters:

  • mapping (Hash)

    Key value pairs that map from old name to new name.

  • strict (Boolean) (defaults to: true)

    Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to mapping).

Returns:



1189
1190
1191
# File 'lib/polars/data_frame.rb', line 1189

def rename(mapping, strict: true)
  lazy.rename(mapping, strict: strict).collect(no_optimization: true)
end

#replace(column, new_col) ⇒ DataFrame

Replace a column by a new Series.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
s = Polars::Series.new([10, 20, 30])
df.replace("foo", s)
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 4   │
# │ 20  ┆ 5   │
# │ 30  ┆ 6   │
# └─────┴─────┘

Parameters:

  • column (String)

    Column to replace.

  • new_col (Series)

    New column to insert.

Returns:



1543
1544
1545
1546
# File 'lib/polars/data_frame.rb', line 1543

def replace(column, new_col)
  _df.replace(column.to_s, new_col._s)
  self
end

#replace_column(index, series) ⇒ DataFrame Also known as: replace_at_idx

Replace a column at an index location.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
s = Polars::Series.new("apple", [10, 20, 30])
df.replace_column(0, s)
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 10    ┆ 6   ┆ a   │
# │ 20    ┆ 7   ┆ b   │
# │ 30    ┆ 8   ┆ c   │
# └───────┴─────┴─────┘

Parameters:

  • index (Integer)

    Column index.

  • series (Series)

    Series that will replace the column.

Returns:



1409
1410
1411
1412
1413
1414
1415
# File 'lib/polars/data_frame.rb', line 1409

def replace_column(index, series)
  if index < 0
    index = columns.length + index
  end
  _df.replace_column(index, series._s)
  self
end

#reverseDataFrame

Reverse the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "key" => ["a", "b", "c"],
    "val" => [1, 2, 3]
  }
)
df.reverse
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ key ┆ val │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 1   │
# └─────┴─────┘

Returns:



1154
1155
1156
# File 'lib/polars/data_frame.rb', line 1154

def reverse
  select(Polars.col("*").reverse)
end

#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ RollingGroupBy Also known as: groupby_rolling, group_by_rolling

Create rolling groups based on a time column.

Also works for index values of type :i32 or :i64.

Different from a dynamic_group_by the windows are now determined by the individual values and are not of constant intervals. For constant intervals use group_by_dynamic

The period and offset arguments are created either from a timedelta, or by using the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_rolling on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

dates = [
  "2020-01-01 13:45:48",
  "2020-01-01 16:42:13",
  "2020-01-01 16:45:09",
  "2020-01-02 18:12:48",
  "2020-01-03 19:45:32",
  "2020-01-08 23:16:43"
]
df = Polars::DataFrame.new({"dt" => dates, "a" => [3, 7, 5, 9, 2, 1]}).with_column(
  Polars.col("dt").str.strptime(Polars::Datetime).set_sorted
)
df.rolling(index_column: "dt", period: "2d").agg(
  [
    Polars.sum("a").alias("sum_a"),
    Polars.min("a").alias("min_a"),
    Polars.max("a").alias("max_a")
  ]
)
# =>
# shape: (6, 4)
# ┌─────────────────────┬───────┬───────┬───────┐
# │ dt                  ┆ sum_a ┆ min_a ┆ max_a │
# │ ---                 ┆ ---   ┆ ---   ┆ ---   │
# │ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
# ╞═════════════════════╪═══════╪═══════╪═══════╡
# │ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
# │ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
# │ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
# │ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
# │ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
# │ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
# └─────────────────────┴───────┴───────┴───────┘

Parameters:

  • index_column (Object)

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a rolling group by on indices, dtype needs to be one of :i32, :i64. Note that :i32 gets temporarily cast to :i64, so if performance matters use an :i64 column.

  • period (Object)

    Length of the window.

  • offset (Object) (defaults to: nil)

    Offset of the window. Default is -period.

  • closed ("right", "left", "both", "none") (defaults to: "right")

    Define whether the temporal window interval is closed or not.

  • by (Object) (defaults to: nil)

    Also group by this column/these columns.

Returns:



1907
1908
1909
1910
1911
1912
1913
1914
1915
# File 'lib/polars/data_frame.rb', line 1907

def rolling(
  index_column:,
  period:,
  offset: nil,
  closed: "right",
  by: nil
)
  RollingGroupBy.new(self, index_column, period, offset, closed, by)
end

#row(index = nil, by_predicate: nil, named: false) ⇒ Object

Note:

The index and by_predicate params are mutually exclusive. Additionally, to ensure clarity, the by_predicate parameter must be supplied by keyword.

When using by_predicate it is an error condition if anything other than one row is returned; more than one row raises TooManyRowsReturned, and zero rows will raise NoRowsReturned (both inherit from RowsException).

Get a row as tuple, either by index or by predicate.

Examples:

Return the row at the given index

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.row(2)
# => [3, 8, "c"]

Get a hash instead with a mapping of column names to row values

df.row(2, named: true)
# => {"foo"=>3, "bar"=>8, "ham"=>"c"}

Return the row that matches the given predicate

df.row(by_predicate: Polars.col("ham") == "b")
# => [2, 7, "b"]

Parameters:

  • index (Object) (defaults to: nil)

    Row index.

  • by_predicate (Object) (defaults to: nil)

    Select the row according to a given expression/predicate.

  • named (Boolean) (defaults to: false)

    Return a hash instead of an array. The hash is a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

Returns:



4721
4722
4723
4724
4725
4726
4727
4728
4729
4730
4731
4732
4733
4734
4735
4736
4737
4738
4739
4740
4741
4742
4743
4744
4745
4746
4747
4748
4749
4750
4751
4752
4753
4754
4755
# File 'lib/polars/data_frame.rb', line 4721

def row(index = nil, by_predicate: nil, named: false)
  if !index.nil? && !by_predicate.nil?
    raise ArgumentError, "Cannot set both 'index' and 'by_predicate'; mutually exclusive"
  elsif index.is_a?(Expr)
    raise TypeError, "Expressions should be passed to the 'by_predicate' param"
  end

  if !index.nil?
    row = _df.row_tuple(index)
    if named
      columns.zip(row).to_h
    else
      row
    end
  elsif !by_predicate.nil?
    if !by_predicate.is_a?(Expr)
      raise TypeError, "Expected by_predicate to be an expression; found #{by_predicate.class.name}"
    end
    rows = filter(by_predicate).rows
    n_rows = rows.length
    if n_rows > 1
      raise TooManyRowsReturned, "Predicate #{by_predicate} returned #{n_rows} rows"
    elsif n_rows == 0
      raise NoRowsReturned, "Predicate #{by_predicate} returned no rows"
    end
    row = rows[0]
    if named
      columns.zip(row).to_h
    else
      row
    end
  else
    raise ArgumentError, "One of 'index' or 'by_predicate' must be set"
  end
end

#rows(named: false) ⇒ Array

Convert columnar data to rows as Ruby arrays.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.rows
# => [[1, 2], [3, 4], [5, 6]]
df.rows(named: true)
# => [{"a"=>1, "b"=>2}, {"a"=>3, "b"=>4}, {"a"=>5, "b"=>6}]

Parameters:

  • named (Boolean) (defaults to: false)

    Return hashes instead of arrays. The hashes are a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

Returns:



4778
4779
4780
4781
4782
4783
4784
4785
4786
4787
# File 'lib/polars/data_frame.rb', line 4778

def rows(named: false)
  if named
    columns = self.columns
    _df.row_tuples.map do |v|
      columns.zip(v).to_h
    end
  else
    _df.row_tuples
  end
end

#sample(n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame

Sample from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.sample(n: 2, seed: 0)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8   ┆ c   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

Parameters:

  • n (Integer) (defaults to: nil)

    Number of items to return. Cannot be used with frac. Defaults to 1 if frac is nil.

  • frac (Float) (defaults to: nil)

    Fraction of items to return. Cannot be used with n.

  • with_replacement (Boolean) (defaults to: false)

    Allow values to be sampled more than once.

  • shuffle (Boolean) (defaults to: false)

    Shuffle the order of sampled data points.

  • seed (Integer) (defaults to: nil)

    Seed for the random number generator. If set to nil (default), a random seed is used.

Returns:



4566
4567
4568
4569
4570
4571
4572
4573
4574
4575
4576
4577
4578
4579
4580
4581
4582
4583
4584
4585
4586
4587
4588
4589
4590
4591
4592
# File 'lib/polars/data_frame.rb', line 4566

def sample(
  n: nil,
  frac: nil,
  with_replacement: false,
  shuffle: false,
  seed: nil
)
  if !n.nil? && !frac.nil?
    raise ArgumentError, "cannot specify both `n` and `frac`"
  end

  if n.nil? && !frac.nil?
    frac = Series.new("frac", [frac]) unless frac.is_a?(Series)

    return _from_rbdf(
      _df.sample_frac(frac._s, with_replacement, shuffle, seed)
    )
  end

  if n.nil?
    n = 1
  end

  n = Series.new("", [n]) unless n.is_a?(Series)

  _from_rbdf(_df.sample_n(n._s, with_replacement, shuffle, seed))
end

#schemaHash

Get the schema.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.schema
# => {"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String}

Returns:

  • (Hash)


216
217
218
# File 'lib/polars/data_frame.rb', line 216

def schema
  columns.zip(dtypes).to_h
end

#select(*exprs, **named_exprs) ⇒ DataFrame

Select columns from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.select("foo")
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘
df.select(["foo", "bar"])
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 6   │
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# └─────┴─────┘
df.select(Polars.col("foo") + 1)
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 2   │
# │ 3   │
# │ 4   │
# └─────┘
df.select([Polars.col("foo") + 1, Polars.col("bar") + 1])
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# │ 4   ┆ 9   │
# └─────┴─────┘
df.select(Polars.when(Polars.col("foo") > 2).then(10).otherwise(0))
# =>
# shape: (3, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ i32     │
# ╞═════════╡
# │ 0       │
# │ 0       │
# │ 10      │
# └─────────┘

Parameters:

  • exprs (Array)

    Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



3841
3842
3843
# File 'lib/polars/data_frame.rb', line 3841

def select(*exprs, **named_exprs)
  lazy.select(*exprs, **named_exprs).collect(_eager: true)
end

#set_sorted(column, descending: false) ⇒ DataFrame

Indicate that one or multiple columns are sorted.

Parameters:

  • column (Object)

    Columns that are sorted

  • descending (Boolean) (defaults to: false)

    Whether the columns are sorted in descending order.

Returns:



5112
5113
5114
5115
5116
5117
5118
5119
# File 'lib/polars/data_frame.rb', line 5112

def set_sorted(
  column,
  descending: false
)
  lazy
    .set_sorted(column, descending: descending)
    .collect(no_optimization: true)
end

#shapeArray

Get the shape of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.shape
# => [5, 1]

Returns:



95
96
97
# File 'lib/polars/data_frame.rb', line 95

def shape
  _df.shape
end

#shift(n, fill_value: nil) ⇒ DataFrame

Shift values by the given period.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.shift(1)
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 1    ┆ 6    ┆ a    │
# │ 2    ┆ 7    ┆ b    │
# └──────┴──────┴──────┘
df.shift(-1)
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 2    ┆ 7    ┆ b    │
# │ 3    ┆ 8    ┆ c    │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Parameters:

  • n (Integer)

    Number of places to shift (may be negative).

  • fill_value (Object) (defaults to: nil)

    Fill the resulting null values with this value.

Returns:



3659
3660
3661
# File 'lib/polars/data_frame.rb', line 3659

def shift(n, fill_value: nil)
  lazy.shift(n, fill_value: fill_value).collect(_eager: true)
end

#shift_and_fill(periods, fill_value) ⇒ DataFrame

Shift the values by a given period and fill the resulting null values.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.shift_and_fill(1, 0)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 0   ┆ 0   ┆ 0   │
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

Parameters:

  • periods (Integer)

    Number of places to shift (may be negative).

  • fill_value (Object)

    fill nil values with this value.

Returns:



3692
3693
3694
# File 'lib/polars/data_frame.rb', line 3692

def shift_and_fill(periods, fill_value)
  shift(periods, fill_value: fill_value)
end

#shrink_to_fit(in_place: false) ⇒ DataFrame

Shrink DataFrame memory usage.

Shrinks to fit the exact capacity needed to hold the data.

Returns:



4873
4874
4875
4876
4877
4878
4879
4880
4881
4882
# File 'lib/polars/data_frame.rb', line 4873

def shrink_to_fit(in_place: false)
  if in_place
    _df.shrink_to_fit
    self
  else
    df = clone
    df._df.shrink_to_fit
    df
  end
end

#slice(offset, length = nil) ⇒ DataFrame

Get a slice of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.slice(1, 2)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 2   ┆ 7.0 ┆ b   │
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • offset (Integer)

    Start index. Negative indexing is supported.

  • length (Integer, nil) (defaults to: nil)

    Length of the slice. If set to nil, all rows starting at the offset will be selected.

Returns:



1577
1578
1579
1580
1581
1582
# File 'lib/polars/data_frame.rb', line 1577

def slice(offset, length = nil)
  if !length.nil? && length < 0
    length = height - offset + length
  end
  _from_rbdf(_df.slice(offset, length))
end

#sort(by, reverse: false, nulls_last: false) ⇒ DataFrame

Sort the DataFrame by column.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.sort("foo", reverse: true)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Sort by multiple columns.

df.sort(
  [Polars.col("foo"), Polars.col("bar")**2],
  reverse: [true, false]
)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Parameters:

  • by (String)

    By which column to sort.

  • reverse (Boolean) (defaults to: false)

    Reverse/descending sort.

  • nulls_last (Boolean) (defaults to: false)

    Place null values last. Can only be used if sorted by a single column.

Returns:



1466
1467
1468
1469
1470
# File 'lib/polars/data_frame.rb', line 1466

def sort(by, reverse: false, nulls_last: false)
  lazy
    .sort(by, reverse: reverse, nulls_last: nulls_last)
    .collect(no_optimization: true)
end

#sort!(by, reverse: false, nulls_last: false) ⇒ DataFrame

Sort the DataFrame by column in-place.

Parameters:

  • by (String)

    By which column to sort.

  • reverse (Boolean) (defaults to: false)

    Reverse/descending sort.

  • nulls_last (Boolean) (defaults to: false)

    Place null values last. Can only be used if sorted by a single column.

Returns:



1482
1483
1484
# File 'lib/polars/data_frame.rb', line 1482

def sort!(by, reverse: false, nulls_last: false)
  self._df = sort(by, reverse: reverse, nulls_last: nulls_last)._df
end

#std(ddof: 1) ⇒ DataFrame

Aggregate the columns of this DataFrame to their standard deviation value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.std
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1.0 ┆ 1.0 ┆ null │
# └─────┴─────┴──────┘
df.std(ddof: 0)
# =>
# shape: (1, 3)
# ┌──────────┬──────────┬──────┐
# │ foo      ┆ bar      ┆ ham  │
# │ ---      ┆ ---      ┆ ---  │
# │ f64      ┆ f64      ┆ str  │
# ╞══════════╪══════════╪══════╡
# │ 0.816497 ┆ 0.816497 ┆ null │
# └──────────┴──────────┴──────┘

Parameters:

  • ddof (Integer) (defaults to: 1)

    Degrees of freedom

Returns:



4236
4237
4238
# File 'lib/polars/data_frame.rb', line 4236

def std(ddof: 1)
  lazy.std(ddof: ddof).collect(_eager: true)
end

#sumDataFrame

Aggregate the columns of this DataFrame to their sum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"],
  }
)
df.sum
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 6   ┆ 21  ┆ null │
# └─────┴─────┴──────┘

Returns:



4109
4110
4111
# File 'lib/polars/data_frame.rb', line 4109

def sum
  lazy.sum.collect(_eager: true)
end

#sum_horizontal(ignore_nulls: true) ⇒ Series

Sum all values horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.sum_horizontal
# =>
# shape: (3,)
# Series: 'sum' [f64]
# [
#         5.0
#         7.0
#         9.0
# ]

Parameters:

  • ignore_nulls (Boolean) (defaults to: true)

    Ignore null values (default). If set to false, any null value in the input will lead to a null output.

Returns:



4137
4138
4139
4140
4141
# File 'lib/polars/data_frame.rb', line 4137

def sum_horizontal(ignore_nulls: true)
  select(
    sum: F.sum_horizontal(F.all, ignore_nulls: ignore_nulls)
  ).to_series
end

#tail(n = 5) ⇒ DataFrame

Get the last n rows.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.tail(3)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8   ┆ c   │
# │ 4   ┆ 9   ┆ d   │
# │ 5   ┆ 10  ┆ e   │
# └─────┴─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



1672
1673
1674
# File 'lib/polars/data_frame.rb', line 1672

def tail(n = 5)
  _from_rbdf(_df.tail(n))
end

#to_aArray

Returns an array representing the DataFrame

Returns:



333
334
335
# File 'lib/polars/data_frame.rb', line 333

def to_a
  rows(named: true)
end

#to_csv(**options) ⇒ String

Write to comma-separated values (CSV) string.

Returns:



800
801
802
# File 'lib/polars/data_frame.rb', line 800

def to_csv(**options)
  write_csv(**options)
end

#to_dummies(columns: nil, separator: "_", drop_first: false) ⇒ DataFrame

Get one hot encoded dummy variables.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2],
    "bar" => [3, 4],
    "ham" => ["a", "b"]
  }
)
df.to_dummies
# =>
# shape: (2, 6)
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │ foo_1 ┆ foo_2 ┆ bar_3 ┆ bar_4 ┆ ham_a ┆ ham_b │
# │ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
# │ u8    ┆ u8    ┆ u8    ┆ u8    ┆ u8    ┆ u8    │
# ╞═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
# │ 1     ┆ 0     ┆ 1     ┆ 0     ┆ 1     ┆ 0     │
# │ 0     ┆ 1     ┆ 0     ┆ 1     ┆ 0     ┆ 1     │
# └───────┴───────┴───────┴───────┴───────┴───────┘

Parameters:

  • columns (defaults to: nil)

    A subset of columns to convert to dummy variables. nil means "all columns".

Returns:



4391
4392
4393
4394
4395
4396
# File 'lib/polars/data_frame.rb', line 4391

def to_dummies(columns: nil, separator: "_", drop_first: false)
  if columns.is_a?(::String)
    columns = [columns]
  end
  _from_rbdf(_df.to_dummies(columns, separator, drop_first))
end

#to_h(as_series: true) ⇒ Hash

Convert DataFrame to a hash mapping column name to values.

Returns:

  • (Hash)


521
522
523
524
525
526
527
# File 'lib/polars/data_frame.rb', line 521

def to_h(as_series: true)
  if as_series
    get_columns.to_h { |s| [s.name, s] }
  else
    get_columns.to_h { |s| [s.name, s.to_a] }
  end
end

#to_hashesArray

Convert every row to a dictionary.

Note that this is slow.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.to_hashes
# =>
# [{"foo"=>1, "bar"=>4}, {"foo"=>2, "bar"=>5}, {"foo"=>3, "bar"=>6}]

Returns:



540
541
542
543
544
545
546
547
# File 'lib/polars/data_frame.rb', line 540

def to_hashes
  rbdf = _df
  names = columns

  height.times.map do |i|
    names.zip(rbdf.row_tuple(i)).to_h
  end
end

#to_numoNumo::NArray

Convert DataFrame to a 2D Numo array.

This operation clones data.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3], "bar" => [6, 7, 8], "ham" => ["a", "b", "c"]}
)
df.to_numo.class
# => Numo::RObject

Returns:

  • (Numo::NArray)


561
562
563
564
565
566
567
568
# File 'lib/polars/data_frame.rb', line 561

def to_numo
  out = _df.to_numo
  if out.nil?
    Numo::NArray.vstack(width.times.map { |i| to_series(i).to_numo }).transpose
  else
    out
  end
end

#to_sString Also known as: inspect

Returns a string representing the DataFrame.

Returns:



325
326
327
# File 'lib/polars/data_frame.rb', line 325

def to_s
  _df.to_s
end

#to_series(index = 0) ⇒ Series

Select column as Series at index location.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.to_series(1)
# =>
# shape: (3,)
# Series: 'bar' [i64]
# [
#         6
#         7
#         8
# ]

Parameters:

  • index (Integer) (defaults to: 0)

    Location of selection.

Returns:



596
597
598
599
600
601
# File 'lib/polars/data_frame.rb', line 596

def to_series(index = 0)
  if index < 0
    index = columns.length + index
  end
  Utils.wrap_s(_df.select_at_idx(index))
end

#to_struct(name) ⇒ Series

Convert a DataFrame to a Series of type Struct.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4, 5],
    "b" => ["one", "two", "three", "four", "five"]
  }
)
df.to_struct("nums")
# =>
# shape: (5,)
# Series: 'nums' [struct[2]]
# [
#         {1,"one"}
#         {2,"two"}
#         {3,"three"}
#         {4,"four"}
#         {5,"five"}
# ]

Parameters:

  • name (String)

    Name for the struct Series

Returns:



5015
5016
5017
# File 'lib/polars/data_frame.rb', line 5015

def to_struct(name)
  Utils.wrap_s(_df.to_struct(name))
end

#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame

Note:

This is a very expensive operation. Perhaps you can do it differently.

Transpose a DataFrame over the diagonal.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => [1, 2, 3]})
df.transpose(include_header: true)
# =>
# shape: (2, 4)
# ┌────────┬──────────┬──────────┬──────────┐
# │ column ┆ column_0 ┆ column_1 ┆ column_2 │
# │ ---    ┆ ---      ┆ ---      ┆ ---      │
# │ str    ┆ i64      ┆ i64      ┆ i64      │
# ╞════════╪══════════╪══════════╪══════════╡
# │ a      ┆ 1        ┆ 2        ┆ 3        │
# │ b      ┆ 1        ┆ 2        ┆ 3        │
# └────────┴──────────┴──────────┴──────────┘

Replace the auto-generated column names with a list

df.transpose(include_header: false, column_names: ["a", "b", "c"])
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 2   ┆ 3   │
# │ 1   ┆ 2   ┆ 3   │
# └─────┴─────┴─────┘

Include the header as a separate column

df.transpose(
  include_header: true, header_name: "foo", column_names: ["a", "b", "c"]
)
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬─────┐
# │ foo ┆ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╡
# │ a   ┆ 1   ┆ 2   ┆ 3   │
# │ b   ┆ 1   ┆ 2   ┆ 3   │
# └─────┴─────┴─────┴─────┘

Parameters:

  • include_header (Boolean) (defaults to: false)

    If set, the column names will be added as first column.

  • header_name (String) (defaults to: "column")

    If include_header is set, this determines the name of the column that will be inserted.

  • column_names (Array) (defaults to: nil)

    Optional generator/iterator that yields column names. Will be used to replace the columns in the DataFrame.

Returns:



1126
1127
1128
1129
# File 'lib/polars/data_frame.rb', line 1126

def transpose(include_header: false, header_name: "column", column_names: nil)
  keep_names_as = include_header ? header_name : nil
  _from_rbdf(_df.transpose(keep_names_as, column_names))
end

#unique(maintain_order: true, subset: nil, keep: "first") ⇒ DataFrame

Note:

Note that this fails if there is a column of type List in the DataFrame or subset.

Drop duplicate rows from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 1, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 1.0, 2.0, 3.0, 3.0],
    "c" => [true, true, true, false, true, true]
  }
)
df.unique
# =>
# shape: (5, 3)
# ┌─────┬─────┬───────┐
# │ a   ┆ b   ┆ c     │
# │ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ bool  │
# ╞═════╪═════╪═══════╡
# │ 1   ┆ 0.5 ┆ true  │
# │ 2   ┆ 1.0 ┆ true  │
# │ 3   ┆ 2.0 ┆ false │
# │ 4   ┆ 3.0 ┆ true  │
# │ 5   ┆ 3.0 ┆ true  │
# └─────┴─────┴───────┘

Parameters:

  • maintain_order (Boolean) (defaults to: true)

    Keep the same order as the original DataFrame. This requires more work to compute.

  • subset (Object) (defaults to: nil)

    Subset to use to compare rows.

  • keep ("first", "last") (defaults to: "first")

    Which of the duplicate rows to keep (in conjunction with subset).

Returns:



4436
4437
4438
4439
4440
4441
4442
4443
# File 'lib/polars/data_frame.rb', line 4436

def unique(maintain_order: true, subset: nil, keep: "first")
  self._from_rbdf(
    lazy
      .unique(maintain_order: maintain_order, subset: subset, keep: keep)
      .collect(no_optimization: true)
      ._df
  )
end

#unnest(names) ⇒ DataFrame

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Examples:

df = Polars::DataFrame.new(
  {
    "before" => ["foo", "bar"],
    "t_a" => [1, 2],
    "t_b" => ["a", "b"],
    "t_c" => [true, nil],
    "t_d" => [[1, 2], [3]],
    "after" => ["baz", "womp"]
  }
).select(["before", Polars.struct(Polars.col("^t_.$")).alias("t_struct"), "after"])
df.unnest("t_struct")
# =>
# shape: (2, 6)
# ┌────────┬─────┬─────┬──────┬───────────┬───────┐
# │ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
# │ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
# │ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
# ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
# │ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
# │ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
# └────────┴─────┴─────┴──────┴───────────┴───────┘

Parameters:

  • names (Object)

    Names of the struct columns that will be decomposed by its fields

Returns:



5051
5052
5053
5054
5055
5056
# File 'lib/polars/data_frame.rb', line 5051

def unnest(names)
  if names.is_a?(::String)
    names = [names]
  end
  _from_rbdf(_df.unnest(names))
end

#unpivot(on, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame Also known as: melt

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
)
df.unpivot(Polars.cs.numeric, index: "a")
# =>
# shape: (6, 3)
# ┌─────┬──────────┬───────┐
# │ a   ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ x   ┆ b        ┆ 1     │
# │ y   ┆ b        ┆ 3     │
# │ z   ┆ b        ┆ 5     │
# │ x   ┆ c        ┆ 2     │
# │ y   ┆ c        ┆ 4     │
# │ z   ┆ c        ┆ 6     │
# └─────┴──────────┴───────┘

Parameters:

  • on (Object)

    Column(s) or selector(s) to use as values variables; if on is empty all columns that are not in index will be used.

  • index (Object) (defaults to: nil)

    Column(s) or selector(s) to use as identifier variables.

  • variable_name (Object) (defaults to: nil)

    Name to give to the variable column. Defaults to "variable"

  • value_name (Object) (defaults to: nil)

    Name to give to the value column. Defaults to "value"

Returns:



3383
3384
3385
3386
3387
3388
# File 'lib/polars/data_frame.rb', line 3383

def unpivot(on, index: nil, variable_name: nil, value_name: nil)
  on = on.nil? ? [] : Utils._expand_selectors(self, on)
  index = index.nil? ? [] : Utils._expand_selectors(self, index)

  _from_rbdf(_df.unpivot(on, index, value_name, variable_name))
end

#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame

Note:

This functionality is experimental and may be subject to changes without it being considered a breaking change.

Unstack a long table to a wide form without doing an aggregation.

This can be much faster than a pivot, because it can skip the grouping phase.

Examples:

df = Polars::DataFrame.new(
  {
    "col1" => "A".."I",
    "col2" => Polars.arange(0, 9, eager: true)
  }
)
# =>
# shape: (9, 2)
# ┌──────┬──────┐
# │ col1 ┆ col2 │
# │ ---  ┆ ---  │
# │ str  ┆ i64  │
# ╞══════╪══════╡
# │ A    ┆ 0    │
# │ B    ┆ 1    │
# │ C    ┆ 2    │
# │ D    ┆ 3    │
# │ E    ┆ 4    │
# │ F    ┆ 5    │
# │ G    ┆ 6    │
# │ H    ┆ 7    │
# │ I    ┆ 8    │
# └──────┴──────┘
df.unstack(step: 3, how: "vertical")
# =>
# shape: (3, 6)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │
# │ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
# │ str    ┆ str    ┆ str    ┆ i64    ┆ i64    ┆ i64    │
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ A      ┆ D      ┆ G      ┆ 0      ┆ 3      ┆ 6      │
# │ B      ┆ E      ┆ H      ┆ 1      ┆ 4      ┆ 7      │
# │ C      ┆ F      ┆ I      ┆ 2      ┆ 5      ┆ 8      │
# └────────┴────────┴────────┴────────┴────────┴────────┘
df.unstack(step: 3, how: "horizontal")
# =>
# shape: (3, 6)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │
# │ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
# │ str    ┆ str    ┆ str    ┆ i64    ┆ i64    ┆ i64    │
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ A      ┆ B      ┆ C      ┆ 0      ┆ 1      ┆ 2      │
# │ D      ┆ E      ┆ F      ┆ 3      ┆ 4      ┆ 5      │
# │ G      ┆ H      ┆ I      ┆ 6      ┆ 7      ┆ 8      │
# └────────┴────────┴────────┴────────┴────────┴────────┘

Parameters:

  • step

    Integer Number of rows in the unstacked frame.

  • how ("vertical", "horizontal") (defaults to: "vertical")

    Direction of the unstack.

  • columns (Object) (defaults to: nil)

    Column to include in the operation.

  • fill_values (Object) (defaults to: nil)

    Fill values that don't fit the new size with this value.

Returns:



3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
# File 'lib/polars/data_frame.rb', line 3462

def unstack(step:, how: "vertical", columns: nil, fill_values: nil)
  if !columns.nil?
    df = select(columns)
  else
    df = self
  end

  height = df.height
  if how == "vertical"
    n_rows = step
    n_cols = (height / n_rows.to_f).ceil
  else
    n_cols = step
    n_rows = (height / n_cols.to_f).ceil
  end

  n_fill = n_cols * n_rows - height

  if n_fill > 0
    if !fill_values.is_a?(::Array)
      fill_values = [fill_values] * df.width
    end

    df = df.select(
      df.get_columns.zip(fill_values).map do |s, next_fill|
        s.extend_constant(next_fill, n_fill)
      end
    )
  end

  if how == "horizontal"
    df = (
      df.with_column(
        (Polars.arange(0, n_cols * n_rows, eager: true) % n_cols).alias(
          "__sort_order"
        )
      )
      .sort("__sort_order")
      .drop("__sort_order")
    )
  end

  zfill_val = Math.log10(n_cols).floor + 1
  slices =
    df.get_columns.flat_map do |s|
      n_cols.times.map do |slice_nbr|
        s.slice(slice_nbr * n_rows, n_rows).alias("%s_%0#{zfill_val}d" % [s.name, slice_nbr])
      end
    end

  _from_rbdf(DataFrame.new(slices)._df)
end

#upsample(time_column:, every:, by: nil, maintain_order: false) ⇒ DataFrame

Upsample a DataFrame at a regular frequency.

The every and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

Examples:

Upsample a DataFrame by a certain interval.

df = Polars::DataFrame.new(
  {
    "time" => [
      DateTime.new(2021, 2, 1),
      DateTime.new(2021, 4, 1),
      DateTime.new(2021, 5, 1),
      DateTime.new(2021, 6, 1)
    ],
    "groups" => ["A", "B", "A", "B"],
    "values" => [0, 1, 2, 3]
  }
).set_sorted("time")
df.upsample(
  time_column: "time", every: "1mo", by: "groups", maintain_order: true
).select(Polars.all.forward_fill)
# =>
# shape: (7, 3)
# ┌─────────────────────┬────────┬────────┐
# │ time                ┆ groups ┆ values │
# │ ---                 ┆ ---    ┆ ---    │
# │ datetime[ns]        ┆ str    ┆ i64    │
# ╞═════════════════════╪════════╪════════╡
# │ 2021-02-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-03-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-04-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-05-01 00:00:00 ┆ A      ┆ 2      │
# │ 2021-04-01 00:00:00 ┆ B      ┆ 1      │
# │ 2021-05-01 00:00:00 ┆ B      ┆ 1      │
# │ 2021-06-01 00:00:00 ┆ B      ┆ 3      │
# └─────────────────────┴────────┴────────┘

Parameters:

  • time_column (Object)

    time column will be used to determine a date_range. Note that this column has to be sorted for the output to make sense.

  • every (String)

    interval will start 'every' duration

  • by (Object) (defaults to: nil)

    First group by these columns and then upsample for every group

  • maintain_order (Boolean) (defaults to: false)

    Keep the ordering predictable. This is slower.

Returns:



2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
# File 'lib/polars/data_frame.rb', line 2239

def upsample(
  time_column:,
  every:,
  by: nil,
  maintain_order: false
)
  if by.nil?
    by = []
  end
  if by.is_a?(::String)
    by = [by]
  end

  every = Utils.parse_as_duration_string(every)

  _from_rbdf(
    _df.upsample(by, time_column, every, maintain_order)
  )
end

#var(ddof: 1) ⇒ DataFrame

Aggregate the columns of this DataFrame to their variance value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.var
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1.0 ┆ 1.0 ┆ null │
# └─────┴─────┴──────┘
df.var(ddof: 0)
# =>
# shape: (1, 3)
# ┌──────────┬──────────┬──────┐
# │ foo      ┆ bar      ┆ ham  │
# │ ---      ┆ ---      ┆ ---  │
# │ f64      ┆ f64      ┆ str  │
# ╞══════════╪══════════╪══════╡
# │ 0.666667 ┆ 0.666667 ┆ null │
# └──────────┴──────────┴──────┘

Parameters:

  • ddof (Integer) (defaults to: 1)

    Degrees of freedom

Returns:



4277
4278
4279
# File 'lib/polars/data_frame.rb', line 4277

def var(ddof: 1)
  lazy.var(ddof: ddof).collect(_eager: true)
end

#vstack(df, in_place: false) ⇒ DataFrame

Grow this DataFrame vertically by stacking a DataFrame to it.

Examples:

df1 = Polars::DataFrame.new(
  {
    "foo" => [1, 2],
    "bar" => [6, 7],
    "ham" => ["a", "b"]
  }
)
df2 = Polars::DataFrame.new(
  {
    "foo" => [3, 4],
    "bar" => [8, 9],
    "ham" => ["c", "d"]
  }
)
df1.vstack(df2)
# =>
# shape: (4, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# │ 4   ┆ 9   ┆ d   │
# └─────┴─────┴─────┘

Parameters:

  • df (DataFrame)

    DataFrame to stack.

  • in_place (Boolean) (defaults to: false)

    Modify in place

Returns:



2729
2730
2731
2732
2733
2734
2735
2736
# File 'lib/polars/data_frame.rb', line 2729

def vstack(df, in_place: false)
  if in_place
    _df.vstack_mut(df._df)
    self
  else
    _from_rbdf(_df.vstack(df._df))
  end
end

#widthInteger

Get the width of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.width
# => 1

Returns:

  • (Integer)


122
123
124
# File 'lib/polars/data_frame.rb', line 122

def width
  _df.width
end

#with_column(column) ⇒ DataFrame

Return a new DataFrame with the column added or replaced.

Examples:

Added

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.with_column((Polars.col("b") ** 2).alias("b_squared"))
# =>
# shape: (3, 3)
# ┌─────┬─────┬───────────┐
# │ a   ┆ b   ┆ b_squared │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ i64 ┆ i64       │
# ╞═════╪═════╪═══════════╡
# │ 1   ┆ 2   ┆ 4         │
# │ 3   ┆ 4   ┆ 16        │
# │ 5   ┆ 6   ┆ 36        │
# └─────┴─────┴───────────┘

Replaced

df.with_column(Polars.col("a") ** 2)
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# │ 9   ┆ 4   │
# │ 25  ┆ 6   │
# └─────┴─────┘

Parameters:

  • column (Object)

    Series, where the name of the Series refers to the column in the DataFrame.

Returns:



2644
2645
2646
2647
2648
# File 'lib/polars/data_frame.rb', line 2644

def with_column(column)
  lazy
    .with_column(column)
    .collect(no_optimization: true, string_cache: false)
end

#with_columns(*exprs, **named_exprs) ⇒ DataFrame

Add columns to this DataFrame.

Added columns will replace existing columns with the same name.

Examples:

Pass an expression to add it as a new column.

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.with_columns((Polars.col("a") ** 2).alias("a^2"))
# =>
# shape: (4, 4)
# ┌─────┬──────┬───────┬─────┐
# │ a   ┆ b    ┆ c     ┆ a^2 │
# │ --- ┆ ---  ┆ ---   ┆ --- │
# │ i64 ┆ f64  ┆ bool  ┆ i64 │
# ╞═════╪══════╪═══════╪═════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   │
# │ 3   ┆ 10.0 ┆ false ┆ 9   │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  │
# └─────┴──────┴───────┴─────┘

Added columns will replace existing columns with the same name.

df.with_columns(Polars.col("a").cast(Polars::Float64))
# =>
# shape: (4, 3)
# ┌─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     │
# │ --- ┆ ---  ┆ ---   │
# │ f64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╡
# │ 1.0 ┆ 0.5  ┆ true  │
# │ 2.0 ┆ 4.0  ┆ true  │
# │ 3.0 ┆ 10.0 ┆ false │
# │ 4.0 ┆ 13.0 ┆ true  │
# └─────┴──────┴───────┘

Multiple columns can be added by passing a list of expressions.

df.with_columns(
  [
    (Polars.col("a") ** 2).alias("a^2"),
    (Polars.col("b") / 2).alias("b/2"),
    (Polars.col("c").not_).alias("not c"),
  ]
)
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Multiple columns also can be added using positional arguments instead of a list.

df.with_columns(
  (Polars.col("a") ** 2).alias("a^2"),
  (Polars.col("b") / 2).alias("b/2"),
  (Polars.col("c").not_).alias("not c"),
)
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Use keyword arguments to easily name your expression inputs.

df.with_columns(
  ab: Polars.col("a") * Polars.col("b"),
  not_c: Polars.col("c").not_
)
# =>
# shape: (4, 5)
# ┌─────┬──────┬───────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ ab   ┆ not_c │
# │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false │
# └─────┴──────┴───────┴──────┴───────┘

Parameters:

  • exprs (Array)

    Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



3953
3954
3955
# File 'lib/polars/data_frame.rb', line 3953

def with_columns(*exprs, **named_exprs)
  lazy.with_columns(*exprs, **named_exprs).collect(_eager: true)
end

#with_row_index(name: "index", offset: 0) ⇒ DataFrame Also known as: with_row_count

Add a column at index 0 that counts the rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.with_row_index
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ index ┆ a   ┆ b   │
# │ ---   ┆ --- ┆ --- │
# │ u32   ┆ i64 ┆ i64 │
# ╞═══════╪═════╪═════╡
# │ 0     ┆ 1   ┆ 2   │
# │ 1     ┆ 3   ┆ 4   │
# │ 2     ┆ 5   ┆ 6   │
# └───────┴─────┴─────┘

Parameters:

  • name (String) (defaults to: "index")

    Name of the column to add.

  • offset (Integer) (defaults to: 0)

    Start the row count at this offset.

Returns:



1774
1775
1776
# File 'lib/polars/data_frame.rb', line 1774

def with_row_index(name: "index", offset: 0)
  _from_rbdf(_df.with_row_index(name, offset))
end

#write_avro(file, compression = "uncompressed", name: "") ⇒ nil

Write to Apache Avro file.

Parameters:

  • file (String)

    File path to which the file should be written.

  • compression ("uncompressed", "snappy", "deflate") (defaults to: "uncompressed")

    Compression method. Defaults to "uncompressed".

Returns:

  • (nil)


812
813
814
815
816
817
818
819
820
821
822
823
824
# File 'lib/polars/data_frame.rb', line 812

def write_avro(file, compression = "uncompressed", name: "")
  if compression.nil?
    compression = "uncompressed"
  end
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  if name.nil?
    name = ""
  end

  _df.write_avro(file, compression, name)
end

#write_csv(file = nil, has_header: true, include_header: nil, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil) ⇒ String?

Write to comma-separated values (CSV) file.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.write_csv("file.csv")

Parameters:

  • file (String, nil) (defaults to: nil)

    File path to which the result should be written. If set to nil (default), the output is returned as a string instead.

  • has_header (Boolean) (defaults to: true)

    Whether to include header in the CSV output.

  • sep (String) (defaults to: ",")

    Separate CSV fields with this symbol.

  • quote (String) (defaults to: '"')

    Byte to use as quoting character.

  • batch_size (Integer) (defaults to: 1024)

    Number of rows that will be processed per thread.

  • datetime_format (String, nil) (defaults to: nil)

    A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame's Datetime cols (if any).

  • date_format (String, nil) (defaults to: nil)

    A format string, with the specifiers defined by the chrono Rust crate.

  • time_format (String, nil) (defaults to: nil)

    A format string, with the specifiers defined by the chrono Rust crate.

  • float_precision (Integer, nil) (defaults to: nil)

    Number of decimal places to write, applied to both :f32 and :f64 datatypes.

  • null_value (String, nil) (defaults to: nil)

    A string representing null values (defaulting to the empty string).

Returns:



737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
# File 'lib/polars/data_frame.rb', line 737

def write_csv(
  file = nil,
  has_header: true,
  include_header: nil,
  sep: ",",
  quote: '"',
  batch_size: 1024,
  datetime_format: nil,
  date_format: nil,
  time_format: nil,
  float_precision: nil,
  null_value: nil
)
  include_header = has_header if include_header.nil?

  if sep.length > 1
    raise ArgumentError, "only single byte separator is allowed"
  elsif quote.length > 1
    raise ArgumentError, "only single byte quote char is allowed"
  elsif null_value == ""
    null_value = nil
  end

  if file.nil?
    buffer = StringIO.new
    buffer.set_encoding(Encoding::BINARY)
    _df.write_csv(
      buffer,
      include_header,
      sep.ord,
      quote.ord,
      batch_size,
      datetime_format,
      date_format,
      time_format,
      float_precision,
      null_value
    )
    return buffer.string.force_encoding(Encoding::UTF_8)
  end

  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  _df.write_csv(
    file,
    include_header,
    sep.ord,
    quote.ord,
    batch_size,
    datetime_format,
    date_format,
    time_format,
    float_precision,
    null_value,
  )
  nil
end

#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil

Write DataFrame as delta table.

Parameters:

  • target (Object)

    URI of a table or a DeltaTable object.

  • mode ("error", "append", "overwrite", "ignore", "merge") (defaults to: "error")

    How to handle existing data.

  • storage_options (Hash) (defaults to: nil)

    Extra options for the storage backends supported by deltalake-rb.

  • delta_write_options (Hash) (defaults to: nil)

    Additional keyword arguments while writing a Delta lake Table.

  • delta_merge_options (Hash) (defaults to: nil)

    Keyword arguments which are required to MERGE a Delta lake Table.

Returns:

  • (nil)


990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
# File 'lib/polars/data_frame.rb', line 990

def write_delta(
  target,
  mode: "error",
  storage_options: nil,
  delta_write_options: nil,
  delta_merge_options: nil
)
  Polars.send(:_check_if_delta_available)

  if Utils.pathlike?(target)
    target = Polars.send(:_resolve_delta_lake_uri, target.to_s, strict: false)
  end

  data = self

  if mode == "merge"
    if delta_merge_options.nil?
      msg = "You need to pass delta_merge_options with at least a given predicate for `MERGE` to work."
      raise ArgumentError, msg
    end
    if target.is_a?(::String)
      dt = DeltaLake::Table.new(target, storage_options: storage_options)
    else
      dt = target
    end

    predicate = delta_merge_options.delete(:predicate)
    dt.merge(data, predicate, **delta_merge_options)
  else
    delta_write_options ||= {}

    DeltaLake.write(
      target,
      data,
      mode: mode,
      storage_options: storage_options,
      **delta_write_options
    )
  end
end

#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2) ⇒ nil

Write to Arrow IPC binary stream or Feather file.

Parameters:

  • file (String)

    File path to which the file should be written.

  • compression ("uncompressed", "lz4", "zstd") (defaults to: "uncompressed")

    Compression method. Defaults to "uncompressed".

Returns:

  • (nil)


834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
# File 'lib/polars/data_frame.rb', line 834

def write_ipc(
  file,
  compression: "uncompressed",
  compat_level: nil,
  storage_options: nil,
  retries: 2
)
  return_bytes = file.nil?
  if return_bytes
    file = StringIO.new
    file.set_encoding(Encoding::BINARY)
  end
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  if compat_level.nil?
    compat_level = true
  end

  if compression.nil?
    compression = "uncompressed"
  end

  if storage_options&.any?
    storage_options = storage_options.to_a
  else
    storage_options = nil
  end

  _df.write_ipc(file, compression, compat_level, storage_options, retries)
  return_bytes ? file.string : nil
end

#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object

Write to Arrow IPC record batch stream.

See "Streaming format" in https://arrow.apache.org/docs/python/ipc.html.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.write_ipc_stream("new_file.arrow")

Parameters:

  • file (Object)

    Path or writable file-like object to which the IPC record batch data will be written. If set to None, the output is returned as a BytesIO object.

  • compression ('uncompressed', 'lz4', 'zstd') (defaults to: "uncompressed")

    Compression method. Defaults to "uncompressed".

Returns:



889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
# File 'lib/polars/data_frame.rb', line 889

def write_ipc_stream(
  file,
  compression: "uncompressed",
  compat_level: nil
)
  return_bytes = file.nil?
  if return_bytes
    file = StringIO.new
    file.set_encoding(Encoding::BINARY)
  elsif Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  if compat_level.nil?
    compat_level = true
  end

  if compression.nil?
    compression = "uncompressed"
  end

  _df.write_ipc_stream(file, compression, compat_level)
  return_bytes ? file.string : nil
end

#write_json(file = nil, pretty: false, row_oriented: false) ⇒ nil

Serialize to JSON representation.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8]
  }
)
df.write_json
# => "{\"columns\":[{\"name\":\"foo\",\"datatype\":\"Int64\",\"bit_settings\":\"\",\"values\":[1,2,3]},{\"name\":\"bar\",\"datatype\":\"Int64\",\"bit_settings\":\"\",\"values\":[6,7,8]}]}"
df.write_json(row_oriented: true)
# => "[{\"foo\":1,\"bar\":6},{\"foo\":2,\"bar\":7},{\"foo\":3,\"bar\":8}]"

Parameters:

  • file (String) (defaults to: nil)

    File path to which the result should be written.

  • pretty (Boolean) (defaults to: false)

    Pretty serialize json.

  • row_oriented (Boolean) (defaults to: false)

    Write to row oriented json. This is slower, but more common.

Returns:

  • (nil)


627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
# File 'lib/polars/data_frame.rb', line 627

def write_json(
  file = nil,
  pretty: false,
  row_oriented: false
)
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  to_string_io = !file.nil? && file.is_a?(StringIO)
  if file.nil? || to_string_io
    buf = StringIO.new
    buf.set_encoding(Encoding::BINARY)
    _df.write_json(buf, pretty, row_oriented)
    json_bytes = buf.string

    json_str = json_bytes.force_encoding(Encoding::UTF_8)
    if to_string_io
      file.write(json_str)
    else
      return json_str
    end
  else
    _df.write_json(file, pretty, row_oriented)
  end
  nil
end

#write_ndjson(file = nil) ⇒ nil

Serialize to newline delimited JSON representation.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8]
  }
)
df.write_ndjson
# => "{\"foo\":1,\"bar\":6}\n{\"foo\":2,\"bar\":7}\n{\"foo\":3,\"bar\":8}\n"

Parameters:

  • file (String) (defaults to: nil)

    File path to which the result should be written.

Returns:

  • (nil)


670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
# File 'lib/polars/data_frame.rb', line 670

def write_ndjson(file = nil)
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  to_string_io = !file.nil? && file.is_a?(StringIO)
  if file.nil? || to_string_io
    buf = StringIO.new
    buf.set_encoding(Encoding::BINARY)
    _df.write_ndjson(buf)
    json_bytes = buf.string

    json_str = json_bytes.force_encoding(Encoding::UTF_8)
    if to_string_io
      file.write(json_str)
    else
      return json_str
    end
  else
    _df.write_ndjson(file)
  end
  nil
end

#write_parquet(file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil) ⇒ nil

Write to Apache Parquet file.

Parameters:

  • file (String, Pathname, StringIO)

    File path to which the file should be written.

  • compression ("lz4", "uncompressed", "snappy", "gzip", "lzo", "brotli", "zstd") (defaults to: "zstd")

    Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression. Choose "snappy" for more backwards compatibility guarantees when you deal with older parquet readers.

  • compression_level (Integer, nil) (defaults to: nil)

    The level of compression to use. Higher compression means smaller files on disk.

    • "gzip" : min-level: 0, max-level: 10.
    • "brotli" : min-level: 0, max-level: 11.
    • "zstd" : min-level: 1, max-level: 22.
  • statistics (Boolean) (defaults to: false)

    Write statistics to the parquet headers. This requires extra compute.

  • row_group_size (Integer, nil) (defaults to: nil)

    Size of the row groups in number of rows. Defaults to 512^2 rows.

  • data_page_size (Integer, nil) (defaults to: nil)

    Size of the data page in bytes. Defaults to 1024^2 bytes.

Returns:

  • (nil)


938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
# File 'lib/polars/data_frame.rb', line 938

def write_parquet(
  file,
  compression: "zstd",
  compression_level: nil,
  statistics: false,
  row_group_size: nil,
  data_page_size: nil
)
  if compression.nil?
    compression = "uncompressed"
  end
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  if statistics == true
    statistics = {
      min: true,
      max: true,
      distinct_count: false,
      null_count: true
    }
  elsif statistics == false
    statistics = {}
  elsif statistics == "full"
    statistics = {
      min: true,
      max: true,
      distinct_count: true,
      null_count: true
    }
  end

  _df.write_parquet(
    file, compression, compression_level, statistics, row_group_size, data_page_size
  )
end