Class: Polars::LazyFrame

Inherits:
Object
  • Object
show all
Defined in:
lib/polars/lazy_frame.rb

Overview

Representation of a Lazy computation graph/query against a DataFrame.

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame

Create a new LazyFrame.



8
9
10
11
12
13
14
15
16
17
18
19
20
21
# File 'lib/polars/lazy_frame.rb', line 8

def initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false)
  self._ldf = (
    DataFrame.new(
      data,
      schema: schema,
      schema_overrides: schema_overrides,
      orient: orient,
      infer_schema_length: infer_schema_length,
      nan_to_null: nan_to_null
    )
    .lazy
    ._ldf
  )
end

Class Method Details

.read_json(file) ⇒ LazyFrame

Read a logical plan from a JSON file to construct a LazyFrame.

Parameters:

  • file (String)

    Path to a file or a file-like object.

Returns:



39
40
41
42
43
44
45
# File 'lib/polars/lazy_frame.rb', line 39

def self.read_json(file)
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  Utils.wrap_ldf(RbLazyFrame.read_json(file))
end

Instance Method Details

#cacheLazyFrame

Cache the result once the execution of the physical plan hits this node.

Returns:



847
848
849
# File 'lib/polars/lazy_frame.rb', line 847

def cache
  _from_rbldf(_ldf.cache)
end

#cast(dtypes, strict: true) ⇒ LazyFrame

Cast LazyFrame column(s) to the specified dtype(s).

Examples:

Cast specific frame columns to the specified dtypes:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => [Date.new(2020, 1, 2), Date.new(2021, 3, 4), Date.new(2022, 5, 6)]
  }
)
lf.cast({"foo" => Polars::Float32, "bar" => Polars::UInt8}).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬────────────┐
# │ foo ┆ bar ┆ ham        │
# │ --- ┆ --- ┆ ---        │
# │ f32 ┆ u8  ┆ date       │
# ╞═════╪═════╪════════════╡
# │ 1.0 ┆ 6   ┆ 2020-01-02 │
# │ 2.0 ┆ 7   ┆ 2021-03-04 │
# │ 3.0 ┆ 8   ┆ 2022-05-06 │
# └─────┴─────┴────────────┘

Cast all frame columns matching one dtype (or dtype group) to another dtype:

lf.cast({Polars::Date => Polars::Datetime}).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────────────────────┐
# │ foo ┆ bar ┆ ham                 │
# │ --- ┆ --- ┆ ---                 │
# │ i64 ┆ f64 ┆ datetime[μs]        │
# ╞═════╪═════╪═════════════════════╡
# │ 1   ┆ 6.0 ┆ 2020-01-02 00:00:00 │
# │ 2   ┆ 7.0 ┆ 2021-03-04 00:00:00 │
# │ 3   ┆ 8.0 ┆ 2022-05-06 00:00:00 │
# └─────┴─────┴─────────────────────┘

Cast all frame columns to the specified dtype:

lf.cast(Polars::String).collect.to_h(as_series: false)
# => {"foo"=>["1", "2", "3"], "bar"=>["6.0", "7.0", "8.0"], "ham"=>["2020-01-02", "2021-03-04", "2022-05-06"]}

Parameters:

  • dtypes (Hash)

    Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.

  • strict (Boolean) (defaults to: true)

    Throw an error if a cast could not be done (for instance, due to an overflow).

Returns:



900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
# File 'lib/polars/lazy_frame.rb', line 900

def cast(dtypes, strict: true)
  if !dtypes.is_a?(Hash)
    return _from_rbldf(_ldf.cast_all(dtypes, strict))
  end

  cast_map = {}
  dtypes.each do |c, dtype|
    dtype = Utils.parse_into_dtype(dtype)
    cast_map.merge!(
      c.is_a?(::String) ? {c => dtype} : Utils.expand_selector(self, c).to_h { |x| [x, dtype] }
    )
  end

  _from_rbldf(_ldf.cast(cast_map, strict))
end

#clear(n = 0) ⇒ LazyFrame Also known as: cleared

Create an empty copy of the current LazyFrame.

The copy has an identical schema but no data.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil],
  }
).lazy
lf.clear.fetch
# =>
# shape: (0, 3)
# ┌─────┬─────┬──────┐
# │ a   ┆ b   ┆ c    │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ f64 ┆ bool │
# ╞═════╪═════╪══════╡
# └─────┴─────┴──────┘
lf.clear(2).fetch
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ a    ┆ b    ┆ c    │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ f64  ┆ bool │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Returns:



952
953
954
# File 'lib/polars/lazy_frame.rb', line 952

def clear(n = 0)
  DataFrame.new(columns: schema).clear(n).lazy
end

#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame

Collect into a DataFrame.

Note: use #fetch if you want to run your query on the first n rows only. This can be a huge time saver in debugging queries.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.all.sum).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ a   ┆ 4   ┆ 10  │
# │ b   ┆ 11  ┆ 10  │
# │ c   ┆ 6   ┆ 1   │
# └─────┴─────┴─────┘

Parameters:

  • type_coercion (Boolean) (defaults to: true)

    Do type coercion optimization.

  • predicate_pushdown (Boolean) (defaults to: true)

    Do predicate pushdown optimization.

  • projection_pushdown (Boolean) (defaults to: true)

    Do projection pushdown optimization.

  • simplify_expression (Boolean) (defaults to: true)

    Run simplify expressions optimization.

  • string_cache (Boolean) (defaults to: false)

    This argument is deprecated. Please set the string cache globally. The argument will be ignored

  • no_optimization (Boolean) (defaults to: false)

    Turn off (certain) optimizations.

  • slice_pushdown (Boolean) (defaults to: true)

    Slice pushdown optimization.

  • common_subplan_elimination (Boolean) (defaults to: true)

    Will try to cache branching subplans that occur on self-joins or unions.

  • allow_streaming (Boolean) (defaults to: false)

    Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:



333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
# File 'lib/polars/lazy_frame.rb', line 333

def collect(
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  string_cache: false,
  no_optimization: false,
  slice_pushdown: true,
  common_subplan_elimination: true,
  comm_subexpr_elim: true,
  allow_streaming: false,
  _eager: false
)
  if no_optimization
    predicate_pushdown = false
    projection_pushdown = false
    slice_pushdown = false
    common_subplan_elimination = false
    comm_subexpr_elim = false
  end

  if allow_streaming
    common_subplan_elimination = false
  end

  ldf = _ldf.optimization_toggle(
    type_coercion,
    predicate_pushdown,
    projection_pushdown,
    simplify_expression,
    slice_pushdown,
    common_subplan_elimination,
    comm_subexpr_elim,
    allow_streaming,
    _eager
  )
  Utils.wrap_df(ldf.collect)
end

#columnsArray

Get or set column names.

Examples:

df = (
   Polars::DataFrame.new(
     {
       "foo" => [1, 2, 3],
       "bar" => [6, 7, 8],
       "ham" => ["a", "b", "c"]
     }
   )
   .lazy
   .select(["foo", "bar"])
)
df.columns
# => ["foo", "bar"]

Returns:



65
66
67
# File 'lib/polars/lazy_frame.rb', line 65

def columns
  _ldf.collect_schema.keys
end

#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String

Create a string representation of the optimized query plan.

Returns:



199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# File 'lib/polars/lazy_frame.rb', line 199

def describe_optimized_plan(
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  slice_pushdown: true,
  common_subplan_elimination: true,
  comm_subexpr_elim: true,
  allow_streaming: false
)
  ldf = _ldf.optimization_toggle(
    type_coercion,
    predicate_pushdown,
    projection_pushdown,
    simplify_expression,
    slice_pushdown,
    common_subplan_elimination,
    comm_subexpr_elim,
    allow_streaming,
    false
  )

  ldf.describe_optimized_plan
end

#describe_planString

Create a string representation of the unoptimized query plan.

Returns:



192
193
194
# File 'lib/polars/lazy_frame.rb', line 192

def describe_plan
  _ldf.describe_plan
end

#drop(*columns) ⇒ LazyFrame

Remove one or multiple columns from a DataFrame.

Examples:

Drop a single column by passing the name of that column.

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
lf.drop("ham").collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ f64 │
# ╞═════╪═════╡
# │ 1   ┆ 6.0 │
# │ 2   ┆ 7.0 │
# │ 3   ┆ 8.0 │
# └─────┴─────┘

Drop multiple columns by passing a selector.

lf.drop(Polars.cs.numeric).collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ ham │
# │ --- │
# │ str │
# ╞═════╡
# │ a   │
# │ b   │
# │ c   │
# └─────┘

Use positional arguments to drop multiple columns.

lf.drop("foo", "ham").collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ bar │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 6.0 │
# │ 7.0 │
# │ 8.0 │
# └─────┘

Parameters:

  • columns (Object)
    • Name of the column that should be removed.
    • List of column names.

Returns:



2199
2200
2201
2202
# File 'lib/polars/lazy_frame.rb', line 2199

def drop(*columns)
  drop_cols = Utils._expand_selectors(self, *columns)
  _from_rbldf(_ldf.drop(drop_cols))
end

#drop_nulls(subset: nil) ⇒ LazyFrame

Drop rows with null values from this LazyFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, nil, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.lazy.drop_nulls.collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • subset (Object) (defaults to: nil)

    Subset of column(s) on which drop_nulls will be applied.

Returns:



3101
3102
3103
3104
3105
3106
# File 'lib/polars/lazy_frame.rb', line 3101

def drop_nulls(subset: nil)
  if !subset.nil? && !subset.is_a?(::Array)
    subset = [subset]
  end
  _from_rbldf(_ldf.drop_nulls(subset))
end

#dtypesArray

Get dtypes of columns in LazyFrame.

Examples:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]

Returns:



83
84
85
# File 'lib/polars/lazy_frame.rb', line 83

def dtypes
  _ldf.collect_schema.values
end

#explode(columns) ⇒ LazyFrame

Explode lists to long format.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["a", "a", "b", "c"],
    "numbers" => [[1], [2, 3], [4, 5], [6, 7, 8]],
  }
).lazy
df.explode("numbers").collect
# =>
# shape: (8, 2)
# ┌─────────┬─────────┐
# │ letters ┆ numbers │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ a       ┆ 1       │
# │ a       ┆ 2       │
# │ a       ┆ 3       │
# │ b       ┆ 4       │
# │ b       ┆ 5       │
# │ c       ┆ 6       │
# │ c       ┆ 7       │
# │ c       ┆ 8       │
# └─────────┴─────────┘

Returns:



3002
3003
3004
3005
# File 'lib/polars/lazy_frame.rb', line 3002

def explode(columns)
  columns = Utils.parse_into_list_of_expressions(columns)
  _from_rbldf(_ldf.explode(columns))
end

#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ DataFrame

Collect a small number of rows for debugging purposes.

Fetch is like a #collect operation, but it overwrites the number of rows read by every scan operation. This is a utility that helps debug a query on a smaller number of rows.

Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.all.sum).fetch(2)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ a   ┆ 1   ┆ 6   │
# │ b   ┆ 2   ┆ 5   │
# └─────┴─────┴─────┘

Parameters:

  • n_rows (Integer) (defaults to: 500)

    Collect n_rows from the data sources.

  • type_coercion (Boolean) (defaults to: true)

    Run type coercion optimization.

  • predicate_pushdown (Boolean) (defaults to: true)

    Run predicate pushdown optimization.

  • projection_pushdown (Boolean) (defaults to: true)

    Run projection pushdown optimization.

  • simplify_expression (Boolean) (defaults to: true)

    Run simplify expressions optimization.

  • string_cache (Boolean) (defaults to: false)

    This argument is deprecated. Please set the string cache globally. The argument will be ignored

  • no_optimization (Boolean) (defaults to: false)

    Turn off optimizations.

  • slice_pushdown (Boolean) (defaults to: true)

    Slice pushdown optimization

  • common_subplan_elimination (Boolean) (defaults to: true)

    Will try to cache branching subplans that occur on self-joins or unions.

  • allow_streaming (Boolean) (defaults to: false)

    Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:



790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
# File 'lib/polars/lazy_frame.rb', line 790

def fetch(
  n_rows = 500,
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  string_cache: false,
  no_optimization: false,
  slice_pushdown: true,
  common_subplan_elimination: true,
  comm_subexpr_elim: true,
  allow_streaming: false
)
  if no_optimization
    predicate_pushdown = false
    projection_pushdown = false
    slice_pushdown = false
    common_subplan_elimination = false
  end

  ldf = _ldf.optimization_toggle(
    type_coercion,
    predicate_pushdown,
    projection_pushdown,
    simplify_expression,
    slice_pushdown,
    common_subplan_elimination,
    comm_subexpr_elim,
    allow_streaming,
    false
  )
  Utils.wrap_df(ldf.fetch(n_rows))
end

#fill_nan(fill_value) ⇒ LazyFrame

Note:

Note that floating point NaN (Not a Number) are not missing values! To replace missing values, use fill_null instead.

Fill floating point NaN values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13],
  }
).lazy
df.fill_nan(99).collect
# =>
# shape: (4, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ f64  ┆ f64  │
# ╞══════╪══════╡
# │ 1.5  ┆ 0.5  │
# │ 2.0  ┆ 4.0  │
# │ 99.0 ┆ 99.0 │
# │ 4.0  ┆ 13.0 │
# └──────┴──────┘

Parameters:

  • fill_value (Object)

    Value to fill the NaN values with.

Returns:



2777
2778
2779
2780
2781
2782
# File 'lib/polars/lazy_frame.rb', line 2777

def fill_nan(fill_value)
  if !fill_value.is_a?(Expr)
    fill_value = F.lit(fill_value)
  end
  _from_rbldf(_ldf.fill_nan(fill_value._rbexpr))
end

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame

Fill null values using the specified value or strategy.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, nil, 4],
    "b" => [0.5, 4, nil, 13]
  }
)
lf.fill_null(99).collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 99  ┆ 99.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "forward").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "max").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "zero").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 0   ┆ 0.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

Returns:



2742
2743
2744
# File 'lib/polars/lazy_frame.rb', line 2742

def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil)
  select(Polars.all.fill_null(value, strategy: strategy, limit: limit))
end

#filter(predicate) ⇒ LazyFrame

Filter the rows in the DataFrame based on a predicate expression.

Examples:

Filter on one condition:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.filter(Polars.col("foo") < 3).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

Filter on multiple conditions:

lf.filter((Polars.col("foo") < 3) & (Polars.col("ham") == "a")).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Parameters:

  • predicate (Object)

    Expression that evaluates to a boolean Series.

Returns:



995
996
997
998
999
1000
1001
# File 'lib/polars/lazy_frame.rb', line 995

def filter(predicate)
  _from_rbldf(
    _ldf.filter(
      Utils.parse_into_expression(predicate, str_as_lit: false)
    )
  )
end

#firstLazyFrame

Get the first row of the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 5, 3],
    "b" => [2, 4, 6]
  }
)
lf.first.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

Returns:



2611
2612
2613
# File 'lib/polars/lazy_frame.rb', line 2611

def first
  slice(0, 1)
end

#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy Also known as: groupby, group

Start a group by operation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.col("b").sum).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 4   │
# │ b   ┆ 11  │
# │ c   ┆ 6   │
# └─────┴─────┘

Parameters:

  • by (Array)

    Column(s) to group by.

  • maintain_order (Boolean) (defaults to: false)

    Make sure that the order of the groups remain consistent. This is more expensive than a default group by.

  • named_by (Hash)

    Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



1132
1133
1134
1135
1136
# File 'lib/polars/lazy_frame.rb', line 1132

def group_by(*by, maintain_order: false, **named_by)
  exprs = Utils.parse_into_list_of_expressions(*by, **named_by)
  lgb = _ldf.group_by(exprs, maintain_order)
  LazyGroupBy.new(lgb)
end

#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window") ⇒ DataFrame Also known as: groupby_dynamic

Group based on a time value (or index value of type :i32, :i64).

Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

  • every: interval of the window
  • period: length of the window
  • offset: offset of the window

The every, period and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_dynamic on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "n" => 0..6
  }
)
# =>
# shape: (7, 2)
# ┌─────────────────────┬─────┐
# │ time                ┆ n   │
# │ ---                 ┆ --- │
# │ datetime[μs]        ┆ i64 │
# ╞═════════════════════╪═════╡
# │ 2021-12-16 00:00:00 ┆ 0   │
# │ 2021-12-16 00:30:00 ┆ 1   │
# │ 2021-12-16 01:00:00 ┆ 2   │
# │ 2021-12-16 01:30:00 ┆ 3   │
# │ 2021-12-16 02:00:00 ┆ 4   │
# │ 2021-12-16 02:30:00 ┆ 5   │
# │ 2021-12-16 03:00:00 ┆ 6   │
# └─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

df.group_by_dynamic("time", every: "1h", closed: "right").agg(
  [
    Polars.col("time").min.alias("time_min"),
    Polars.col("time").max.alias("time_max")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬─────────────────────┬─────────────────────┐
# │ time                ┆ time_min            ┆ time_max            │
# │ ---                 ┆ ---                 ┆ ---                 │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
# └─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result.

df.group_by_dynamic(
  "time", every: "1h", include_boundaries: true, closed: "right"
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (4, 4)
# ┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# └─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed="left", should not include right end of interval.

df.group_by_dynamic("time", every: "1h", closed: "left").agg(
  [
    Polars.col("time").count.alias("time_count"),
    Polars.col("time").alias("time_agg_list")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬─────────────────────────────────┐
# │ time                ┆ time_count ┆ time_agg_list                   │
# │ ---                 ┆ ---        ┆ ---                             │
# │ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]              │
# ╞═════════════════════╪════════════╪═════════════════════════════════╡
# │ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-… │
# │ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-… │
# │ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-… │
# │ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]           │
# └─────────────────────┴────────────┴─────────────────────────────────┘

When closed="both" the time values at the window boundaries belong to 2 groups.

df.group_by_dynamic("time", every: "1h", closed: "both").agg(
  [Polars.col("time").count.alias("time_count")]
)
# =>
# shape: (5, 2)
# ┌─────────────────────┬────────────┐
# │ time                ┆ time_count │
# │ ---                 ┆ ---        │
# │ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 3          │
# │ 2021-12-16 01:00:00 ┆ 3          │
# │ 2021-12-16 02:00:00 ┆ 3          │
# │ 2021-12-16 03:00:00 ┆ 1          │
# └─────────────────────┴────────────┘

Dynamic group bys can also be combined with grouping on normal keys.

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "groups" => ["a", "a", "a", "b", "b", "a", "a"]
  }
)
df.group_by_dynamic(
  "time",
  every: "1h",
  closed: "both",
  by: "groups",
  include_boundaries: true
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (7, 5)
# ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
# │ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# │ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
# │ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
# └────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic group by on an index column.

df = Polars::DataFrame.new(
  {
    "idx" => Polars.arange(0, 6, eager: true),
    "A" => ["A", "A", "B", "B", "B", "C"]
  }
)
df.group_by_dynamic(
  "idx",
  every: "2i",
  period: "3i",
  include_boundaries: true,
  closed: "right"
).agg(Polars.col("A").alias("A_agg_list"))
# =>
# shape: (4, 4)
# ┌─────────────────┬─────────────────┬─────┬─────────────────┐
# │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
# │ ---             ┆ ---             ┆ --- ┆ ---             │
# │ i64             ┆ i64             ┆ i64 ┆ list[str]       │
# ╞═════════════════╪═════════════════╪═════╪═════════════════╡
# │ -2              ┆ 1               ┆ -2  ┆ ["A", "A"]      │
# │ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
# │ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
# │ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
# └─────────────────┴─────────────────┴─────┴─────────────────┘

Parameters:

  • index_column (Object)

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a dynamic group by on indices, dtype needs to be one of :i32, :i64. Note that :i32 gets temporarily cast to :i64, so if performance matters use an :i64 column.

  • every (Object)

    Interval of the window.

  • period (Object) (defaults to: nil)

    Length of the window, if None it is equal to 'every'.

  • offset (Object) (defaults to: nil)

    Offset of the window if None and period is None it will be equal to negative every.

  • truncate (Boolean) (defaults to: nil)

    Truncate the time value to the window lower bound.

  • include_boundaries (Boolean) (defaults to: false)

    Add the lower and upper bound of the window to the "_lower_bound" and "_upper_bound" columns. This will impact performance because it's harder to parallelize

  • closed ("right", "left", "both", "none") (defaults to: "left")

    Define whether the temporal window interval is closed or not.

  • by (Object) (defaults to: nil)

    Also group by this column/these columns

Returns:



1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
# File 'lib/polars/lazy_frame.rb', line 1479

def group_by_dynamic(
  index_column,
  every:,
  period: nil,
  offset: nil,
  truncate: nil,
  include_boundaries: false,
  closed: "left",
  label: "left",
  by: nil,
  start_by: "window"
)
  if !truncate.nil?
    label = truncate ? "left" : "datapoint"
  end

  index_column = Utils.parse_into_expression(index_column, str_as_lit: false)
  if offset.nil?
    offset = period.nil? ? "-#{every}" : "0ns"
  end

  if period.nil?
    period = every
  end

  period = Utils.parse_as_duration_string(period)
  offset = Utils.parse_as_duration_string(offset)
  every = Utils.parse_as_duration_string(every)

  rbexprs_by = by.nil? ? [] : Utils.parse_into_list_of_expressions(by)
  lgb = _ldf.group_by_dynamic(
    index_column,
    every,
    period,
    offset,
    label,
    include_boundaries,
    closed,
    rbexprs_by,
    start_by
  )
  LazyGroupBy.new(lgb)
end

#head(n = 5) ⇒ LazyFrame

Note:

Consider using the #fetch operation if you only want to test your query. The #fetch operation will load the first n rows at the scan level, whereas the #head/#limit are applied at the end.

Get the first n rows.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.head.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# └─────┴─────┘
lf.head(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



2516
2517
2518
# File 'lib/polars/lazy_frame.rb', line 2516

def head(n = 5)
  slice(0, n)
end

#include?(key) ⇒ Boolean

Check if LazyFrame includes key.

Returns:



120
121
122
# File 'lib/polars/lazy_frame.rb', line 120

def include?(key)
  columns.include?(key)
end

#interpolateLazyFrame

Interpolate intermediate values. The interpolation method is linear.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 9, 10],
    "bar" => [6, 7, 9, nil],
    "baz" => [1, nil, nil, 9]
  }
).lazy
df.interpolate.collect
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────────┐
# │ foo  ┆ bar  ┆ baz      │
# │ ---  ┆ ---  ┆ ---      │
# │ f64  ┆ f64  ┆ f64      │
# ╞══════╪══════╪══════════╡
# │ 1.0  ┆ 6.0  ┆ 1.0      │
# │ 5.0  ┆ 7.0  ┆ 3.666667 │
# │ 9.0  ┆ 9.0  ┆ 6.333333 │
# │ 10.0 ┆ null ┆ 9.0      │
# └──────┴──────┴──────────┘

Returns:



3204
3205
3206
# File 'lib/polars/lazy_frame.rb', line 3204

def interpolate
  select(F.col("*").interpolate)
end

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, allow_parallel: true, force_parallel: false, coalesce: nil) ⇒ LazyFrame

Add a join operation to the Logical Plan.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
other_df = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
).lazy
df.join(other_df, on: "ham").collect
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "full").collect
# =>
# shape: (4, 5)
# ┌──────┬──────┬──────┬───────┬───────────┐
# │ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
# │ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
# │ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
# ╞══════╪══════╪══════╪═══════╪═══════════╡
# │ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
# │ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
# │ null ┆ null ┆ null ┆ z     ┆ d         │
# │ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
# └──────┴──────┴──────┴───────┴───────────┘
df.join(other_df, on: "ham", how: "left").collect
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# │ 3   ┆ 8.0 ┆ c   ┆ null  │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "semi").collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6.0 ┆ a   │
# │ 2   ┆ 7.0 ┆ b   │
# └─────┴─────┴─────┘
df.join(other_df, on: "ham", how: "anti").collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • other (LazyFrame)

    Lazy DataFrame to join with.

  • left_on (Object) (defaults to: nil)

    Join column of the left DataFrame.

  • right_on (Object) (defaults to: nil)

    Join column of the right DataFrame.

  • on (defaults to: nil)

    Object Join column of both DataFrames. If set, left_on and right_on should be None.

  • how ("inner", "left", "full", "semi", "anti", "cross") (defaults to: "inner")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • validate ('m:m', 'm:1', '1:m', '1:1') (defaults to: "m:m")

    Checks if join is of specified type.

    • many_to_many - “m:m”: default, does not result in checks
    • one_to_one - “1:1”: check if join keys are unique in both left and right datasets
    • one_to_many - “1:m”: check if join keys are unique in left dataset
    • many_to_one - “m:1”: check if join keys are unique in right dataset
  • join_nulls (Boolean) (defaults to: false)

    Join on null values. By default null values will never produce matches.

  • allow_parallel (Boolean) (defaults to: true)

    Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

  • force_parallel (Boolean) (defaults to: false)

    Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

  • coalesce (Boolean) (defaults to: nil)

    Coalescing behavior (merging of join columns).

    • nil: -> join specific.
    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.

Returns:



1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
# File 'lib/polars/lazy_frame.rb', line 1966

def join(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  how: "inner",
  suffix: "_right",
  validate: "m:m",
  join_nulls: false,
  allow_parallel: true,
  force_parallel: false,
  coalesce: nil
)
  if !other.is_a?(LazyFrame)
    raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}"
  end

  if how == "outer"
    how = "full"
  elsif how == "cross"
    return _from_rbldf(
      _ldf.join(
        other._ldf, [], [], allow_parallel, join_nulls, force_parallel, how, suffix, validate, coalesce
      )
    )
  end

  if !on.nil?
    rbexprs = Utils.parse_into_list_of_expressions(on)
    rbexprs_left = rbexprs
    rbexprs_right = rbexprs
  elsif !left_on.nil? && !right_on.nil?
    rbexprs_left = Utils.parse_into_list_of_expressions(left_on)
    rbexprs_right = Utils.parse_into_list_of_expressions(right_on)
  else
    raise ArgumentError, "must specify `on` OR `left_on` and `right_on`"
  end

  _from_rbldf(
    self._ldf.join(
      other._ldf,
      rbexprs_left,
      rbexprs_right,
      allow_parallel,
      force_parallel,
      join_nulls,
      how,
      suffix,
      validate,
      coalesce
    )
  )
end

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true) ⇒ LazyFrame

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the join_asof key.

For each row in the left DataFrame:

  • A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
  • A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.

The default is "backward".

Examples:

gdp = Polars::LazyFrame.new(
  {
    "date" => Polars.date_range(
      Date.new(2016, 1, 1),
      Date.new(2020, 1, 1),
      "1y",
      eager: true
    ),
    "gdp" => [4164, 4411, 4566, 4696, 4827]
  }
)
gdp.collect
# =>
# shape: (5, 2)
# ┌────────────┬──────┐
# │ date       ┆ gdp  │
# │ ---        ┆ ---  │
# │ date       ┆ i64  │
# ╞════════════╪══════╡
# │ 2016-01-01 ┆ 4164 │
# │ 2017-01-01 ┆ 4411 │
# │ 2018-01-01 ┆ 4566 │
# │ 2019-01-01 ┆ 4696 │
# │ 2020-01-01 ┆ 4827 │
# └────────────┴──────┘
population = Polars::LazyFrame.new(
  {
    "date" => [Date.new(2016, 3, 1), Date.new(2018, 8, 1), Date.new(2019, 1, 1)],
    "population" => [82.19, 82.66, 83.12]
  }
).sort("date")
population.collect
# =>
# shape: (3, 2)
# ┌────────────┬────────────┐
# │ date       ┆ population │
# │ ---        ┆ ---        │
# │ date       ┆ f64        │
# ╞════════════╪════════════╡
# │ 2016-03-01 ┆ 82.19      │
# │ 2018-08-01 ┆ 82.66      │
# │ 2019-01-01 ┆ 83.12      │
# └────────────┴────────────┘

Note how the dates don't quite match. If we join them using join_asof and strategy: "backward", then each date from population which doesn't have an exact match is matched with the closest earlier date from gdp:

population.join_asof(gdp, on: "date", strategy: "backward").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 4566 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
population.join_asof(
  gdp, on: "date", strategy: "backward", coalesce: false
).collect
# =>
# shape: (3, 4)
# ┌────────────┬────────────┬────────────┬──────┐
# │ date       ┆ population ┆ date_right ┆ gdp  │
# │ ---        ┆ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ date       ┆ i64  │
# ╞════════════╪════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 2016-01-01 ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 2018-01-01 ┆ 4566 │
# │ 2019-01-01 ┆ 83.12      ┆ 2019-01-01 ┆ 4696 │
# └────────────┴────────────┴────────────┴──────┘

If we instead use strategy: "forward", then each date from population which doesn't have an exact match is matched with the closest later date from gdp:

population.join_asof(gdp, on: "date", strategy: "forward").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4411 │
# │ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
population.join_asof(gdp, on: "date", strategy: "nearest").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
gdp_dates = Polars.date_range(
  Date.new(2016, 1, 1), Date.new(2020, 1, 1), "1y", eager: true
)
gdp2 = Polars::LazyFrame.new(
  {
    "country" => ["Germany"] * 5 + ["Netherlands"] * 5,
    "date" => Polars.concat([gdp_dates, gdp_dates]),
    "gdp" => [4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909]
  }
).sort("country", "date")
gdp2.collect
# =>
# shape: (10, 3)
# ┌─────────────┬────────────┬──────┐
# │ country     ┆ date       ┆ gdp  │
# │ ---         ┆ ---        ┆ ---  │
# │ str         ┆ date       ┆ i64  │
# ╞═════════════╪════════════╪══════╡
# │ Germany     ┆ 2016-01-01 ┆ 4164 │
# │ Germany     ┆ 2017-01-01 ┆ 4411 │
# │ Germany     ┆ 2018-01-01 ┆ 4566 │
# │ Germany     ┆ 2019-01-01 ┆ 4696 │
# │ Germany     ┆ 2020-01-01 ┆ 4827 │
# │ Netherlands ┆ 2016-01-01 ┆ 784  │
# │ Netherlands ┆ 2017-01-01 ┆ 833  │
# │ Netherlands ┆ 2018-01-01 ┆ 914  │
# │ Netherlands ┆ 2019-01-01 ┆ 910  │
# │ Netherlands ┆ 2020-01-01 ┆ 909  │
# └─────────────┴────────────┴──────┘
pop2 = Polars::LazyFrame.new(
  {
    "country" => ["Germany"] * 3 + ["Netherlands"] * 3,
    "date" => [
      Date.new(2016, 3, 1),
      Date.new(2018, 8, 1),
      Date.new(2019, 1, 1),
      Date.new(2016, 3, 1),
      Date.new(2018, 8, 1),
      Date.new(2019, 1, 1)
    ],
    "population" => [82.19, 82.66, 83.12, 17.11, 17.32, 17.40]
  }
).sort("country", "date")
pop2.collect
# =>
# shape: (6, 3)
# ┌─────────────┬────────────┬────────────┐
# │ country     ┆ date       ┆ population │
# │ ---         ┆ ---        ┆ ---        │
# │ str         ┆ date       ┆ f64        │
# ╞═════════════╪════════════╪════════════╡
# │ Germany     ┆ 2016-03-01 ┆ 82.19      │
# │ Germany     ┆ 2018-08-01 ┆ 82.66      │
# │ Germany     ┆ 2019-01-01 ┆ 83.12      │
# │ Netherlands ┆ 2016-03-01 ┆ 17.11      │
# │ Netherlands ┆ 2018-08-01 ┆ 17.32      │
# │ Netherlands ┆ 2019-01-01 ┆ 17.4       │
# └─────────────┴────────────┴────────────┘
pop2.join_asof(gdp2, by: "country", on: "date", strategy: "nearest").collect
# =>
# shape: (6, 4)
# ┌─────────────┬────────────┬────────────┬──────┐
# │ country     ┆ date       ┆ population ┆ gdp  │
# │ ---         ┆ ---        ┆ ---        ┆ ---  │
# │ str         ┆ date       ┆ f64        ┆ i64  │
# ╞═════════════╪════════════╪════════════╪══════╡
# │ Germany     ┆ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ Germany     ┆ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ Germany     ┆ 2019-01-01 ┆ 83.12      ┆ 4696 │
# │ Netherlands ┆ 2016-03-01 ┆ 17.11      ┆ 784  │
# │ Netherlands ┆ 2018-08-01 ┆ 17.32      ┆ 910  │
# │ Netherlands ┆ 2019-01-01 ┆ 17.4       ┆ 910  │
# └─────────────┴────────────┴────────────┴──────┘

Parameters:

  • other (LazyFrame)

    Lazy DataFrame to join with.

  • left_on (String) (defaults to: nil)

    Join column of the left DataFrame.

  • right_on (String) (defaults to: nil)

    Join column of the right DataFrame.

  • on (String) (defaults to: nil)

    Join column of both DataFrames. If set, left_on and right_on should be None.

  • by (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • by_left (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • by_right (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • strategy ("backward", "forward") (defaults to: "backward")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • tolerance (Object) (defaults to: nil)

    Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time" you use the following string language:

    • 1ns (1 nanosecond)
    • 1us (1 microsecond)
    • 1ms (1 millisecond)
    • 1s (1 second)
    • 1m (1 minute)
    • 1h (1 hour)
    • 1d (1 day)
    • 1w (1 week)
    • 1mo (1 calendar month)
    • 1y (1 calendar year)
    • 1i (1 index count)

    Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

  • allow_parallel (Boolean) (defaults to: true)

    Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

  • force_parallel (Boolean) (defaults to: false)

    Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

  • coalesce (Boolean) (defaults to: true)

    Coalescing behavior (merging of join columns).

    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.

Returns:



1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
# File 'lib/polars/lazy_frame.rb', line 1775

def join_asof(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  by_left: nil,
  by_right: nil,
  by: nil,
  strategy: "backward",
  suffix: "_right",
  tolerance: nil,
  allow_parallel: true,
  force_parallel: false,
  coalesce: true
)
  if !other.is_a?(LazyFrame)
    raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}"
  end

  if on.is_a?(::String)
    left_on = on
    right_on = on
  end

  if left_on.nil? || right_on.nil?
    raise ArgumentError, "You should pass the column to join on as an argument."
  end

  if by_left.is_a?(::String) || by_left.is_a?(Expr)
    by_left_ = [by_left]
  else
    by_left_ = by_left
  end

  if by_right.is_a?(::String) || by_right.is_a?(Expr)
    by_right_ = [by_right]
  else
    by_right_ = by_right
  end

  if by.is_a?(::String)
    by_left_ = [by]
    by_right_ = [by]
  elsif by.is_a?(::Array)
    by_left_ = by
    by_right_ = by
  end

  tolerance_str = nil
  tolerance_num = nil
  if tolerance.is_a?(::String)
    tolerance_str = tolerance
  else
    tolerance_num = tolerance
  end

  _from_rbldf(
    _ldf.join_asof(
      other._ldf,
      Polars.col(left_on)._rbexpr,
      Polars.col(right_on)._rbexpr,
      by_left_,
      by_right_,
      allow_parallel,
      force_parallel,
      suffix,
      strategy,
      tolerance_num,
      tolerance_str,
      coalesce
    )
  )
end

#lastLazyFrame

Get the last row of the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 5, 3],
    "b" => [2, 4, 6]
  }
)
lf.last.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 3   ┆ 6   │
# └─────┴─────┘

Returns:



2586
2587
2588
# File 'lib/polars/lazy_frame.rb', line 2586

def last
  tail(1)
end

#lazyLazyFrame

Return lazy representation, i.e. itself.

Useful for writing code that expects either a DataFrame or LazyFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil]
  }
)
df.lazy

Returns:



840
841
842
# File 'lib/polars/lazy_frame.rb', line 840

def lazy
  self
end

#limit(n = 5) ⇒ LazyFrame

Note:

Consider using the #fetch operation if you only want to test your query. The #fetch operation will load the first n rows at the scan level, whereas the #head/#limit are applied at the end.

Get the first n rows.

Alias for #head.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.limit.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# └─────┴─────┘
lf.limit(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



2466
2467
2468
# File 'lib/polars/lazy_frame.rb', line 2466

def limit(n = 5)
  head(n)
end

#maxLazyFrame

Aggregate the columns in the DataFrame to their maximum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.max.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 4   ┆ 2   │
# └─────┴─────┘

Returns:



2864
2865
2866
# File 'lib/polars/lazy_frame.rb', line 2864

def max
  _from_rbldf(_ldf.max)
end

#meanLazyFrame

Aggregate the columns in the DataFrame to their mean value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.mean.collect
# =>
# shape: (1, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ f64 ┆ f64  │
# ╞═════╪══════╡
# │ 2.5 ┆ 1.25 │
# └─────┴──────┘

Returns:



2924
2925
2926
# File 'lib/polars/lazy_frame.rb', line 2924

def mean
  _from_rbldf(_ldf.mean)
end

#medianLazyFrame

Aggregate the columns in the DataFrame to their median value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.median.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═════╡
# │ 2.5 ┆ 1.0 │
# └─────┴─────┘

Returns:



2944
2945
2946
# File 'lib/polars/lazy_frame.rb', line 2944

def median
  _from_rbldf(_ldf.median)
end

#merge_sorted(other, key) ⇒ LazyFrame

Take two sorted DataFrames and merge them by the sorted key.

The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.

The schemas of both LazyFrames must be equal.

Examples:

df0 = Polars::LazyFrame.new(
  {"name" => ["steve", "elise", "bob"], "age" => [42, 44, 18]}
).sort("age")
df1 = Polars::LazyFrame.new(
  {"name" => ["anna", "megan", "steve", "thomas"], "age" => [21, 33, 42, 20]}
).sort("age")
df0.merge_sorted(df1, "age").collect
# =>
# shape: (7, 2)
# ┌────────┬─────┐
# │ name   ┆ age │
# │ ---    ┆ --- │
# │ str    ┆ i64 │
# ╞════════╪═════╡
# │ bob    ┆ 18  │
# │ thomas ┆ 20  │
# │ anna   ┆ 21  │
# │ megan  ┆ 33  │
# │ steve  ┆ 42  │
# │ steve  ┆ 42  │
# │ elise  ┆ 44  │
# └────────┴─────┘

Parameters:

  • other (DataFrame)

    Other DataFrame that must be merged

  • key (String)

    Key that is sorted.

Returns:



3304
3305
3306
# File 'lib/polars/lazy_frame.rb', line 3304

def merge_sorted(other, key)
  _from_rbldf(_ldf.merge_sorted(other._ldf, key))
end

#minLazyFrame

Aggregate the columns in the DataFrame to their minimum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.min.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 1   │
# └─────┴─────┘

Returns:



2884
2885
2886
# File 'lib/polars/lazy_frame.rb', line 2884

def min
  _from_rbldf(_ldf.min)
end

#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Examples:

cast_str_to_int = lambda do |data, col_name:|
  data.with_column(Polars.col(col_name).cast(:i64))
end

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => ["10", "20", "30", "40"]}).lazy
df.pipe(cast_str_to_int, col_name: "b").collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 10  │
# │ 2   ┆ 20  │
# │ 3   ┆ 30  │
# │ 4   ┆ 40  │
# └─────┴─────┘

Parameters:

  • func (Object)

    Callable; will receive the frame as the first parameter, followed by any given args/kwargs.

  • args (Object)

    Arguments to pass to the UDF.

  • kwargs (Object)

    Keyword arguments to pass to the UDF.

Returns:



185
186
187
# File 'lib/polars/lazy_frame.rb', line 185

def pipe(func, *args, **kwargs, &block)
  func.call(self, *args, **kwargs, &block)
end

#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame

Aggregate the columns in the DataFrame to their quantile value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.quantile(0.7).collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═════╡
# │ 3.0 ┆ 1.0 │
# └─────┴─────┘

Parameters:

  • quantile (Float)

    Quantile between 0.0 and 1.0.

  • interpolation ("nearest", "higher", "lower", "midpoint", "linear") (defaults to: "nearest")

    Interpolation method.

Returns:



2969
2970
2971
2972
# File 'lib/polars/lazy_frame.rb', line 2969

def quantile(quantile, interpolation: "nearest")
  quantile = Utils.parse_into_expression(quantile, str_as_lit: false)
  _from_rbldf(_ldf.quantile(quantile, interpolation))
end

#rename(mapping, strict: true) ⇒ LazyFrame

Rename column names.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
lf.rename({"foo" => "apple"}).collect
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 1     ┆ 6   ┆ a   │
# │ 2     ┆ 7   ┆ b   │
# │ 3     ┆ 8   ┆ c   │
# └───────┴─────┴─────┘
lf.rename(->(column_name) { "c" + column_name[1..] }).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ coo ┆ car ┆ cam │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • mapping (Hash)

    Key value pairs that map from old name to new name.

  • strict (Boolean) (defaults to: true)

    Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to mapping).

Returns:



2249
2250
2251
2252
2253
2254
2255
2256
2257
# File 'lib/polars/lazy_frame.rb', line 2249

def rename(mapping, strict: true)
  if mapping.respond_to?(:call)
    select(F.all.name.map(&mapping))
  else
    existing = mapping.keys
    _new = mapping.values
    _from_rbldf(_ldf.rename(existing, _new, strict))
  end
end

#reverseLazyFrame

Reverse the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "key" => ["a", "b", "c"],
    "val" => [1, 2, 3]
  }
)
lf.reverse.collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ key ┆ val │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 1   │
# └─────┴─────┘

Returns:



2282
2283
2284
# File 'lib/polars/lazy_frame.rb', line 2282

def reverse
  _from_rbldf(_ldf.reverse)
end

#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ LazyFrame Also known as: group_by_rolling, groupby_rolling

Create rolling groups based on a time column.

Also works for index values of type :i32 or :i64.

Different from a dynamic_group_by the windows are now determined by the individual values and are not of constant intervals. For constant intervals use group_by_dynamic.

The period and offset arguments are created either from a timedelta, or by using the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_rolling on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

dates = [
  "2020-01-01 13:45:48",
  "2020-01-01 16:42:13",
  "2020-01-01 16:45:09",
  "2020-01-02 18:12:48",
  "2020-01-03 19:45:32",
  "2020-01-08 23:16:43"
]
df = Polars::LazyFrame.new({"dt" => dates, "a" => [3, 7, 5, 9, 2, 1]}).with_column(
  Polars.col("dt").str.strptime(Polars::Datetime).set_sorted
)
df.rolling(index_column: "dt", period: "2d").agg(
  [
    Polars.sum("a").alias("sum_a"),
    Polars.min("a").alias("min_a"),
    Polars.max("a").alias("max_a")
  ]
).collect
# =>
# shape: (6, 4)
# ┌─────────────────────┬───────┬───────┬───────┐
# │ dt                  ┆ sum_a ┆ min_a ┆ max_a │
# │ ---                 ┆ ---   ┆ ---   ┆ ---   │
# │ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
# ╞═════════════════════╪═══════╪═══════╪═══════╡
# │ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
# │ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
# │ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
# │ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
# │ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
# │ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
# └─────────────────────┴───────┴───────┴───────┘

Parameters:

  • index_column (Object)

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a rolling group by on indices, dtype needs to be one of :i32, :i64. Note that :i32 gets temporarily cast to :i64, so if performance matters use an :i64 column.

  • period (Object)

    Length of the window.

  • offset (Object) (defaults to: nil)

    Offset of the window. Default is -period.

  • closed ("right", "left", "both", "none") (defaults to: "right")

    Define whether the temporal window interval is closed or not.

  • by (Object) (defaults to: nil)

    Also group by this column/these columns.

Returns:



1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
# File 'lib/polars/lazy_frame.rb', line 1224

def rolling(
  index_column:,
  period:,
  offset: nil,
  closed: "right",
  by: nil
)
  index_column = Utils.parse_into_expression(index_column)
  if offset.nil?
    offset = Utils.negate_duration_string(Utils.parse_as_duration_string(period))
  end

  rbexprs_by = (
    !by.nil? ? Utils.parse_into_list_of_expressions(by) : []
  )
  period = Utils.parse_as_duration_string(period)
  offset = Utils.parse_as_duration_string(offset)

  lgb = _ldf.rolling(index_column, period, offset, closed, rbexprs_by)
  LazyGroupBy.new(lgb)
end

#schemaHash

Get the schema.

Examples:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.schema
# => {"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String}

Returns:

  • (Hash)


101
102
103
# File 'lib/polars/lazy_frame.rb', line 101

def schema
  _ldf.collect_schema
end

#select(*exprs, **named_exprs) ⇒ LazyFrame

Select columns from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"],
  }
).lazy
df.select("foo").collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘
df.select(["foo", "bar"]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 6   │
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# └─────┴─────┘
df.select(Polars.col("foo") + 1).collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 2   │
# │ 3   │
# │ 4   │
# └─────┘
df.select([Polars.col("foo") + 1, Polars.col("bar") + 1]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# │ 4   ┆ 9   │
# └─────┴─────┘
df.select(Polars.when(Polars.col("foo") > 2).then(10).otherwise(0)).collect
# =>
# shape: (3, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ i32     │
# ╞═════════╡
# │ 0       │
# │ 0       │
# │ 10      │
# └─────────┘

Parameters:

  • exprs (Array)

    Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



1091
1092
1093
1094
1095
1096
1097
1098
# File 'lib/polars/lazy_frame.rb', line 1091

def select(*exprs, **named_exprs)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0"

  rbexprs = Utils.parse_into_list_of_expressions(
    *exprs, **named_exprs, __structify: structify
  )
  _from_rbldf(_ldf.select(rbexprs))
end

#set_sorted(column, descending: false) ⇒ LazyFrame

Indicate that one or multiple columns are sorted.

Parameters:

  • column (Object)

    Columns that are sorted

  • descending (Boolean) (defaults to: false)

    Whether the columns are sorted in descending order.

Returns:



3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
# File 'lib/polars/lazy_frame.rb', line 3316

def set_sorted(
  column,
  descending: false
)
  if !Utils.strlike?(column)
    msg = "expected a 'str' for argument 'column' in 'set_sorted'"
    raise TypeError, msg
  end
  with_columns(F.col(column).set_sorted(descending: descending))
end

#shift(n, fill_value: nil) ⇒ LazyFrame

Shift the values by a given period.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.shift(1).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ i64  ┆ i64  │
# ╞══════╪══════╡
# │ null ┆ null │
# │ 1    ┆ 2    │
# │ 3    ┆ 4    │
# └──────┴──────┘
df.shift(-1).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ i64  ┆ i64  │
# ╞══════╪══════╡
# │ 3    ┆ 4    │
# │ 5    ┆ 6    │
# │ null ┆ null │
# └──────┴──────┘

Parameters:

  • n (Integer)

    Number of places to shift (may be negative).

  • fill_value (Object) (defaults to: nil)

    Fill the resulting null values with this value.

Returns:



2328
2329
2330
2331
2332
2333
2334
# File 'lib/polars/lazy_frame.rb', line 2328

def shift(n, fill_value: nil)
  if !fill_value.nil?
    fill_value = Utils.parse_into_expression(fill_value, str_as_lit: true)
  end
  n = Utils.parse_into_expression(n)
  _from_rbldf(_ldf.shift(n, fill_value))
end

#shift_and_fill(periods, fill_value) ⇒ LazyFrame

Shift the values by a given period and fill the resulting null values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.shift_and_fill(1, 0).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 0   ┆ 0   │
# │ 1   ┆ 2   │
# │ 3   ┆ 4   │
# └─────┴─────┘
df.shift_and_fill(-1, 0).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 3   ┆ 4   │
# │ 5   ┆ 6   │
# │ 0   ┆ 0   │
# └─────┴─────┘

Parameters:

  • periods (Integer)

    Number of places to shift (may be negative).

  • fill_value (Object)

    Fill nil values with the result of this expression.

Returns:



2378
2379
2380
# File 'lib/polars/lazy_frame.rb', line 2378

def shift_and_fill(periods, fill_value)
  shift(periods, fill_value: fill_value)
end

#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame

Evaluate the query in streaming mode and write to a CSV file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_csv("out.csv")

Parameters:

  • path (String)

    File path to which the file should be written.

  • include_bom (Boolean) (defaults to: false)

    Whether to include UTF-8 BOM in the CSV output.

  • include_header (Boolean) (defaults to: true)

    Whether to include header in the CSV output.

  • separator (String) (defaults to: ",")

    Separate CSV fields with this symbol.

  • line_terminator (String) (defaults to: "\n")

    String used to end each row.

  • quote_char (String) (defaults to: '"')

    Byte to use as quoting character.

  • batch_size (Integer) (defaults to: 1024)

    Number of rows that will be processed per thread.

  • datetime_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame's Datetime cols (if any).

  • date_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate.

  • time_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate.

  • float_precision (Integer) (defaults to: nil)

    Number of decimal places to write, applied to both Float32 and Float64 datatypes.

  • null_value (String) (defaults to: nil)

    A string representing null values (defaulting to the empty string).

  • quote_style ("necessary", "always", "non_numeric", "never") (defaults to: nil)

    Determines the quoting strategy used.

    • necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
    • always: This puts quotes around every field. Always.
    • never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
    • non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.
  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • type_coercion (Boolean) (defaults to: true)

    Do type coercion optimization.

  • predicate_pushdown (Boolean) (defaults to: true)

    Do predicate pushdown optimization.

  • projection_pushdown (Boolean) (defaults to: true)

    Do projection pushdown optimization.

  • simplify_expression (Boolean) (defaults to: true)

    Run simplify expressions optimization.

  • slice_pushdown (Boolean) (defaults to: true)

    Slice pushdown optimization.

  • no_optimization (Boolean) (defaults to: false)

    Turn off (certain) optimizations.

Returns:



606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
# File 'lib/polars/lazy_frame.rb', line 606

def sink_csv(
  path,
  include_bom: false,
  include_header: true,
  separator: ",",
  line_terminator: "\n",
  quote_char: '"',
  batch_size: 1024,
  datetime_format: nil,
  date_format: nil,
  time_format: nil,
  float_scientific: nil,
  float_precision: nil,
  null_value: nil,
  quote_style: nil,
  maintain_order: true,
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  slice_pushdown: true,
  no_optimization: false
)
  Utils._check_arg_is_1byte("separator", separator, false)
  Utils._check_arg_is_1byte("quote_char", quote_char, false)

  lf = _set_sink_optimizations(
    type_coercion: type_coercion,
    predicate_pushdown: predicate_pushdown,
    projection_pushdown: projection_pushdown,
    simplify_expression: simplify_expression,
    slice_pushdown: slice_pushdown,
    no_optimization: no_optimization
  )

  lf.sink_csv(
    path,
    include_bom,
    include_header,
    separator.ord,
    line_terminator,
    quote_char.ord,
    batch_size,
    datetime_format,
    date_format,
    time_format,
    float_scientific,
    float_precision,
    null_value,
    quote_style,
    maintain_order
  )
end

#sink_ipc(path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame

Evaluate the query in streaming mode and write to an IPC file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_ipc("out.arrow")

Parameters:

  • path (String)

    File path to which the file should be written.

  • compression ("lz4", "zstd") (defaults to: "zstd")

    Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • type_coercion (Boolean) (defaults to: true)

    Do type coercion optimization.

  • predicate_pushdown (Boolean) (defaults to: true)

    Do predicate pushdown optimization.

  • projection_pushdown (Boolean) (defaults to: true)

    Do projection pushdown optimization.

  • simplify_expression (Boolean) (defaults to: true)

    Run simplify expressions optimization.

  • slice_pushdown (Boolean) (defaults to: true)

    Slice pushdown optimization.

  • no_optimization (Boolean) (defaults to: false)

    Turn off (certain) optimizations.

Returns:



504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
# File 'lib/polars/lazy_frame.rb', line 504

def sink_ipc(
  path,
  compression: "zstd",
  maintain_order: true,
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  slice_pushdown: true,
  no_optimization: false
)
  lf = _set_sink_optimizations(
    type_coercion: type_coercion,
    predicate_pushdown: predicate_pushdown,
    projection_pushdown: projection_pushdown,
    simplify_expression: simplify_expression,
    slice_pushdown: slice_pushdown,
    no_optimization: no_optimization
  )

  lf.sink_ipc(
    path,
    compression,
    maintain_order
  )
end

#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame

Evaluate the query in streaming mode and write to an NDJSON file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_ndjson("out.ndjson")

Parameters:

  • path (String)

    File path to which the file should be written.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • type_coercion (Boolean) (defaults to: true)

    Do type coercion optimization.

  • predicate_pushdown (Boolean) (defaults to: true)

    Do predicate pushdown optimization.

  • projection_pushdown (Boolean) (defaults to: true)

    Do projection pushdown optimization.

  • simplify_expression (Boolean) (defaults to: true)

    Run simplify expressions optimization.

  • slice_pushdown (Boolean) (defaults to: true)

    Slice pushdown optimization.

  • no_optimization (Boolean) (defaults to: false)

    Turn off (certain) optimizations.

Returns:



687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
# File 'lib/polars/lazy_frame.rb', line 687

def sink_ndjson(
  path,
  maintain_order: true,
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  slice_pushdown: true,
  no_optimization: false
)
  lf = _set_sink_optimizations(
    type_coercion: type_coercion,
    predicate_pushdown: predicate_pushdown,
    projection_pushdown: projection_pushdown,
    simplify_expression: simplify_expression,
    slice_pushdown: slice_pushdown,
    no_optimization: no_optimization
  )

  lf.sink_json(path, maintain_order)
end

#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true) ⇒ DataFrame

Persists a LazyFrame at the provided path.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_parquet("out.parquet")

Parameters:

  • path (String)

    File path to which the file should be written.

  • compression ("lz4", "uncompressed", "snappy", "gzip", "lzo", "brotli", "zstd") (defaults to: "zstd")

    Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression. Choose "snappy" for more backwards compatibility guarantees when you deal with older parquet readers.

  • compression_level (Integer) (defaults to: nil)

    The level of compression to use. Higher compression means smaller files on disk.

    • "gzip" : min-level: 0, max-level: 10.
    • "brotli" : min-level: 0, max-level: 11.
    • "zstd" : min-level: 1, max-level: 22.
  • statistics (Boolean) (defaults to: true)

    Write statistics to the parquet headers. This requires extra compute.

  • row_group_size (Integer) (defaults to: nil)

    Size of the row groups in number of rows. If nil (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.

  • data_pagesize_limit (Integer) (defaults to: nil)

    Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • type_coercion (Boolean) (defaults to: true)

    Do type coercion optimization.

  • predicate_pushdown (Boolean) (defaults to: true)

    Do predicate pushdown optimization.

  • projection_pushdown (Boolean) (defaults to: true)

    Do projection pushdown optimization.

  • simplify_expression (Boolean) (defaults to: true)

    Run simplify expressions optimization.

  • no_optimization (Boolean) (defaults to: false)

    Turn off (certain) optimizations.

  • slice_pushdown (Boolean) (defaults to: true)

    Slice pushdown optimization.

Returns:



421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
# File 'lib/polars/lazy_frame.rb', line 421

def sink_parquet(
  path,
  compression: "zstd",
  compression_level: nil,
  statistics: true,
  row_group_size: nil,
  data_pagesize_limit: nil,
  maintain_order: true,
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  no_optimization: false,
  slice_pushdown: true
)
  lf = _set_sink_optimizations(
    type_coercion: type_coercion,
    predicate_pushdown: predicate_pushdown,
    projection_pushdown: projection_pushdown,
    simplify_expression: simplify_expression,
    slice_pushdown: slice_pushdown,
    no_optimization: no_optimization
  )

  if statistics == true
    statistics = {
      min: true,
      max: true,
      distinct_count: false,
      null_count: true
    }
  elsif statistics == false
    statistics = {}
  elsif statistics == "full"
    statistics = {
      min: true,
      max: true,
      distinct_count: true,
      null_count: true
    }
  end

  lf.sink_parquet(
    path,
    compression,
    compression_level,
    statistics,
    row_group_size,
    data_pagesize_limit,
    maintain_order
  )
end

#slice(offset, length = nil) ⇒ LazyFrame

Get a slice of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
).lazy
df.slice(1, 2).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ y   ┆ 3   ┆ 4   │
# │ z   ┆ 5   ┆ 6   │
# └─────┴─────┴─────┘

Parameters:

  • offset (Integer)

    Start index. Negative indexing is supported.

  • length (Integer) (defaults to: nil)

    Length of the slice. If set to nil, all rows starting at the offset will be selected.

Returns:



2411
2412
2413
2414
2415
2416
# File 'lib/polars/lazy_frame.rb', line 2411

def slice(offset, length = nil)
  if length && length < 0
    raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame"
  end
  _from_rbldf(_ldf.slice(offset, length))
end

#sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame

Sort the DataFrame.

Sorting can be done by:

  • A single column name
  • An expression
  • Multiple expressions

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
df.sort("foo", reverse: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Parameters:

  • by (Object)

    Column (expressions) to sort by.

  • reverse (Boolean) (defaults to: false)

    Sort in descending order.

  • nulls_last (Boolean) (defaults to: false)

    Place null values last. Can only be used if sorted by a single column.

Returns:



264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# File 'lib/polars/lazy_frame.rb', line 264

def sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true)
  if by.is_a?(::String) && more_by.empty?
    return _from_rbldf(
      _ldf.sort(
        by, reverse, nulls_last, maintain_order, multithreaded
      )
    )
  end

  by = Utils.parse_into_list_of_expressions(by, *more_by)
  reverse = Utils.extend_bool(reverse, by.length, "reverse", "by")
  nulls_last = Utils.extend_bool(nulls_last, by.length, "nulls_last", "by")
  _from_rbldf(
    _ldf.sort_by_exprs(
      by, reverse, nulls_last, maintain_order, multithreaded
    )
  )
end

#std(ddof: 1) ⇒ LazyFrame

Aggregate the columns in the DataFrame to their standard deviation value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.std.collect
# =>
# shape: (1, 2)
# ┌──────────┬─────┐
# │ a        ┆ b   │
# │ ---      ┆ --- │
# │ f64      ┆ f64 │
# ╞══════════╪═════╡
# │ 1.290994 ┆ 0.5 │
# └──────────┴─────┘
df.std(ddof: 0).collect
# =>
# shape: (1, 2)
# ┌──────────┬──────────┐
# │ a        ┆ b        │
# │ ---      ┆ ---      │
# │ f64      ┆ f64      │
# ╞══════════╪══════════╡
# │ 1.118034 ┆ 0.433013 │
# └──────────┴──────────┘

Returns:



2812
2813
2814
# File 'lib/polars/lazy_frame.rb', line 2812

def std(ddof: 1)
  _from_rbldf(_ldf.std(ddof))
end

#sumLazyFrame

Aggregate the columns in the DataFrame to their sum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.sum.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 5   │
# └─────┴─────┘

Returns:



2904
2905
2906
# File 'lib/polars/lazy_frame.rb', line 2904

def sum
  _from_rbldf(_ldf.sum)
end

#tail(n = 5) ⇒ LazyFrame

Get the last n rows.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.tail.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# │ 6   ┆ 12  │
# └─────┴─────┘
lf.tail(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 5   ┆ 11  │
# │ 6   ┆ 12  │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows.

Returns:



2561
2562
2563
# File 'lib/polars/lazy_frame.rb', line 2561

def tail(n = 5)
  _from_rbldf(_ldf.tail(n))
end

#take_every(n) ⇒ LazyFrame

Take every nth row in the LazyFrame and return as a new LazyFrame.

Examples:

s = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [5, 6, 7, 8]}).lazy
s.take_every(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 5   │
# │ 3   ┆ 7   │
# └─────┴─────┘

Returns:



2669
2670
2671
# File 'lib/polars/lazy_frame.rb', line 2669

def take_every(n)
  select(F.col("*").take_every(n))
end

#to_sString

Returns a string representing the LazyFrame.

Returns:



132
133
134
135
136
137
138
# File 'lib/polars/lazy_frame.rb', line 132

def to_s
  <<~EOS
    naive plan: (run LazyFrame#describe_optimized_plan to see the optimized plan)

    #{describe_plan}
  EOS
end

#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame

Drop duplicate rows from this DataFrame.

Note that this fails if there is a column of type List in the DataFrame or subset.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3, 1],
    "bar" => ["a", "a", "a", "a"],
    "ham" => ["b", "b", "b", "b"]
  }
)
lf.unique(maintain_order: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ a   ┆ b   │
# │ 2   ┆ a   ┆ b   │
# │ 3   ┆ a   ┆ b   │
# └─────┴─────┴─────┘
lf.unique(subset: ["bar", "ham"], maintain_order: true).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ a   ┆ b   │
# └─────┴─────┴─────┘
lf.unique(keep: "last", maintain_order: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 2   ┆ a   ┆ b   │
# │ 3   ┆ a   ┆ b   │
# │ 1   ┆ a   ┆ b   │
# └─────┴─────┴─────┘

Parameters:

  • maintain_order (Boolean) (defaults to: true)

    Keep the same order as the original DataFrame. This requires more work to compute.

  • subset (Object) (defaults to: nil)

    Subset to use to compare rows.

  • keep ("first", "last") (defaults to: "first")

    Which of the duplicate rows to keep.

Returns:



3068
3069
3070
3071
3072
3073
# File 'lib/polars/lazy_frame.rb', line 3068

def unique(maintain_order: true, subset: nil, keep: "first")
  if !subset.nil? && !subset.is_a?(::Array)
    subset = [subset]
  end
  _from_rbldf(_ldf.unique(maintain_order, subset, keep))
end

#unnest(names) ⇒ LazyFrame

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Examples:

df = (
  Polars::DataFrame.new(
    {
      "before" => ["foo", "bar"],
      "t_a" => [1, 2],
      "t_b" => ["a", "b"],
      "t_c" => [true, nil],
      "t_d" => [[1, 2], [3]],
      "after" => ["baz", "womp"]
    }
  )
  .lazy
  .select(
    ["before", Polars.struct(Polars.col("^t_.$")).alias("t_struct"), "after"]
  )
)
df.fetch
# =>
# shape: (2, 3)
# ┌────────┬─────────────────────┬───────┐
# │ before ┆ t_struct            ┆ after │
# │ ---    ┆ ---                 ┆ ---   │
# │ str    ┆ struct[4]           ┆ str   │
# ╞════════╪═════════════════════╪═══════╡
# │ foo    ┆ {1,"a",true,[1, 2]} ┆ baz   │
# │ bar    ┆ {2,"b",null,[3]}    ┆ womp  │
# └────────┴─────────────────────┴───────┘
df.unnest("t_struct").fetch
# =>
# shape: (2, 6)
# ┌────────┬─────┬─────┬──────┬───────────┬───────┐
# │ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
# │ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
# │ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
# ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
# │ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
# │ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
# └────────┴─────┴─────┴──────┴───────────┴───────┘

Parameters:

  • names (Object)

    Names of the struct columns that will be decomposed by its fields

Returns:



3259
3260
3261
3262
3263
3264
# File 'lib/polars/lazy_frame.rb', line 3259

def unnest(names)
  if names.is_a?(::String)
    names = [names]
  end
  _from_rbldf(_ldf.unnest(names))
end

#unpivot(on, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame Also known as: melt

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
)
lf.unpivot(Polars.cs.numeric, index: "a").collect
# =>
# shape: (6, 3)
# ┌─────┬──────────┬───────┐
# │ a   ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ x   ┆ b        ┆ 1     │
# │ y   ┆ b        ┆ 3     │
# │ z   ┆ b        ┆ 5     │
# │ x   ┆ c        ┆ 2     │
# │ y   ┆ c        ┆ 4     │
# │ z   ┆ c        ┆ 6     │
# └─────┴──────────┴───────┘

Parameters:

  • on (Object)

    Column(s) or selector(s) to use as values variables; if on is empty all columns that are not in index will be used.

  • index (Object) (defaults to: nil)

    Column(s) or selector(s) to use as identifier variables.

  • variable_name (String) (defaults to: nil)

    Name to give to the variable column. Defaults to "variable"

  • value_name (String) (defaults to: nil)

    Name to give to the value column. Defaults to "value"

  • streamable (Boolean) (defaults to: true)

    Allow this node to run in the streaming engine. If this runs in streaming, the output of the unpivot operation will not have a stable ordering.

Returns:



3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
# File 'lib/polars/lazy_frame.rb', line 3156

def unpivot(
  on,
  index: nil,
  variable_name: nil,
  value_name: nil,
  streamable: true
)
  if !streamable
    warn "The `streamable` parameter for `LazyFrame.unpivot` is deprecated"
  end

  on = on.nil? ? [] : Utils.parse_into_list_of_expressions(on)
  index = index.nil? ? [] : Utils.parse_into_list_of_expressions(index)

  _from_rbldf(
    _ldf.unpivot(on, index, value_name, variable_name)
  )
end

#var(ddof: 1) ⇒ LazyFrame

Aggregate the columns in the DataFrame to their variance value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.var.collect
# =>
# shape: (1, 2)
# ┌──────────┬──────┐
# │ a        ┆ b    │
# │ ---      ┆ ---  │
# │ f64      ┆ f64  │
# ╞══════════╪══════╡
# │ 1.666667 ┆ 0.25 │
# └──────────┴──────┘
df.var(ddof: 0).collect
# =>
# shape: (1, 2)
# ┌──────┬────────┐
# │ a    ┆ b      │
# │ ---  ┆ ---    │
# │ f64  ┆ f64    │
# ╞══════╪════════╡
# │ 1.25 ┆ 0.1875 │
# └──────┴────────┘

Returns:



2844
2845
2846
# File 'lib/polars/lazy_frame.rb', line 2844

def var(ddof: 1)
  _from_rbldf(_ldf.var(ddof))
end

#widthInteger

Get the width of the LazyFrame.

Examples:

lf = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]}).lazy
lf.width
# => 2

Returns:

  • (Integer)


113
114
115
# File 'lib/polars/lazy_frame.rb', line 113

def width
  _ldf.collect_schema.length
end

#with_column(column) ⇒ LazyFrame

Add or overwrite column in a DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.with_column((Polars.col("b") ** 2).alias("b_squared")).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬───────────┐
# │ a   ┆ b   ┆ b_squared │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ i64 ┆ i64       │
# ╞═════╪═════╪═══════════╡
# │ 1   ┆ 2   ┆ 4         │
# │ 3   ┆ 4   ┆ 16        │
# │ 5   ┆ 6   ┆ 36        │
# └─────┴─────┴───────────┘
df.with_column(Polars.col("a") ** 2).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# │ 9   ┆ 4   │
# │ 25  ┆ 6   │
# └─────┴─────┘

Parameters:

  • column (Object)

    Expression that evaluates to column or a Series to use.

Returns:



2139
2140
2141
# File 'lib/polars/lazy_frame.rb', line 2139

def with_column(column)
  with_columns([column])
end

#with_columns(*exprs, **named_exprs) ⇒ LazyFrame

Add or overwrite multiple columns in a DataFrame.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
).lazy
ldf.with_columns(
  [
    (Polars.col("a") ** 2).alias("a^2"),
    (Polars.col("b") / 2).alias("b/2"),
    (Polars.col("c").is_not).alias("not c")
  ]
).collect
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Parameters:

  • exprs (Object)

    List of Expressions that evaluate to columns.

Returns:



2054
2055
2056
2057
2058
2059
2060
# File 'lib/polars/lazy_frame.rb', line 2054

def with_columns(*exprs, **named_exprs)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0"

  rbexprs = Utils.parse_into_list_of_expressions(*exprs, **named_exprs, __structify: structify)

  _from_rbldf(_ldf.with_columns(rbexprs))
end

#with_context(other) ⇒ LazyFrame

Add an external context to the computation graph.

This allows expressions to also access columns from DataFrames that are not part of this one.

Examples:

df_a = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => ["a", "c", nil]}).lazy
df_other = Polars::DataFrame.new({"c" => ["foo", "ham"]})
(
  df_a.with_context(df_other.lazy).select(
    [Polars.col("b") + Polars.col("c").first]
  )
).collect
# =>
# shape: (3, 1)
# ┌──────┐
# │ b    │
# │ ---  │
# │ str  │
# ╞══════╡
# │ afoo │
# │ cfoo │
# │ null │
# └──────┘

Parameters:

  • other (Object)

    Lazy DataFrame to join with.

Returns:



2091
2092
2093
2094
2095
2096
2097
# File 'lib/polars/lazy_frame.rb', line 2091

def with_context(other)
  if !other.is_a?(::Array)
    other = [other]
  end

  _from_rbldf(_ldf.with_context(other.map(&:_ldf)))
end

#with_row_index(name: "index", offset: 0) ⇒ LazyFrame Also known as: with_row_count

Note:

This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.

Add a column at index 0 that counts the rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.with_row_index.collect
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ index ┆ a   ┆ b   │
# │ ---   ┆ --- ┆ --- │
# │ u32   ┆ i64 ┆ i64 │
# ╞═══════╪═════╪═════╡
# │ 0     ┆ 1   ┆ 2   │
# │ 1     ┆ 3   ┆ 4   │
# │ 2     ┆ 5   ┆ 6   │
# └───────┴─────┴─────┘

Parameters:

  • name (String) (defaults to: "index")

    Name of the column to add.

  • offset (Integer) (defaults to: 0)

    Start the row count at this offset.

Returns:



2647
2648
2649
# File 'lib/polars/lazy_frame.rb', line 2647

def with_row_index(name: "index", offset: 0)
  _from_rbldf(_ldf.with_row_index(name, offset))
end

#write_json(file) ⇒ nil

Write the logical plan of this LazyFrame to a file or string in JSON format.

Parameters:

  • file (String)

    File path to which the result should be written.

Returns:

  • (nil)


146
147
148
149
150
151
152
# File 'lib/polars/lazy_frame.rb', line 146

def write_json(file)
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  _ldf.write_json(file)
  nil
end