Class: Polars::LazyFrame

Inherits:
Object
  • Object
show all
Defined in:
lib/polars/lazy_frame.rb

Overview

Representation of a Lazy computation graph/query against a DataFrame.

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false, height: nil) ⇒ LazyFrame

Create a new LazyFrame.



8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# File 'lib/polars/lazy_frame.rb', line 8

def initialize(
  data = nil,
  schema: nil,
  schema_overrides: nil,
  strict: true,
  orient: nil,
  infer_schema_length: N_INFER_DEFAULT,
  nan_to_null: false,
  height: nil
)
  self._ldf = (
    DataFrame.new(
      data,
      schema: schema,
      schema_overrides: schema_overrides,
      strict: strict,
      orient: orient,
      infer_schema_length: infer_schema_length,
      nan_to_null: nan_to_null,
      height: height
    )
    .lazy
    ._ldf
  )
end

Class Method Details

.deserialize(source, format: "binary") ⇒ LazyFrame

Note:

This function uses marshaling if the logical plan contains Ruby UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.

Note:

Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.

Read a logical plan from a file to construct a LazyFrame.

Examples:

lf = Polars::LazyFrame.new({"a" => [1, 2, 3]}).sum
bytes = lf.serialize
Polars::LazyFrame.deserialize(StringIO.new(bytes)).collect
# =>
# shape: (1, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 6   │
# └─────┘

Parameters:

  • source (Object)

    Path to a file or a file-like object (by file-like object, we refer to objects that have a read method, such as a file handler or StringIO).

  • format ('binary', 'json') (defaults to: "binary")

    The format with which the LazyFrame was serialized. Options:

    • "binary": Deserialize from binary format (bytes). This is the default.
    • "json": Deserialize from JSON format (string).

Returns:



76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/polars/lazy_frame.rb', line 76

def self.deserialize(source, format: "binary")
  if Utils.pathlike?(source)
    source = Utils.normalize_filepath(source)
  end

  if format == "binary"
    raise Todo unless RbLazyFrame.respond_to?(:deserialize_binary)
    deserializer = RbLazyFrame.method(:deserialize_binary)
  elsif format == "json"
    deserializer = RbLazyFrame.method(:deserialize_json)
  else
    msg = "`format` must be one of {{'binary', 'json'}}, got #{format.inspect}"
    raise ArgumentError, msg
  end

  _from_rbldf(deserializer.(source))
end

Instance Method Details

#bottom_k(k, by:, reverse: false) ⇒ LazyFrame

Return the k smallest rows.

Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call :func:sort after this function if you wish the output to be sorted.

Examples:

Get the rows which contain the 4 smallest values in column b.

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [2, 1, 1, 3, 2, 1]
  }
)
lf.bottom_k(4, by: "b").collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 1   │
# │ a   ┆ 1   │
# │ c   ┆ 1   │
# │ a   ┆ 2   │
# └─────┴─────┘

Get the rows which contain the 4 smallest values when sorting on column a and b.

lf.bottom_k(4, by: ["a", "b"]).collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 1   │
# │ a   ┆ 2   │
# │ b   ┆ 1   │
# │ b   ┆ 2   │
# └─────┴─────┘

Parameters:

  • k (Integer)

    Number of rows to return.

  • by (Object)

    Column(s) used to determine the bottom rows. Accepts expression input. Strings are parsed as column names.

  • reverse (Object) (defaults to: false)

    Consider the k largest elements of the by column(s) (instead of the k smallest). This can be specified per column by passing an array of booleans.

Returns:



849
850
851
852
853
854
855
856
857
# File 'lib/polars/lazy_frame.rb', line 849

def bottom_k(
  k,
  by:,
  reverse: false
)
  by = Utils.parse_into_list_of_expressions(by)
  reverse = Utils.extend_bool(reverse, by.length, "reverse", "by")
  _from_rbldf(_ldf.bottom_k(k, by, reverse))
end

#cacheLazyFrame

Cache the result once the execution of the physical plan hits this node.

Returns:



1668
1669
1670
# File 'lib/polars/lazy_frame.rb', line 1668

def cache
  _from_rbldf(_ldf.cache)
end

#cast(dtypes, strict: true) ⇒ LazyFrame

Cast LazyFrame column(s) to the specified dtype(s).

Examples:

Cast specific frame columns to the specified dtypes:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => [Date.new(2020, 1, 2), Date.new(2021, 3, 4), Date.new(2022, 5, 6)]
  }
)
lf.cast({"foo" => Polars::Float32, "bar" => Polars::UInt8}).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬────────────┐
# │ foo ┆ bar ┆ ham        │
# │ --- ┆ --- ┆ ---        │
# │ f32 ┆ u8  ┆ date       │
# ╞═════╪═════╪════════════╡
# │ 1.0 ┆ 6   ┆ 2020-01-02 │
# │ 2.0 ┆ 7   ┆ 2021-03-04 │
# │ 3.0 ┆ 8   ┆ 2022-05-06 │
# └─────┴─────┴────────────┘

Cast all frame columns matching one dtype (or dtype group) to another dtype:

lf.cast({Polars::Date => Polars::Datetime}).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────────────────────┐
# │ foo ┆ bar ┆ ham                 │
# │ --- ┆ --- ┆ ---                 │
# │ i64 ┆ f64 ┆ datetime[μs]        │
# ╞═════╪═════╪═════════════════════╡
# │ 1   ┆ 6.0 ┆ 2020-01-02 00:00:00 │
# │ 2   ┆ 7.0 ┆ 2021-03-04 00:00:00 │
# │ 3   ┆ 8.0 ┆ 2022-05-06 00:00:00 │
# └─────┴─────┴─────────────────────┘

Cast all frame columns to the specified dtype:

lf.cast(Polars::String).collect.to_h(as_series: false)
# => {"foo"=>["1", "2", "3"], "bar"=>["6.0", "7.0", "8.0"], "ham"=>["2020-01-02", "2021-03-04", "2022-05-06"]}

Parameters:

  • dtypes (Hash)

    Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.

  • strict (Boolean) (defaults to: true)

    Throw an error if a cast could not be done (for instance, due to an overflow).

Returns:



1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
# File 'lib/polars/lazy_frame.rb', line 1721

def cast(dtypes, strict: true)
  if !dtypes.is_a?(Hash)
    return _from_rbldf(_ldf.cast_all(dtypes, strict))
  end

  cast_map = {}
  dtypes.each do |c, dtype|
    dtype = Utils.parse_into_dtype(dtype)
    cast_map.merge!(
      c.is_a?(::String) ? {c => dtype} : Utils.expand_selector(self, c).to_h { |x| [x, dtype] }
    )
  end

  _from_rbldf(_ldf.cast(cast_map, strict))
end

#clear(n = 0) ⇒ LazyFrame

Create an empty copy of the current LazyFrame.

The copy has an identical schema but no data.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil],
  }
).lazy
lf.clear.collect
# =>
# shape: (0, 3)
# ┌─────┬─────┬──────┐
# │ a   ┆ b   ┆ c    │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ f64 ┆ bool │
# ╞═════╪═════╪══════╡
# └─────┴─────┴──────┘
lf.clear(2).collect
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ a    ┆ b    ┆ c    │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ f64  ┆ bool │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Returns:



1773
1774
1775
# File 'lib/polars/lazy_frame.rb', line 1773

def clear(n = 0)
  DataFrame.new(schema: schema).clear(n).lazy
end

#collect(engine: "auto", background: false, optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame

Materialize this LazyFrame into a DataFrame.

By default, all query optimizations are enabled. Individual optimizations may be disabled by setting the corresponding parameter to false.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.all.sum).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ a   ┆ 4   ┆ 10  │
# │ b   ┆ 11  ┆ 10  │
# │ c   ┆ 6   ┆ 1   │
# └─────┴─────┴─────┘

Parameters:

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • background (Boolean) (defaults to: false)

    Run the query in the background and get a handle to the query. This handle can be used to fetch the result or cancel the query.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
# File 'lib/polars/lazy_frame.rb', line 966

def collect(
  engine: "auto",
  background: false,
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if engine == "streaming"
    Utils.issue_unstable_warning("streaming mode is considered unstable.")
  end

  ldf = _ldf.with_optimizations(optimizations._rboptflags)
  if background
    Utils.issue_unstable_warning("background mode is considered unstable.")
    return InProcessQuery.new(ldf.collect_concurrently)
  end

  Utils.wrap_df(ldf.collect(engine))
end

#collect_batches(chunk_size: nil, maintain_order: true, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Note:

This method is much slower than native sinks. Only use it if you cannot implement your logic otherwise.

Evaluate the query in streaming mode and get a generator that returns chunks.

This allows streaming results that are larger than RAM to be written to disk.

The query will always be fully executed unless stop is called, so you should call next until all chunks have been seen.

Parameters:

  • chunk_size (Integer) (defaults to: nil)

    The number of rows that are buffered before a chunk is given.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • lazy (Boolean) (defaults to: false)

    Start the query when first requesting a batch.

  • engine (String) (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (Object) (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

Returns:



1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
# File 'lib/polars/lazy_frame.rb', line 1628

def collect_batches(
  chunk_size: nil,
  maintain_order: true,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  ldf = _ldf.with_optimizations(optimizations._rboptflags)
  inner = ldf.collect_batches(
    engine,
    maintain_order,
    chunk_size,
    lazy
  )
  CollectBatches.new(inner)
end

#collect_schemaSchema

Resolve the schema of this LazyFrame.

Examples:

Determine the schema.

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
lf.collect_schema
# => Polars::Schema({"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String})

Access various properties of the schema.

schema = lf.collect_schema
schema["bar"]
# => Polars::Float64
schema.names
# => ["foo", "bar", "ham"]
schema.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]
schema.length
# => 3

Returns:



1017
1018
1019
# File 'lib/polars/lazy_frame.rb', line 1017

def collect_schema
  Schema.new(_ldf.collect_schema, check_dtypes: false)
end

#columnsArray

Get or set column names.

Examples:

df = (
   Polars::DataFrame.new(
     {
       "foo" => [1, 2, 3],
       "bar" => [6, 7, 8],
       "ham" => ["a", "b", "c"]
     }
   )
   .lazy
   .select(["foo", "bar"])
)
df.columns
# => ["foo", "bar"]

Returns:



112
113
114
# File 'lib/polars/lazy_frame.rb', line 112

def columns
  _ldf.collect_schema.keys
end

#countLazyFrame

Return the number of non-null elements for each column.

Examples:

lf = Polars::LazyFrame.new(
  {"a" => [1, 2, 3, 4], "b" => [1, 2, 1, nil], "c" => [nil, nil, nil, nil]}
)
lf.count.collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 4   ┆ 3   ┆ 0   │
# └─────┴─────┴─────┘

Returns:



5125
5126
5127
# File 'lib/polars/lazy_frame.rb', line 5125

def count
  _from_rbldf(_ldf.count)
end

#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame

Note:

The median is included by default as the 50% percentile.

Note:

This method does not maintain the laziness of the frame, and will collect the final result. This could potentially be an expensive operation.

Note:

We do not guarantee the output of describe to be stable. It will show statistics that we deem informative, and may be updated in the future. Using describe programmatically (versus interactive exploration) is not recommended for this reason.

Creates a summary of statistics for a LazyFrame, returning a DataFrame.

Examples:

Show default frame statistics:

lf = Polars::LazyFrame.new(
  {
    "float" => [1.0, 2.8, 3.0],
    "int" => [40, 50, nil],
    "bool" => [true, false, true],
    "str" => ["zz", "xx", "yy"],
    "date" => [Date.new(2020, 1, 1), Date.new(2021, 7, 5), Date.new(2022, 12, 31)]
  }
)
lf.describe
# =>
# shape: (9, 6)
# ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────────┐
# │ statistic  ┆ float    ┆ int      ┆ bool     ┆ str  ┆ date                    │
# │ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---                     │
# │ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str                     │
# ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════════╡
# │ count      ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 3    ┆ 3                       │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 0    ┆ 0                       │
# │ mean       ┆ 2.266667 ┆ 45.0     ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 UTC │
# │ std        ┆ 1.101514 ┆ 7.071068 ┆ null     ┆ null ┆ null                    │
# │ min        ┆ 1.0      ┆ 40.0     ┆ 0.0      ┆ xx   ┆ 2020-01-01              │
# │ 25%        ┆ 2.8      ┆ 40.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 50%        ┆ 2.8      ┆ 50.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 75%        ┆ 3.0      ┆ 50.0     ┆ null     ┆ null ┆ 2022-12-31              │
# │ max        ┆ 3.0      ┆ 50.0     ┆ 1.0      ┆ zz   ┆ 2022-12-31              │
# └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────────┘

Customize which percentiles are displayed, applying linear interpolation:

lf.describe(
  percentiles: [0.1, 0.3, 0.5, 0.7, 0.9],
  interpolation: "linear"
)
# =>
# shape: (11, 6)
# ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────────┐
# │ statistic  ┆ float    ┆ int      ┆ bool     ┆ str  ┆ date                    │
# │ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---                     │
# │ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str                     │
# ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════════╡
# │ count      ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 3    ┆ 3                       │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 0    ┆ 0                       │
# │ mean       ┆ 2.266667 ┆ 45.0     ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 UTC │
# │ std        ┆ 1.101514 ┆ 7.071068 ┆ null     ┆ null ┆ null                    │
# │ min        ┆ 1.0      ┆ 40.0     ┆ 0.0      ┆ xx   ┆ 2020-01-01              │
# │ …          ┆ …        ┆ …        ┆ …        ┆ …    ┆ …                       │
# │ 30%        ┆ 2.08     ┆ 43.0     ┆ null     ┆ null ┆ 2020-11-26              │
# │ 50%        ┆ 2.8      ┆ 45.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 70%        ┆ 2.88     ┆ 47.0     ┆ null     ┆ null ┆ 2022-02-07              │
# │ 90%        ┆ 2.96     ┆ 49.0     ┆ null     ┆ null ┆ 2022-09-13              │
# │ max        ┆ 3.0      ┆ 50.0     ┆ 1.0      ┆ zz   ┆ 2022-12-31              │
# └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────────┘

Parameters:

  • percentiles (Array) (defaults to: [0.25, 0.5, 0.75])

    One or more percentiles to include in the summary statistics. All values must be in the range [0, 1].

  • interpolation ('nearest', 'higher', 'lower', 'midpoint', 'linear', 'equiprobable') (defaults to: "nearest")

    Interpolation method used when calculating percentiles.

Returns:



341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
# File 'lib/polars/lazy_frame.rb', line 341

def describe(
  percentiles: [0.25, 0.5, 0.75],
  interpolation: "nearest"
)
  schema = collect_schema.to_h

  if schema.empty?
    msg = "cannot describe a LazyFrame that has no columns"
    raise TypeError, msg
  end

  # create list of metrics
  metrics = ["count", "null_count", "mean", "std", "min"]
  if (quantiles = Utils.parse_percentiles(percentiles)).any?
    metrics.concat(quantiles.map { |q| "%g%%" % [q * 100] })
  end
  metrics.append("max")

  skip_minmax = lambda do |dt|
    dt.nested? || [Categorical, Enum, Null, Object, Unknown].include?(dt)
  end

  # determine which columns will produce std/mean/percentile/etc
  # statistics in a single pass over the frame schema
  has_numeric_result, sort_cols = Set.new, Set.new
  metric_exprs = []
  null = F.lit(nil)

  schema.each do |c, dtype|
    is_numeric = dtype.numeric?
    is_temporal = !is_numeric && dtype.temporal?

    # counts
    count_exprs = [
      F.col(c).count.name.prefix("count:"),
      F.col(c).null_count.name.prefix("null_count:")
    ]
    # mean
    mean_expr =
      if is_temporal || is_numeric || dtype == Boolean
        F.col(c).mean
      else
        null
      end

    # standard deviation, min, max
    expr_std = is_numeric ? F.col(c).std : null
    min_expr = !skip_minmax.(dtype) ? F.col(c).min : null
    max_expr = !skip_minmax.(dtype) ? F.col(c).max : null

    # percentiles
    pct_exprs = []
    quantiles.each do |p|
      if is_numeric || is_temporal
        pct_expr =
          if is_temporal
            F.col(c).to_physical.quantile(p, interpolation: interpolation).cast(dtype)
          else
            F.col(c).quantile(p, interpolation: interpolation)
          end
        sort_cols.add(c)
      else
        pct_expr = null
      end
      pct_exprs << pct_expr.alias("#{p}:#{c}")
    end

    if is_numeric || dtype.nested? || [Null, Boolean].include?(dtype)
      has_numeric_result.add(c)
    end

    # add column expressions (in end-state 'metrics' list order)
    metric_exprs.concat(
      [
        *count_exprs,
        mean_expr.alias("mean:#{c}"),
        expr_std.alias("std:#{c}"),
        min_expr.alias("min:#{c}"),
        *pct_exprs,
        max_expr.alias("max:#{c}")
      ]
    )
  end

  # calculate requested metrics in parallel, then collect the result
  df_metrics = (
    (
      # if more than one quantile, sort the relevant columns to make them O(1)
      # TODO: drop sort once we have efficient retrieval of multiple quantiles
      sort_cols ? with_columns(sort_cols.map { |c| F.col(c).sort }) : self
    )
    .select(*metric_exprs)
    .collect
  )

  # reshape wide result
  n_metrics = metrics.length
  column_metrics =
    schema.length.times.map do |n|
      df_metrics.row(0)[(n * n_metrics)...((n + 1) * n_metrics)]
    end

  summary = schema.keys.zip(column_metrics).to_h

  # cast by column type (numeric/bool -> float), (other -> string)
  schema.each_key do |c|
    summary[c] =
      summary[c].map do |v|
        if v.nil? || v.is_a?(Hash)
          nil
        else
          if has_numeric_result.include?(c)
            if v == true
              1.0
            elsif v == false
              0.0
            else
              v.to_f
            end
          else
            "#{v}"
          end
        end
      end
  end

  # return results as a DataFrame
  df_summary = Polars.from_hash(summary)
  df_summary.insert_column(0, Polars::Series.new("statistic", metrics))
  df_summary
end

#drop(*columns, strict: true) ⇒ LazyFrame

Remove one or multiple columns from a DataFrame.

Examples:

Drop a single column by passing the name of that column.

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
lf.drop("ham").collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ f64 │
# ╞═════╪═════╡
# │ 1   ┆ 6.0 │
# │ 2   ┆ 7.0 │
# │ 3   ┆ 8.0 │
# └─────┴─────┘

Drop multiple columns by passing a selector.

lf.drop(Polars.cs.numeric).collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ ham │
# │ --- │
# │ str │
# ╞═════╡
# │ a   │
# │ b   │
# │ c   │
# └─────┘

Use positional arguments to drop multiple columns.

lf.drop("foo", "ham").collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ bar │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 6.0 │
# │ 7.0 │
# │ 8.0 │
# └─────┘

Parameters:

  • columns (Object)
    • Name of the column that should be removed.
    • List of column names.
  • strict (Boolean) (defaults to: true)

    Validate that all column names exist in the current schema, and throw an exception if any do not.

Returns:



3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
# File 'lib/polars/lazy_frame.rb', line 3369

def drop(*columns, strict: true)
  selectors = []
  columns.each do |c|
    if c.is_a?(Enumerable)
      selectors += c
    else
      selectors += [c]
    end
  end

  drop_cols = Utils.parse_list_into_selector(selectors, strict: strict)
  _from_rbldf(_ldf.drop(drop_cols._rbselector))
end

#drop_nans(subset: nil) ⇒ LazyFrame

Drop all rows that contain one or more NaN values.

The original order of the remaining rows is preserved.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [-20.5, Float::NAN, 80.0],
    "bar" => [Float::NAN, 110.0, 25.5],
    "ham" => ["xxx", "yyy", nil]
  }
)
lf.drop_nans.collect
# =>
# shape: (1, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ f64  ┆ f64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 80.0 ┆ 25.5 ┆ null │
# └──────┴──────┴──────┘
lf.drop_nans(subset: ["bar"]).collect
# =>
# shape: (2, 3)
# ┌──────┬───────┬──────┐
# │ foo  ┆ bar   ┆ ham  │
# │ ---  ┆ ---   ┆ ---  │
# │ f64  ┆ f64   ┆ str  │
# ╞══════╪═══════╪══════╡
# │ NaN  ┆ 110.0 ┆ yyy  │
# │ 80.0 ┆ 25.5  ┆ null │
# └──────┴───────┴──────┘

Parameters:

  • subset (Object) (defaults to: nil)

    Column name(s) for which NaN values are considered; if set to nil (default), use all columns (note that only floating-point columns can contain NaNs).

Returns:



4314
4315
4316
4317
4318
4319
4320
# File 'lib/polars/lazy_frame.rb', line 4314

def drop_nans(subset: nil)
  selector_subset = nil
  if !subset.nil?
    selector_subset = Utils.parse_list_into_selector(subset)._rbselector
  end
  _from_rbldf(_ldf.drop_nans(selector_subset))
end

#drop_nulls(subset: nil) ⇒ LazyFrame

Drop all rows that contain one or more null values.

The original order of the remaining rows is preserved.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, nil, 8],
    "ham" => ["a", "b", nil]
  }
)
lf.drop_nulls.collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘
lf.drop_nulls(subset: Polars.cs.integer).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1   ┆ 6   ┆ a    │
# │ 3   ┆ 8   ┆ null │
# └─────┴─────┴──────┘

Parameters:

  • subset (Object) (defaults to: nil)

    Column name(s) for which null values are considered. If set to nil (default), use all columns.

Returns:



4363
4364
4365
4366
4367
4368
4369
# File 'lib/polars/lazy_frame.rb', line 4363

def drop_nulls(subset: nil)
  selector_subset = nil
  if !subset.nil?
    selector_subset = Utils.parse_list_into_selector(subset)._rbselector
  end
  _from_rbldf(_ldf.drop_nulls(selector_subset))
end

#dtypesArray

Get dtypes of columns in LazyFrame.

Examples:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]

Returns:



130
131
132
# File 'lib/polars/lazy_frame.rb', line 130

def dtypes
  _ldf.collect_schema.values
end

#explain(format: "plain", optimized: true, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ String

Create a string representation of the query plan.

Different optimizations can be turned on or off.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
lf.group_by("a", maintain_order: true).agg(Polars.all.sum).sort(
  "a"
).explain

Returns:



490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
# File 'lib/polars/lazy_frame.rb', line 490

def explain(
  format: "plain",
  optimized: true,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if engine == "streaming"
    Utils.issue_unstable_warning("streaming mode is considered unstable.")
  end

  if optimized
    ldf = _ldf.with_optimizations(optimizations._rboptflags)
    if format == "tree"
      return ldf.describe_optimized_plan_tree
    else
      return ldf.describe_optimized_plan
    end
  end

  if format == "tree"
    _ldf.describe_plan_tree
  else
    _ldf.describe_plan
  end
end

#explode(columns, *more_columns, empty_as_null: true, keep_nulls: true) ⇒ LazyFrame

Explode lists to long format.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["a", "a", "b", "c"],
    "numbers" => [[1], [2, 3], [4, 5], [6, 7, 8]],
  }
).lazy
df.explode("numbers").collect
# =>
# shape: (8, 2)
# ┌─────────┬─────────┐
# │ letters ┆ numbers │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ a       ┆ 1       │
# │ a       ┆ 2       │
# │ a       ┆ 3       │
# │ b       ┆ 4       │
# │ b       ┆ 5       │
# │ c       ┆ 6       │
# │ c       ┆ 7       │
# │ c       ┆ 8       │
# └─────────┴─────────┘

Parameters:

  • columns (Object)

    Column names, expressions, or a selector defining them. The underlying columns being exploded must be of the List or Array data type.

  • more_columns (Array)

    Additional names of columns to explode, specified as positional arguments.

  • empty_as_null (Boolean) (defaults to: true)

    Explode an empty list/array into a null.

  • keep_nulls (Boolean) (defaults to: true)

    Explode a null list/array into a null.

Returns:



4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
# File 'lib/polars/lazy_frame.rb', line 4191

def explode(
  columns,
  *more_columns,
  empty_as_null: true,
  keep_nulls: true
)
  subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector(
    more_columns
  )
  _from_rbldf(_ldf.explode(subset._rbselector, empty_as_null, keep_nulls))
end

#fill_nan(value) ⇒ LazyFrame

Note:

Note that floating point NaN (Not a Number) are not missing values! To replace missing values, use fill_null instead.

Fill floating point NaN values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13],
  }
).lazy
df.fill_nan(99).collect
# =>
# shape: (4, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ f64  ┆ f64  │
# ╞══════╪══════╡
# │ 1.5  ┆ 0.5  │
# │ 2.0  ┆ 4.0  │
# │ 99.0 ┆ 99.0 │
# │ 4.0  ┆ 13.0 │
# └──────┴──────┘

Parameters:

  • value (Object)

    Value to fill the NaN values with.

Returns:



3930
3931
3932
3933
3934
3935
# File 'lib/polars/lazy_frame.rb', line 3930

def fill_nan(value)
  if !value.is_a?(Expr)
    value = F.lit(value)
  end
  _from_rbldf(_ldf.fill_nan(value._rbexpr))
end

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ LazyFrame

Fill null values using the specified value or strategy.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, nil, 4],
    "b" => [0.5, 4, nil, 13]
  }
)
lf.fill_null(99).collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 99  ┆ 99.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "forward").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "max").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
lf.fill_null(strategy: "zero").collect
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 0   ┆ 0.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

Returns:



3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
# File 'lib/polars/lazy_frame.rb', line 3855

def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true)
  if !value.nil?
    if value.is_a?(Expr)
      dtypes = nil
    elsif value.is_a?(TrueClass) || value.is_a?(FalseClass)
      dtypes = [Boolean]
    elsif matches_supertype && (value.is_a?(Integer) || value.is_a?(Float))
      dtypes = [
        Int8,
        Int16,
        Int32,
        Int64,
        Int128,
        UInt8,
        UInt16,
        UInt32,
        UInt64,
        Float32,
        Float64,
        Decimal.new
      ]
    elsif value.is_a?(Integer)
      dtypes = [Int64]
    elsif value.is_a?(Float)
      dtypes = [Float64]
    elsif value.is_a?(::Date)
      dtypes = [Date]
    elsif value.is_a?(::String)
      dtypes = [String, Categorical]
    else
      # fallback; anything not explicitly handled above
      dtypes = nil
    end

    if dtypes
      return with_columns(
        F.col(dtypes).fill_null(value, strategy: strategy, limit: limit)
      )
    end
  end

  select(Polars.all.fill_null(value, strategy: strategy, limit: limit))
end

#filter(*predicates, **constraints) ⇒ LazyFrame

Filter the rows in the DataFrame based on a predicate expression.

Examples:

Filter on one condition:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3, nil, 4, nil, 0],
    "bar" => [6, 7, 8, nil, nil, 9, 0],
    "ham" => ["a", "b", "c", nil, "d", "e", "f"]
  }
)
lf.filter(Polars.col("foo") > 1).collect
# =>
# shape: (3, 3)
# ┌─────┬──────┬─────┐
# │ foo ┆ bar  ┆ ham │
# │ --- ┆ ---  ┆ --- │
# │ i64 ┆ i64  ┆ str │
# ╞═════╪══════╪═════╡
# │ 2   ┆ 7    ┆ b   │
# │ 3   ┆ 8    ┆ c   │
# │ 4   ┆ null ┆ d   │
# └─────┴──────┴─────┘

Filter on multiple conditions:

lf.filter((Polars.col("foo") < 3) & (Polars.col("ham") == "a")).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Provide multiple filters using *args syntax:

lf.filter(
  Polars.col("foo") == 1,
  Polars.col("ham") == "a"
).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Provide multiple filters using **kwargs syntax:

lf.filter(foo: 1, ham: "a").collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Filter on an OR condition:

lf.filter(
  (Polars.col("foo") == 1) | (Polars.col("ham") == "c")
).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Filter by comparing two columns against each other

lf.filter(
  Polars.col("foo") == Polars.col("bar")
).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 0   ┆ 0   ┆ f   │
# └─────┴─────┴─────┘
lf.filter(
  Polars.col("foo") != Polars.col("bar")
).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • predicates (Array)

    Expression(s) that evaluate to a boolean Series.

  • constraints (Hash)

    Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as Polars.col(name).eq(value), and is implicitly joined with the other filter conditions using &.

Returns:



1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
# File 'lib/polars/lazy_frame.rb', line 1892

def filter(*predicates, **constraints)
  if constraints.empty?
    # early-exit conditions (exclude/include all rows)
    if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass))
      return dup
    end
    if predicates.length == 1 && predicates[0].is_a?(FalseClass)
      return clear
    end
  end

  _filter(
    predicates: predicates,
    constraints: constraints,
    invert: false
  )
end

#firstLazyFrame

Get the first row of the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 5, 3],
    "b" => [2, 4, 6]
  }
)
lf.first.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

Returns:



3720
3721
3722
# File 'lib/polars/lazy_frame.rb', line 3720

def first
  slice(0, 1)
end

#gather_every(n, offset: 0) ⇒ LazyFrame

Take every nth row in the LazyFrame and return as a new LazyFrame.

Examples:

s = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [5, 6, 7, 8]}).lazy
s.gather_every(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 5   │
# │ 3   ┆ 7   │
# └─────┴─────┘

Parameters:

  • n (Integer)

    Gather every n-th row.

  • offset (Integer) (defaults to: 0)

    Starting index.

Returns:



3782
3783
3784
# File 'lib/polars/lazy_frame.rb', line 3782

def gather_every(n, offset: 0)
  select(F.col("*").gather_every(n, offset))
end

#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy

Start a group by operation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.col("b").sum).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 4   │
# │ b   ┆ 11  │
# │ c   ┆ 6   │
# └─────┴─────┘

Parameters:

  • by (Array)

    Column(s) to group by.

  • maintain_order (Boolean) (defaults to: false)

    Make sure that the order of the groups remain consistent. This is more expensive than a default group by.

  • named_by (Hash)

    Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



2196
2197
2198
2199
2200
# File 'lib/polars/lazy_frame.rb', line 2196

def group_by(*by, maintain_order: false, **named_by)
  exprs = Utils.parse_into_list_of_expressions(*by, **named_by)
  lgb = _ldf.group_by(exprs, maintain_order)
  LazyGroupBy.new(lgb)
end

#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame

Group based on a time value (or index value of type Int32, Int64).

Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

  • every: interval of the window
  • period: length of the window
  • offset: offset of the window

The every, period and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_dynamic on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "n" => 0..6
  }
)
# =>
# shape: (7, 2)
# ┌─────────────────────┬─────┐
# │ time                ┆ n   │
# │ ---                 ┆ --- │
# │ datetime[μs]        ┆ i64 │
# ╞═════════════════════╪═════╡
# │ 2021-12-16 00:00:00 ┆ 0   │
# │ 2021-12-16 00:30:00 ┆ 1   │
# │ 2021-12-16 01:00:00 ┆ 2   │
# │ 2021-12-16 01:30:00 ┆ 3   │
# │ 2021-12-16 02:00:00 ┆ 4   │
# │ 2021-12-16 02:30:00 ┆ 5   │
# │ 2021-12-16 03:00:00 ┆ 6   │
# └─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

df.group_by_dynamic("time", every: "1h", closed: "right").agg(
  [
    Polars.col("time").min.alias("time_min"),
    Polars.col("time").max.alias("time_max")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬─────────────────────┬─────────────────────┐
# │ time                ┆ time_min            ┆ time_max            │
# │ ---                 ┆ ---                 ┆ ---                 │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
# └─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result.

df.group_by_dynamic(
  "time", every: "1h", include_boundaries: true, closed: "right"
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (4, 4)
# ┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# └─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed="left", should not include right end of interval.

df.group_by_dynamic("time", every: "1h", closed: "left").agg(
  [
    Polars.col("time").count.alias("time_count"),
    Polars.col("time").alias("time_agg_list")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬─────────────────────────────────┐
# │ time                ┆ time_count ┆ time_agg_list                   │
# │ ---                 ┆ ---        ┆ ---                             │
# │ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]              │
# ╞═════════════════════╪════════════╪═════════════════════════════════╡
# │ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-… │
# │ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-… │
# │ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-… │
# │ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]           │
# └─────────────────────┴────────────┴─────────────────────────────────┘

When closed="both" the time values at the window boundaries belong to 2 groups.

df.group_by_dynamic("time", every: "1h", closed: "both").agg(
  [Polars.col("time").count.alias("time_count")]
)
# =>
# shape: (5, 2)
# ┌─────────────────────┬────────────┐
# │ time                ┆ time_count │
# │ ---                 ┆ ---        │
# │ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 3          │
# │ 2021-12-16 01:00:00 ┆ 3          │
# │ 2021-12-16 02:00:00 ┆ 3          │
# │ 2021-12-16 03:00:00 ┆ 1          │
# └─────────────────────┴────────────┘

Dynamic group bys can also be combined with grouping on normal keys.

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "groups" => ["a", "a", "a", "b", "b", "a", "a"]
  }
)
df.group_by_dynamic(
  "time",
  every: "1h",
  closed: "both",
  group_by: "groups",
  include_boundaries: true
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (7, 5)
# ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
# │ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# │ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
# │ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
# └────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic group by on an index column.

df = Polars::DataFrame.new(
  {
    "idx" => Polars.arange(0, 6, eager: true),
    "A" => ["A", "A", "B", "B", "B", "C"]
  }
)
df.group_by_dynamic(
  "idx",
  every: "2i",
  period: "3i",
  include_boundaries: true,
  closed: "right"
).agg(Polars.col("A").alias("A_agg_list"))
# =>
# shape: (4, 4)
# ┌─────────────────┬─────────────────┬─────┬─────────────────┐
# │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
# │ ---             ┆ ---             ┆ --- ┆ ---             │
# │ i64             ┆ i64             ┆ i64 ┆ list[str]       │
# ╞═════════════════╪═════════════════╪═════╪═════════════════╡
# │ -2              ┆ 1               ┆ -2  ┆ ["A", "A"]      │
# │ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
# │ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
# │ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
# └─────────────────┴─────────────────┴─────┴─────────────────┘

Parameters:

  • index_column (Object)

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a dynamic group by on indices, dtype needs to be one of \{Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

  • every (Object)

    Interval of the window.

  • period (Object) (defaults to: nil)

    Length of the window, if nil it is equal to 'every'.

  • offset (Object) (defaults to: nil)

    Offset of the window if nil and period is nil it will be equal to negative every.

  • include_boundaries (Boolean) (defaults to: false)

    Add the lower and upper bound of the window to the "_lower_bound" and "_upper_bound" columns. This will impact performance because it's harder to parallelize

  • closed ("right", "left", "both", "none") (defaults to: "left")

    Define whether the temporal window interval is closed or not.

  • label ('left', 'right', 'datapoint') (defaults to: "left")

    Define which label to use for the window:

    • 'left': lower boundary of the window
    • 'right': upper boundary of the window
    • 'datapoint': the first value of the index column in the given window. If you don't need the label to be at one of the boundaries, choose this option for maximum performance
  • group_by (Object) (defaults to: nil)

    Also group by this column/these columns

  • start_by ('window', 'datapoint', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday') (defaults to: "window")

    The strategy to determine the start of the first window by.

    • 'window': Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.
    • 'datapoint': Start from the first encountered data point.
    • a day of the week (only takes effect if every contains 'w'):

      • 'monday': Start the window on the Monday before the first data point.
      • 'tuesday': Start the window on the Tuesday before the first data point.
      • ...
      • 'sunday': Start the window on the Sunday before the first data point.

    The resulting window is then shifted back until the earliest datapoint is in or in front of it.

Returns:



2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
# File 'lib/polars/lazy_frame.rb', line 2559

def group_by_dynamic(
  index_column,
  every:,
  period: nil,
  offset: nil,
  include_boundaries: false,
  closed: "left",
  label: "left",
  group_by: nil,
  start_by: "window"
)
  index_column = Utils.parse_into_expression(index_column, str_as_lit: false)
  if offset.nil?
    offset = period.nil? ? "-#{every}" : "0ns"
  end

  if period.nil?
    period = every
  end

  period = Utils.parse_as_duration_string(period)
  offset = Utils.parse_as_duration_string(offset)
  every = Utils.parse_as_duration_string(every)

  rbexprs_by = group_by.nil? ? [] : Utils.parse_into_list_of_expressions(group_by)
  lgb = _ldf.group_by_dynamic(
    index_column,
    every,
    period,
    offset,
    label,
    include_boundaries,
    closed,
    rbexprs_by,
    start_by
  )
  LazyGroupBy.new(lgb)
end

#head(n = 5) ⇒ LazyFrame

Get the first n rows.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.head.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# └─────┴─────┘
lf.head(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



3625
3626
3627
# File 'lib/polars/lazy_frame.rb', line 3625

def head(n = 5)
  slice(0, n)
end

#include?(key) ⇒ Boolean

Check if LazyFrame includes key.

Returns:



167
168
169
# File 'lib/polars/lazy_frame.rb', line 167

def include?(key)
  columns.include?(key)
end

#interpolateLazyFrame

Interpolate intermediate values. The interpolation method is linear.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 9, 10],
    "bar" => [6, 7, 9, nil],
    "baz" => [1, nil, nil, 9]
  }
).lazy
df.interpolate.collect
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────────┐
# │ foo  ┆ bar  ┆ baz      │
# │ ---  ┆ ---  ┆ ---      │
# │ f64  ┆ f64  ┆ f64      │
# ╞══════╪══════╪══════════╡
# │ 1.0  ┆ 6.0  ┆ 1.0      │
# │ 5.0  ┆ 7.0  ┆ 3.666667 │
# │ 9.0  ┆ 9.0  ┆ 6.333333 │
# │ 10.0 ┆ null ┆ 9.0      │
# └──────┴──────┴──────────┘

Returns:



4718
4719
4720
# File 'lib/polars/lazy_frame.rb', line 4718

def interpolate
  select(F.col("*").interpolate)
end

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil) ⇒ LazyFrame

Add a join operation to the Logical Plan.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
other_df = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
).lazy
df.join(other_df, on: "ham").collect
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "full").collect
# =>
# shape: (4, 5)
# ┌──────┬──────┬──────┬───────┬───────────┐
# │ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
# │ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
# │ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
# ╞══════╪══════╪══════╪═══════╪═══════════╡
# │ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
# │ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
# │ null ┆ null ┆ null ┆ z     ┆ d         │
# │ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
# └──────┴──────┴──────┴───────┴───────────┘
df.join(other_df, on: "ham", how: "left").collect
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# │ 3   ┆ 8.0 ┆ c   ┆ null  │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "semi").collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6.0 ┆ a   │
# │ 2   ┆ 7.0 ┆ b   │
# └─────┴─────┴─────┘
df.join(other_df, on: "ham", how: "anti").collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • other (LazyFrame)

    Lazy DataFrame to join with.

  • left_on (Object) (defaults to: nil)

    Join column of the left DataFrame.

  • right_on (Object) (defaults to: nil)

    Join column of the right DataFrame.

  • on (defaults to: nil)

    Object Join column of both DataFrames. If set, left_on and right_on should be nil.

  • how ("inner", "left", "full", "semi", "anti", "cross") (defaults to: "inner")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • validate ('m:m', 'm:1', '1:m', '1:1') (defaults to: "m:m")

    Checks if join is of specified type.

    • many_to_many - “m:m”: default, does not result in checks
    • one_to_one - “1:1”: check if join keys are unique in both left and right datasets
    • one_to_many - “1:m”: check if join keys are unique in left dataset
    • many_to_one - “m:1”: check if join keys are unique in right dataset
  • nulls_equal (Boolean) (defaults to: false)

    Join on null values. By default null values will never produce matches.

  • allow_parallel (Boolean) (defaults to: true)

    Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

  • force_parallel (Boolean) (defaults to: false)

    Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

  • coalesce (Boolean) (defaults to: nil)

    Coalescing behavior (merging of join columns).

    • nil: -> join specific.
    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.
  • maintain_order ('none', 'left', 'right', 'left_right', 'right_left') (defaults to: nil)

    Which DataFrame row order to preserve, if any. Do not rely on any observed ordering without explicitly setting this parameter, as your code may break in a future release. Not specifying any ordering can improve performance Supported for inner, left, right and full joins

    • none No specific ordering is desired. The ordering might differ across Polars versions or even between different runs.
    • left Preserves the order of the left DataFrame.
    • right Preserves the order of the right DataFrame.
    • left_right First preserves the order of the left DataFrame, then the right.
    • right_left First preserves the order of the right DataFrame, then the left.

Returns:



3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
# File 'lib/polars/lazy_frame.rb', line 3070

def join(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  how: "inner",
  suffix: "_right",
  validate: "m:m",
  nulls_equal: false,
  allow_parallel: true,
  force_parallel: false,
  coalesce: nil,
  maintain_order: nil
)
  if !other.is_a?(LazyFrame)
    raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}"
  end

  if maintain_order.nil?
    maintain_order = "none"
  end

  if how == "outer"
    how = "full"
  elsif how == "cross"
    return _from_rbldf(
      _ldf.join(
        other._ldf,
        [],
        [],
        allow_parallel,
        nulls_equal,
        force_parallel,
        how,
        suffix,
        validate,
        maintain_order,
        coalesce
      )
    )
  end

  if !on.nil?
    rbexprs = Utils.parse_into_list_of_expressions(on)
    rbexprs_left = rbexprs
    rbexprs_right = rbexprs
  elsif !left_on.nil? && !right_on.nil?
    rbexprs_left = Utils.parse_into_list_of_expressions(left_on)
    rbexprs_right = Utils.parse_into_list_of_expressions(right_on)
  else
    raise ArgumentError, "must specify `on` OR `left_on` and `right_on`"
  end

  _from_rbldf(
    self._ldf.join(
      other._ldf,
      rbexprs_left,
      rbexprs_right,
      allow_parallel,
      force_parallel,
      nulls_equal,
      how,
      suffix,
      validate,
      maintain_order,
      coalesce
    )
  )
end

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ LazyFrame

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the join_asof key.

For each row in the left DataFrame:

  • A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
  • A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.

The default is "backward".

Examples:

gdp = Polars::LazyFrame.new(
  {
    "date" => Polars.date_range(
      Date.new(2016, 1, 1),
      Date.new(2020, 1, 1),
      "1y",
      eager: true
    ),
    "gdp" => [4164, 4411, 4566, 4696, 4827]
  }
)
gdp.collect
# =>
# shape: (5, 2)
# ┌────────────┬──────┐
# │ date       ┆ gdp  │
# │ ---        ┆ ---  │
# │ date       ┆ i64  │
# ╞════════════╪══════╡
# │ 2016-01-01 ┆ 4164 │
# │ 2017-01-01 ┆ 4411 │
# │ 2018-01-01 ┆ 4566 │
# │ 2019-01-01 ┆ 4696 │
# │ 2020-01-01 ┆ 4827 │
# └────────────┴──────┘
population = Polars::LazyFrame.new(
  {
    "date" => [Date.new(2016, 3, 1), Date.new(2018, 8, 1), Date.new(2019, 1, 1)],
    "population" => [82.19, 82.66, 83.12]
  }
).sort("date")
population.collect
# =>
# shape: (3, 2)
# ┌────────────┬────────────┐
# │ date       ┆ population │
# │ ---        ┆ ---        │
# │ date       ┆ f64        │
# ╞════════════╪════════════╡
# │ 2016-03-01 ┆ 82.19      │
# │ 2018-08-01 ┆ 82.66      │
# │ 2019-01-01 ┆ 83.12      │
# └────────────┴────────────┘

Note how the dates don't quite match. If we join them using join_asof and strategy: "backward", then each date from population which doesn't have an exact match is matched with the closest earlier date from gdp:

population.join_asof(gdp, on: "date", strategy: "backward").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 4566 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
population.join_asof(
  gdp, on: "date", strategy: "backward", coalesce: false
).collect
# =>
# shape: (3, 4)
# ┌────────────┬────────────┬────────────┬──────┐
# │ date       ┆ population ┆ date_right ┆ gdp  │
# │ ---        ┆ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ date       ┆ i64  │
# ╞════════════╪════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 2016-01-01 ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 2018-01-01 ┆ 4566 │
# │ 2019-01-01 ┆ 83.12      ┆ 2019-01-01 ┆ 4696 │
# └────────────┴────────────┴────────────┴──────┘

If we instead use strategy: "forward", then each date from population which doesn't have an exact match is matched with the closest later date from gdp:

population.join_asof(gdp, on: "date", strategy: "forward").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4411 │
# │ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
population.join_asof(gdp, on: "date", strategy: "nearest").collect
# =>
# shape: (3, 3)
# ┌────────────┬────────────┬──────┐
# │ date       ┆ population ┆ gdp  │
# │ ---        ┆ ---        ┆ ---  │
# │ date       ┆ f64        ┆ i64  │
# ╞════════════╪════════════╪══════╡
# │ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ 2019-01-01 ┆ 83.12      ┆ 4696 │
# └────────────┴────────────┴──────┘
gdp_dates = Polars.date_range(
  Date.new(2016, 1, 1), Date.new(2020, 1, 1), "1y", eager: true
)
gdp2 = Polars::LazyFrame.new(
  {
    "country" => ["Germany"] * 5 + ["Netherlands"] * 5,
    "date" => Polars.concat([gdp_dates, gdp_dates]),
    "gdp" => [4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909]
  }
).sort("country", "date")
gdp2.collect
# =>
# shape: (10, 3)
# ┌─────────────┬────────────┬──────┐
# │ country     ┆ date       ┆ gdp  │
# │ ---         ┆ ---        ┆ ---  │
# │ str         ┆ date       ┆ i64  │
# ╞═════════════╪════════════╪══════╡
# │ Germany     ┆ 2016-01-01 ┆ 4164 │
# │ Germany     ┆ 2017-01-01 ┆ 4411 │
# │ Germany     ┆ 2018-01-01 ┆ 4566 │
# │ Germany     ┆ 2019-01-01 ┆ 4696 │
# │ Germany     ┆ 2020-01-01 ┆ 4827 │
# │ Netherlands ┆ 2016-01-01 ┆ 784  │
# │ Netherlands ┆ 2017-01-01 ┆ 833  │
# │ Netherlands ┆ 2018-01-01 ┆ 914  │
# │ Netherlands ┆ 2019-01-01 ┆ 910  │
# │ Netherlands ┆ 2020-01-01 ┆ 909  │
# └─────────────┴────────────┴──────┘
pop2 = Polars::LazyFrame.new(
  {
    "country" => ["Germany"] * 3 + ["Netherlands"] * 3,
    "date" => [
      Date.new(2016, 3, 1),
      Date.new(2018, 8, 1),
      Date.new(2019, 1, 1),
      Date.new(2016, 3, 1),
      Date.new(2018, 8, 1),
      Date.new(2019, 1, 1)
    ],
    "population" => [82.19, 82.66, 83.12, 17.11, 17.32, 17.40]
  }
).sort("country", "date")
pop2.collect
# =>
# shape: (6, 3)
# ┌─────────────┬────────────┬────────────┐
# │ country     ┆ date       ┆ population │
# │ ---         ┆ ---        ┆ ---        │
# │ str         ┆ date       ┆ f64        │
# ╞═════════════╪════════════╪════════════╡
# │ Germany     ┆ 2016-03-01 ┆ 82.19      │
# │ Germany     ┆ 2018-08-01 ┆ 82.66      │
# │ Germany     ┆ 2019-01-01 ┆ 83.12      │
# │ Netherlands ┆ 2016-03-01 ┆ 17.11      │
# │ Netherlands ┆ 2018-08-01 ┆ 17.32      │
# │ Netherlands ┆ 2019-01-01 ┆ 17.4       │
# └─────────────┴────────────┴────────────┘
pop2.join_asof(gdp2, by: "country", on: "date", strategy: "nearest", check_sortedness: false).collect
# =>
# shape: (6, 4)
# ┌─────────────┬────────────┬────────────┬──────┐
# │ country     ┆ date       ┆ population ┆ gdp  │
# │ ---         ┆ ---        ┆ ---        ┆ ---  │
# │ str         ┆ date       ┆ f64        ┆ i64  │
# ╞═════════════╪════════════╪════════════╪══════╡
# │ Germany     ┆ 2016-03-01 ┆ 82.19      ┆ 4164 │
# │ Germany     ┆ 2018-08-01 ┆ 82.66      ┆ 4696 │
# │ Germany     ┆ 2019-01-01 ┆ 83.12      ┆ 4696 │
# │ Netherlands ┆ 2016-03-01 ┆ 17.11      ┆ 784  │
# │ Netherlands ┆ 2018-08-01 ┆ 17.32      ┆ 910  │
# │ Netherlands ┆ 2019-01-01 ┆ 17.4       ┆ 910  │
# └─────────────┴────────────┴────────────┴──────┘

Parameters:

  • other (LazyFrame)

    Lazy DataFrame to join with.

  • left_on (String) (defaults to: nil)

    Join column of the left DataFrame.

  • right_on (String) (defaults to: nil)

    Join column of the right DataFrame.

  • on (String) (defaults to: nil)

    Join column of both DataFrames. If set, left_on and right_on should be nil.

  • by_left (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • by_right (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • by (Object) (defaults to: nil)

    Join on these columns before doing asof join.

  • strategy ("backward", "forward") (defaults to: "backward")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • tolerance (Object) (defaults to: nil)

    Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time" you use the following string language:

    • 1ns (1 nanosecond)
    • 1us (1 microsecond)
    • 1ms (1 millisecond)
    • 1s (1 second)
    • 1m (1 minute)
    • 1h (1 hour)
    • 1d (1 day)
    • 1w (1 week)
    • 1mo (1 calendar month)
    • 1y (1 calendar year)
    • 1i (1 index count)

    Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

  • allow_parallel (Boolean) (defaults to: true)

    Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

  • force_parallel (Boolean) (defaults to: false)

    Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

  • coalesce (Boolean) (defaults to: true)

    Coalescing behavior (merging of join columns).

    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.
  • allow_exact_matches (Boolean) (defaults to: true)

    Whether exact matches are valid join predicates.

    • If true, allow matching with the same on value (i.e. less-than-or-equal-to / greater-than-or-equal-to).
    • If false, don't match the same on value (i.e., strictly less-than / strictly greater-than).
  • check_sortedness (Boolean) (defaults to: true)

    Check the sortedness of the asof keys. If the keys are not sorted Polars will error, or in case of 'by' argument raise a warning. This might become a hard error in the future.

Returns:



2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
# File 'lib/polars/lazy_frame.rb', line 2857

def join_asof(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  by_left: nil,
  by_right: nil,
  by: nil,
  strategy: "backward",
  suffix: "_right",
  tolerance: nil,
  allow_parallel: true,
  force_parallel: false,
  coalesce: true,
  allow_exact_matches: true,
  check_sortedness: true
)
  if !other.is_a?(LazyFrame)
    raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}"
  end

  if on.is_a?(::String)
    left_on = on
    right_on = on
  end

  if left_on.nil? || right_on.nil?
    raise ArgumentError, "You should pass the column to join on as an argument."
  end

  if by_left.is_a?(::String) || by_left.is_a?(Expr)
    by_left_ = [by_left]
  else
    by_left_ = by_left
  end

  if by_right.is_a?(::String) || by_right.is_a?(Expr)
    by_right_ = [by_right]
  else
    by_right_ = by_right
  end

  if by.is_a?(::String)
    by_left_ = [by]
    by_right_ = [by]
  elsif by.is_a?(::Array)
    by_left_ = by
    by_right_ = by
  end

  tolerance_str = nil
  tolerance_num = nil
  if tolerance.is_a?(::String)
    tolerance_str = tolerance
  else
    tolerance_num = tolerance
  end

  _from_rbldf(
    _ldf.join_asof(
      other._ldf,
      Polars.col(left_on)._rbexpr,
      Polars.col(right_on)._rbexpr,
      by_left_,
      by_right_,
      allow_parallel,
      force_parallel,
      suffix,
      strategy,
      tolerance_num,
      tolerance_str,
      coalesce,
      allow_exact_matches,
      check_sortedness
    )
  )
end

#join_where(other, *predicates, suffix: "_right") ⇒ LazyFrame

Note:

The row order of the input DataFrames is not preserved.

Note:

This functionality is experimental. It may be changed at any point without it being considered a breaking change.

Perform a join based on one or multiple (in)equality predicates.

This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.

Examples:

Join two lazyframes together based on two predicates which get AND-ed together.

east = Polars::LazyFrame.new(
  {
    "id" => [100, 101, 102],
    "dur" => [120, 140, 160],
    "rev" => [12, 14, 16],
    "cores" => [2, 8, 4]
  }
)
west = Polars::LazyFrame.new(
  {
    "t_id" => [404, 498, 676, 742],
    "time" => [90, 130, 150, 170],
    "cost" => [9, 13, 15, 16],
    "cores" => [4, 2, 1, 4]
  }
)
east.join_where(
  west,
  Polars.col("dur") < Polars.col("time"),
  Polars.col("rev") < Polars.col("cost")
).collect
# =>
# shape: (5, 8)
# ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐
# │ id  ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │
# │ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---         │
# │ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  ┆ i64  ┆ i64         │
# ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 498  ┆ 130  ┆ 13   ┆ 2           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘

To OR them together, use a single expression and the | operator.

east.join_where(
  west,
  (Polars.col("dur") < Polars.col("time")) | (Polars.col("rev") < Polars.col("cost"))
).collect
# =>
# shape: (6, 8)
# ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐
# │ id  ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │
# │ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---         │
# │ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  ┆ i64  ┆ i64         │
# ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 498  ┆ 130  ┆ 13   ┆ 2           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 102 ┆ 160 ┆ 16  ┆ 4     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘

Parameters:

  • other (Object)

    DataFrame to join with.

  • predicates (Object)

    (In)Equality condition to join the two tables on. When a column name occurs in both tables, the proper suffix must be applied in the predicate.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

Returns:



3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
# File 'lib/polars/lazy_frame.rb', line 3219

def join_where(
  other,
  *predicates,
  suffix: "_right"
)
  Utils.require_same_type(self, other)

  rbexprs = Utils.parse_into_list_of_expressions(*predicates)

  _from_rbldf(
    _ldf.join_where(
      other._ldf,
      rbexprs,
      suffix
    )
  )
end

#lastLazyFrame

Get the last row of the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 5, 3],
    "b" => [2, 4, 6]
  }
)
lf.last.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 3   ┆ 6   │
# └─────┴─────┘

Returns:



3695
3696
3697
# File 'lib/polars/lazy_frame.rb', line 3695

def last
  tail(1)
end

#lazyLazyFrame

Return lazy representation, i.e. itself.

Useful for writing code that expects either a DataFrame or LazyFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil]
  }
)
df.lazy

Returns:



1661
1662
1663
# File 'lib/polars/lazy_frame.rb', line 1661

def lazy
  self
end

#limit(n = 5) ⇒ LazyFrame

Get the first n rows.

Alias for #head.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.limit.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# └─────┴─────┘
lf.limit(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 7   │
# │ 2   ┆ 8   │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



3580
3581
3582
# File 'lib/polars/lazy_frame.rb', line 3580

def limit(n = 5)
  head(n)
end

#map_batches(predicate_pushdown: true, projection_pushdown: true, slice_pushdown: true, no_optimizations: false, schema: nil, validate_output_schema: true, streamable: false, &function) ⇒ LazyFrame

Note:

The schema of a LazyFrame must always be correct. It is up to the caller of this function to ensure that this invariant is upheld.

Note:

It is important that the optimization flags are correct. If the custom function for instance does an aggregation of a column, predicate_pushdown should not be allowed, as this prunes rows and will influence your aggregation results.

Note:

A UDF passed to map_batches must be pure, meaning that it cannot modify or depend on state other than its arguments.

Apply a custom function.

It is important that the function returns a Polars DataFrame.

Examples:

lf = (
  Polars::LazyFrame.new(
    {
      "a": Polars.int_range(-100_000, 0, eager: true),
      "b": Polars.int_range(0, 100_000, eager: true)
    }
  )
  .map_batches(streamable: true) { |x| x * 2 }
  .collect(engine: "streaming")
)
# =>
# shape: (100_000, 2)
# ┌─────────┬────────┐
# │ a       ┆ b      │
# │ ---     ┆ ---    │
# │ i64     ┆ i64    │
# ╞═════════╪════════╡
# │ -200000 ┆ 0      │
# │ -199998 ┆ 2      │
# │ -199996 ┆ 4      │
# │ -199994 ┆ 6      │
# │ -199992 ┆ 8      │
# │ …       ┆ …      │
# │ -10     ┆ 199990 │
# │ -8      ┆ 199992 │
# │ -6      ┆ 199994 │
# │ -4      ┆ 199996 │
# │ -2      ┆ 199998 │
# └─────────┴────────┘

Parameters:

  • predicate_pushdown (Boolean) (defaults to: true)

    Allow predicate pushdown optimization to pass this node.

  • projection_pushdown (Boolean) (defaults to: true)

    Allow projection pushdown optimization to pass this node.

  • slice_pushdown (Boolean) (defaults to: true)

    Allow slice pushdown optimization to pass this node.

  • no_optimizations (Boolean) (defaults to: false)

    Turn off all optimizations past this point.

  • schema (Object) (defaults to: nil)

    Output schema of the function, if set to nil we assume that the schema will remain unchanged by the applied function.

  • validate_output_schema (Boolean) (defaults to: true)

    It is paramount that polars' schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to false will not do this check, but may lead to hard to debug bugs.

  • streamable (Boolean) (defaults to: false)

    Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.

Returns:

Raises:

  • (Todo)


4662
4663
4664
4665
4666
4667
4668
4669
4670
4671
4672
4673
4674
4675
4676
4677
4678
4679
4680
4681
4682
4683
4684
4685
4686
4687
4688
4689
4690
4691
# File 'lib/polars/lazy_frame.rb', line 4662

def map_batches(
  predicate_pushdown: true,
  projection_pushdown: true,
  slice_pushdown: true,
  no_optimizations: false,
  schema: nil,
  validate_output_schema: true,
  streamable: false,
  &function
)
  raise Todo if !schema.nil?

  if no_optimizations
    predicate_pushdown = false
    projection_pushdown = false
    slice_pushdown = false
  end

  _from_rbldf(
    _ldf.map_batches(
      function,
      predicate_pushdown,
      projection_pushdown,
      slice_pushdown,
      streamable,
      schema,
      validate_output_schema,
    )
  )
end

#maxLazyFrame

Aggregate the columns in the DataFrame to their maximum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.max.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 4   ┆ 2   │
# └─────┴─────┘

Returns:



4017
4018
4019
# File 'lib/polars/lazy_frame.rb', line 4017

def max
  _from_rbldf(_ldf.max)
end

#meanLazyFrame

Aggregate the columns in the DataFrame to their mean value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.mean.collect
# =>
# shape: (1, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ f64 ┆ f64  │
# ╞═════╪══════╡
# │ 2.5 ┆ 1.25 │
# └─────┴──────┘

Returns:



4077
4078
4079
# File 'lib/polars/lazy_frame.rb', line 4077

def mean
  _from_rbldf(_ldf.mean)
end

#medianLazyFrame

Aggregate the columns in the DataFrame to their median value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.median.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═════╡
# │ 2.5 ┆ 1.0 │
# └─────┴─────┘

Returns:



4097
4098
4099
# File 'lib/polars/lazy_frame.rb', line 4097

def median
  _from_rbldf(_ldf.median)
end

#merge_sorted(other, key) ⇒ LazyFrame

Take two sorted DataFrames and merge them by the sorted key.

The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.

The schemas of both LazyFrames must be equal.

Examples:

df0 = Polars::LazyFrame.new(
  {"name" => ["steve", "elise", "bob"], "age" => [42, 44, 18]}
).sort("age")
df1 = Polars::LazyFrame.new(
  {"name" => ["anna", "megan", "steve", "thomas"], "age" => [21, 33, 42, 20]}
).sort("age")
df0.merge_sorted(df1, "age").collect
# =>
# shape: (7, 2)
# ┌────────┬─────┐
# │ name   ┆ age │
# │ ---    ┆ --- │
# │ str    ┆ i64 │
# ╞════════╪═════╡
# │ bob    ┆ 18  │
# │ thomas ┆ 20  │
# │ anna   ┆ 21  │
# │ megan  ┆ 33  │
# │ steve  ┆ 42  │
# │ steve  ┆ 42  │
# │ elise  ┆ 44  │
# └────────┴─────┘

Parameters:

  • other (DataFrame)

    Other DataFrame that must be merged

  • key (String)

    Key that is sorted.

Returns:



4823
4824
4825
# File 'lib/polars/lazy_frame.rb', line 4823

def merge_sorted(other, key)
  _from_rbldf(_ldf.merge_sorted(other._ldf, key))
end

#minLazyFrame

Aggregate the columns in the DataFrame to their minimum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.min.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 1   │
# └─────┴─────┘

Returns:



4037
4038
4039
# File 'lib/polars/lazy_frame.rb', line 4037

def min
  _from_rbldf(_ldf.min)
end

#null_countLazyFrame

Aggregate the columns in the LazyFrame as the sum of their null value count.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, nil, 3],
    "bar" => [6, 7, nil],
    "ham" => ["a", "b", "c"]
  }
)
lf.null_count.collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 1   ┆ 0   │
# └─────┴─────┴─────┘

Returns:



4123
4124
4125
# File 'lib/polars/lazy_frame.rb', line 4123

def null_count
  _from_rbldf(_ldf.null_count)
end

#pipe(function, *args, **kwargs, &block) ⇒ LazyFrame

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Examples:

cast_str_to_int = lambda do |data, col_name:|
  data.with_columns(Polars.col(col_name).cast(Polars::Int64))
end

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => ["10", "20", "30", "40"]}).lazy
df.pipe(cast_str_to_int, col_name: "b").collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 10  │
# │ 2   ┆ 20  │
# │ 3   ┆ 30  │
# │ 4   ┆ 40  │
# └─────┴─────┘

Parameters:

  • function (Object)

    Callable; will receive the frame as the first parameter, followed by any given args/kwargs.

  • args (Object)

    Arguments to pass to the UDF.

  • kwargs (Object)

    Keyword arguments to pass to the UDF.

Returns:



261
262
263
# File 'lib/polars/lazy_frame.rb', line 261

def pipe(function, *args, **kwargs, &block)
  function.(self, *args, **kwargs, &block)
end

#pivot(on, on_columns:, index: nil, values: nil, aggregate_function: nil, maintain_order: false, separator: "_") ⇒ LazyFrame

Note:

In some other frameworks, you might know this operation as pivot_wider.

Create a spreadsheet-style pivot table as a DataFrame.

Examples:

Using pivot, we can reshape so we have one row per student, with different subjects as columns, and their test_1 scores as values:

df = Polars::DataFrame.new(
  {
    "name" => ["Cady", "Cady", "Karen", "Karen"],
    "subject" => ["maths", "physics", "maths", "physics"],
    "test_1" => [98, 99, 61, 58],
    "test_2" => [100, 100, 60, 60]
  }
)
df.lazy.pivot(
  "subject",
  on_columns: ["maths", "physics"],
  index: "name",
  values: "test_1",
  maintain_order: true
).collect
# =>
# shape: (2, 3)
# ┌───────┬───────┬─────────┐
# │ name  ┆ maths ┆ physics │
# │ ---   ┆ ---   ┆ ---     │
# │ str   ┆ i64   ┆ i64     │
# ╞═══════╪═══════╪═════════╡
# │ Cady  ┆ 98    ┆ 99      │
# │ Karen ┆ 61    ┆ 58      │
# └───────┴───────┴─────────┘

Parameters:

  • on (Object)

    The column(s) whose values will be used as the new columns of the output DataFrame.

  • on_columns (Object)

    What value combinations will be considered for the output table.

  • index (Object) (defaults to: nil)

    The column(s) that remain from the input to the output. The output DataFrame will have one row for each unique combination of the index's values. If nil, all remaining columns not specified on on and values will be used. At least one of index and values must be specified.

  • values (Object) (defaults to: nil)

    The existing column(s) of values which will be moved under the new columns from index. If an aggregation is specified, these are the values on which the aggregation will be computed. If nil, all remaining columns not specified on on and index will be used. At least one of index and values must be specified.

  • aggregate_function (Object) (defaults to: nil)

    Choose from:

    • nil: no aggregation takes place, will raise error if multiple values are in group.
    • A predefined aggregate function string, one of \{'min', 'max', 'first', 'last', 'sum', 'mean', 'median', 'len', 'item'}
    • An expression to do the aggregation. The expression can only access data from the respective 'values' columns as generated by pivot, through pl.element().
  • maintain_order (Boolean) (defaults to: false)

    Ensure the values of index are sorted by discovery order.

  • separator (String) (defaults to: "_")

    Used as separator/delimiter in generated column names in case of multiple values columns.

Returns:



4433
4434
4435
4436
4437
4438
4439
4440
4441
4442
4443
4444
4445
4446
4447
4448
4449
4450
4451
4452
4453
4454
4455
4456
4457
4458
4459
4460
4461
4462
4463
4464
4465
4466
4467
4468
4469
4470
4471
4472
4473
4474
4475
4476
4477
4478
4479
4480
4481
4482
4483
4484
4485
4486
4487
4488
4489
4490
4491
4492
4493
4494
4495
4496
4497
4498
4499
4500
4501
4502
4503
4504
4505
4506
4507
4508
4509
4510
4511
4512
4513
4514
4515
4516
4517
# File 'lib/polars/lazy_frame.rb', line 4433

def pivot(
  on,
  on_columns:,
  index: nil,
  values: nil,
  aggregate_function: nil,
  maintain_order: false,
  separator: "_"
)
  if index.nil? && values.nil?
    msg = "`pivot` needs either `index or `values` needs to be specified"
    raise InvalidOperationError, msg
  end

  on_selector = Utils.parse_list_into_selector(on)
  if !values.nil?
    values_selector = Utils.parse_list_into_selector(values)
  end
  if !index.nil?
    index_selector = Utils.parse_list_into_selector(index)
  end

  if values.nil?
    values_selector = Polars.cs.all - on_selector - index_selector
  end
  if index.nil?
    index_selector = Polars.cs.all - on_selector - values_selector
  end

  agg = F.element
  if aggregate_function.is_a?(::String)
    if aggregate_function == "first"
      agg = agg.first
    elsif aggregate_function == "item"
      agg = agg.item
    elsif aggregate_function == "sum"
      agg = agg.sum
    elsif aggregate_function == "max"
      agg = agg.max
    elsif aggregate_function == "min"
      agg = agg.min
    elsif aggregate_function == "mean"
      agg = agg.mean
    elsif aggregate_function == "median"
      agg = agg.median
    elsif aggregate_function == "last"
      agg = agg.last
    elsif aggregate_function == "len"
      agg = agg.len
    elsif aggregate_function == "count"
      Utils.issue_deprecation_warning(
        "`aggregate_function='count'` input for `pivot` is deprecated." +
        " Please use `aggregate_function='len'`."
      )
      agg = agg.len
    else
      msg = "invalid input for `aggregate_function` argument: #{aggregate_function.inspect}"
      raise ArgumentError, msg
    end
  elsif aggregate_function.nil?
    agg = agg.item(allow_empty: true)
  else
    agg = aggregate_function
  end

  if on_columns.is_a?(DataFrame)
    on_cols = on_columns
  elsif on_columns.is_a?(Series)
    on_cols = on_columns.to_frame
  else
    on_cols = Series.new(on_columns).to_frame
  end

  _from_rbldf(
    _ldf.pivot(
      on_selector._rbselector,
      on_cols._df,
      index_selector._rbselector,
      values_selector._rbselector,
      agg._rbexpr,
      maintain_order,
      separator
    )
  )
end

#profile(engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Array

Profile a LazyFrame.

This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.

The units of the timings are microseconds.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
lf.group_by("a", maintain_order: true).agg(Polars.all.sum).sort(
  "a"
).profile
# =>
# [shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ a   ┆ 4   ┆ 10  │
# │ b   ┆ 11  ┆ 10  │
# │ c   ┆ 6   ┆ 1   │
# └─────┴─────┴─────┘,
# shape: (3, 3)
# ┌──────────────┬───────┬─────┐
# │ node         ┆ start ┆ end │
# │ ---          ┆ ---   ┆ --- │
# │ str          ┆ u64   ┆ u64 │
# ╞══════════════╪═══════╪═════╡
# │ optimization ┆ 0     ┆ 67  │
# │ sort(a)      ┆ 67    ┆ 79  │
# └──────────────┴───────┴─────┘]

Parameters:

  • engine (String) (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine.

  • optimizations (Object) (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

Returns:



911
912
913
914
915
916
917
918
919
920
921
# File 'lib/polars/lazy_frame.rb', line 911

def profile(
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  ldf = _ldf.with_optimizations(optimizations._rboptflags)

  df_rb, timings_rb = ldf.profile
  [Utils.wrap_df(df_rb), Utils.wrap_df(timings_rb)]
end

#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame

Aggregate the columns in the DataFrame to their quantile value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.quantile(0.7).collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═════╡
# │ 3.0 ┆ 1.0 │
# └─────┴─────┘

Parameters:

  • quantile (Float)

    Quantile between 0.0 and 1.0.

  • interpolation ("nearest", "higher", "lower", "midpoint", "linear") (defaults to: "nearest")

    Interpolation method.

Returns:



4148
4149
4150
4151
# File 'lib/polars/lazy_frame.rb', line 4148

def quantile(quantile, interpolation: "nearest")
  quantile = Utils.parse_into_expression(quantile, str_as_lit: false)
  _from_rbldf(_ldf.quantile(quantile, interpolation))
end

#remove(*predicates, **constraints) ⇒ LazyFrame

Remove rows, dropping those that match the given predicate expression(s).

The original order of the remaining rows is preserved.

Rows where the filter predicate does not evaluate to true are retained (this includes rows where the predicate evaluates as null).

Examples:

Remove rows matching a condition:

lf = Polars::LazyFrame.new(
  {
    "foo" => [2, 3, nil, 4, 0],
    "bar" => [5, 6, nil, nil, 0],
    "ham" => ["a", "b", nil, "c", "d"]
  }
)
lf.remove(
  Polars.col("bar") >= 5
).collect
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# │ 0    ┆ 0    ┆ d    │
# └──────┴──────┴──────┘

Discard rows based on multiple conditions, combined with and/or operators:

lf.remove(
  (Polars.col("foo") >= 0) & (Polars.col("bar") >= 0)
).collect
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘
lf.remove(
  (Polars.col("foo") >= 0) | (Polars.col("bar") >= 0)
).collect
# =>
# shape: (1, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Provide multiple constraints using *args syntax:

lf.remove(
  Polars.col("ham").is_not_null,
  Polars.col("bar") >= 0
).collect
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘

Provide constraints(s) using **kwargs syntax:

lf.remove(foo: 0, bar: 0).collect
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 2    ┆ 5    ┆ a    │
# │ 3    ┆ 6    ┆ b    │
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘

Remove rows by comparing two columns against each other; in this case, we remove rows where the two columns are not equal (using ne_missing to ensure that null values compare equal):

lf.remove(
  Polars.col("foo").ne_missing(Polars.col("bar"))
).collect
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 0    ┆ 0    ┆ d    │
# └──────┴──────┴──────┘

Parameters:

  • predicates (Array)

    Expression that evaluates to a boolean Series.

  • constraints (Hash)

    Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as Polars.col(name).eq(value), and is implicitly joined with the other filter conditions using &.

Returns:



2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
# File 'lib/polars/lazy_frame.rb', line 2023

def remove(
  *predicates,
  **constraints
)
  if constraints.empty?
    # early-exit conditions (exclude/include all rows)
    if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass))
      return clear
    end
    if predicates.length == 1 && predicates[0].is_a?(FalseClass)
      return dup
    end
  end

  _filter(
    predicates: predicates,
    constraints: constraints,
    invert: true
  )
end

#rename(mapping, strict: true) ⇒ LazyFrame

Rename column names.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
lf.rename({"foo" => "apple"}).collect
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 1     ┆ 6   ┆ a   │
# │ 2     ┆ 7   ┆ b   │
# │ 3     ┆ 8   ┆ c   │
# └───────┴─────┴─────┘

Parameters:

  • mapping (Hash)

    Key value pairs that map from old name to new name.

  • strict (Boolean) (defaults to: true)

    Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to mapping).

Returns:



3414
3415
3416
3417
3418
3419
3420
3421
3422
# File 'lib/polars/lazy_frame.rb', line 3414

def rename(mapping, strict: true)
  if mapping.respond_to?(:call)
    select(F.all.name.map(&mapping))
  else
    existing = mapping.keys
    _new = mapping.values
    _from_rbldf(_ldf.rename(existing, _new, strict))
  end
end

#reverseLazyFrame

Reverse the DataFrame.

Examples:

lf = Polars::LazyFrame.new(
  {
    "key" => ["a", "b", "c"],
    "val" => [1, 2, 3]
  }
)
lf.reverse.collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ key ┆ val │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 1   │
# └─────┴─────┘

Returns:



3447
3448
3449
# File 'lib/polars/lazy_frame.rb', line 3447

def reverse
  _from_rbldf(_ldf.reverse)
end

#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ LazyFrame

Create rolling groups based on a time column.

Different from a dynamic_group_by the windows are now determined by the individual values and are not of constant intervals. For constant intervals use group_by_dynamic.

The period and offset arguments are created either from a timedelta, or by using the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_rolling on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

dates = [
  "2020-01-01 13:45:48",
  "2020-01-01 16:42:13",
  "2020-01-01 16:45:09",
  "2020-01-02 18:12:48",
  "2020-01-03 19:45:32",
  "2020-01-08 23:16:43"
]
df = Polars::LazyFrame.new({"dt" => dates, "a" => [3, 7, 5, 9, 2, 1]}).with_columns(
  Polars.col("dt").str.strptime(Polars::Datetime).set_sorted
)
df.rolling(index_column: "dt", period: "2d").agg(
  [
    Polars.sum("a").alias("sum_a"),
    Polars.min("a").alias("min_a"),
    Polars.max("a").alias("max_a")
  ]
).collect
# =>
# shape: (6, 4)
# ┌─────────────────────┬───────┬───────┬───────┐
# │ dt                  ┆ sum_a ┆ min_a ┆ max_a │
# │ ---                 ┆ ---   ┆ ---   ┆ ---   │
# │ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
# ╞═════════════════════╪═══════╪═══════╪═══════╡
# │ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
# │ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
# │ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
# │ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
# │ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
# │ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
# └─────────────────────┴───────┴───────┴───────┘

Parameters:

  • index_column (Object)

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a rolling group by on indices, dtype needs to be one of \{UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.

  • period (Object)

    Length of the window.

  • offset (Object) (defaults to: nil)

    Offset of the window. Default is -period.

  • closed ("right", "left", "both", "none") (defaults to: "right")

    Define whether the temporal window interval is closed or not.

  • group_by (Object) (defaults to: nil)

    Also group by this column/these columns.

Returns:



2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
# File 'lib/polars/lazy_frame.rb', line 2284

def rolling(
  index_column:,
  period:,
  offset: nil,
  closed: "right",
  group_by: nil
)
  index_column = Utils.parse_into_expression(index_column)
  if offset.nil?
    offset = Utils.negate_duration_string(Utils.parse_as_duration_string(period))
  end

  rbexprs_by = (
    !group_by.nil? ? Utils.parse_into_list_of_expressions(group_by) : []
  )
  period = Utils.parse_as_duration_string(period)
  offset = Utils.parse_as_duration_string(offset)

  lgb = _ldf.rolling(index_column, period, offset, closed, rbexprs_by)
  LazyGroupBy.new(lgb)
end

#schemaHash

Get the schema.

Examples:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.schema
# => {"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String}

Returns:

  • (Hash)


148
149
150
# File 'lib/polars/lazy_frame.rb', line 148

def schema
  _ldf.collect_schema
end

#select(*exprs, **named_exprs) ⇒ LazyFrame

Select columns from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"],
  }
).lazy
df.select("foo").collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘
df.select(["foo", "bar"]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 6   │
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# └─────┴─────┘
df.select(Polars.col("foo") + 1).collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 2   │
# │ 3   │
# │ 4   │
# └─────┘
df.select([Polars.col("foo") + 1, Polars.col("bar") + 1]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# │ 4   ┆ 9   │
# └─────┴─────┘
df.select(Polars.when(Polars.col("foo") > 2).then(10).otherwise(0)).collect
# =>
# shape: (3, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ i32     │
# ╞═════════╡
# │ 0       │
# │ 0       │
# │ 10      │
# └─────────┘

Parameters:

  • exprs (Array)

    Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



2132
2133
2134
2135
2136
2137
2138
2139
# File 'lib/polars/lazy_frame.rb', line 2132

def select(*exprs, **named_exprs)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0"

  rbexprs = Utils.parse_into_list_of_expressions(
    *exprs, **named_exprs, __structify: structify
  )
  _from_rbldf(_ldf.select(rbexprs))
end

#select_seq(*exprs, **named_exprs) ⇒ LazyFrame

Select columns from this LazyFrame.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

  • exprs (Array)

    Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



2155
2156
2157
2158
2159
2160
2161
2162
# File 'lib/polars/lazy_frame.rb', line 2155

def select_seq(*exprs, **named_exprs)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0

  rbexprs = Utils.parse_into_list_of_expressions(
    *exprs, **named_exprs, __structify: structify
  )
  _from_rbldf(_ldf.select_seq(rbexprs))
end

#serialize(file = nil, format: "binary") ⇒ Object

Note:

Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.

Serialize the logical plan of this LazyFrame to a file or string.

Examples:

Serialize the logical plan into a binary representation.

lf = Polars::LazyFrame.new({"a" => [1, 2, 3]}).sum
bytes = lf.serialize
Polars::LazyFrame.deserialize(StringIO.new(bytes)).collect
# =>
# shape: (1, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 6   │
# └─────┘

Parameters:

  • file (Object) (defaults to: nil)

    File path to which the result should be written. If set to nil (default), the output is returned as a string instead.

  • format ('binary', 'json') (defaults to: "binary")

    The format in which to serialize. Options:

    • "binary": Serialize to binary format (bytes). This is the default.
    • "json": Serialize to JSON format (string) (deprecated).

Returns:



214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# File 'lib/polars/lazy_frame.rb', line 214

def serialize(file = nil, format: "binary")
  if format == "binary"
    raise Todo unless _ldf.respond_to?(:serialize_binary)
    serializer = _ldf.method(:serialize_binary)
  elsif format == "json"
    msg = "'json' serialization format of LazyFrame is deprecated"
    warn msg
    serializer = _ldf.method(:serialize_json)
  else
    msg = "`format` must be one of {{'binary', 'json'}}, got #{format.inspect}"
    raise ArgumentError, msg
  end

  Utils.serialize_polars_object(serializer, file)
end

#set_sorted(column, *more_columns, descending: false, nulls_last: false) ⇒ LazyFrame

Note:

This can lead to incorrect results if the data is NOT sorted! Use with care!

Flag a column as sorted.

This can speed up future operations.

Parameters:

  • column (Object)

    Column that is sorted.

  • more_columns (Array)

    Columns that are sorted over after column.

  • descending (Boolean) (defaults to: false)

    Whether the column is sorted in descending order.

  • nulls_last (Boolean) (defaults to: false)

    Whether the nulls are at the end.

Returns:



4844
4845
4846
4847
4848
4849
4850
4851
4852
4853
4854
4855
4856
4857
4858
4859
4860
4861
4862
4863
4864
4865
4866
4867
4868
4869
4870
4871
# File 'lib/polars/lazy_frame.rb', line 4844

def set_sorted(
  column,
  *more_columns,
  descending: false,
  nulls_last: false
)
  if !Utils.strlike?(column)
    msg = "expected a 'str' for argument 'column' in 'set_sorted'"
    raise TypeError, msg
  end

  if Utils.bool?(descending)
    ds = [descending]
  else
    ds = descending
  end
  if Utils.bool?(nulls_last)
    nl = [nulls_last]
  else
    nl = nulls_last
  end

  _from_rbldf(
    _ldf.hint_sorted(
      [column] + more_columns, ds, nl
    )
  )
end

#shift(n = 1, fill_value: nil) ⇒ LazyFrame

Shift the values by a given period.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.shift(1).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ i64  ┆ i64  │
# ╞══════╪══════╡
# │ null ┆ null │
# │ 1    ┆ 2    │
# │ 3    ┆ 4    │
# └──────┴──────┘
df.shift(-1).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ i64  ┆ i64  │
# ╞══════╪══════╡
# │ 3    ┆ 4    │
# │ 5    ┆ 6    │
# │ null ┆ null │
# └──────┴──────┘

Parameters:

  • n (Integer) (defaults to: 1)

    Number of places to shift (may be negative).

  • fill_value (Object) (defaults to: nil)

    Fill the resulting null values with this value.

Returns:



3493
3494
3495
3496
3497
3498
3499
# File 'lib/polars/lazy_frame.rb', line 3493

def shift(n = 1, fill_value: nil)
  if !fill_value.nil?
    fill_value = Utils.parse_into_expression(fill_value, str_as_lit: true)
  end
  n = Utils.parse_into_expression(n)
  _from_rbldf(_ldf.shift(n, fill_value))
end

#show_graph(optimized: true, show: true, output_path: nil, raw_output: false, figsize: [16.0, 12.0], engine: "auto", plan_stage: "ir", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ Object

Show a plot of the query plan.

Note that Graphviz must be installed to render the visualization (if not already present, you can download it here: https://graphviz.org/download.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
lf.group_by("a", maintain_order: true).agg(Polars.all.sum).sort(
  "a"
).show_graph

Parameters:

  • optimized (Boolean) (defaults to: true)

    Optimize the query plan.

  • show (Boolean) (defaults to: true)

    Show the figure.

  • output_path (String) (defaults to: nil)

    Write the figure to disk.

  • raw_output (Boolean) (defaults to: false)

    Return dot syntax. This cannot be combined with show and/or output_path.

  • engine (String) (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine.

  • plan_stage ('ir', 'physical') (defaults to: "ir")

    Select the stage to display. Currently only the streaming engine has a separate physical stage, for the other engines both IR and physical are the same.

  • optimizations (Object) (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The set of the optimizations considered during query optimization.

Returns:



559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
# File 'lib/polars/lazy_frame.rb', line 559

def show_graph(
  optimized: true,
  show: true,
  output_path: nil,
  raw_output: false,
  figsize: [16.0, 12.0],
  engine: "auto",
  plan_stage: "ir",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if engine == "streaming"
    issue_unstable_warning("streaming mode is considered unstable.")
  end

  optimizations = optimizations.dup
  optimizations._rboptflags.streaming = engine == "streaming"
  _ldf = self._ldf.with_optimizations(optimizations._rboptflags)

  if plan_stage == "ir"
    dot = _ldf.to_dot(optimized)
  elsif plan_stage == "physical"
    if engine == "streaming"
      dot = _ldf.to_dot_streaming_phys(optimized)
    else
      dot = _ldf.to_dot(optimized)
    end
  else
    error_msg = "invalid plan stage '#{plan_stage}'"
    raise TypeError, error_msg
  end

  Utils.display_dot_graph(
    dot: dot,
    show: show,
    output_path: output_path,
    raw_output: raw_output
  )
end

#sink_csv(path, include_bom: false, compression: "uncompressed", compression_level: nil, check_extension: true, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame

Evaluate the query in streaming mode and write to a CSV file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_csv("out.csv")

Parameters:

  • path (String)

    File path to which the file should be written.

  • include_bom (Boolean) (defaults to: false)

    Whether to include UTF-8 BOM in the CSV output.

  • include_header (Boolean) (defaults to: true)

    Whether to include header in the CSV output.

  • separator (String) (defaults to: ",")

    Separate CSV fields with this symbol.

  • line_terminator (String) (defaults to: "\n")

    String used to end each row.

  • quote_char (String) (defaults to: '"')

    Byte to use as quoting character.

  • batch_size (Integer) (defaults to: 1024)

    Number of rows that will be processed per thread.

  • datetime_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame's Datetime cols (if any).

  • date_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate.

  • time_format (String) (defaults to: nil)

    A format string, with the specifiers defined by the chrono <https://docs.rs/chrono/latest/chrono/format/strftime/index.html>_ Rust crate.

  • float_scientific (Integer) (defaults to: nil)

    Whether to use scientific form always (true), never (false), or automatically (nil) for Float32 and Float64 datatypes.

  • float_precision (Integer) (defaults to: nil)

    Number of decimal places to write, applied to both Float32 and Float64 datatypes.

  • decimal_comma (Boolean) (defaults to: false)

    Use a comma as the decimal separator instead of a point. Floats will be encapsulated in quotes if necessary; set the field separator to override.

  • null_value (String) (defaults to: nil)

    A string representing null values (defaulting to the empty string).

  • quote_style ("necessary", "always", "non_numeric", "never") (defaults to: nil)

    Determines the quoting strategy used.

    • necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
    • always: This puts quotes around every field. Always.
    • never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
    • non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.
  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • storage_options (Object) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • sync_on_close ('data', 'all') (defaults to: nil)

    Sync to disk when before closing a file.

    • nil does not sync.
    • data syncs the file contents.
    • all syncs the file contents and metadata.
  • mkdir (Boolean) (defaults to: false)

    Recursively create all the directories in the path.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
# File 'lib/polars/lazy_frame.rb', line 1398

def sink_csv(
  path,
  include_bom: false,
  compression: "uncompressed",
  compression_level: nil,
  check_extension: true,
  include_header: true,
  separator: ",",
  line_terminator: "\n",
  quote_char: '"',
  batch_size: 1024,
  datetime_format: nil,
  date_format: nil,
  time_format: nil,
  float_scientific: nil,
  float_precision: nil,
  decimal_comma: false,
  null_value: nil,
  quote_style: nil,
  maintain_order: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  sync_on_close: nil,
  mkdir: false,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  Utils._check_arg_is_1byte("separator", separator, false)
  Utils._check_arg_is_1byte("quote_char", quote_char, false)
  engine = _select_engine(engine)

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  credential_provider_builder = _init_credential_provider_builder.(
    credential_provider, path, storage_options, "sink_csv"
  )

  target = _to_sink_target(path)

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  sink_options = SinkOptions.new(
    mkdir: mkdir,
    maintain_order: maintain_order,
    sync_on_close: sync_on_close,
    storage_options: storage_options,
    credential_provider: credential_provider_builder
  )

  ldf_rb = _ldf.sink_csv(
    target,
    sink_options,
    include_bom,
    compression,
    compression_level,
    check_extension,
    include_header,
    separator.ord,
    line_terminator,
    quote_char.ord,
    batch_size,
    datetime_format,
    date_format,
    time_format,
    float_scientific,
    float_precision,
    decimal_comma,
    null_value,
    quote_style
  )

  if !lazy
    ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags)
    ldf = LazyFrame._from_rbldf(ldf_rb)
    ldf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf_rb)
end

#sink_ipc(path, compression: "uncompressed", compat_level: nil, record_batch_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS, _record_batch_statistics: false) ⇒ DataFrame

Evaluate the query in streaming mode and write to an IPC file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_ipc("out.arrow")

Parameters:

  • path (String)

    File path to which the file should be written.

  • compression ("lz4", "zstd") (defaults to: "uncompressed")

    Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • storage_options (String) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

    The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

    • aws
    • gcp
    • azure
    • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

    If storage_options is not provided, Polars will try to infer the information from environment variables.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • sync_on_close ('data', 'all') (defaults to: nil)

    Sync to disk when before closing a file.

    • nil does not sync.
    • data syncs the file contents.
    • all syncs the file contents and metadata.
  • mkdir (Boolean) (defaults to: false)

    Recursively create all the directories in the path.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
# File 'lib/polars/lazy_frame.rb', line 1233

def sink_ipc(
  path,
  compression: "uncompressed",
  compat_level: nil,
  record_batch_size: nil,
  maintain_order: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  sync_on_close: nil,
  mkdir: false,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS,
  _record_batch_statistics: false
)
  engine = _select_engine(engine)

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  credential_provider_builder = _init_credential_provider_builder.(
    credential_provider, path, storage_options, "sink_ipc"
  )

  target = _to_sink_target(path)

  compat_level_rb = nil
  if compat_level.nil?
    compat_level_rb = true
  else
    raise Todo
  end

  if compression.nil?
    compression = "uncompressed"
  end

  sink_options = SinkOptions.new(
    mkdir: mkdir,
    maintain_order: maintain_order,
    sync_on_close: sync_on_close,
    storage_options: storage_options,
    credential_provider: credential_provider_builder
  )

  ldf_rb = _ldf.sink_ipc(
    target,
    sink_options,
    compression,
    compat_level_rb,
    record_batch_size,
    _record_batch_statistics
  )

  if !lazy
    ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags)
    ldf = LazyFrame._from_rbldf(ldf_rb)
    ldf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf_rb)
end

#sink_ndjson(path, compression: "uncompressed", compression_level: nil, check_extension: true, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, mkdir: false, lazy: false, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame

Evaluate the query in streaming mode and write to an NDJSON file.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_ndjson("out.ndjson")

Parameters:

  • path (String)

    File path to which the file should be written.

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • storage_options (String) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

    The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

    • aws
    • gcp
    • azure
    • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

    If storage_options is not provided, Polars will try to infer the information from environment variables.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • sync_on_close ('data', 'all') (defaults to: nil)

    Sync to disk when before closing a file.

    • nil does not sync.
    • data syncs the file contents.
    • all syncs the file contents and metadata.
  • mkdir (Boolean) (defaults to: false)

    Recursively create all the directories in the path.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
# File 'lib/polars/lazy_frame.rb', line 1537

def sink_ndjson(
  path,
  compression: "uncompressed",
  compression_level: nil,
  check_extension: true,
  maintain_order: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  sync_on_close: nil,
  mkdir: false,
  lazy: false,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  credential_provider_builder = _init_credential_provider_builder.(
    credential_provider, path, storage_options, "sink_ndjson"
  )

  target = _to_sink_target(path)

  sink_options = SinkOptions.new(
    mkdir: mkdir,
    maintain_order: maintain_order,
    sync_on_close: sync_on_close,
    storage_options: storage_options,
    credential_provider: credential_provider_builder
  )

  ldf_rb = _ldf.sink_ndjson(
    target,
    compression,
    compression_level,
    check_extension,
    sink_options
  )

  if !lazy
    ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags)
    ldf = LazyFrame._from_rbldf(ldf_rb)
    ldf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf_rb)
end

#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, maintain_order: true, storage_options: nil, credential_provider: "auto", retries: nil, sync_on_close: nil, metadata: nil, mkdir: false, lazy: false, arrow_schema: nil, engine: "auto", optimizations: DEFAULT_QUERY_OPT_FLAGS) ⇒ DataFrame

Persists a LazyFrame at the provided path.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_parquet("out.parquet")

Parameters:

  • path (String)

    File path to which the file should be written.

  • compression ("lz4", "uncompressed", "snappy", "gzip", "lzo", "brotli", "zstd") (defaults to: "zstd")

    Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression. Choose "snappy" for more backwards compatibility guarantees when you deal with older parquet readers.

  • compression_level (Integer) (defaults to: nil)

    The level of compression to use. Higher compression means smaller files on disk.

    • "gzip" : min-level: 0, max-level: 10.
    • "brotli" : min-level: 0, max-level: 11.
    • "zstd" : min-level: 1, max-level: 22.
  • statistics (Boolean) (defaults to: true)

    Write statistics to the parquet headers. This requires extra compute.

  • row_group_size (Integer) (defaults to: nil)

    Size of the row groups in number of rows. If nil (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.

  • data_page_size (Integer) (defaults to: nil)

    Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes

  • maintain_order (Boolean) (defaults to: true)

    Maintain the order in which data is processed. Setting this to false will be slightly faster.

  • storage_options (String) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

    The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

    • aws
    • gcp
    • azure
    • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

    If storage_options is not provided, Polars will try to infer the information from environment variables.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • sync_on_close ('data', 'all') (defaults to: nil)

    Sync to disk when before closing a file.

    • nil does not sync.
    • data syncs the file contents.
    • all syncs the file contents and metadata.
  • mkdir (Boolean) (defaults to: false)

    Recursively create all the directories in the path.

  • lazy (Boolean) (defaults to: false)

    Wait to start execution until collect is called.

  • engine (defaults to: "auto")

    Select the engine used to process the query, optional. At the moment, if set to "auto" (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.

  • optimizations (defaults to: DEFAULT_QUERY_OPT_FLAGS)

    The optimization passes done during query optimization.

    This has no effect if lazy is set to true.

Returns:



1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
# File 'lib/polars/lazy_frame.rb', line 1095

def sink_parquet(
  path,
  compression: "zstd",
  compression_level: nil,
  statistics: true,
  row_group_size: nil,
  data_page_size: nil,
  maintain_order: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  sync_on_close: nil,
  metadata: nil,
  mkdir: false,
  lazy: false,
  arrow_schema: nil,
  engine: "auto",
  optimizations: DEFAULT_QUERY_OPT_FLAGS
)
  engine = _select_engine(engine)

  if statistics == true
    statistics = {
      min: true,
      max: true,
      distinct_count: false,
      null_count: true
    }
  elsif statistics == false
    statistics = {}
  elsif statistics == "full"
    statistics = {
      min: true,
      max: true,
      distinct_count: true,
      null_count: true
    }
  end

  _init_credential_provider_builder = Polars.method(:_init_credential_provider_builder)

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  credential_provider_builder = _init_credential_provider_builder.(
    credential_provider, path, storage_options, "sink_parquet"
  )

  target = _to_sink_target(path)

  sink_options = SinkOptions.new(
    mkdir: mkdir,
    maintain_order: maintain_order,
    sync_on_close: sync_on_close,
    storage_options: storage_options,
    credential_provider: credential_provider_builder
  )

  ldf_rb = _ldf.sink_parquet(
    target,
    sink_options,
    compression,
    compression_level,
    statistics,
    row_group_size,
    data_page_size,
    ,
    arrow_schema
  )

  if !lazy
    ldf_rb = ldf_rb.with_optimizations(optimizations._rboptflags)
    ldf = LazyFrame._from_rbldf(ldf_rb)
    ldf.collect(engine: engine)
    return nil
  end
  LazyFrame._from_rbldf(ldf_rb)
end

#slice(offset, length = nil) ⇒ LazyFrame

Get a slice of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
).lazy
df.slice(1, 2).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ y   ┆ 3   ┆ 4   │
# │ z   ┆ 5   ┆ 6   │
# └─────┴─────┴─────┘

Parameters:

  • offset (Integer)

    Start index. Negative indexing is supported.

  • length (Integer) (defaults to: nil)

    Length of the slice. If set to nil, all rows starting at the offset will be selected.

Returns:



3530
3531
3532
3533
3534
3535
# File 'lib/polars/lazy_frame.rb', line 3530

def slice(offset, length = nil)
  if length && length < 0
    raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame"
  end
  _from_rbldf(_ldf.slice(offset, length))
end

#sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame

Sort the DataFrame.

Sorting can be done by:

  • A single column name
  • An expression
  • Multiple expressions

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
df.sort("foo", descending: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Parameters:

  • by (Object)

    Column (expressions) to sort by.

  • more_by (Array)

    Additional columns to sort by, specified as positional arguments.

  • descending (Boolean) (defaults to: false)

    Sort in descending order.

  • nulls_last (Boolean) (defaults to: false)

    Place null values last. Can only be used if sorted by a single column.

  • maintain_order (Boolean) (defaults to: false)

    Whether the order should be maintained if elements are equal. Note that if true streaming is not possible and performance might be worse since this requires a stable search.

  • multithreaded (Boolean) (defaults to: true)

    Sort using multiple threads.

Returns:



645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
# File 'lib/polars/lazy_frame.rb', line 645

def sort(by, *more_by, descending: false, nulls_last: false, maintain_order: false, multithreaded: true)
  if by.is_a?(::String) && more_by.empty?
    return _from_rbldf(
      _ldf.sort(
        by, descending, nulls_last, maintain_order, multithreaded
      )
    )
  end

  by = Utils.parse_into_list_of_expressions(by, *more_by)
  descending = Utils.extend_bool(descending, by.length, "descending", "by")
  nulls_last = Utils.extend_bool(nulls_last, by.length, "nulls_last", "by")
  _from_rbldf(
    _ldf.sort_by_exprs(
      by, descending, nulls_last, maintain_order, multithreaded
    )
  )
end

#sql(query, table_name: "self") ⇒ Expr

Note:

This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.

Note:
  • The calling frame is automatically registered as a table in the SQL context under the name "self". If you want access to the DataFrames and LazyFrames found in the current globals, use the top-level Polars.sql.
  • More control over registration and execution behaviour is available by using the SQLContext object.

Execute a SQL query against the LazyFrame.

Examples:

Query the LazyFrame using SQL:

lf1 = Polars::LazyFrame.new({"a" => [1, 2, 3], "b" => [6, 7, 8], "c" => ["z", "y", "x"]})
lf2 = Polars::LazyFrame.new({"a" => [3, 2, 1], "d" => [125, -654, 888]})
lf1.sql("SELECT c, b FROM self WHERE a > 1").collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ c   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ y   ┆ 7   │
# │ x   ┆ 8   │
# └─────┴─────┘

Apply SQL transforms (aliasing "self" to "frame") then filter natively (you can freely mix SQL and native operations):

lf1.sql(
  "
    SELECT
        a,
        (a % 2 == 0) AS a_is_even,
        (b::float4 / 2) AS \"b/2\",
        CONCAT_WS(':', c, c, c) AS c_c_c
    FROM frame
    ORDER BY a
  ",
  table_name: "frame",
).filter(~Polars.col("c_c_c").str.starts_with("x")).collect
# =>
# shape: (2, 4)
# ┌─────┬───────────┬─────┬───────┐
# │ a   ┆ a_is_even ┆ b/2 ┆ c_c_c │
# │ --- ┆ ---       ┆ --- ┆ ---   │
# │ i64 ┆ bool      ┆ f32 ┆ str   │
# ╞═════╪═══════════╪═════╪═══════╡
# │ 1   ┆ false     ┆ 3.0 ┆ z:z:z │
# │ 2   ┆ true      ┆ 3.5 ┆ y:y:y │
# └─────┴───────────┴─────┴───────┘

Parameters:

  • query (String)

    SQL query to execute.

  • table_name (String) (defaults to: "self")

    Optionally provide an explicit name for the table that represents the calling frame (defaults to "self").

Returns:



724
725
726
727
728
729
# File 'lib/polars/lazy_frame.rb', line 724

def sql(query, table_name: "self")
  ctx = Polars::SQLContext.new
  name = table_name || "self"
  ctx.register(name, self)
  ctx.execute(query)
end

#std(ddof: 1) ⇒ LazyFrame

Aggregate the columns in the DataFrame to their standard deviation value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.std.collect
# =>
# shape: (1, 2)
# ┌──────────┬─────┐
# │ a        ┆ b   │
# │ ---      ┆ --- │
# │ f64      ┆ f64 │
# ╞══════════╪═════╡
# │ 1.290994 ┆ 0.5 │
# └──────────┴─────┘
df.std(ddof: 0).collect
# =>
# shape: (1, 2)
# ┌──────────┬──────────┐
# │ a        ┆ b        │
# │ ---      ┆ ---      │
# │ f64      ┆ f64      │
# ╞══════════╪══════════╡
# │ 1.118034 ┆ 0.433013 │
# └──────────┴──────────┘

Returns:



3965
3966
3967
# File 'lib/polars/lazy_frame.rb', line 3965

def std(ddof: 1)
  _from_rbldf(_ldf.std(ddof))
end

#sumLazyFrame

Aggregate the columns in the DataFrame to their sum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.sum.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 5   │
# └─────┴─────┘

Returns:



4057
4058
4059
# File 'lib/polars/lazy_frame.rb', line 4057

def sum
  _from_rbldf(_ldf.sum)
end

#tail(n = 5) ⇒ LazyFrame

Get the last n rows.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => [1, 2, 3, 4, 5, 6],
    "b" => [7, 8, 9, 10, 11, 12]
  }
)
lf.tail.collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 8   │
# │ 3   ┆ 9   │
# │ 4   ┆ 10  │
# │ 5   ┆ 11  │
# │ 6   ┆ 12  │
# └─────┴─────┘
lf.tail(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 5   ┆ 11  │
# │ 6   ┆ 12  │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows.

Returns:



3670
3671
3672
# File 'lib/polars/lazy_frame.rb', line 3670

def tail(n = 5)
  _from_rbldf(_ldf.tail(n))
end

#to_sString

Returns a string representing the LazyFrame.

Returns:



176
177
178
179
180
181
182
# File 'lib/polars/lazy_frame.rb', line 176

def to_s
  "    naive plan: (run LazyFrame#explain(optimized: true) to see the optimized plan)\n\n    \#{explain(optimized: false)}\n  EOS\nend\n"

#top_k(k, by:, reverse: false) ⇒ LazyFrame

Return the k largest rows.

Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call :func:sort after this function if you wish the output to be sorted.

Examples:

Get the rows which contain the 4 largest values in column b.

lf = Polars::LazyFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [2, 1, 1, 3, 2, 1]
  }
)
lf.top_k(4, by: "b").collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 3   │
# │ a   ┆ 2   │
# │ b   ┆ 2   │
# │ b   ┆ 1   │
# └─────┴─────┘

Get the rows which contain the 4 largest values when sorting on column b and a.

lf.top_k(4, by: ["b", "a"]).collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 2   │
# │ c   ┆ 1   │
# └─────┴─────┘

Parameters:

  • k (Integer)

    Number of rows to return.

  • by (Object)

    Column(s) used to determine the top rows. Accepts expression input. Strings are parsed as column names.

  • reverse (Object) (defaults to: false)

    Consider the k smallest elements of the by column(s) (instead of the k largest). This can be specified per column by passing an array of booleans.

Returns:



785
786
787
788
789
790
791
792
793
# File 'lib/polars/lazy_frame.rb', line 785

def top_k(
  k,
  by:,
  reverse: false
)
  by = Utils.parse_into_list_of_expressions(by)
  reverse = Utils.extend_bool(reverse, by.length, "reverse", "by")
  _from_rbldf(_ldf.top_k(k, by, reverse))
end

#unique(maintain_order: false, subset: nil, keep: "any") ⇒ LazyFrame

Drop duplicate rows from this DataFrame.

Note that this fails if there is a column of type List in the DataFrame or subset.

Examples:

lf = Polars::LazyFrame.new(
  {
    "foo" => [1, 2, 3, 1],
    "bar" => ["a", "a", "a", "a"],
    "ham" => ["b", "b", "b", "b"]
  }
)
lf.unique(maintain_order: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ a   ┆ b   │
# │ 2   ┆ a   ┆ b   │
# │ 3   ┆ a   ┆ b   │
# └─────┴─────┴─────┘
lf.unique(subset: ["bar", "ham"], maintain_order: true).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ a   ┆ b   │
# └─────┴─────┴─────┘
lf.unique(keep: "last", maintain_order: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 2   ┆ a   ┆ b   │
# │ 3   ┆ a   ┆ b   │
# │ 1   ┆ a   ┆ b   │
# └─────┴─────┴─────┘

Parameters:

  • maintain_order (Boolean) (defaults to: false)

    Keep the same order as the original DataFrame. This requires more work to compute.

  • subset (Object) (defaults to: nil)

    Subset to use to compare rows.

  • keep ("first", "last") (defaults to: "any")

    Which of the duplicate rows to keep.

Returns:



4264
4265
4266
4267
4268
4269
4270
# File 'lib/polars/lazy_frame.rb', line 4264

def unique(maintain_order: false, subset: nil, keep: "any")
  parsed_subset = nil
  if !subset.nil?
    parsed_subset = Utils.parse_into_list_of_expressions(subset, __require_selectors: true)
  end
  _from_rbldf(_ldf.unique(maintain_order, parsed_subset, keep))
end

#unnest(columns, *more_columns, separator: nil) ⇒ LazyFrame

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Examples:

df = (
  Polars::DataFrame.new(
    {
      "before" => ["foo", "bar"],
      "t_a" => [1, 2],
      "t_b" => ["a", "b"],
      "t_c" => [true, nil],
      "t_d" => [[1, 2], [3]],
      "after" => ["baz", "womp"]
    }
  )
  .lazy
  .select(
    ["before", Polars.struct(Polars.col("^t_.$")).alias("t_struct"), "after"]
  )
)
df.collect
# =>
# shape: (2, 3)
# ┌────────┬─────────────────────┬───────┐
# │ before ┆ t_struct            ┆ after │
# │ ---    ┆ ---                 ┆ ---   │
# │ str    ┆ struct[4]           ┆ str   │
# ╞════════╪═════════════════════╪═══════╡
# │ foo    ┆ {1,"a",true,[1, 2]} ┆ baz   │
# │ bar    ┆ {2,"b",null,[3]}    ┆ womp  │
# └────────┴─────────────────────┴───────┘
df.unnest("t_struct").collect
# =>
# shape: (2, 6)
# ┌────────┬─────┬─────┬──────┬───────────┬───────┐
# │ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
# │ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
# │ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
# ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
# │ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
# │ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
# └────────┴─────┴─────┴──────┴───────────┴───────┘

Parameters:

  • columns (Object)

    Names of the struct columns that will be decomposed by its fields

  • more_columns (Array)

    Additional columns to unnest, specified as positional arguments.

  • separator (String) (defaults to: nil)

    Rename output column names as combination of the struct column name, name separator and field name.

Returns:



4778
4779
4780
4781
4782
4783
# File 'lib/polars/lazy_frame.rb', line 4778

def unnest(columns, *more_columns, separator: nil)
  subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector(
    more_columns
  )
  _from_rbldf(_ldf.unnest(subset._rbselector, separator))
end

#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.

Examples:

lf = Polars::LazyFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
)
lf.unpivot(Polars.cs.numeric, index: "a").collect
# =>
# shape: (6, 3)
# ┌─────┬──────────┬───────┐
# │ a   ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ x   ┆ b        ┆ 1     │
# │ y   ┆ b        ┆ 3     │
# │ z   ┆ b        ┆ 5     │
# │ x   ┆ c        ┆ 2     │
# │ y   ┆ c        ┆ 4     │
# │ z   ┆ c        ┆ 6     │
# └─────┴──────────┴───────┘

Parameters:

  • on (Object) (defaults to: nil)

    Column(s) or selector(s) to use as values variables; if on is empty all columns that are not in index will be used.

  • index (Object) (defaults to: nil)

    Column(s) or selector(s) to use as identifier variables.

  • variable_name (String) (defaults to: nil)

    Name to give to the variable column. Defaults to "variable"

  • value_name (String) (defaults to: nil)

    Name to give to the value column. Defaults to "value"

  • streamable (Boolean) (defaults to: true)

    Allow this node to run in the streaming engine. If this runs in streaming, the output of the unpivot operation will not have a stable ordering.

Returns:



4567
4568
4569
4570
4571
4572
4573
4574
4575
4576
4577
4578
4579
4580
4581
4582
4583
4584
4585
4586
4587
4588
4589
# File 'lib/polars/lazy_frame.rb', line 4567

def unpivot(
  on = nil,
  index: nil,
  variable_name: nil,
  value_name: nil,
  streamable: true
)
  if !streamable
    warn "The `streamable` parameter for `LazyFrame.unpivot` is deprecated"
  end

  selector_on = on.nil? ? Selectors.empty : Utils.parse_list_into_selector(on)
  selector_index = index.nil? ? Selectors.empty : Utils.parse_list_into_selector(index)

  _from_rbldf(
    _ldf.unpivot(
      selector_on._rbselector,
      selector_index._rbselector,
      value_name,
      variable_name
    )
  )
end

#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ LazyFrame

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Note:

This is syntactic sugar for a left/inner join that preserves the order of the left DataFrame by default, with an optional coalesce when include_nulls: false.

Update the values in this LazyFrame with the values in other.

Examples:

Update df values with the non-null values in new_df, by row index:

lf = Polars::LazyFrame.new(
  {
    "A" => [1, 2, 3, 4],
    "B" => [400, 500, 600, 700]
  }
)
new_lf = Polars::LazyFrame.new(
  {
    "B" => [-66, nil, -99],
    "C" => [5, 3, 1]
  }
)
lf.update(new_lf).collect
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -66 │
# │ 2   ┆ 500 │
# │ 3   ┆ -99 │
# │ 4   ┆ 700 │
# └─────┴─────┘

Update df values with the non-null values in new_df, by row index, but only keeping those rows that are common to both frames:

lf.update(new_lf, how: "inner").collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -66 │
# │ 2   ┆ 500 │
# │ 3   ┆ -99 │
# └─────┴─────┘

Update df values with the non-null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:

lf.update(new_lf, left_on: ["A"], right_on: ["C"], how: "full").collect
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -99 │
# │ 2   ┆ 500 │
# │ 3   ┆ 600 │
# │ 4   ┆ 700 │
# │ 5   ┆ -66 │
# └─────┴─────┘

Update df values including null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:

lf.update(
  new_lf, left_on: "A", right_on: "C", how: "full", include_nulls: true
).collect
# =>
# shape: (5, 2)
# ┌─────┬──────┐
# │ A   ┆ B    │
# │ --- ┆ ---  │
# │ i64 ┆ i64  │
# ╞═════╪══════╡
# │ 1   ┆ -99  │
# │ 2   ┆ 500  │
# │ 3   ┆ null │
# │ 4   ┆ 700  │
# │ 5   ┆ -66  │
# └─────┴──────┘

Parameters:

  • other (LazyFrame)

    LazyFrame that will be used to update the values

  • on (Object) (defaults to: nil)

    Column names that will be joined on. If set to nil (default), the implicit row index of each frame is used as a join key.

  • how ('left', 'inner', 'full') (defaults to: "left")
    • 'left' will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left row's key.
    • 'inner' keeps only those rows where the key exists in both frames.
    • 'full' will update existing rows where the key matches while also adding any new rows contained in the given frame.
  • left_on (Object) (defaults to: nil)

    Join column(s) of the left DataFrame.

  • right_on (Object) (defaults to: nil)

    Join column(s) of the right DataFrame.

  • include_nulls (Boolean) (defaults to: false)

    Overwrite values in the left frame with null values from the right frame. If set to false (default), null values in the right frame are ignored.

  • maintain_order ('none', 'left', 'right', 'left_right', 'right_left') (defaults to: "left")

    Which order of rows from the inputs to preserve. See LazyFrame.join for details. Unlike join this function preserves the left order by default.

Returns:



4983
4984
4985
4986
4987
4988
4989
4990
4991
4992
4993
4994
4995
4996
4997
4998
4999
5000
5001
5002
5003
5004
5005
5006
5007
5008
5009
5010
5011
5012
5013
5014
5015
5016
5017
5018
5019
5020
5021
5022
5023
5024
5025
5026
5027
5028
5029
5030
5031
5032
5033
5034
5035
5036
5037
5038
5039
5040
5041
5042
5043
5044
5045
5046
5047
5048
5049
5050
5051
5052
5053
5054
5055
5056
5057
5058
5059
5060
5061
5062
5063
5064
5065
5066
5067
5068
5069
5070
5071
5072
5073
5074
5075
5076
5077
5078
5079
5080
5081
5082
5083
5084
5085
5086
5087
5088
5089
5090
5091
5092
5093
5094
5095
5096
5097
5098
5099
5100
5101
5102
5103
5104
5105
# File 'lib/polars/lazy_frame.rb', line 4983

def update(
  other,
  on: nil,
  how: "left",
  left_on: nil,
  right_on: nil,
  include_nulls: false,
  maintain_order: "left"
)
  Utils.require_same_type(self, other)
  if ["outer", "outer_coalesce"].include?(how)
    how = "full"
  end

  if !["left", "inner", "full"].include?(how)
    msg = "`how` must be one of {{'left', 'inner', 'full'}}; found #{how.inspect}"
    raise ArgumentError, msg
  end

  slf = self
  row_index_used = false
  if on.nil?
    if left_on.nil? && right_on.nil?
      # no keys provided--use row index
      row_index_used = true
      row_index_name = "__POLARS_ROW_INDEX"
      slf = slf.with_row_index(name: row_index_name)
      other = other.with_row_index(name: row_index_name)
      left_on = right_on = [row_index_name]
    else
      # one of left or right is missing, raise error
      if left_on.nil?
        msg = "missing join columns for left frame"
        raise ArgumentError, msg
      end
      if right_on.nil?
        msg = "missing join columns for right frame"
        raise ArgumentError, msg
      end
    end
  else
    # move on into left/right_on to simplify logic
    left_on = right_on = on
  end

  if left_on.is_a?(::String)
    left_on = [left_on]
  end
  if right_on.is_a?(::String)
    right_on = [right_on]
  end

  left_schema = slf.collect_schema
  left_on.each do |name|
    if !left_schema.include?(name)
      msg = "left join column #{name.inspect} not found"
      raise ArgumentError, msg
    end
  end
  right_schema = other.collect_schema
  right_on.each do |name|
    if !right_schema.include?(name)
      msg = "right join column #{name.inspect} not found"
      raise ArgumentError, msg
    end
  end

  # no need to join if *only* join columns are in other (inner/left update only)
  if how != "full" && right_schema.length == right_on.length
    if row_index_used
      return slf.drop(row_index_name)
    end
    return slf
  end

  # only use non-idx right columns present in left frame
  right_other = Set.new(right_schema.to_h.keys).intersection(left_schema.to_h.keys) - Set.new(right_on)

  # When include_nulls is true, we need to distinguish records after the join that
  # were originally null in the right frame, as opposed to records that were null
  # because the key was missing from the right frame.
  # Add a validity column to track whether row was matched or not.
  if include_nulls
    validity = ["__POLARS_VALIDITY"]
    other = other.with_columns(F.lit(true).alias(validity[0]))
  else
    validity = []
  end

  tmp_name = "__POLARS_RIGHT"
  drop_columns = right_other.map { |name| "#{name}#{tmp_name}" } + validity
  result = (
    slf.join(
      other.select(*right_on, *right_other, *validity),
      left_on: left_on,
      right_on: right_on,
      how: how,
      suffix: tmp_name,
      coalesce: true,
      maintain_order: maintain_order
    )
    .with_columns(
      right_other.map do |name|
        (
          if include_nulls
            # use left value only when right value failed to join
            F.when(F.col(validity).is_null)
            .then(F.col(name))
            .otherwise(F.col("#{name}#{tmp_name}"))
          else
            F.coalesce(["#{name}#{tmp_name}", F.col(name)])
          end
        ).alias(name)
      end
    )
    .drop(drop_columns)
  )
  if row_index_used
    result = result.drop(row_index_name)
  end

  _from_rbldf(result._ldf)
end

#var(ddof: 1) ⇒ LazyFrame

Aggregate the columns in the DataFrame to their variance value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.var.collect
# =>
# shape: (1, 2)
# ┌──────────┬──────┐
# │ a        ┆ b    │
# │ ---      ┆ ---  │
# │ f64      ┆ f64  │
# ╞══════════╪══════╡
# │ 1.666667 ┆ 0.25 │
# └──────────┴──────┘
df.var(ddof: 0).collect
# =>
# shape: (1, 2)
# ┌──────┬────────┐
# │ a    ┆ b      │
# │ ---  ┆ ---    │
# │ f64  ┆ f64    │
# ╞══════╪════════╡
# │ 1.25 ┆ 0.1875 │
# └──────┴────────┘

Returns:



3997
3998
3999
# File 'lib/polars/lazy_frame.rb', line 3997

def var(ddof: 1)
  _from_rbldf(_ldf.var(ddof))
end

#widthInteger

Get the width of the LazyFrame.

Examples:

lf = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]}).lazy
lf.width
# => 2

Returns:

  • (Integer)


160
161
162
# File 'lib/polars/lazy_frame.rb', line 160

def width
  _ldf.collect_schema.length
end

#with_columns(*exprs, **named_exprs) ⇒ LazyFrame

Add or overwrite multiple columns in a DataFrame.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
).lazy
ldf.with_columns(
  [
    (Polars.col("a") ** 2).alias("a^2"),
    (Polars.col("b") / 2).alias("b/2"),
    (Polars.col("c").is_not).alias("not c")
  ]
).collect
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Parameters:

  • exprs (Object)

    List of Expressions that evaluate to columns.

  • named_exprs (Hash)

    Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



3274
3275
3276
3277
3278
3279
3280
# File 'lib/polars/lazy_frame.rb', line 3274

def with_columns(*exprs, **named_exprs)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0"

  rbexprs = Utils.parse_into_list_of_expressions(*exprs, **named_exprs, __structify: structify)

  _from_rbldf(_ldf.with_columns(rbexprs))
end

#with_columns_seq(*exprs, **named_exprs) ⇒ LazyFrame

Add columns to this LazyFrame.

Added columns will replace existing columns with the same name.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

  • exprs (Array)

    Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
# File 'lib/polars/lazy_frame.rb', line 3298

def with_columns_seq(
  *exprs,
  **named_exprs
)
  structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0

  rbexprs = Utils.parse_into_list_of_expressions(
    *exprs, **named_exprs, __structify: structify
  )
  _from_rbldf(_ldf.with_columns_seq(rbexprs))
end

#with_row_index(name: "index", offset: 0) ⇒ LazyFrame

Note:

This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.

Add a column at index 0 that counts the rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.with_row_index.collect
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ index ┆ a   ┆ b   │
# │ ---   ┆ --- ┆ --- │
# │ u32   ┆ i64 ┆ i64 │
# ╞═══════╪═════╪═════╡
# │ 0     ┆ 1   ┆ 2   │
# │ 1     ┆ 3   ┆ 4   │
# │ 2     ┆ 5   ┆ 6   │
# └───────┴─────┴─────┘

Parameters:

  • name (String) (defaults to: "index")

    Name of the column to add.

  • offset (Integer) (defaults to: 0)

    Start the row count at this offset.

Returns:



3756
3757
3758
# File 'lib/polars/lazy_frame.rb', line 3756

def with_row_index(name: "index", offset: 0)
  _from_rbldf(_ldf.with_row_index(name, offset))
end